Add dataset created from processing State of OA DOIs #13

jglev · 2017-12-01T14:15:51Z

This is an in-progress PR, which will eventually contain the dataset I'm downloading and processing from the State of OA DOIs list.

…ecify row numbers to download.

jglev · 2017-12-01T14:15:55Z

As of this writing, I've queried just over 106,000 DOIs. The full dataset has ~290,000, as I remember. I currently plan to leave the downloader running over the weekend.

…cked with git-lfs.

…previous commit, but which wasn't yet added to the repo.

dhimmel · 2017-12-04T15:20:54Z

.gitattributes

@@ -1 +1 @@
-*.xz filter=lfs diff=lfs merge=lfs -text
+data/library_coverage_xml_and_fulltext_indicators.db.xz filter=lfs diff=lfs merge=lfs -text


Why did you change this to not track data/library_coverage_xml_and_fulltext_indicators.tsv.xz using Git LFS? I think tracking all .xz files with LFS makes the most sense?

A combination of oversight and confusion about what we agreed to track with lfs : P Reverted in 44aba2d.

…files.

…of DOIs for that row.

jglev · 2017-12-04T17:08:55Z

I've committed the dataset (containing all 290,120 rows), as well as an RMarkdown file for generating a markdown-formatted table that can presumably be worked into your code for the figure you mentioned. How does this look to you?

Here's the current table output from that RMarkdown file:

oa_doi_color	no_access_percent	yes_access_percent	yes_access_rate	oa_color_total
bronze	17.79	82.21	36077	43886
closed	17.43	82.57	150934	182804
gold	3.37	96.63	22018	22786
green	9.78	90.22	23441	25981
hybrid	15.45	84.55	12397	14663

jglev · 2017-12-04T17:13:04Z

To make sure the table's column names make sense: for the OA "Bronze" level:

There were 43,886 DOIs in that category.
Of those 43,886, UPenn has access to 36,077.
That means that UPenn has access to 36,007 / 43,883 = 82.21% of the DOIs queried in that category.
It also means that UPenn does not have access to 100 - 82.21 = 17.79% of the DOIs queried in that category.

jglev · 2017-12-04T17:17:26Z

Ha, and just to double-check, I just now ran the RMarkdown file, and then ran length(which(original_dataset_with_oa_color_column$oadoi_color == "bronze")), to check from the original dataset itself that the number of Bronze DOIs is 43,886. To confirm, it is. : )

dhimmel · 2017-12-04T17:47:54Z

Great! Excited for these access calls and incorporating them into the Sci-Hub Manuscript.

data/library_coverage_xml_and_fulltext_indicators.tsv.xz still isn't tracked with LFS. Perhaps stop tracking it and then re-add it.

I did confirm that it contained 290121 lines:

curl --location --silent \
  https://github.com/publicus/library-access/raw/5f04fbcacbef4cefcba41c79b23d58294afc6b72/data/library_coverage_xml_and_fulltext_indicators.tsv.xz \
   | xzcat | wc --lines

So that's good.

There's still this problematic line in .gitignore:

./library_coverage_xml_and_fulltext_indicators.db*

Can you change it to

data/library_coverage_xml_and_fulltext_indicators.db

…s.tsv.xz with 'git rm --cached data/library_coverage_xml_and_fulltext_indicators.tsv.xz' and then 'git add data/library_coverage_xml_and_fulltext_indicators.tsv.xz'

jglev · 2017-12-04T19:48:28Z

I've made a new commit to remove and re-add data/library_coverage_xml_and_fulltext_indicators.tsv.xz. Has the tracking of that file been solved in f2d98d9? I'm having trouble telling.

Re: the line in .gitignore, removing the wildcard will cause git to prompt users to add library_coverage_xml_and_fulltext_indicators.db-shm and library_coverage_xml_and_fulltext_indicators.db-wal, which are created whenever the database is opened (because write-ahead logging is turned on). That seems undesirable to me -- does it seem desirable to you, though?

I've been thinking more about what the table I posted above is telling us, and have a few thoughts to discuss / figure out together, so I'll type those up next...

dhimmel · 2017-12-04T19:53:26Z

Both XZ files are now tracked with LFS. See "Git LFS file not shown" under Files Changed.

It's wrong (although possible) to track a file that's ignored. How about:

data/library_coverage_xml_and_fulltext_indicators.db*
!data/library_coverage_xml_and_fulltext_indicators.db.xz

I think that should track the XZ file and ignore the others (see https://git-scm.com/docs/gitignore)

jglev · 2017-12-04T20:37:11Z

I spoke with my supervisor this afternoon about the table I posted above, and we came out of our conversation with several questions about the data, and what to draw from them. From our conversation, there are two big points that I think are important to note:

The table shows how much the Library's catalog says users have access to, which is not necessarily the same thing as what users do have access to.

As an example: Our results indicate that the Library's system would tell users that they have access to 82.21% of the "bronze" DOIs -- but by definition, all bronze DOIs should be available to users, since they're openly accessible through the publisher's website. (A similar point applies to gold and green DOIs.)

We can take an example DOI from that remaining bronze 17.79%, 10.1002/2013JD021255. If we go directly in a web browser to doi.org/10.1002/2013JD021255, we get the publisher's webpage for the article, which does have full-text (at least from my system as I write this, on Penn's campus). If a user goes to the Library's search tool, though (click here, then click on "Penn Text Article Finder" at the bottom of the page), and enters 10.1002/2013JD021255 in the DOI field, she'll get this page, which does not reflect that full-text access.

The process of resolving a DOI, comparing it to a list of journal subscriptions, and then figuring out whether full text is available is complicated, and could break down at any of several steps, including:

Something wrong with the metadata the publisher supplies about the article
The metadata from the publisher was correct, but isn't now (e.g., with bronze DOIs, the DOI may have been free in the past, but the publisher has since locked it down).
Something wrong with the services used by the intermediary the Library uses to resolve DOIs.
Something wrong with Penn Text Search itself.

This is all to say that it's not yet apparent where that 17.79% disconnect comes from. It could also be the case that some of the DOIs themselves don't resolve (as an additional issue alongside those enumerated above).

Similarly, the "green" row of the table shows the percentage of DOIs that the Library has access to through the publisher website.

This is a smaller element to note; it's slightly different from what the State of OA authors (page 6) defined "hybrid" as: "Toll-access on the publisher page, but there is a free copy in an OA repository."

So, what to take from this:

I think there are two main points to keep in mind as we incorporate this into the manuscript:

Rhetorically, the emphasis here should be more on the experience of the user than on the Library's access itself -- in cases like with the DOI above, the user does have legal access to the DOI, but is told that there isn't access through the Library's system. And that could have implications for users seeking that DOI from alternative sources, including SciHub.
With the system that we queried (which is what the PennText Search frontend uses -- hence point 1 above), and with any system, there is going to be some rate of false negatives (and maybe even false positives).
One way that I could look into this point is by taking a couple of hours, taking a small sample of DOIs from each category (e.g., a few dozen), manually resolving the DOI, and recording whether a user has access. Then, I could use the rate estimator that I wrote for PR Adding full text true rate estimator #8 to get an interval around the rate of false negatives. I'd be willing to do this -- it seems useful for clarifying what these data actually tell us.

In any case, these seem like things to note explicitly in the write-up wherever these data get incorporated.

Does this all make sense as I'm writing it here? @dhimmel, are there thoughts that you have around this?

jglev · 2017-12-04T20:42:37Z

Oh, I see a place where we may have been talking past each other:
./library_coverage_xml_and_fulltext_indicators.db* (i.e., the top-level directory of this repo., from git's perspective) is where the untracked database gets saved by our downloader script.

That database, which is untracked, then gets copied in compressed format into data/library_coverage_xml_and_fulltext_indicators.db.xz, which is tracked.

Thus, the .gitignore line ./library_coverage_xml_and_fulltext_indicators.db* shouldn't be affecting the data/ copy of the database. But if it is, I should then add a new line, !data/library_coverage_xml_and_fulltext_indicators.db.xz, as you suggested, correct?

dhimmel · 2017-12-04T20:44:50Z

@Publicus I agree with your commentary above. Can you repost it in a new issue, since this PR isn't the ideal place for that discussion. Coincidentally, I just opened #15 about manually investigating certain calls.

dhimmel · 2017-12-04T20:46:23Z

I see a place where we may have been talking past each other

Got it. I didn't realize the database was in the top-level directory. It really would make the most sense in the data directory? Would it be difficult to move the db location? If not, could we do that in this PR?

Otherwise, this all looks good.

dhimmel · 2017-12-04T20:48:00Z

Got it. I didn't realize the database was in the top-level directory. It really would make the most sense in the data directory? Would it be difficult to move the db location? If not, could we do that in this PR?

Actually I think we should do this in a separate PR that will be quick after merging this one. Will merge.

dhimmel

Note to consider a follow up PR to move the database to the data directory.

Fixed bug whereby downloader would fail if the config file did not sp…

a2f141c

…ecify row numbers to download.

Jacob Levernier added 2 commits December 4, 2017 09:49

Added dataset to be tracked with Git, and the database copy to be tra…

7f2f2be

…cked with git-lfs.

Added database copy file, which I meant to track with git-lfs in the …

f792031

…previous commit, but which wasn't yet added to the repo.

dhimmel reviewed Dec 4, 2017

View reviewed changes

Jacob Levernier added 5 commits December 4, 2017 11:28

Added beginning of Rmd file for evaluating library access.

cd6a0a2

Finished basic Rmarkdown report.

b420bcb

Reverted .gitattributes so that git lfs will track all xz-compressed …

44aba2d

…files.

Changed knitr::kable format to markdown.

8456b02

Added additional column to markdown output, showing the total number …

5f04fbc

…of DOIs for that row.

Attempting to re-add data/library_coverage_xml_and_fulltext_indicator…

f2d98d9

…s.tsv.xz with 'git rm --cached data/library_coverage_xml_and_fulltext_indicators.tsv.xz' and then 'git add data/library_coverage_xml_and_fulltext_indicators.tsv.xz'

dhimmel approved these changes Dec 4, 2017

View reviewed changes

dhimmel merged commit b7fe08c into greenelab:master Dec 4, 2017

This was referenced Dec 6, 2017

Move the database to the data directory #16

Closed

Accuracy analysis of full_text_indicator calls #15

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dataset created from processing State of OA DOIs #13

Add dataset created from processing State of OA DOIs #13

jglev commented Dec 1, 2017

jglev commented Dec 1, 2017

dhimmel Dec 4, 2017

jglev Dec 4, 2017

jglev commented Dec 4, 2017 •

edited

Loading

jglev commented Dec 4, 2017

jglev commented Dec 4, 2017

dhimmel commented Dec 4, 2017 •

edited

Loading

jglev commented Dec 4, 2017

dhimmel commented Dec 4, 2017

jglev commented Dec 4, 2017

jglev commented Dec 4, 2017

dhimmel commented Dec 4, 2017

dhimmel commented Dec 4, 2017

dhimmel commented Dec 4, 2017

dhimmel left a comment

		@@ -1 +1 @@
		*.xz filter=lfs diff=lfs merge=lfs -text
		data/library_coverage_xml_and_fulltext_indicators.db.xz filter=lfs diff=lfs merge=lfs -text

Add dataset created from processing State of OA DOIs #13

Add dataset created from processing State of OA DOIs #13

Conversation

jglev commented Dec 1, 2017

jglev commented Dec 1, 2017

dhimmel Dec 4, 2017

Choose a reason for hiding this comment

jglev Dec 4, 2017

Choose a reason for hiding this comment

jglev commented Dec 4, 2017 • edited Loading

jglev commented Dec 4, 2017

jglev commented Dec 4, 2017

dhimmel commented Dec 4, 2017 • edited Loading

jglev commented Dec 4, 2017

dhimmel commented Dec 4, 2017

jglev commented Dec 4, 2017

The table shows how much the Library's catalog says users have access to, which is not necessarily the same thing as what users do have access to.

Similarly, the "green" row of the table shows the percentage of DOIs that the Library has access to through the publisher website.

So, what to take from this:

jglev commented Dec 4, 2017

dhimmel commented Dec 4, 2017

dhimmel commented Dec 4, 2017

dhimmel commented Dec 4, 2017

dhimmel left a comment

Choose a reason for hiding this comment

jglev commented Dec 4, 2017 •

edited

Loading

dhimmel commented Dec 4, 2017 •

edited

Loading