Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dataset created from processing State of OA DOIs #13

Conversation

jglev
Copy link

@jglev jglev commented Dec 1, 2017

This is an in-progress PR, which will eventually contain the dataset I'm downloading and processing from the State of OA DOIs list.

@jglev
Copy link
Author

jglev commented Dec 1, 2017

As of this writing, I've queried just over 106,000 DOIs. The full dataset has ~290,000, as I remember. I currently plan to leave the downloader running over the weekend.

Jacob Levernier added 2 commits December 4, 2017 09:49
.gitattributes Outdated
@@ -1 +1 @@
*.xz filter=lfs diff=lfs merge=lfs -text
data/library_coverage_xml_and_fulltext_indicators.db.xz filter=lfs diff=lfs merge=lfs -text
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you change this to not track data/library_coverage_xml_and_fulltext_indicators.tsv.xz using Git LFS? I think tracking all .xz files with LFS makes the most sense?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A combination of oversight and confusion about what we agreed to track with lfs : P Reverted in 44aba2d.

@jglev
Copy link
Author

jglev commented Dec 4, 2017

I've committed the dataset (containing all 290,120 rows), as well as an RMarkdown file for generating a markdown-formatted table that can presumably be worked into your code for the figure you mentioned. How does this look to you?

Here's the current table output from that RMarkdown file:

oa_doi_color no_access_percent yes_access_percent yes_access_rate oa_color_total
bronze 17.79 82.21 36077 43886
closed 17.43 82.57 150934 182804
gold 3.37 96.63 22018 22786
green 9.78 90.22 23441 25981
hybrid 15.45 84.55 12397 14663

@jglev
Copy link
Author

jglev commented Dec 4, 2017

To make sure the table's column names make sense: for the OA "Bronze" level:

  • There were 43,886 DOIs in that category.
  • Of those 43,886, UPenn has access to 36,077.
  • That means that UPenn has access to 36,007 / 43,883 = 82.21% of the DOIs queried in that category.
  • It also means that UPenn does not have access to 100 - 82.21 = 17.79% of the DOIs queried in that category.

@jglev
Copy link
Author

jglev commented Dec 4, 2017

Ha, and just to double-check, I just now ran the RMarkdown file, and then ran length(which(original_dataset_with_oa_color_column$oadoi_color == "bronze")), to check from the original dataset itself that the number of Bronze DOIs is 43,886. To confirm, it is. : )

@dhimmel
Copy link
Contributor

dhimmel commented Dec 4, 2017

Great! Excited for these access calls and incorporating them into the Sci-Hub Manuscript.

data/library_coverage_xml_and_fulltext_indicators.tsv.xz still isn't tracked with LFS. Perhaps stop tracking it and then re-add it.

I did confirm that it contained 290121 lines:

curl --location --silent \
  https://github.com/publicus/library-access/raw/5f04fbcacbef4cefcba41c79b23d58294afc6b72/data/library_coverage_xml_and_fulltext_indicators.tsv.xz \
   | xzcat | wc --lines

So that's good.

There's still this problematic line in .gitignore:

./library_coverage_xml_and_fulltext_indicators.db*

Can you change it to

data/library_coverage_xml_and_fulltext_indicators.db

…s.tsv.xz with 'git rm --cached data/library_coverage_xml_and_fulltext_indicators.tsv.xz' and then 'git add data/library_coverage_xml_and_fulltext_indicators.tsv.xz'
@jglev
Copy link
Author

jglev commented Dec 4, 2017

I've made a new commit to remove and re-add data/library_coverage_xml_and_fulltext_indicators.tsv.xz. Has the tracking of that file been solved in f2d98d9? I'm having trouble telling.

Re: the line in .gitignore, removing the wildcard will cause git to prompt users to add library_coverage_xml_and_fulltext_indicators.db-shm and library_coverage_xml_and_fulltext_indicators.db-wal, which are created whenever the database is opened (because write-ahead logging is turned on). That seems undesirable to me -- does it seem desirable to you, though?

I've been thinking more about what the table I posted above is telling us, and have a few thoughts to discuss / figure out together, so I'll type those up next...

@dhimmel
Copy link
Contributor

dhimmel commented Dec 4, 2017

Both XZ files are now tracked with LFS. See "Git LFS file not shown" under Files Changed.

It's wrong (although possible) to track a file that's ignored. How about:

data/library_coverage_xml_and_fulltext_indicators.db*
!data/library_coverage_xml_and_fulltext_indicators.db.xz

I think that should track the XZ file and ignore the others (see https://git-scm.com/docs/gitignore)

@jglev
Copy link
Author

jglev commented Dec 4, 2017

I spoke with my supervisor this afternoon about the table I posted above, and we came out of our conversation with several questions about the data, and what to draw from them. From our conversation, there are two big points that I think are important to note:

The table shows how much the Library's catalog says users have access to, which is not necessarily the same thing as what users do have access to.

As an example: Our results indicate that the Library's system would tell users that they have access to 82.21% of the "bronze" DOIs -- but by definition, all bronze DOIs should be available to users, since they're openly accessible through the publisher's website. (A similar point applies to gold and green DOIs.)

We can take an example DOI from that remaining bronze 17.79%, 10.1002/2013JD021255. If we go directly in a web browser to doi.org/10.1002/2013JD021255, we get the publisher's webpage for the article, which does have full-text (at least from my system as I write this, on Penn's campus). If a user goes to the Library's search tool, though (click here, then click on "Penn Text Article Finder" at the bottom of the page), and enters 10.1002/2013JD021255 in the DOI field, she'll get this page, which does not reflect that full-text access.

The process of resolving a DOI, comparing it to a list of journal subscriptions, and then figuring out whether full text is available is complicated, and could break down at any of several steps, including:

  1. Something wrong with the metadata the publisher supplies about the article
  2. The metadata from the publisher was correct, but isn't now (e.g., with bronze DOIs, the DOI may have been free in the past, but the publisher has since locked it down).
  3. Something wrong with the services used by the intermediary the Library uses to resolve DOIs.
  4. Something wrong with Penn Text Search itself.

This is all to say that it's not yet apparent where that 17.79% disconnect comes from. It could also be the case that some of the DOIs themselves don't resolve (as an additional issue alongside those enumerated above).

Similarly, the "green" row of the table shows the percentage of DOIs that the Library has access to through the publisher website.

This is a smaller element to note; it's slightly different from what the State of OA authors (page 6) defined "hybrid" as: "Toll-access on the publisher page, but there is a free copy in an OA repository."

So, what to take from this:

I think there are two main points to keep in mind as we incorporate this into the manuscript:

  1. Rhetorically, the emphasis here should be more on the experience of the user than on the Library's access itself -- in cases like with the DOI above, the user does have legal access to the DOI, but is told that there isn't access through the Library's system. And that could have implications for users seeking that DOI from alternative sources, including SciHub.
  2. With the system that we queried (which is what the PennText Search frontend uses -- hence point 1 above), and with any system, there is going to be some rate of false negatives (and maybe even false positives).
    One way that I could look into this point is by taking a couple of hours, taking a small sample of DOIs from each category (e.g., a few dozen), manually resolving the DOI, and recording whether a user has access. Then, I could use the rate estimator that I wrote for PR Adding full text true rate estimator #8 to get an interval around the rate of false negatives. I'd be willing to do this -- it seems useful for clarifying what these data actually tell us.

In any case, these seem like things to note explicitly in the write-up wherever these data get incorporated.

Does this all make sense as I'm writing it here? @dhimmel, are there thoughts that you have around this?

@jglev
Copy link
Author

jglev commented Dec 4, 2017

Oh, I see a place where we may have been talking past each other:
./library_coverage_xml_and_fulltext_indicators.db* (i.e., the top-level directory of this repo., from git's perspective) is where the untracked database gets saved by our downloader script.

That database, which is untracked, then gets copied in compressed format into data/library_coverage_xml_and_fulltext_indicators.db.xz, which is tracked.

Thus, the .gitignore line ./library_coverage_xml_and_fulltext_indicators.db* shouldn't be affecting the data/ copy of the database. But if it is, I should then add a new line, !data/library_coverage_xml_and_fulltext_indicators.db.xz, as you suggested, correct?

@dhimmel
Copy link
Contributor

dhimmel commented Dec 4, 2017

@Publicus I agree with your commentary above. Can you repost it in a new issue, since this PR isn't the ideal place for that discussion. Coincidentally, I just opened #15 about manually investigating certain calls.

@dhimmel
Copy link
Contributor

dhimmel commented Dec 4, 2017

I see a place where we may have been talking past each other

Got it. I didn't realize the database was in the top-level directory. It really would make the most sense in the data directory? Would it be difficult to move the db location? If not, could we do that in this PR?

Otherwise, this all looks good.

@dhimmel
Copy link
Contributor

dhimmel commented Dec 4, 2017

Got it. I didn't realize the database was in the top-level directory. It really would make the most sense in the data directory? Would it be difficult to move the db location? If not, could we do that in this PR?

Actually I think we should do this in a separate PR that will be quick after merging this one. Will merge.

Copy link
Contributor

@dhimmel dhimmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to consider a follow up PR to move the database to the data directory.

@dhimmel dhimmel merged commit b7fe08c into greenelab:master Dec 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants