-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dataset created from processing State of OA DOIs #13
Add dataset created from processing State of OA DOIs #13
Conversation
…ecify row numbers to download.
As of this writing, I've queried just over 106,000 DOIs. The full dataset has ~290,000, as I remember. I currently plan to leave the downloader running over the weekend. |
…cked with git-lfs.
…previous commit, but which wasn't yet added to the repo.
.gitattributes
Outdated
@@ -1 +1 @@ | |||
*.xz filter=lfs diff=lfs merge=lfs -text | |||
data/library_coverage_xml_and_fulltext_indicators.db.xz filter=lfs diff=lfs merge=lfs -text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you change this to not track data/library_coverage_xml_and_fulltext_indicators.tsv.xz
using Git LFS? I think tracking all .xz
files with LFS makes the most sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A combination of oversight and confusion about what we agreed to track with lfs
: P Reverted in 44aba2d.
…of DOIs for that row.
I've committed the dataset (containing all 290,120 rows), as well as an RMarkdown file for generating a markdown-formatted table that can presumably be worked into your code for the figure you mentioned. How does this look to you? Here's the current table output from that RMarkdown file:
|
To make sure the table's column names make sense: for the OA "Bronze" level:
|
Ha, and just to double-check, I just now ran the RMarkdown file, and then ran |
Great! Excited for these access calls and incorporating them into the Sci-Hub Manuscript.
I did confirm that it contained 290121 lines: curl --location --silent \
https://github.com/publicus/library-access/raw/5f04fbcacbef4cefcba41c79b23d58294afc6b72/data/library_coverage_xml_and_fulltext_indicators.tsv.xz \
| xzcat | wc --lines So that's good. There's still this problematic line in
Can you change it to
|
…s.tsv.xz with 'git rm --cached data/library_coverage_xml_and_fulltext_indicators.tsv.xz' and then 'git add data/library_coverage_xml_and_fulltext_indicators.tsv.xz'
I've made a new commit to remove and re-add Re: the line in I've been thinking more about what the table I posted above is telling us, and have a few thoughts to discuss / figure out together, so I'll type those up next... |
Both XZ files are now tracked with LFS. See "Git LFS file not shown" under Files Changed. It's wrong (although possible) to track a file that's ignored. How about:
I think that should track the XZ file and ignore the others (see https://git-scm.com/docs/gitignore) |
I spoke with my supervisor this afternoon about the table I posted above, and we came out of our conversation with several questions about the data, and what to draw from them. From our conversation, there are two big points that I think are important to note: The table shows how much the Library's catalog says users have access to, which is not necessarily the same thing as what users do have access to.As an example: Our results indicate that the Library's system would tell users that they have access to 82.21% of the "bronze" DOIs -- but by definition, all bronze DOIs should be available to users, since they're openly accessible through the publisher's website. (A similar point applies to gold and green DOIs.) We can take an example DOI from that remaining bronze 17.79%, The process of resolving a DOI, comparing it to a list of journal subscriptions, and then figuring out whether full text is available is complicated, and could break down at any of several steps, including:
This is all to say that it's not yet apparent where that 17.79% disconnect comes from. It could also be the case that some of the DOIs themselves don't resolve (as an additional issue alongside those enumerated above). Similarly, the "green" row of the table shows the percentage of DOIs that the Library has access to through the publisher website.This is a smaller element to note; it's slightly different from what the State of OA authors (page 6) defined "hybrid" as: "Toll-access on the publisher page, but there is a free copy in an OA repository." So, what to take from this:I think there are two main points to keep in mind as we incorporate this into the manuscript:
In any case, these seem like things to note explicitly in the write-up wherever these data get incorporated. Does this all make sense as I'm writing it here? @dhimmel, are there thoughts that you have around this? |
Oh, I see a place where we may have been talking past each other: That database, which is untracked, then gets copied in compressed format into Thus, the |
Got it. I didn't realize the database was in the top-level directory. It really would make the most sense in the Otherwise, this all looks good. |
Actually I think we should do this in a separate PR that will be quick after merging this one. Will merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to consider a follow up PR to move the database to the data
directory.
This is an in-progress PR, which will eventually contain the dataset I'm downloading and processing from the State of OA DOIs list.