Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analyze PennText accuracy on 200 DOIs, Select 500 for additional calls #19

Merged
merged 3 commits into from
Jan 12, 2018

Conversation

dhimmel
Copy link
Contributor

@dhimmel dhimmel commented Jan 10, 2018

Refs #15.

Analyzes accuracy on 200 DOIs (100 where PennText was true, 100 where PennText was false).

Select 500 DOIs for an expanded manual assessment. Stratified on PennText to match the proportion in the entire DOI set. Reuses as many DOIs with calls as possible.

Todo:

  • Switch notebook to project environment

@dhimmel
Copy link
Contributor Author

dhimmel commented Jan 10, 2018

Note the overall inaccuracy of PennText calls is only 12.4%. This is because accuracy when PennText was true is 94%, and most DOIs are PennText == true.

@dhimmel
Copy link
Contributor Author

dhimmel commented Jan 10, 2018

The idea here is that curation would continue in manual-doi-checks-500.tsv. Currently, this file doesn't have the date queried columns and has different names than before. Let me know if that's a problem. You could always edit the column names / add new ones manually if you wanted.

@dhimmel dhimmel requested a review from jglev January 10, 2018 20:49
@dhimmel
Copy link
Contributor Author

dhimmel commented Jan 12, 2018

Pinging @Publicus

@jglev
Copy link

jglev commented Jan 12, 2018

I've reviewed your sample, and it looks good to me. I've updated the facilitation script to use manual-doi-checks-500.tsv, as well, and have gotten started. The facilitation script does add the date columns back automatically; I do prefer keeping them if the data are going to be public, since they can help if there's a question later about journal subscription timelines. My understanding from your comment above is that you're fine with those date columns being retained; is that correct?

@dhimmel
Copy link
Contributor Author

dhimmel commented Jan 12, 2018

My understanding from your comment above is that you're fine with those date columns being retained; is that correct?

Yep.

I've updated the facilitation script to use manual-doi-checks-500.tsv

So this PR is ready to merge? If everything looks good to you, "approve" it under "files changed" > "review changes".

@jglev
Copy link

jglev commented Jan 12, 2018

A quick logistics question: Is your idea that the edits to the facilitation script, and the results from the 500, be in their own PR? If so, yes, this is ready, and I'll mark it as approved.

@dhimmel
Copy link
Contributor Author

dhimmel commented Jan 12, 2018

A quick logistics question: Is your idea that the edits to the facilitation script, and the results from the 500, be in their own PR? If so, yes, this is ready, and I'll mark it as approved.

Yes

@dhimmel dhimmel merged commit dcf151b into greenelab:master Jan 12, 2018
@dhimmel dhimmel deleted the penntext-accuracy branch January 12, 2018 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants