New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate the syntax of DOI values #4643

Merged
merged 2 commits into from Sep 11, 2017

Conversation

Projects
None yet
3 participants
@robertknight
Contributor

robertknight commented Sep 7, 2017

Depends on #4644

For context, this relates to the work done to support annotating EPUBs recently. See #4611


Validate the syntax of DOI values before creating document equivalence claim entries in the DB which are labeled as being of type dc-doi or highwire-doi.

Not all Dublin Core identifier field values (from <meta name="dc.identifier"> tags in HTML documents) are DOIs.

Additionally, the identifier value is only intended to be "unambiguous within a given context" [1] which I take to imply that it is not necessarily globally unique and indeed, looking at actual values we have captured so far from these tags in the prod h DB, this has been the case for a small number of entries. See comments below for details.

In the process of making this change, I discovered that one of the example DOIs we were using in many test cases was wrong, perhaps due to a typo that got copy-pasta'd.

[1] http://dublincore.org/documents/dcmi-terms/#terms-identifier

Fixes #4611

@codecov

This comment has been minimized.

Show comment
Hide comment
@codecov

codecov bot Sep 7, 2017

Codecov Report

Merging #4643 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4643      +/-   ##
==========================================
+ Coverage   95.19%   95.19%   +<.01%     
==========================================
  Files         373      373              
  Lines       20438    20442       +4     
  Branches     1171     1171              
==========================================
+ Hits        19455    19459       +4     
  Misses        879      879              
  Partials      104      104
Impacted Files Coverage Δ
h/util/document_claims.py 100% <100%> (ø) ⬆️
tests/h/util/document_claims_test.py 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e491760...4ff6601. Read the comment docs.

codecov bot commented Sep 7, 2017

Codecov Report

Merging #4643 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4643      +/-   ##
==========================================
+ Coverage   95.19%   95.19%   +<.01%     
==========================================
  Files         373      373              
  Lines       20438    20442       +4     
  Branches     1171     1171              
==========================================
+ Hits        19455    19459       +4     
  Misses        879      879              
  Partials      104      104
Impacted Files Coverage Δ
h/util/document_claims.py 100% <100%> (ø) ⬆️
tests/h/util/document_claims_test.py 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e491760...4ff6601. Read the comment docs.

@robertknight

This comment has been minimized.

Show comment
Hide comment
@robertknight

robertknight Sep 7, 2017

Contributor

Here is a dump of all the DOI values (not unique) that have been captured from annotated documents on the prod instance of h in the document_uri table as dc-doi or highwire-doi URLs: https://gist.github.com/robertknight/9232b5b3cc6c6ba29c62e64cc80e1513 . As expected, there are a lot of non-DOI values in there.

However, there were 1122 values which are URLs that point to the DOI resolver, eg. https://doi.org/10.1016/j.stem.2017.05.001 . We probably want to extract the DOI from those and normalize them to doi:<doi>. This should enable a page with <meta name="dc.identifier" content="https://doi.org/10.1016/j.stem.2017.05.001"> to show the same annotations as a page with <meta name="dc.identifier" content="10.1016/j.stem.2017.05.001">.

This would imply a bunch of extra work however, such as a migration of existing data. For the time being I'd propose that we simply continue to capture such values.

Contributor

robertknight commented Sep 7, 2017

Here is a dump of all the DOI values (not unique) that have been captured from annotated documents on the prod instance of h in the document_uri table as dc-doi or highwire-doi URLs: https://gist.github.com/robertknight/9232b5b3cc6c6ba29c62e64cc80e1513 . As expected, there are a lot of non-DOI values in there.

However, there were 1122 values which are URLs that point to the DOI resolver, eg. https://doi.org/10.1016/j.stem.2017.05.001 . We probably want to extract the DOI from those and normalize them to doi:<doi>. This should enable a page with <meta name="dc.identifier" content="https://doi.org/10.1016/j.stem.2017.05.001"> to show the same annotations as a page with <meta name="dc.identifier" content="10.1016/j.stem.2017.05.001">.

This would imply a bunch of extra work however, such as a migration of existing data. For the time being I'd propose that we simply continue to capture such values.

@robertknight robertknight added the WIP label Sep 7, 2017

@robertknight

This comment has been minimized.

Show comment
Hide comment
@robertknight

robertknight Sep 7, 2017

Contributor

I'm going to rebase this on #4644

Contributor

robertknight commented Sep 7, 2017

I'm going to rebase this on #4644

@robertknight robertknight removed the WIP label Sep 7, 2017

@dwhly

This comment has been minimized.

Show comment
Hide comment
@dwhly

dwhly Sep 8, 2017

Member

This would imply a bunch of extra work however, such as a migration of existing data. For the time being I'd propose that we simply continue to capture such values.

We could start normalizing them now, and clean up the data later though right?

Member

dwhly commented Sep 8, 2017

This would imply a bunch of extra work however, such as a migration of existing data. For the time being I'd propose that we simply continue to capture such values.

We could start normalizing them now, and clean up the data later though right?

@robertknight

This comment has been minimized.

Show comment
Hide comment
@robertknight

robertknight Sep 8, 2017

Contributor

We could start normalizing them now, and clean up the data later though right?

If you change the part of a system that captures data, you also have to consider what happens in the parts of the system that query data and what happens when querying data that was captured before & after the change. I haven't thought it through fully in this context, but just filtering what we already capture avoids those concerns in this PR.

Contributor

robertknight commented Sep 8, 2017

We could start normalizing them now, and clean up the data later though right?

If you change the part of a system that captures data, you also have to consider what happens in the parts of the system that query data and what happens when querying data that was captured before & after the change. I haven't thought it through fully in this context, but just filtering what we already capture avoids those concerns in this PR.

robertknight added some commits Sep 7, 2017

Validate the syntax of DOI values
Validate the syntax of DOI values before creating document equivalence
claim entries in the DB which are labeled as being of type `dc-doi` or
`highwire-doi`.

Not all Dublin Core identifier field values (from `<meta
name="dc.identifier">` tags in HTML documents) are DOIs.

Additionally, the identifier value is only intended to be "unambiguous
within a given context" [1] which I take to imply that it is not
necessarily globally unique. If the value is a DOI then it is however.

[1] http://dublincore.org/documents/dcmi-terms/#terms-identifier
Support DOI URLs
Examining existing DOI metadata stored in the h prod DB shows that
"dc.identifier" and "citation_doi" fields are often used to store DOIs
as URLs rather than just identifiers.

Ideally we would normalize the different representations to the same
form, but for now we just continue to accept them as-is.

@seanh seanh self-assigned this Sep 11, 2017

@seanh

This comment has been minimized.

Show comment
Hide comment
@seanh

seanh Sep 11, 2017

Contributor

Rebased

Contributor

seanh commented Sep 11, 2017

Rebased

@seanh

seanh approved these changes Sep 11, 2017

# See https://www.doi.org/overview/DOI_article_ELIS3.pdf.
#
# This also accepts DOIs represented as URLs (eg.
# "https://doi.org/10.1000/123456").

This comment has been minimized.

@seanh

seanh Sep 11, 2017

Contributor

Elsewhere in h we seem to use the #: comment style for docstrings of constants and things like this. (autodoc understands it)

@seanh

seanh Sep 11, 2017

Contributor

Elsewhere in h we seem to use the #: comment style for docstrings of constants and things like this. (autodoc understands it)

'', ' ', 'doi:', 'doi: ', ' doi:', ' doi: '.
If the given string, minus any `doi:` prefix does not match the syntax
for DOI names ("10.NNNN/.*") then `None` is returned.

This comment has been minimized.

@seanh

seanh Sep 11, 2017

Contributor

I think these should be:

``...``

not

`...`

because they're rSt not markdown. We don't actually extract and parse them as rSt anywhere currently though so no biggie.

@seanh

seanh Sep 11, 2017

Contributor

I think these should be:

``...``

not

`...`

because they're rSt not markdown. We don't actually extract and parse them as rSt anywhere currently though so no biggie.

@seanh seanh merged commit db7fab4 into master Sep 11, 2017

3 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
hound No violations found. Woof!

@seanh seanh deleted the doi-syntax-check branch Sep 11, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment