Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Validate the syntax of DOI values #4643
Depends on #4644
For context, this relates to the work done to support annotating EPUBs recently. See #4611
Validate the syntax of DOI values before creating document equivalence claim entries in the DB which are labeled as being of type
Not all Dublin Core identifier field values (from
Additionally, the identifier value is only intended to be "unambiguous within a given context"  which I take to imply that it is not necessarily globally unique and indeed, looking at actual values we have captured so far from these tags in the prod h DB, this has been the case for a small number of entries. See comments below for details.
In the process of making this change, I discovered that one of the example DOIs we were using in many test cases was wrong, perhaps due to a typo that got copy-pasta'd.
@@ Coverage Diff @@ ## master #4643 +/- ## ========================================== + Coverage 95.19% 95.19% +<.01% ========================================== Files 373 373 Lines 20438 20442 +4 Branches 1171 1171 ========================================== + Hits 19455 19459 +4 Misses 879 879 Partials 104 104
Here is a dump of all the DOI values (not unique) that have been captured from annotated documents on the prod instance of h in the
However, there were 1122 values which are URLs that point to the DOI resolver, eg. https://doi.org/10.1016/j.stem.2017.05.001 . We probably want to extract the DOI from those and normalize them to
This would imply a bunch of extra work however, such as a migration of existing data. For the time being I'd propose that we simply continue to capture such values.
If you change the part of a system that captures data, you also have to consider what happens in the parts of the system that query data and what happens when querying data that was captured before & after the change. I haven't thought it through fully in this context, but just filtering what we already capture avoids those concerns in this PR.