New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

search anomalies involving the new urn:x-dc uri formed from dc.relation.ispartof and dc.identifier #4611

Closed
judell opened this Issue Jul 24, 2017 · 4 comments

Comments

Projects
None yet
2 participants
@judell
Contributor

judell commented Jul 24, 2017

Consider two documents A and B:

A: http://jonudell.net/h/publisher_a_chap_1_sec_1.html

<html>
<title>Publisher A / Book 1 / Chapter 1 / Section 1 </title>
<head>
<meta name="dc.relation.ispartof" content="http://jonudell.net/h/publisher_a_chap_1_sec_1.html">
<meta name="dc.identifier" content="chapter1/section1">
</head>
<body>
<p>
This is content from publisher A.
</p>
</body>
</html>

B: http://jonudell.net/h/publisher_b_chap_1_sec_1.html

<html>
<title>Publisher B / Book 1 / Chapter 1 / Section 1 </title>
<head>
<meta name="dc.relation.ispartof" content="http://jonudell.net/h/publisher_a_chap_1_sec_1.html">
<meta name="dc.identifier" content="chapter1/section1">
</head>
<body>
<p>
This is content from publisher A.
</p>
<p>
This is content from publisher B.
</p>
</body>
</html>

I annotate both.

I expect results for both from this query but get none:

/api/search?uri=urn:x-dc:http%3A%2F%2Fjonudell.net%2Fh%2Fpublisher_a_chap_1_sec_1.html/chapter1%2Fsection1

I do /not/ expect results from this query:

/api/search?uri=doi:chapter1/section1

But it returns both.

Clues as to why:

image

I think the backend assumes dc:identifier is a doi?

Absence of type on the claimaint seems suspicious.

@robertknight robertknight self-assigned this Jul 25, 2017

@robertknight

This comment has been minimized.

Show comment
Hide comment
@robertknight

robertknight Sep 7, 2017

Contributor

I expect results for both from this query but get none:

/api/search?uri=urn:x-
dc:http%3A%2F%2Fjonudell.net%2Fh%2Fpublisher_a_chap_1_sec_1.html/chapter1%2Fsection1

The reason this is not working is because the third :-delimited part of the URN is URI-encoded when constructed. When the resulting URN is then encoded as a query param, the result is that the part after x-dc should be double-URI encoded, eg: /api/search?uri=urn%3Ax-dc%3Ahttp%253A%252F%252Fpublisher.org%252Fbook%2Fchapter1.

The client does this when constructing the fingerprint that is used to search for annotations.

I do /not/ expect results from this query:

This is an issue, because our code assumes that dc.identifier meta-tags are DOIs which is not always the case, nor AIUI are they guaranteed to be globally unique. I think the best thing to do here is to validate that the dc.identifier value conforms to the DOI syntax before storing it as a DOI.

Contributor

robertknight commented Sep 7, 2017

I expect results for both from this query but get none:

/api/search?uri=urn:x-
dc:http%3A%2F%2Fjonudell.net%2Fh%2Fpublisher_a_chap_1_sec_1.html/chapter1%2Fsection1

The reason this is not working is because the third :-delimited part of the URN is URI-encoded when constructed. When the resulting URN is then encoded as a query param, the result is that the part after x-dc should be double-URI encoded, eg: /api/search?uri=urn%3Ax-dc%3Ahttp%253A%252F%252Fpublisher.org%252Fbook%2Fchapter1.

The client does this when constructing the fingerprint that is used to search for annotations.

I do /not/ expect results from this query:

This is an issue, because our code assumes that dc.identifier meta-tags are DOIs which is not always the case, nor AIUI are they guaranteed to be globally unique. I think the best thing to do here is to validate that the dc.identifier value conforms to the DOI syntax before storing it as a DOI.

@judell

This comment has been minimized.

Show comment
Hide comment
@judell

judell Sep 8, 2017

Contributor

I think the best thing to do here is to validate that the dc.identifier value conforms to the DOI syntax before storing it as a DOI.

That seems good for now, thanks!

I presume it leaves open the possibility that we can later invent a new equivalence type (something like, say, chapter-section), to augment the set of types in document_uri?

The reason for that being, as we've discussed, so that book publishers can coalesce annotations across syndicated copies of works in the same way that scientific publishers now can using DOIs.

Existing document_uri types:

rel-bookmark
dc-doi
self-claim
rel-shortlink
highwire-pdf
rel-alternate
rel-canonical
highwire-doi

Contributor

judell commented Sep 8, 2017

I think the best thing to do here is to validate that the dc.identifier value conforms to the DOI syntax before storing it as a DOI.

That seems good for now, thanks!

I presume it leaves open the possibility that we can later invent a new equivalence type (something like, say, chapter-section), to augment the set of types in document_uri?

The reason for that being, as we've discussed, so that book publishers can coalesce annotations across syndicated copies of works in the same way that scientific publishers now can using DOIs.

Existing document_uri types:

rel-bookmark
dc-doi
self-claim
rel-shortlink
highwire-pdf
rel-alternate
rel-canonical
highwire-doi

@judell judell closed this Sep 8, 2017

@robertknight

This comment has been minimized.

Show comment
Hide comment
@robertknight

robertknight Sep 8, 2017

Contributor

I presume it leaves open the possibility that we can later invent a new equivalence type (something like, say, chapter-section), to augment the set of types in document_uri?

Yes. There are several cases to consider for the value of dc.identifier

  1. The value is a globally unique identifier from some recognized system (eg. ISBN) but is not a DOI. We can add support for these other systems incrementally, since there are a limited number of them in common use.
  2. The value is not a globally unique identifier on its own, in which case we cannot form a URI from it unless there is some other information that it can be combined with.
  3. The value's type is unrecognized, in which case we probably have to assume case (2).
Contributor

robertknight commented Sep 8, 2017

I presume it leaves open the possibility that we can later invent a new equivalence type (something like, say, chapter-section), to augment the set of types in document_uri?

Yes. There are several cases to consider for the value of dc.identifier

  1. The value is a globally unique identifier from some recognized system (eg. ISBN) but is not a DOI. We can add support for these other systems incrementally, since there are a limited number of them in common use.
  2. The value is not a globally unique identifier on its own, in which case we cannot form a URI from it unless there is some other information that it can be combined with.
  3. The value's type is unrecognized, in which case we probably have to assume case (2).
@judell

This comment has been minimized.

Show comment
Hide comment
@judell

judell Sep 8, 2017

Contributor

I can check, but if you happen to know offhand: if we are not now saving a DOI as the result of parsing dc.relation.ispartof in conjunction with dc.identifier, are we saving that raw data for possible later use? If not, and we do later invent a new equivalence, I suppose we can reconstruct from sources.

Contributor

judell commented Sep 8, 2017

I can check, but if you happen to know offhand: if we are not now saving a DOI as the result of parsing dc.relation.ispartof in conjunction with dc.identifier, are we saving that raw data for possible later use? If not, and we do later invent a new equivalence, I suppose we can reconstruct from sources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment