-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
global: fix unicode issues #34
Conversation
tests/test_idutils.py
Outdated
@@ -51,6 +51,12 @@ | |||
('doi.org/10.1016/j.epsl.2011.11.037', ['doi', 'handle'], | |||
'10.1016/j.epsl.2011.11.037', | |||
'https://doi.org/10.1016/j.epsl.2011.11.037'), | |||
('10.1016/üникóδé-дôΐ', ['doi', 'handle'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting case - is this an actual DOI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the spec it could be, though from a brief search (from Zenodo records and the web in general) I didn't find any solid example of such DOIs.
(I discovered the issue/error by accident, (user forgot an em-dash in a DOI), and saw that the spec supported UTF-8 characters in DOIs)
tests/test_idutils.py
Outdated
@@ -51,6 +51,12 @@ | |||
('doi.org/10.1016/j.epsl.2011.11.037', ['doi', 'handle'], | |||
'10.1016/j.epsl.2011.11.037', | |||
'https://doi.org/10.1016/j.epsl.2011.11.037'), | |||
('10.1016/üникóδé-дôΐ', ['doi', 'handle'], | |||
'10.1016/üникóδé-дôΐ', | |||
'https://doi.org/10.1016/üникóδé-дôΐ'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand DOI character encoding specification we would need to URL-encode the non-ASCII characters.
tests/test_idutils.py
Outdated
@@ -32,25 +32,38 @@ | |||
('http://www.example.org/ark:/13030/tqb3kh97gh8w', ['ark', 'url'], '', ''), | |||
('10.1016/j.epsl.2011.11.037', ['doi', 'handle'], | |||
'10.1016/j.epsl.2011.11.037', | |||
'https://doi.org/10.1016/j.epsl.2011.11.037'), | |||
'https://doi.org/10.1016%2Fj.epsl.2011.11.037'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct according to the DOI spec? I think all ascii characters should probably stay as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to this part of the spec, (and the final sentence of this section), it's better for forward slashes to be url-encoded
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is funny - so 2.5.2.4 puts slash in the recommend category for encoding. Just below in 2.6.2 it says:
The DOI name "10.1006/jmbi.1998.2354" would be made an actionable link as "https://doi.org/10.1006/jmbi.1998.2354".
(i.e. no encoding) :S
Let's discuss IRL - not fully sure what the impact would be.
c9ebe91
to
ed803c0
Compare
* Handles unicode characters in DOI. * URL-encodes generated DOI URLs
No description provided.