You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some publishers use non-breaking hyphens (U+2011) instead of the more typically used hyphen-minus (U+002D) in author strings in typesetted PDFs and people copy and paste them into their databases, which then breaks parsing. For example, compare these 2 outputs:
Thank you @gdower, this is a good catch. I think it does make sense to add non-breaking hythen, as it is something we know appears 'in the wild', while I would postpone other hyphens until they are encountered for real to save some CPU cycles.
Some publishers use non-breaking hyphens (U+2011) instead of the more typically used hyphen-minus (U+002D) in author strings in typesetted PDFs and people copy and paste them into their databases, which then breaks parsing. For example, compare these 2 outputs:
https://parser.globalnames.org/?format=html&names=Passalus+%28Pertinax%29+gaboi+Jim%C3%A9nez%E2%80%91Ferbans+%26+Reyes%E2%80%91Castillo%2C+2022%0D%0APassalus+%28Pertinax%29+gaboi+Jim%C3%A9nez-Ferbans+%26+Reyes-Castillo%2C+2022&with_details=on
Perhaps atypical hyphens should be standardized to U+002D hyphens prior to parsing?
Here's the PDF although they don't put the non-breaking hyphens in the web version.
Here's some other atypical hyphens that might also occasionally be an issue introduced by publishers or bad OCR:
https://www.fileformat.info/info/unicode/category/Pd/list.htm
If it hurts performance too much, it's probably okay to not bother with handling it. It's not a frequently encountered problem.
The text was updated successfully, but these errors were encountered: