Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As a user I want atypical hyphens standardized and parsed #237

Closed
gdower opened this issue Nov 9, 2022 · 2 comments
Closed

As a user I want atypical hyphens standardized and parsed #237

gdower opened this issue Nov 9, 2022 · 2 comments

Comments

@gdower
Copy link
Contributor

gdower commented Nov 9, 2022

Some publishers use non-breaking hyphens (U+2011) instead of the more typically used hyphen-minus (U+002D) in author strings in typesetted PDFs and people copy and paste them into their databases, which then breaks parsing. For example, compare these 2 outputs:

https://parser.globalnames.org/?format=html&names=Passalus+%28Pertinax%29+gaboi+Jim%C3%A9nez%E2%80%91Ferbans+%26+Reyes%E2%80%91Castillo%2C+2022%0D%0APassalus+%28Pertinax%29+gaboi+Jim%C3%A9nez-Ferbans+%26+Reyes-Castillo%2C+2022&with_details=on

Perhaps atypical hyphens should be standardized to U+002D hyphens prior to parsing?

Here's the PDF although they don't put the non-breaking hyphens in the web version.

Here's some other atypical hyphens that might also occasionally be an issue introduced by publishers or bad OCR:

https://www.fileformat.info/info/unicode/category/Pd/list.htm

If it hurts performance too much, it's probably okay to not bother with handling it. It's not a frequently encountered problem.

@dimus
Copy link
Member

dimus commented Nov 10, 2022

Thank you @gdower, this is a good catch. I think it does make sense to add non-breaking hythen, as it is something we know appears 'in the wild', while I would postpone other hyphens until they are encountered for real to save some CPU cycles.

@dimus dimus closed this as completed in 057a468 Nov 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants