As a user I want atypical hyphens standardized and parsed #237

gdower · 2022-11-09T19:03:32Z

Some publishers use non-breaking hyphens (U+2011) instead of the more typically used hyphen-minus (U+002D) in author strings in typesetted PDFs and people copy and paste them into their databases, which then breaks parsing. For example, compare these 2 outputs:

https://parser.globalnames.org/?format=html&names=Passalus+%28Pertinax%29+gaboi+Jim%C3%A9nez%E2%80%91Ferbans+%26+Reyes%E2%80%91Castillo%2C+2022%0D%0APassalus+%28Pertinax%29+gaboi+Jim%C3%A9nez-Ferbans+%26+Reyes-Castillo%2C+2022&with_details=on

Perhaps atypical hyphens should be standardized to U+002D hyphens prior to parsing?

Here's the PDF although they don't put the non-breaking hyphens in the web version.

Here's some other atypical hyphens that might also occasionally be an issue introduced by publishers or bad OCR:

https://www.fileformat.info/info/unicode/category/Pd/list.htm

If it hurts performance too much, it's probably okay to not bother with handling it. It's not a frequently encountered problem.

dimus · 2022-11-10T16:26:22Z

Thank you @gdower, this is a good catch. I think it does make sense to add non-breaking hythen, as it is something we know appears 'in the wild', while I would postpone other hyphens until they are encountered for real to save some CPU cycles.

dimus · 2022-11-10T17:45:10Z

Hopefully

https://parser.globalnames.org/?format=html&names=Passalus+%28Pertinax%29+gaboi+Jim%C3%A9nez%E2%80%91Ferbans+%26+Reyes%E2%80%91Castillo%2C+2022%0D%0APassalus+%28Pertinax%29+gaboi+Jim%C3%A9nez-Ferbans+%26+Reyes-Castillo%2C+2022&with_details=on

Now parses correctly

gdower mentioned this issue Nov 9, 2022

Parsing atypical hyphens CatalogueOfLife/backend#1178

Closed

dimus closed this as completed in 057a468 Nov 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

As a user I want atypical hyphens standardized and parsed #237

As a user I want atypical hyphens standardized and parsed #237

gdower commented Nov 9, 2022

dimus commented Nov 10, 2022

dimus commented Nov 10, 2022

As a user I want atypical hyphens standardized and parsed #237

As a user I want atypical hyphens standardized and parsed #237

Comments

gdower commented Nov 9, 2022

dimus commented Nov 10, 2022

dimus commented Nov 10, 2022