Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing hyphenated genus names starting with a 2-letter segment #205

Closed
tobymarsden opened this issue Nov 14, 2021 · 3 comments
Closed

Parsing hyphenated genus names starting with a 2-letter segment #205

tobymarsden opened this issue Nov 14, 2021 · 3 comments

Comments

@tobymarsden
Copy link

Parsing fails for genera that start with a 2-letter segment, e.g. Le-monniera.

tobymarsden added a commit to amazingplants/gnparser that referenced this issue Nov 14, 2021
@dimus
Copy link
Member

dimus commented Nov 14, 2021

I think it is a good feature. I do have a concern though. GNparser serves not only as a parser, but also as a sort of 'linter' which
should break on strings that are entered as a scientific name by mistake.

If to check GNverifier name-strings for names with 2-letters before dash, most of the results are junk. So I propose to limit 2-letter prefixes to a hardcoded subset, disalowing anything else.
If more names show up later they can be added to the list. For example, such approach exists for 2-letter generic names. From the file below it looks like only these "prefixes" happen in the wild

De-
Eu-
Le-
Ne-

2char-dash.txt

@tobymarsden
Copy link
Author

@dimus Thanks!

I've now completed parsing all of the Kew names and indeed it turns out that Le-monniera was the only one like this gnparser struggled with. Which means that (excluding six names which are wrong in the source data) once this issue and #203 are resolved gnparser will parse all 1,197,503 names in the Kew dataset.

I'll update the PR to special-case these four prefixes.

tobymarsden added a commit to amazingplants/gnparser that referenced this issue Nov 14, 2021
@dimus dimus closed this as completed Nov 14, 2021
@dimus
Copy link
Member

dimus commented Nov 14, 2021

I've now completed parsing all of the Kew names and indeed it turns out that Le-monniera was the only one like this gnparser struggled with.

Great news @tobymarsden! Closed this with dc67aaf but put PR instead of the issue in the comment by mistake, making release now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants