Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hybrid character sometimes missing #174

Closed
tobymarsden opened this issue Jun 23, 2021 · 4 comments
Closed

Hybrid character sometimes missing #174

tobymarsden opened this issue Jun 23, 2021 · 4 comments
Labels
bug Something isn't working

Comments

@tobymarsden
Copy link

When parsing Magnolia x soulangeana, the words section of the details looks like this:

"words": [
    {
      "verbatim": "Magnolia",
      "normalized": "Magnolia",
      "wordType": "GENUS",
      "start": 0,
      "end": 8
    },
    {
      "verbatim": "×",
      "normalized": "×",
      "wordType": "HYBRID_CHAR",
      "start": 9,
      "end": 10
    },
    {
      "verbatim": "soulangeana",
      "normalized": "soulangeana",
      "wordType": "SPECIES",
      "start": 11,
      "end": 22
    }
  ],

(I wonder if verbatim should be x instead of ×, as currently subsp is to subsp., but that's a nitpick.)

However, when parsing Magnolia denudata x Magnolia liliiflora, the output is:

  "words": [
    {
      "verbatim": "Magnolia",
      "normalized": "Magnolia",
      "wordType": "GENUS",
      "start": 0,
      "end": 8
    },
    {
      "verbatim": "denudata",
      "normalized": "denudata",
      "wordType": "SPECIES",
      "start": 9,
      "end": 17
    },
    {
      "verbatim": "",
      "normalized": "",
      "wordType": "HYBRID_CHAR",
      "start": 18,
      "end": 19
    },
    {
      "verbatim": "Magnolia",
      "normalized": "Magnolia",
      "wordType": "GENUS",
      "start": 20,
      "end": 28
    },
    {
      "verbatim": "liliiflora",
      "normalized": "liliiflora",
      "wordType": "SPECIES",
      "start": 29,
      "end": 39
    }
  ]

i.e. the HYBRID_CHAR word has empty verbatim and normalized properties.

The same applies to names like × Sorbopyrus auricularis.

Is this a bug, and if so, would you consider a PR?

@dimus
Copy link
Member

dimus commented Jun 23, 2021

definitely a bug, and yes PR would be fantastic if you are up to it

@dimus dimus added the bug Something isn't working label Jun 23, 2021
tobymarsden added a commit to amazingplants/gnparser that referenced this issue Jun 23, 2021
@tobymarsden
Copy link
Author

@dimus PR at #175

@dimus
Copy link
Member

dimus commented Jun 24, 2021

I see that preprocessing adds to the problem, because there is a substitution of all hybrid characters to ×. I will need to think a bit how to reorganize the code to get the correct verbatim.

@dimus
Copy link
Member

dimus commented Jun 24, 2021

The problem was largely caused by a code debt, where an unnecessary legacy struct parser.wordNode was shoehorned into
parsed.Word. I removed the legacy struct. Also I added test_data_cultivars.md to tools/gentest.go to simplify test generation where many changes are introduced. I am adding a section how to use the tool to CONTRIBUTING.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants