Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Advanced search: filter on taxonomic rank #95

Closed
larsgw opened this issue Aug 8, 2022 · 18 comments
Closed

Advanced search: filter on taxonomic rank #95

larsgw opened this issue Aug 8, 2022 · 18 comments

Comments

@larsgw
Copy link

larsgw commented Aug 8, 2022

Searching for "Cicadomorpha" with Catalogue of Life as a source returns two results with an equal score, the infraorder Cicadomorpha and the genus Cicadomorpha Martynov, 1927. In my case I have the rank available. The features in the README lists

Fine-tunng the match score by matching authors, years, ranks etc.

But I do not see how to do this in the advanced search query language. Is this possible to do?

@Adafede
Copy link

Adafede commented Aug 8, 2022

I think what you are looking for is: https://github.com/gnames/gnverifier#advanced-search-query-language

You can limit the parents, or look for genera or lower taxa, but not filter for a specific rank afaik

@larsgw
Copy link
Author

larsgw commented Aug 8, 2022

You can [...] not filter for a specific rank afaik

Yes, that's what I meant. I was hoping there would be some (hidden) way to do so but I guess rank matching is only used when the rank is implied (with binomial names & subspecific ranks)?

@larsgw
Copy link
Author

larsgw commented Aug 10, 2022

Could I help with adding support for this in some way?

@dimus
Copy link
Member

dimus commented Aug 23, 2022

Hi @larsgw @Adafede, I am back from vacation now. @larsgw you are right, the rank match is implicit.

Adding a constraint by rank should possible, but I do wonder how often such a usecase would be useful. I can see your example, but I would assume it is a quite rare situation? Let me know if I am wrong.

Could I help with adding support for this in some way?

Adding bug reports, suggesting new features, letting us know about useful public data sources, discussing ideas for further development, citing the app are all helpful. And for adventurous enough, contributing code to "scratch an itch" is the best help :)

@larsgw
Copy link
Author

larsgw commented Aug 23, 2022

Adding a constraint by rank should possible, but I do wonder how often such a usecase would be useful. I can see your example, but I would assume it is a quite rare situation? Let me know if I am wrong.

I have encountered other situations that could be solved by the same solution:

  • Errors in GBIF and CoL taxa, like the "genera" Asilidae or Scarabaeoidea returning before the respective family and superfamily. Ideally those data errors should just be fixed but this is blocking my work.
  • Subgenera with the same name as the parent genus (i.e. Genus s.s.)

And for adventurous enough, contributing code to "scratch an itch" is the best help :)

That's what I was suggesting, but I haven't had any luck navigating the various repositories involved in this. (I also don't think I can set up a testing environment at the moment due to limited disk space)

@larsgw
Copy link
Author

larsgw commented Aug 23, 2022

My use case is that I have manually extracted lists of taxa (with their taxonomic ranks) from older and newer sources, and I want to match these taxa to GBIF. https://twitter.com/larswillighagen/status/1557875955301056512

(I have the feeling that the results now are a bit worse than a few months ago, sometimes adding the author & year does not seem to change the top result even if there's an exact match, but that's only for a few taxa)

@dimus
Copy link
Member

dimus commented Aug 23, 2022

And for adventurous enough, contributing code to "scratch an itch" is the best help :)

That's what I was suggesting, but I haven't had any luck navigating the various repositories involved in this. (I also don't think I can set up a testing environment at the moment due to limited disk space)

I'll be happy to help to set the gnresolver development env. when/if you will be ready to help with the code. There is also a way to help with a limited disk space as well by modifying advanced search query library https://github.com/gnames/gnquery to include rank.

@dimus
Copy link
Member

dimus commented Aug 23, 2022

(I have the feeling that the results now are a bit worse than a few months ago, sometimes adding the author & year does not seem to change the top result even if there's an exact match, but that's only for a few taxa)

I'd be curious to know learn more about this, I work with CoL guys @yroskov and @gdower and I alert them about possible issues. CoL went through big changes recently and got integrated more with GBIF backbone taxonomy.

@larsgw
Copy link
Author

larsgw commented Aug 23, 2022

(I have the feeling that the results now are a bit worse than a few months ago, sometimes adding the author & year does not seem to change the top result even if there's an exact match, but that's only for a few taxa)

I'd be curious to know learn more about this, I work with CoL guys @yroskov and @gdower and I alert them about possible issues. CoL went through big changes recently and got integrated more with GBIF backbone taxonomy.

Hm, I looked into it a bit more (I was rushing a bit when I worked on it last week), and that aspect might be my fault.

@larsgw
Copy link
Author

larsgw commented Aug 23, 2022

Hm, I looked into it a bit more (I was rushing a bit when I worked on it last week), and that aspect might be my fault.

Yep, I hadn't set up unit tests for the custom name parsing code and accidentally introduced a regression that seems to have discarded all author names in processing. I have the input still so it's fine, but oops.

@gdower
Copy link

gdower commented Aug 23, 2022

It's possible that Systema Dipterorum updated Asilidae cristatus recently and the changes just haven't made it into CoL and GBIF yet. In the Systema Dipterorum V3.8 data it is Asilidae cristatus.

@dimus
Copy link
Member

dimus commented Aug 24, 2022

@larsgw, I thought more about rank constraint, and I do not think it is a good idea. Ranks are not normalized and are a mess of a variety of strings, sometimes they are given, and sometimes not, so such an option will create misunderstanding of results and a confusion. Sorry about that.

I guess a postprocessing of the results is the best we can do at this point.

@dimus dimus closed this as completed Aug 24, 2022
@larsgw
Copy link
Author

larsgw commented Aug 24, 2022

Ah I see as well that the JSON output does (kind of, in the classificationRanks field) show the rank of the result, while this isn't available in the CSV. That's enough for me to continue, thank you very much.

@dimus
Copy link
Member

dimus commented Aug 25, 2022

I did move gnverifier and gnfinder to v1.0.0. If you use API, then changing /api/v0/ to /api/v1/ in API URL is needed for scripts to work again

@larsgw
Copy link
Author

larsgw commented Aug 25, 2022

Thank you for letting me know. I noticed that the CLI broke for me as well until I uncommented the API URL setting in the config file.

@dimus
Copy link
Member

dimus commented Aug 25, 2022

updating gnverifier via brew or github download should also do the trick

@larsgw
Copy link
Author

larsgw commented Aug 18, 2023

Actually classificationRanks does not work because it contains the classification of the accepted name, so a species name that is now a synonym for a subspecies is seen as a rank mismatch. Some other examples that I am running into:

  • "Diptera" matches the former genus Diptera Borkh., synonym of Saxifraga L.
  • "Nomada cypria Mavromoustakis" matches Nomada (the genus) first. This can be solved with the cardinality score though, and I cannot reproduce it anymore today.

@larsgw
Copy link
Author

larsgw commented Aug 18, 2023

I've improved my algorithm to detect short common prefixes. Before it already noticed that Diptera Borkh. was unusual because of the short common prefix of the classificationPath, but now it can figure out that (although "Diptera" is listed first) the actual most likely intended common prefix is Animalia|Arthropoda|Insecta|Diptera, and grab the Diptera result that matches it from the list of results.

There are still some other issues: "Mycetophilidae" first matches Mycetophilidae MyceoIntGen [sic] instead of the family. Because the genus is not a synonym it's possible to reliably get the rank and compare to my own data, but that probably won't always be the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants