New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better taxanomic resolution between family and genus on export by including SubFamily, Tribe, and Subtribe #159
Comments
hi @whitfarnum good to hear from you and thanks for your detailed examples on ways to help nomer a little more useful for your work. Just curious - how do you typically use Nomer? |
@jhpoelen
I am currently working on Scarabaeidae and catalog of life has the best online resource for that so I am only searching catalog of life. I am looking at the catalog of life API to fill in higher taxonomy but it feels like it is still under development. |
@whitfarnum Thanks for detailing your workflow. Hoping to do what I can to add the additional taxon ranks as separate fields sooner rather than later. Thanks for being patient. |
@whitfarnum I've expanded the alignment schema to include subfamily, tribe and subtribe as you suggested. In running your example Aegialia arenaria, I found that expected values appeared when matching against ITIS / NCBI. For report see https://github.com/globalbioticinteractions/name-alignment-template/actions/runs/5682620836 and attached . For example snippet from report, see below. Can you confirm?
|
@jhpoelen |
@whitfarnum great! Looking forward to hearing your notes. |
@jhpoelen Adoretosoma elegans Search catalog of life for Adoretosoma elegans Anisoplia baetica Paracotalpa ursina Works Lagochile trigona Works Strigoderma arboricola Works Nomer: alignedExternalId: Chlorota aulica Works Adoretus semperi Works I have som number.s I submitted ~650 names. 550 of the names matched in catalog of life. I then tested the url associated with each name. ~13% of the names gave a 404 error. The results file is attached. |
@whitfarnum thanks for sharing your detailed notes. As I might have told you, Nomer uses a versioned copy of Catalogue of Life (i.e., accessed on 2022-09-09T20:05:22.601Z at https://download.catalogueoflife.org/col/latest_coldp.zip with signature hash://sha256/9ac28297a996e02f6026c40d24e67f59f7f39d495bb45759ebc4adb475d51f63 or hash://md5/ce89c200aab5be1b439647c1ac72813f as part of [1]). In this versioned (and signed) copy, I was able to locate the taxon ids that Catalogue of Life appears to have forgotten in their more recent version.
See results attached and below in markdown table. Note that, in the versioned copy of Catalogue of Life, the taxon ids This suggests that Catalogue of Life "forgot" or stopped using / redirecting at least 2 previously issued taxon ids. suspicious-taxa-with-404s-or-replacement-id.tsv.txt
[1] Poelen, Jorrit H. (2023). Nomer Corpus of Taxonomic Resources hash://sha256/0e9bc57bc082b58a2c7a509bb73362b258ec8ddfc6664898e25c639786413fda hash://md5/91dd844e787ffae8f0a2bbb8c1f29192 (0.16) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.8125362 |
Now, I wonder if the Catalogue of Life team can share what happened to taxon ids This seems especially relevant because GBIF is moving to adopt the Catalogue of Life as their backbone taxonomy. from https://www.gbif.org/publisher/f4ce3c03-7b38-445e-86e6-5f6b04b649d4 -
@mdoering @gdower Can you please help us understand the policy on taxonomic ids issued by Catalogue of Life? |
Note, by the way, that a identical query to #159 (comment) applied to the current version of Catalogue of Life yielded only the "newer" taxonomic identifers, and the older ones were not to be found. I am probably looking in the wrong location, so I'd appreciate any insight that you may have on the internals of Catalogue of Life.
2023-07-28-col-suspicious-taxa-with-404s-or-replacement-id.tsv.txt
|
As far as I can tell, the original issue, adding subfamily, tribe and subtribe has been resolved. Transferring another issue to a more suitable location - the catalogue of life tracker. |
I've transferred the Catalogue of Life taxonomic id question to CatalogueOfLife/general#98 . |
The author string changed for Adoretosoma elegans, which results in a new ID being minted: https://api.checklistbank.org/dataset/COL2022/taxon/64SCW The annual checklist 2022 was released in August 2022, a month before @jhpoelen harvested it. @whitfarnum, if you can point to the API like with the link above which uses the COL2022 alias as the dataset_id, the annual checklists won't ever be deleted and although ChecklistBank is still under development, the taxon endpoint has been stable for at least 3 years. Unfortunately, there's no way to load the COL2022 annual checklist into the catalogueoflife.org portal to link to it there. You may be interested in: |
@whitfarnum, I think @jhpoelen harvested the 2022 COL Annual Checklist so you should have no 404 errors if you use the API links with COL2022 as the dataset_id. Otherwise if it were the September release, you could potentially expect IDs to be broken for these names in COL2022, which shows a difference between the Scarabs dataset the Aug 2022 and Sept 2022 releases: https://www.checklistbank.org/dataset/1027/diff?attempts=92..93 200: https://api.checklistbank.org/dataset/COL2022/taxon/3SGCB new url: But that should still be a lot less 404 errors compared with using links to the current COL release. I wouldn't link to monthly COL releases because eventually they will get deleted. |
@whitfarnum Your careful observations led to an exchange with one of the Catalogue of Life developers @ CatalogueOfLife/general#98 (comment) . Curious to hear your thoughts on this. |
@jhpoelen
I am using the catalog of life data to curate our scarabs. Because I am using a single trusted source I can automate a lot of the name update procedures using the export results form Nomer. Here are some proposed features I think will make data updating smoother.
In the higher taxonomy the data jumps from alignedFamilyName down to alignedGenusName. This ignore SubFamily, Tribe, and Subtribe. Those data are available in alignedPath and alignedPathNames fields. Having them already parsed out like family and genus would save me a lot of manual copy and paste.
Examples:
Biota | Animalia | Arthropoda | Insecta | Coleoptera | Scarabaeoidea | Scarabaeidae | Aegialiinae | Aegialiini | Aegialia | Aegialia arenaria
unranked | kingdom | phylum | class | order | superfamily | family | subfamily | tribe | genus | species
For catalog of life, at least they have the author attached to higher taxonomy and if it could be included in the export that would save a lot of copy and paste and the back end. Here is a sample higher taxa link form catalog of life.
https://www.catalogueoflife.org/data/taxon/8RYS7
The text was updated successfully, but these errors were encountered: