Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Name occurrence verification needs #67

Open
Teinostoma opened this issue Jun 14, 2023 · 9 comments
Open

Name occurrence verification needs #67

Teinostoma opened this issue Jun 14, 2023 · 9 comments

Comments

@Teinostoma
Copy link

OCR often does very poorly on documents in BHL, and the list of names being searched for is very incomplete, at least when it comes to fossil mollusks. Authors also did not make this easy, often using idiosyncratic ways of abbreviating. As a result, both the false positive and false negative rates are very high in the documents that I am reading on BHL. A few ideas:

Is there a way to take the date of the publication into consideration? Names published after a publication was written will not be found in that publication (for example, the word lens will not be a reference to the genus Lens Simpson, 1900 in publications from the 1800's). This would help decease false positives.

Is there a way to allow users to quickly indicate "here is a name missed by the system", "this is correct", "this name finding is spurious", etc.? It would require verification to protect against trolling or errors, but could be a useful way to improve the name finding.

Is there a way to take context into account to identify higher taxonomic levels? This is especially of value for homonyms. For example, being able to search for references that contain both Auricularia and Mollusca would avoid the huge number of hits for the fungus Auricularia.

@dimus
Copy link
Member

dimus commented Jun 15, 2023

Thank you for your feedback @Teinostoma. Currently there is a hybrid approach to BHL names. There are name-finding methods that used ngram approach and BHLindex. BHLindex creates very few false positives, but misses many names with OCR errors. Names found before BHLindex using ngram did find many names with OCR errors, however they also contain a lot of false positives. We are thinking about ways to fix this problem.

Do you know a good resource for fossil mollusks names? Currently, I think, there are 3 sources of mollusks names: WoRMS, PaleoBioDB, and for old names -- Sherborn's Index Animalium.

Context, usage of dates, figuring out how to reconcile abbreviated names -- definitely a way to improve name-fidning. I am actually writing a grant exacly about that.

Implementing user-feedback for names would be out of scope of BHLindex, and more of a decision for BHL folks (@mlichtenberg, @cajunjoel). @gdower also might be interested.

@Teinostoma
Copy link
Author

Teinostoma commented Jun 15, 2023 via email

@dimus
Copy link
Member

dimus commented Jun 15, 2023

Does Ruhoff exist as a curated digitized data? I did try to OCR it using Adobe Acrobat and found that the final result does contain a fair amount of errors
F_A_Ruhoff_Mollusca_1850_1870.txt

If the OCR errors are corrected in the file (from the species epithet to the year), it would be fairly easy to convert it into a data-source

@Teinostoma
Copy link
Author

Teinostoma commented Jun 15, 2023 via email

@dimus
Copy link
Member

dimus commented Jun 18, 2023

@Teinostoma, I did try my best to clean up OCR for names in Ruhoff, would it be ok for you to look at the result, and tell what do you think:

https://github.com/gnames/ds-ruhoff-mollusca/blob/master/data/07-fmt-names.csv

To avoid problems with UTF-8, it is better to use LibreOffice instead of Excel

I only did pay attention to the names themselves (1st and 2nd columns), the metadata after the names are not as clean.
If/when they are clean enough, I can add them to https://verifier.globalnames.org and use these names in bhlindex.

I did try to reconcile them against other datasets, looks like about half of them are new for my data.

@Teinostoma
Copy link
Author

Teinostoma commented Jun 19, 2023 via email

@dimus
Copy link
Member

dimus commented Jun 20, 2023

thank you @Teinostoma! I added a fix gnames/ds-ruhoff-mollusca@1299450

I have currently 3 levels of curation quality, "curated" -- when I am pretty sure that there is a significant effort to scrutinize data done by specialists, "auto-curated" when cleaning is done mostly by scripts, and the rest, when curation is unknown.

For example IRMNG is considered to be curated, GBIF is auto-curated, ION is not curated.

Do you think it is good enough to apply "auto-curated" to the data? It would push its matching results above 'non-curated' names.

@Teinostoma
Copy link
Author

Teinostoma commented Jun 20, 2023 via email

@dimus
Copy link
Member

dimus commented Jul 25, 2023

I did attempt to detect more elusive typos, looks like about 25% of names in the publication are new to https://verifier.globalnames.org/

https://raw.githubusercontent.com/gnames/ds-ruhoff-mollusca/master/data/08-reconsile.csv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants