Name occurrence verification needs #67

Teinostoma · 2023-06-14T21:31:20Z

OCR often does very poorly on documents in BHL, and the list of names being searched for is very incomplete, at least when it comes to fossil mollusks. Authors also did not make this easy, often using idiosyncratic ways of abbreviating. As a result, both the false positive and false negative rates are very high in the documents that I am reading on BHL. A few ideas:

Is there a way to take the date of the publication into consideration? Names published after a publication was written will not be found in that publication (for example, the word lens will not be a reference to the genus Lens Simpson, 1900 in publications from the 1800's). This would help decease false positives.

Is there a way to allow users to quickly indicate "here is a name missed by the system", "this is correct", "this name finding is spurious", etc.? It would require verification to protect against trolling or errors, but could be a useful way to improve the name finding.

Is there a way to take context into account to identify higher taxonomic levels? This is especially of value for homonyms. For example, being able to search for references that contain both Auricularia and Mollusca would avoid the huge number of hits for the fungus Auricularia.

dimus · 2023-06-15T12:52:09Z

Thank you for your feedback @Teinostoma. Currently there is a hybrid approach to BHL names. There are name-finding methods that used ngram approach and BHLindex. BHLindex creates very few false positives, but misses many names with OCR errors. Names found before BHLindex using ngram did find many names with OCR errors, however they also contain a lot of false positives. We are thinking about ways to fix this problem.

Do you know a good resource for fossil mollusks names? Currently, I think, there are 3 sources of mollusks names: WoRMS, PaleoBioDB, and for old names -- Sherborn's Index Animalium.

Context, usage of dates, figuring out how to reconcile abbreviated names -- definitely a way to improve name-fidning. I am actually writing a grant exacly about that.

Implementing user-feedback for names would be out of scope of BHLindex, and more of a decision for BHL folks (@mlichtenberg, @cajunjoel). @gdower also might be interested.

Teinostoma · 2023-06-15T14:31:38Z

I believe that WoRMS has all the names from MolluscaBase, so I don't think MolluscaBase would need separate attention. Paleobiology Database doesn't have very thorough coverage of many mollusc faunas; most of the attention has gone to "what are things you can do with this data" rather than to supporting data generation and quality control (a common problem of large biodiversity databases). Ruhoff (https://repository.si.edu/handle/10088/5331 ) adds a couple of decades beyond Sherborn, though it is not quite as thorough. Fossils were not included in the Zoological Register for a while, so it does not help with them for the first few decades.

…

On Thu, Jun 15, 2023 at 8:52 AM Dmitry Mozzherin ***@***.***> wrote: Thank you for your feedback @Teinostoma <https://github.com/Teinostoma>. Currently there is a hybrid approach to BHL names. There are name-finding methods that used ngram approach and BHLindex. BHLindex creates very few false positives, but misses many names with OCR errors. Names found before BHLindex using ngram did find many names with OCR errors, however they also contain a lot of false positives. We are thinking about ways to fix this problem. Do you know a good resource for fossil mollusks names? Currently, I think, there are 3 sources of mollusks names: WoRMS, PaleoBioDB, and for old names -- Sherborn's Index Animalium. Context, usage of dates, figuring out how to reconcile abbreviated names -- definitely a way to improve name-fidning. I am actually writing a grant exacly about that. Implementing user-feedback for names would be out of scope of BHLindex, and more of a decision for BHL folks ***@***.*** <https://github.com/mlichtenberg>, @cajunjoel <https://github.com/cajunjoel>). @gdower <https://github.com/gdower> also might be interested. — Reply to this email directly, view it on GitHub <#67 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AY5MAWGT2VS6TQSWNMGNUK3XLMAQLANCNFSM6AAAAAAZG6KK4Y> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Dr. David Campbell Associate Professor, Geology Department of Natural Sciences 110 S Main St, #7270 Gardner-Webb University Boiling Springs NC 28017

dimus · 2023-06-15T21:07:53Z

Does Ruhoff exist as a curated digitized data? I did try to OCR it using Adobe Acrobat and found that the final result does contain a fair amount of errors
F_A_Ruhoff_Mollusca_1850_1870.txt

If the OCR errors are corrected in the file (from the species epithet to the year), it would be fairly easy to convert it into a data-source

Teinostoma · 2023-06-15T21:22:54Z

I don't know of a curated version of Ruhoff; I mostly use my print copy, which doesn't help what you need much.

…

On Thu, Jun 15, 2023 at 5:08 PM Dmitry Mozzherin ***@***.***> wrote: Does Ruhoff exists as a curated digitized data? I did try to OCR it using Adobe Acrobat and found that the final result does contain a fair amount of errors F_A_Ruhoff_Mollusca_1850_1870.txt <https://github.com/gnames/bhlindex/files/11763335/F_A_Ruhoff_Mollusca_1850_1870.txt> — Reply to this email directly, view it on GitHub <#67 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AY5MAWD72FD2YSAP2ESWGNTXLN2TLANCNFSM6AAAAAAZG6KK4Y> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Dr. David Campbell Associate Professor, Geology Department of Natural Sciences 110 S Main St, #7270 Gardner-Webb University Boiling Springs NC 28017

dimus · 2023-06-18T23:57:21Z

@Teinostoma, I did try my best to clean up OCR for names in Ruhoff, would it be ok for you to look at the result, and tell what do you think:

https://github.com/gnames/ds-ruhoff-mollusca/blob/master/data/07-fmt-names.csv

To avoid problems with UTF-8, it is better to use LibreOffice instead of Excel

I only did pay attention to the names themselves (1st and 2nd columns), the metadata after the names are not as clean.
If/when they are clean enough, I can add them to https://verifier.globalnames.org and use these names in bhlindex.

I did try to reconcile them against other datasets, looks like about half of them are new for my data.

Teinostoma · 2023-06-19T16:50:31Z

It looks like a good start. I noticed two corrections for the first page - in *Nucula hammen aalensis, **hammen* is an error for *hammeri* and *Architectonica abbottii* Gabb, 1861 is missing, but that's far better than the OCR.

…

On Sun, Jun 18, 2023 at 7:57 PM Dmitry Mozzherin ***@***.***> wrote: @Teinostoma <https://github.com/Teinostoma>, I did try my best to clean up OCR for names in Ruhoff, would it be ok for you too look at the result, and tell what do you think: https://github.com/gnames/ds-ruhoff-mollusca/blob/master/data/07-fmt-names.csv — Reply to this email directly, view it on GitHub <#67 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AY5MAWDTDZR6DKAV5LYQD7TXL6IWZANCNFSM6AAAAAAZG6KK4Y> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Dr. David Campbell Associate Professor, Geology Department of Natural Sciences 110 S Main St, #7270 Gardner-Webb University Boiling Springs NC 28017

dimus · 2023-06-20T11:14:09Z

thank you @Teinostoma! I added a fix gnames/ds-ruhoff-mollusca@1299450

I have currently 3 levels of curation quality, "curated" -- when I am pretty sure that there is a significant effort to scrutinize data done by specialists, "auto-curated" when cleaning is done mostly by scripts, and the rest, when curation is unknown.

For example IRMNG is considered to be curated, GBIF is auto-curated, ION is not curated.

Do you think it is good enough to apply "auto-curated" to the data? It would push its matching results above 'non-curated' names.

Teinostoma · 2023-06-20T13:09:24Z

That seems the right level to me.

…

On Tue, Jun 20, 2023 at 7:14 AM Dmitry Mozzherin ***@***.***> wrote: thank you @Teinostoma <https://github.com/Teinostoma>! I have currently 3 levels of curation quality, "curated" -- when I am pretty sure that there is a significant effort to scrutinize data, "auto-curated" when cleaning is done mostly by scripts, and the rest, when curation is unknown. For example IRMNG is considered to be curated, GBIF is auto-curated, ION is not curated. Do you think it is good enough to apply "auto-curated" to the data? It would push its result above 'non-curated' names. — Reply to this email directly, view it on GitHub <#67 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AY5MAWGTL6WGE37OB2FJTADXMGAY3ANCNFSM6AAAAAAZG6KK4Y> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Dr. David Campbell Associate Professor, Geology Department of Natural Sciences 110 S Main St, #7270 Gardner-Webb University Boiling Springs NC 28017

dimus · 2023-07-25T16:24:46Z

I did attempt to detect more elusive typos, looks like about 25% of names in the publication are new to https://verifier.globalnames.org/

https://raw.githubusercontent.com/gnames/ds-ruhoff-mollusca/master/data/08-reconsile.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Name occurrence verification needs #67

Name occurrence verification needs #67

Teinostoma commented Jun 14, 2023

dimus commented Jun 15, 2023

Teinostoma commented Jun 15, 2023 via email

dimus commented Jun 15, 2023 •

edited

Loading

Teinostoma commented Jun 15, 2023 via email

dimus commented Jun 18, 2023 •

edited

Loading

Teinostoma commented Jun 19, 2023 via email

dimus commented Jun 20, 2023 •

edited

Loading

Teinostoma commented Jun 20, 2023 via email

dimus commented Jul 25, 2023 •

edited

Loading

Name occurrence verification needs #67

Name occurrence verification needs #67

Comments

Teinostoma commented Jun 14, 2023

dimus commented Jun 15, 2023

Teinostoma commented Jun 15, 2023 via email

dimus commented Jun 15, 2023 • edited Loading

Teinostoma commented Jun 15, 2023 via email

dimus commented Jun 18, 2023 • edited Loading

Teinostoma commented Jun 19, 2023 via email

dimus commented Jun 20, 2023 • edited Loading

Teinostoma commented Jun 20, 2023 via email

dimus commented Jul 25, 2023 • edited Loading

dimus commented Jun 15, 2023 •

edited

Loading

dimus commented Jun 18, 2023 •

edited

Loading

dimus commented Jun 20, 2023 •

edited

Loading

dimus commented Jul 25, 2023 •

edited

Loading