BigBatch processing: issues to cover #73

myrmoteras · 2023-01-03T17:36:15Z

Batch processing of all the treatments should include the following issues

Accession numbers
SubSubSection types normalization
vernacularName
COL IDs on treatment taxon names and cited taxon names

flsimoes · 2023-01-11T15:01:30Z

Taxonomic status normalization (e.g. "sp. nov.", "sp. n." and "n. sp.")
Normalization of Animalia/Metazoa

gsautter · 2023-01-12T17:46:06Z

A few more point:

remove generic number annotations
normalize annotation types (see full type list for details)

gsautter · 2023-01-25T09:20:17Z

One more point that just came up in our discussion here in Oslo

annotate individual keywords
add keywords to article stats

flsimoes · 2023-01-26T10:51:38Z

Species defined as "sp. 1", "sp. 2", "sp. 3", etc should be "undetermined" instead

myrmoteras · 2023-01-26T10:56:08Z

Species defined as "sp. 1", "sp. 2", "sp. 3", etc should be "undetermined" instead

no, we should make them status=undet.. sp. 1 is a defined name that often later is cited, for example in a revision where the specimen then gets a taxonomic name.

myrmoteras · 2023-01-26T10:59:09Z

clean up taxonomic status and come up of a reference vocabulary for at least the common terms, like sp. nov. https://tb.plazi.org/GgServer/srsStats/stats?outputFields=tax.status&groupingFields=tax.status&format=HTML

myrmoteras · 2023-01-26T11:01:36Z

find a solution to normalize types (typeStatus in material citation attributes to enable, among others, more efficient searches https://tb.plazi.org/GgServer/srsStats/stats?outputFields=matCit.typeStatus&groupingFields=matCit.typeStatus&format=HTML

if we do the attributes, we should there start to use a reference vocabulary

gsautter · 2023-01-26T11:04:13Z

Species defined as "sp. 1", "sp. 2", "sp. 3", etc should be "undetermined" instead

no, we should make them status=undet.. sp. 1 is a defined name that often later is cited, for example in a revision where the specimen then gets a taxonomic name.

@myrmoteras that's a lot harder to filter by, and far more prone to incurring incorrect cross-article matches ... species="undefined" and species="undetermined" are way easier to catch and filter than a somewhat arbitrary scheme like "sp. 1, sp. 2, sp. 3" or "sp. A, sp. B, sp. C" or "sp. ", as the latter require pattern matching to a certain extent, which vastly complicates query processing.
Plus, we had agreed on the "undefined", "unknown", "undetermined", "uncertain" scheme, and implemented respective mechanics throughout our systems ... it's only older data that we need to catch up to that level.

flsimoes · 2023-01-26T11:09:07Z

Species defined as "sp. 1", "sp. 2", "sp. 3", etc should be "undetermined" instead

no, we should make them status=undet.. sp. 1 is a defined name that often later is cited, for example in a revision where the specimen then gets a taxonomic name.

@myrmoteras that's a lot harder to filter by, and far more prone to incurring incorrect cross-article matches ... species="undefined" and species="undetermined" are way easier to catch and filter than a somewhat arbitrary scheme like "sp. 1, sp. 2, sp. 3" or "sp. A, sp. B, sp. C" or "sp. ", as the latter require pattern matching to a certain extent, which vastly complicates query processing. Plus, we had agreed on the "undefined", "unknown", "undetermined", "uncertain" scheme, and implemented respective mechanics throughout our systems ... it's only older data that we need to catch up to that level.

That's what I had understood as well...

myrmoteras · 2023-01-26T11:10:18Z

@flsimoes @gsautter can you point out a couple of examples, to make sure we talk about the same?

in my view, in the taxonomicName attributes, Corvus sp. 1 should be attributed

genus = Corvus
species = sp. 1
status = undefined

In taxonomy, sp. 1 have a connotation, that is that there is a species, that has no proper name. It is referred to later Corvus sp. 1 in Treatment UUID=XXX in publication UUID=YYYY.

This also allows to differentiate between sp. 1 and sp. 2 in a genus in the same publication.

flsimoes · 2023-01-26T12:34:08Z

@flsimoes @gsautter can you point out a couple of examples, to make sure we talk about the same?

in my view, in the taxonomicName attributes, Corvus sp. 1 should be attributed

genus = Corvus

species = sp. 1

status = undefined

In taxonomy, sp. 1 have a connotation, that is that there is a species, that has no proper name. It is referred to later Corvus sp. 1 in Treatment UUID=XXX in publication UUID=YYYY.

This also allows to differentiate between sp. 1 and sp. 2 in a genus in the same publication.

Taxonomically speaking, I fully agree with your rationale.

In this issue we discussed the same
https://github.com/plazi/conversion/issues/13#issuecomment-1072609723

gsautter · 2023-01-26T12:41:52Z

Well, the status attribute normally is for status labels like "spec. nov." or "comb. nov.", and only for that.

And while I do see the citation part, the implied connection between "Corvus sp. 1 Smith 1900" and "Corvus sp. 1 Jones 1986" just doesn't exist, generating a pattern of homonyms that is way harder to filter downstream as a well-defined single placeholder "undefined" ...
Plus, if going for the species in the citation context is the goal, there always is the verbatim annotation value "Corvus sp. 1", so attributes aren't the only way of accessing this.

gsautter · 2023-01-31T13:23:31Z

Another point: remove URL prefix from DOIs to simplify comparison.

gsautter · 2023-02-14T21:25:20Z

Another point:

normalize taxonomicName status attribute to lower case, especially getting rid of all-caps (example: https://tb.plazi.org/GgServer/html/875F87E22D65FFC7D359FA429E1C79F5)

gsautter · 2023-03-06T08:34:04Z

Yet another point:

remove stale approvalRequiredFor_<XYZ> document attributes to free them up for use by user certification authority

myrmoteras · 2023-03-23T18:50:43Z

@flsimoes @tcatapano please check weather we can go ahead to run the big batch. Please check by end of March 2023.

tcatapano · 2023-03-23T18:57:57Z

@flsimoes @tcatapano please check weather we can go ahead to run the big batch. Please check by end of March 2023.

@myrmoteras: it's not clear to me what the task is. I havent been involved in this issue so don't have any context.

gsautter · 2023-03-23T19:10:14Z

@myrmoteras: it's not clear to me what the task is. I havent been involved in this issue so don't have any context.

@tcatapano the whole effort is basically a full-corpus cleanup operation to get rid of erroneous annotation types, normalize values of certain attributes, run now-standard linking jobs and QC on the older portion of our data, etc. The sheer number of documents to process sort of justifies the more thorough planning than a 200 document job would require.

Meaning to say: if you can think of anything we kind of did wrong in the past, but never got around to cleaning up the existing mess after resolving the issue for prospective documents, cleaning up the earlier documents is something to add to this list ...

flsimoes · 2023-03-23T19:12:26Z

@flsimoes @tcatapano please check weather we can go ahead to run the big batch. Please check by end of March 2023.

Yes, no further comments from my part

tcatapano · 2023-03-27T19:13:20Z

I dont think it is worth delaying anything. Go ahead.

gsautter · 2023-10-23T02:05:56Z

All implemented now, just need to (a) figure out how to bundle up all the gizmos and (b) run a few tests afterwards.

myrmoteras · 2023-10-23T02:57:33Z

Thanks Guido. Good luck and tell us when you need us to check output Donat Get Outlook for Android<https://aka.ms/AAb9ysg>

…

________________________________ From: Guido Sautter ***@***.***> Sent: Monday, October 23, 2023 4:06:11 AM To: plazi/treatmentBank ***@***.***> Cc: Donat Agosti ***@***.***>; Mention ***@***.***> Subject: Re: [plazi/treatmentBank] BigBatch processing: issues to cover (Issue #73) EXTERNAL SENDER All implemented now, just need to (a) figure out how to bundle up all the gizmos and (b) run a few tests afterwards. — Reply to this email directly, view it on GitHub<#73 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABDFPJA5U3WO5YVAPKUZ3KTYAXGJHAVCNFSM6AAAAAATP5SEG2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZUGMYTSMZTGI>. You are receiving this because you were mentioned.Message ID: ***@***.***>

gsautter · 2023-10-27T16:17:15Z

Intermediate result after some 60 hours: about 12K IMFs processed, another 7K in the feed hopper, basically all IMFs uploaded before Jan 01, 2019 (first half of 2016 alone is 14K IMFs).

Looking good so far, the impact on day-to-day operations is minimal (according to POA), only the export queues are (expectably) quite full, and there is the current issue with Zenodo, which is why the updates are currently held back from going there (to be addressed at the sprint).

myrmoteras added the processing label Jan 3, 2023

myrmoteras mentioned this issue May 17, 2023

Reptilia disappeared: does this have any impact on higher taxonomy in TB? #94

Open

myrmoteras assigned gsautter Oct 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BigBatch processing: issues to cover #73

BigBatch processing: issues to cover #73

myrmoteras commented Jan 3, 2023 •

edited by gsautter

Loading

flsimoes commented Jan 11, 2023 •

edited

Loading

gsautter commented Jan 12, 2023

gsautter commented Jan 25, 2023 •

edited

Loading

flsimoes commented Jan 26, 2023

myrmoteras commented Jan 26, 2023

myrmoteras commented Jan 26, 2023

myrmoteras commented Jan 26, 2023 •

edited

Loading

gsautter commented Jan 26, 2023

flsimoes commented Jan 26, 2023

myrmoteras commented Jan 26, 2023 •

edited

Loading

flsimoes commented Jan 26, 2023 •

edited

Loading

gsautter commented Jan 26, 2023

gsautter commented Jan 31, 2023

gsautter commented Feb 14, 2023 •

edited

Loading

gsautter commented Mar 6, 2023

myrmoteras commented Mar 23, 2023

tcatapano commented Mar 23, 2023

gsautter commented Mar 23, 2023

flsimoes commented Mar 23, 2023

tcatapano commented Mar 27, 2023

gsautter commented Oct 23, 2023

myrmoteras commented Oct 23, 2023 via email

gsautter commented Oct 27, 2023

BigBatch processing: issues to cover #73

BigBatch processing: issues to cover #73

Comments

myrmoteras commented Jan 3, 2023 • edited by gsautter Loading

flsimoes commented Jan 11, 2023 • edited Loading

gsautter commented Jan 12, 2023

gsautter commented Jan 25, 2023 • edited Loading

flsimoes commented Jan 26, 2023

myrmoteras commented Jan 26, 2023

myrmoteras commented Jan 26, 2023

myrmoteras commented Jan 26, 2023 • edited Loading

gsautter commented Jan 26, 2023

flsimoes commented Jan 26, 2023

myrmoteras commented Jan 26, 2023 • edited Loading

flsimoes commented Jan 26, 2023 • edited Loading

gsautter commented Jan 26, 2023

gsautter commented Jan 31, 2023

gsautter commented Feb 14, 2023 • edited Loading

gsautter commented Mar 6, 2023

myrmoteras commented Mar 23, 2023

tcatapano commented Mar 23, 2023

gsautter commented Mar 23, 2023

flsimoes commented Mar 23, 2023

tcatapano commented Mar 27, 2023

gsautter commented Oct 23, 2023

myrmoteras commented Oct 23, 2023 via email

gsautter commented Oct 27, 2023

myrmoteras commented Jan 3, 2023 •

edited by gsautter

Loading

flsimoes commented Jan 11, 2023 •

edited

Loading

gsautter commented Jan 25, 2023 •

edited

Loading

myrmoteras commented Jan 26, 2023 •

edited

Loading

myrmoteras commented Jan 26, 2023 •

edited

Loading

flsimoes commented Jan 26, 2023 •

edited

Loading

gsautter commented Feb 14, 2023 •

edited

Loading