Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigBatch processing: issues to cover #73

Open
myrmoteras opened this issue Jan 3, 2023 · 23 comments
Open

BigBatch processing: issues to cover #73

myrmoteras opened this issue Jan 3, 2023 · 23 comments
Assignees

Comments

@myrmoteras
Copy link
Contributor

myrmoteras commented Jan 3, 2023

Batch processing of all the treatments should include the following issues

  • Accession numbers
  • SubSubSection types normalization
  • vernacularName
  • COL IDs on treatment taxon names and cited taxon names
@flsimoes
Copy link

flsimoes commented Jan 11, 2023

  • Taxonomic status normalization (e.g. "sp. nov.", "sp. n." and "n. sp.")
  • Normalization of Animalia/Metazoa

@gsautter
Copy link

A few more point:

  • remove generic number annotations
  • normalize annotation types (see full type list for details)

@gsautter
Copy link

gsautter commented Jan 25, 2023

One more point that just came up in our discussion here in Oslo

  • annotate individual keywords
  • add keywords to article stats

@flsimoes
Copy link

  • Species defined as "sp. 1", "sp. 2", "sp. 3", etc should be "undetermined" instead

@myrmoteras
Copy link
Contributor Author

  • Species defined as "sp. 1", "sp. 2", "sp. 3", etc should be "undetermined" instead

no, we should make them status=undet.. sp. 1 is a defined name that often later is cited, for example in a revision where the specimen then gets a taxonomic name.

@myrmoteras
Copy link
Contributor Author

clean up taxonomic status and come up of a reference vocabulary for at least the common terms, like sp. nov. https://tb.plazi.org/GgServer/srsStats/stats?outputFields=tax.status&groupingFields=tax.status&format=HTML

@myrmoteras
Copy link
Contributor Author

myrmoteras commented Jan 26, 2023

find a solution to normalize types (typeStatus in material citation attributes to enable, among others, more efficient searches https://tb.plazi.org/GgServer/srsStats/stats?outputFields=matCit.typeStatus&groupingFields=matCit.typeStatus&format=HTML

if we do the attributes, we should there start to use a reference vocabulary

@gsautter
Copy link

  • Species defined as "sp. 1", "sp. 2", "sp. 3", etc should be "undetermined" instead

no, we should make them status=undet.. sp. 1 is a defined name that often later is cited, for example in a revision where the specimen then gets a taxonomic name.

@myrmoteras that's a lot harder to filter by, and far more prone to incurring incorrect cross-article matches ... species="undefined" and species="undetermined" are way easier to catch and filter than a somewhat arbitrary scheme like "sp. 1, sp. 2, sp. 3" or "sp. A, sp. B, sp. C" or "sp. ", as the latter require pattern matching to a certain extent, which vastly complicates query processing.
Plus, we had agreed on the "undefined", "unknown", "undetermined", "uncertain" scheme, and implemented respective mechanics throughout our systems ... it's only older data that we need to catch up to that level.

@flsimoes
Copy link

  • Species defined as "sp. 1", "sp. 2", "sp. 3", etc should be "undetermined" instead

no, we should make them status=undet.. sp. 1 is a defined name that often later is cited, for example in a revision where the specimen then gets a taxonomic name.

@myrmoteras that's a lot harder to filter by, and far more prone to incurring incorrect cross-article matches ... species="undefined" and species="undetermined" are way easier to catch and filter than a somewhat arbitrary scheme like "sp. 1, sp. 2, sp. 3" or "sp. A, sp. B, sp. C" or "sp. ", as the latter require pattern matching to a certain extent, which vastly complicates query processing. Plus, we had agreed on the "undefined", "unknown", "undetermined", "uncertain" scheme, and implemented respective mechanics throughout our systems ... it's only older data that we need to catch up to that level.

That's what I had understood as well...

@myrmoteras
Copy link
Contributor Author

myrmoteras commented Jan 26, 2023

@flsimoes @gsautter can you point out a couple of examples, to make sure we talk about the same?

in my view, in the taxonomicName attributes, Corvus sp. 1 should be attributed

  • genus = Corvus
  • species = sp. 1
  • status = undefined

In taxonomy, sp. 1 have a connotation, that is that there is a species, that has no proper name. It is referred to later Corvus sp. 1 in Treatment UUID=XXX in publication UUID=YYYY.

This also allows to differentiate between sp. 1 and sp. 2 in a genus in the same publication.

@flsimoes
Copy link

flsimoes commented Jan 26, 2023

@flsimoes @gsautter can you point out a couple of examples, to make sure we talk about the same?

in my view, in the taxonomicName attributes, Corvus sp. 1 should be attributed

  • genus = Corvus
  • species = sp. 1
  • status = undefined

In taxonomy, sp. 1 have a connotation, that is that there is a species, that has no proper name. It is referred to later Corvus sp. 1 in Treatment UUID=XXX in publication UUID=YYYY.

This also allows to differentiate between sp. 1 and sp. 2 in a genus in the same publication.

Taxonomically speaking, I fully agree with your rationale.

In this issue we discussed the same
https://github.com/plazi/conversion/issues/13#issuecomment-1072609723

@gsautter
Copy link

Well, the status attribute normally is for status labels like "spec. nov." or "comb. nov.", and only for that.

And while I do see the citation part, the implied connection between "Corvus sp. 1 Smith 1900" and "Corvus sp. 1 Jones 1986" just doesn't exist, generating a pattern of homonyms that is way harder to filter downstream as a well-defined single placeholder "undefined" ...
Plus, if going for the species in the citation context is the goal, there always is the verbatim annotation value "Corvus sp. 1", so attributes aren't the only way of accessing this.

@gsautter
Copy link

Another point: remove URL prefix from DOIs to simplify comparison.

@gsautter
Copy link

gsautter commented Feb 14, 2023

Another point:

@gsautter
Copy link

gsautter commented Mar 6, 2023

Yet another point:

  • remove stale approvalRequiredFor_<XYZ> document attributes to free them up for use by user certification authority

@myrmoteras
Copy link
Contributor Author

@flsimoes @tcatapano please check weather we can go ahead to run the big batch. Please check by end of March 2023.

@tcatapano
Copy link
Member

@flsimoes @tcatapano please check weather we can go ahead to run the big batch. Please check by end of March 2023.

@myrmoteras: it's not clear to me what the task is. I havent been involved in this issue so don't have any context.

@gsautter
Copy link

@myrmoteras: it's not clear to me what the task is. I havent been involved in this issue so don't have any context.

@tcatapano the whole effort is basically a full-corpus cleanup operation to get rid of erroneous annotation types, normalize values of certain attributes, run now-standard linking jobs and QC on the older portion of our data, etc. The sheer number of documents to process sort of justifies the more thorough planning than a 200 document job would require.

Meaning to say: if you can think of anything we kind of did wrong in the past, but never got around to cleaning up the existing mess after resolving the issue for prospective documents, cleaning up the earlier documents is something to add to this list ...

@flsimoes
Copy link

@flsimoes @tcatapano please check weather we can go ahead to run the big batch. Please check by end of March 2023.

Yes, no further comments from my part

@tcatapano
Copy link
Member

I dont think it is worth delaying anything. Go ahead.

@gsautter
Copy link

All implemented now, just need to (a) figure out how to bundle up all the gizmos and (b) run a few tests afterwards.

@myrmoteras
Copy link
Contributor Author

myrmoteras commented Oct 23, 2023 via email

@gsautter
Copy link

Intermediate result after some 60 hours: about 12K IMFs processed, another 7K in the feed hopper, basically all IMFs uploaded before Jan 01, 2019 (first half of 2016 alone is 14K IMFs).

Looking good so far, the impact on day-to-day operations is minimal (according to POA), only the export queues are (expectably) quite full, and there is the current issue with Zenodo, which is why the updates are currently held back from going there (to be addressed at the sprint).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants