Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Granularity levels, verification and Quality Control #195

Open
flsimoes opened this issue Sep 30, 2022 · 16 comments
Open

Granularity levels, verification and Quality Control #195

flsimoes opened this issue Sep 30, 2022 · 16 comments
Assignees

Comments

@flsimoes
Copy link
Collaborator

flsimoes commented Sep 30, 2022

We need to define what we understand as Granularity, Verification and Quality Control, and what levels to assign to it.
EDIT: QC and Verification are dropped. We'll focus on general granularity + a QCd/Not-QCd tag

Granularity

The level of processing applied to a given document. This is the level to which the batch (or an individual extraction) goes to.
At the moment, processing through batch currently does not enable any partial processing, meaning it activates all the treatment, treatmentCitation, and materialsCitation macros. In order to apply the levels below we would need to implement elements that enables the template creator to signal at what point a batch process should stop.

  • Level 0
    • Objective: Make a document Machine-actionable
    • Product:
      • PDF made FAIR and Machine-actionable (e.g. PDF uploaded to BLR/Zenodo)
      • Metadata
    • Checked by default (users need to have the correct metadata to upload it)
  • Level 1
    • Objective: Make IMFs available (JATS), document structure, figures and tables
    • Product:
      • PDF made Machine-readable
      • Figures
      • Tables
    • Checked/Not-Checked: Blocks, paragraphs, line breaks, captions, etc
      • Blocks Transits to: Zenodo, GBIF
    • Templated/Not-templated
  • Level 2
    • Objective: Provide taxonomicNames, sections, treatments, subSubSections and references
    • Product:
      • treatment
        • nomenclature subSubSections (mandatory)
        • other subSubSections
      • taxonomicNames
      • bibliographic references
    • Checked/Not-Checked: boundaries and attributes of subSubSections, taxonomicNames and bibRefs
      • Blocks Transits to:
  • Level 3 (a, b)
    • Objective: Provide treatmentCitations (3a) OR materialsCitations (3b), if both "3"
    • Product:
      • treatmentCitations (3a)
      • materialsCitations (3b)
    • Checked/Not-Checked: boundaries of treatmentCitations and/or materialsCitations
      • Blocks Transits to:
  • Level 4
    • Objective: Parse treatmentCitations (4a) OR materialsCitations (4b), if both "4"
      • Distinction between types-only and full matCit
    • Product:
      • treatmentCitations (4a)
      • materialsCitations (4b)
    • Checked/Not-Checked: attributes/parsing of treatmentCitations and/or materialsCitations
      • Blocks Transits to:

(DEPRECATED)

Quality Control

The amount of verification and parsing Plazi applies to a given document. This QC level system is a translation of what we currently dub "Granularity levels"

  • Level 0 (none)
    • No QC protocol carried after standard processing
  • Level 1 (low)
    • Standard QC currently applied to the likes of Zootaxa and Phytotaxa. Plazi only looks at clearing blocker flags to enable publications, treatments and any other product to transit to GBIF, LOD, etc
  • Level 2 (medium)
    • Currently applied to the MNHN journals (Adansonia, Anthropozoologica, Geodiversitas and Zoosystema).
    • The new specifications need to be discussed, but it will involve looking at treatments, subSubSections and holotypes.
  • Level 3 (high)
    • Currently applied solely to EJT.
    • Highest level of QC, involves checking all treatments, materialsCitations and treatmentCitations.

Verification

The level to which a document has been fully checked by an user. The inspiration for this model is iNaturalist

  • Not Verified
    • Document was processed, but no one made any further checks (QC level 0)
  • Plazi Verified
  • Expert Verified
    • With this tier we expect to get verification of the accuracy of the extracted data from experts in the related field
@flsimoes
Copy link
Collaborator Author

flsimoes commented Oct 3, 2022

I'm adding the specifications for each level and further descriptions

@flsimoes
Copy link
Collaborator Author

flsimoes commented Oct 3, 2022

@myrmoteras please check it out

@flsimoes
Copy link
Collaborator Author

flsimoes commented Oct 4, 2022

It would also be really useful to have this tags findable through the TB Stats

@gsautter
Copy link

gsautter commented Oct 4, 2022

It would also be really useful to have this tags findable through the TB Stats

Sure thing ... once they are defined and we start assigning them, that is ... beforehand, there's preciously little sense in adding a bunch of empty stats fields.

@flsimoes
Copy link
Collaborator Author

flsimoes commented Oct 4, 2022

It would also be really useful to have this tags findable through the TB Stats

Sure thing ... once they are defined and we start assigning them, that is ... beforehand, there's preciously little sense in adding a bunch of empty stats fields.

Completely agree, and that's why we are here discussing them :)

@myrmoteras
Copy link
Contributor

@flsimoes let's define the granularity levels, and let's have 3- 5 "gold standard" publications where we can show this, and which include the variation of treatments, ie

  • for new species,
  • redescriptions with treatment citations,
  • redescriptions with treatment citations including synonymy,
  • redescriptions with treatment citations including synonymy and links to the cited treatments
  • complete and less complete material citations.
  • Articles with 1 to 5 treatments

@flsimoes
Copy link
Collaborator Author

flsimoes commented Nov 8, 2022

We're currently gathering examples

@myrmoteras
Copy link
Contributor

myrmoteras commented Nov 22, 2022

example papers:
article stats; treatmentStats; JSON

  1. Cipola, Nikolas G. & Katz, Aron D., 2021, Morphological and molecular analysis of Willowsia nigromaculata (Collembola, Entomobryidae, Entomobryinae) reveals a new cryptic species from the United States, European Journal of Taxonomy 739 (1), pp. 92-116 FFCEFFA71D0DD960FF886D158538FF88
  2. Silva, Ruan Felipe Da, Caron, Edilson & Carvalho-Filho, Fernando Da Silva, 2022, An update on Termitomorpha Wasman (Coleoptera: Staphylinidae) including a new species, species redescriptions and geographic extension, Zootaxa 5205 (1), pp. 1-25
    https://tb.plazi.org/GgServer/summary/9F3EB341FFF2FF82B345D672FF811F58
  3. Subedi, Madan, 2022, A new genus and a new groundhopper species from Nepal (Orthoptera: Tetriginae Skejotettix netrajyoti gen. et sp. nov.), Zootaxa 5205 (1), pp. 35-54
    https://tb.plazi.org/GgServer/summary/737C7525576CFFF6FFBEFF8F3E49FFC1
  4. Cutrim, Marcelo, M, Alberto, Silva-Neto, oreira da, Rafael, José Albertino, García, Alfonso Nery & Aldrete, 2022, The genus Ptiloneura Enderlein, 1901 (Psocodea, ‘ Psocoptera’, Ptiloneuridae) in the Brazilian Amazon Forest and Atlantic Forest: new species, variations in forewings and a key to the species, Zoosystema 44 (20), pp. 493-501
    https://tb.plazi.org/GgServer/summary/244CDA14FFD3BB1B2820FF8C5E190A55
  5. Štepánek, Jan & Kirschner, Jan, 2013, A revision of mountain species of the genus Taraxacum F. H. Wigg. (Compositae) in Corsica, Candollea 68 (1), pp. 29-39
    https://tb.plazi.org/GgServer/summary/FFD01C59FFE65F4EBF6AFFD7FFB5CA6C

@flsimoes
Copy link
Collaborator Author

We need to flesh out a few things, such as difference between automatic and manual bibRef parsing and level of detail of matCit parsing.

@myrmoteras
Copy link
Contributor

@flsimoes here is the milestone "Initial scoping and assessment of optimal degree of automation for processing workflow" i n BiCIKL that deals also with granularity issues.

@myrmoteras
Copy link
Contributor

myrmoteras commented Nov 28, 2022

Annotations for level 4, covering treatments, NOT tables yet.

see also Agosti et al, 2022 for further explanation of the annotations.

annotation Element name comment
treatment Treatment
subSubSection Treatment sub-sections
treatmentCitationGroup A group of treatment citations for the same taxon concept
treatmentCitation a citation of a previous treatment
taxonomicName Scientific name
taxonomicNameLabel Designator for a new or changed scientific name
materialsCitation Citation of a physical specimen
collectingCountry Country where the specimen has been collected
location location where the specimen has been collected
date Date of the collection of the specimen
specimenCode code assigned to a specimen by an institution
geoCoordinate geographic coordinates
elevation elevation of the collection of the specimen
collectorName person who collected the specimen
collectionCode code of the institution hosting the specimen
accessionCode code of the DNA sequence isolated from the specimen

Nesting of the annotations

The tags have specific positions in a treatment.

annotation sub-annotation subsub-annotation subSubSub-annoation comment
taxonomicName Scientific name; can occur in all sections, including material citation. Can occur anywhere in a treatment or article
taxonomicNameLabel Designator for a new or changed scientific name
treatment
subSubSection Treatment sub-sections
treatmentCitationGroup A group of treatment citations for the same taxon concept
treatmentCitation a citation of a previous treatment can be alone, or nested in TCG
materialsCitation Citation of a physical specimen
collectingCountry Country where the specimen has been collected
location location where the specimen has been collected
date
specimenCode
geoCoordinate
elevation
collectorName
collectionCode
accessionCode

@myrmoteras
Copy link
Contributor

example papers: article stats; treatmentStats; JSON

@flsimoes can you please make sure that these examples are all high level and checked - so they work as examples.

@flsimoes
Copy link
Collaborator Author

We'll make sure of that

@flsimoes
Copy link
Collaborator Author

All high-level now

@flsimoes
Copy link
Collaborator Author

flsimoes commented Nov 1, 2023

TaxPub - level-1
plazi/ggxml2taxpub#21

@flsimoes
Copy link
Collaborator Author

flsimoes commented Nov 1, 2023

From Patrick Ruch

"The document granularity is one of the subject of BioHackathon n26 that we are organizing; therefore I am cc-ing to Alexandre and Julien for respectively the Elixir BioHackathon, which is currently going on, and the BioC format used in SIBiLS and displayed in Pam's module.

Let's add it to the agenda of thursday !"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants