Skip to content

Reference database curation workflow to accompany case study on taxonomic harmonization and crowdsourcing

Notifications You must be signed in to change notification settings

monagrland/taxo-harmo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Curation of vascular plant species list for reference database

Case study to illustrate linking of taxonomic identifiers and incorporation of Wikidata into workflow. Please see the paper for further details.

Aim: Build reference database of barcode marker sequences for vascular plants from Germany by matching checklist to NCBI taxonIDs.

We use a checklist dataset from the Bundesamt für Naturschutz published on GBIF. This appears to be based on a BfN publication (Buttler, May, Metzing, 2018).

Challenges:

  • Species occurrence checklist is published on GBIF, which uses a different taxonomy than NCBI, where sequences are published, so the IDs have to be mapped to each other
  • Mapping of GBIF to NCBI taxonIDs is not generally available; partially crowdsourced on Wikidata but some errors have been observed
  • Matching canonical names (binomials) is error-prone because of homonyms, orthographic differences, different accepted synonyms

To ameliorate errors we perform name matching with gndiff, a specialized tool for comparing scientific names that accounts for common issues; in addition to canonical names we also compare authorities, and retrieve Wikidata records corresponding to the GBIF taxonIDs. Unambiguous exact matches (name, authority, and Wikidata all agree) are automatically approved, whereas remaining name matches are sorted and manually curated to screen out spurious matches.

Errors in the source databases are also noted: errors in GBIF are reported in their web interface, while Wikidata can be edited directly.

File organization:

.
├── compare-names.ipynb     # Notebook with curation steps
├── data                    # Input dataset (git ignored)
├── env                     # Conda environments (git ignored)
├── environment.yml         # Conda environment definition file
├── name-match              # Intermediate working files (git ignored)
├── paths.json              # Paths to input files for notebook
├── README.md               # This Readme file
├── resources               # Source databases (git ignored)
└── results                 # Results files for manual curation

Versions of databases used:

Re-running the code

Re-running the code will produce different results because Wikidata is continuously updated. The Wikidata QuickStatements commands in this repository should not be run as-is!

First set up the Conda environment with the required code (recommend doing this with Mamba instead of Conda).

mkdir -p env
mamba env create -f environment.yml -p ./env/taxo-harmo

Download and extract the dataset and database files

bash get_data.sh

Run the notebook either interactively or at the command line. The following example code will execute the notebook and write output to HTML format. Conda environment has to be activated first.

conda activate ./env/taxo-harmo
mkdir -p results name-match
jupyter nbconvert --to html --ExecutePreprocessor.kernel=python3 \
  --execute compare-names.ipynb --output=compare-names.html

Sources

Bundesamt für Naturschutz / Netzwerk Phytodiversität Deutschland. Flora von Deutschland (Phanerogamen). Occurrence dataset https://doi.org/10.15468/0fxsox accessed via GBIF.org on 2023-03-16.

  • GBIF.org (16 March 2023) GBIF Occurrence Download https://doi.org/10.15468/dl.gv8n69
  • GBIF dataset e6fab7b3-c733-40b9-8df3-2a03e49532c1
  • Exported as "species list" format to file 0097412-230224095556074.zip, unzip to folder data/

Buttler, Karl P., Rudolf May, and Detlev Metzing. Liste der Gefäßpflanzen Deutschlands. DE: Bundesamt für Naturschutz, 2018.

Citation

Seah B. (2023) Paying it forward: Crowdsourcing the harmonisation and linking of taxon names and biodiversity identifiers. ''Biodiversity Data Journal'' 11 : e114076. https://doi.org/10.3897/BDJ.11.e114076

About

Reference database curation workflow to accompany case study on taxonomic harmonization and crowdsourcing

Resources

Stars

Watchers

Forks

Packages

No packages published