# Whip Natuurpunt/Natagora species checklist

This script tests if a Darwin Core Archive confirms to defined [whip specifications](https://github.com/inbo/whip).

In [1]:
import pandas as pd
import numpy as np
import yaml
from pywhip import whip_csv
from IPython.display import HTML, display_html

## Read data

In [2]:
taxa_file = "../data/raw/P-CO-000730_TrIAS_Soortenlijst_NaVerwerkingAanwLijsten_Resultaat_dumpTem20220206_V20220513.csv"
dist_file = "../data/raw/P-CO-000730_TrIAS_Aanwlijst_Resultaat_dumpTem20220206_v20220513.csv"

In [3]:
taxa = pd.read_csv(taxa_file, delimiter=";", dtype=object)

In [4]:
dist = pd.read_csv(dist_file, delimiter=";", dtype=object) if dist_file else False

In [5]:
taxa.rename(columns={
  "language": "language",
  "license": "license",
  "rightsholder": "rightsHolder",
  "accessrights": "accessRights",
  "references": "references",
  "datasetid": "datasetID",
  "datasetname": "datasetName",
  "taxonid": "taxonID",
  "scientificname": "scientificName",
  "kingdom": "kingdom",
  "taxonrank": "taxonRank"
}, inplace=True)

In [6]:
dist.rename(columns={
    "taxonid": "taxonID",
    "locationid": "locationID",
    "locality": "locality",
    "countrycode": "countryCode",
    "occurrencestatus": "occurrenceStatus",
    "establishmentmeans": "establishmentMeans",
    "eventdate": "eventDate",
}, inplace=True)

## Some stats

Number of records:

In [7]:
len(taxa)

18785

In [8]:
len(dist) if dist_file else False

52849

In [9]:
dist["eventDate"].min() if dist_file else False

'1643-01-01/2022-01-04'

In [10]:
dist["eventDate"].max() if dist_file else False

'2022-01-28/2022-01-28'

In [11]:
taxa["scientificName"].sort_values().unique()

array(['Abax ovalis', 'Abax parallelepipedus', 'Abax parallelus', ...,
       'Zygodon conoideus', 'Zygodon rupestris', 'Zygodon viridissimus'],
      dtype=object)

## Verify data

### Relationships between files

In [12]:
taxa_dist = pd.merge(taxa, dist, how = "left") if dist_file else False

Number of records with empty values when merging with taxa. Should be `False` or `0` for all.

In [13]:
taxa_dist[taxa_dist["countryCode"].isnull()]["taxonID"].unique() if dist_file else False

array([], dtype=object)

### Unique IDs

Number of records with a duplicate ids. Should be `False` or `0` for all.

In [14]:
taxa[taxa["taxonID"].duplicated(keep=False)]["taxonID"].sort_values().count()

0

## Export data

In [15]:
taxa.to_csv('../data/processed/taxon.tsv', sep='\t', index=False)

In [16]:
dist.to_csv('../data/processed/distribution.tsv', sep='\t', index=False) if dist_file else False

## Whip data

### Taxa

In [17]:
taxa_spec = yaml.safe_load(open("../specification/dwc-taxon.yml").read())

In [18]:
taxa_whipped = whip_csv("../data/processed/taxon.tsv", taxa_spec, delimiter="\t")

Dataset does not comply the specifications, check reportsfor a more detailed information.


In [19]:
display_html(HTML(taxa_whipped.get_report("html")), metadata=dict(isolated=True))

#,Data value,Message,Failed rows,First row
1,https://doi.org/to-be-defined,unallowed value https://doi.org/to-be-defined,18785,1


### Distribution

In [20]:
dist_spec = yaml.safe_load(open("../specification/dwc-distribution.yml").read()) if dist_file else False

In [21]:
dist_whipped = whip_csv("../data/processed/distribution.tsv", dist_spec, delimiter="\t") if dist_file else False

Hooray, your data set is according to the guidelines!


In [22]:
display_html(HTML(dist_whipped.get_report("html")), metadata=dict(isolated=True)) if dist_file else False