# Whip a Darwin Core Archive

This script tests if a Darwin Core Archive confirms to defined [whip specifications](https://github.com/inbo/whip).

1. Define whip specs at `datasets/<dataset_dir>/specification/` (copy/paste and adapt from other datasets)
2. Place an unzipped Darwin Core Archive in the `data` directory
3. Indicate the directory name `dataset_dir` to pull specs
4. Indicate what core and extension files are part of the archive

In [28]:
dataset_dir = "29 meetnetten-dagvlinders-algemene-occurrences"
event_core = True
occ_core = False
occ_ext = True
mof_ext = True

In [29]:
import pandas as pd
import numpy as np
import yaml
from pywhip import whip_csv
from IPython.display import HTML, display_html

In [30]:
occ_core_ext = True if occ_core or occ_ext else False

## Read data

In [31]:
event = pd.read_csv("../data/event.txt", delimiter="\t", dtype=object) if event_core else False

In [32]:
occ = pd.read_csv("../data/occurrence.txt", delimiter="\t", dtype=object) if occ_core_ext else False

In [33]:
mof = pd.read_csv("../data/measurementorfact.txt", delimiter="\t", dtype=object) if mof_ext else False

## Some stats

Number of records:

In [34]:
len(event) if event_core else False

69745

In [35]:
len(occ) if occ_core_ext else False

136668

In [36]:
len(mof) if mof_ext else False

96854

In [37]:
event["eventDate"].min() if event_core else occ["eventDate"].min()

'1991-04-01'

In [38]:
event["eventDate"].max() if event_core else occ["eventDate"].max()

'2020-04-05'

In [39]:
occ["scientificName"].sort_values().unique()

array(['Aglais io', 'Aglais urticae', 'Amata phegea',
       'Anthocharis cardamines', 'Apatura iris', 'Aphantopus hyperantus',
       'Araschnia levana', 'Arethusana arethusa', 'Argynnis paphia',
       'Aricia agestis', 'Callophrys rubi', 'Carterocephalus palaemon',
       'Celastrina argiolus', 'Coenonympha pamphilus',
       'Colias alfacariensis', 'Colias croceus', 'Colias hyale',
       'Favonius quercus', 'Gonepteryx rhamni', 'Hesperia comma',
       'Hipparchia semele', 'Issoria lathonia', 'Lasiommata megera',
       'Leptidea sinapis', 'Limenitis camilla', 'Lycaena phlaeas',
       'Maniola jurtina', 'Melitaea cinxia', 'Nymphalis antiopa',
       'Nymphalis polychloros', 'Ochlodes sylvanus', 'Papilio machaon',
       'Pararge aegeria', 'Phengaris alcon', 'Pieris brassicae',
       'Pieris napi', 'Pieris rapae', 'Pieris spec.', 'Plebejus argus',
       'Polygonia c-album', 'Polyommatus coridon', 'Polyommatus icarus',
       'Pyrgus malvae', 'Pyronia tithonus', 'Satyrium ilicis'

In [40]:
occ.groupby(["scientificName","taxonRank","vernacularName"])["occurrenceID"].count().reset_index()

Unnamed: 0,scientificName,taxonRank,vernacularName,occurrenceID
0,Aglais io,species,Dagpauwoog,7043
1,Aglais urticae,species,Kleine vos,2938
2,Amata phegea,species,Phegeavlinder,8
3,Anthocharis cardamines,species,Oranjetipje,2516
4,Apatura iris,species,Grote weerschijnvlinder,7
5,Aphantopus hyperantus,species,Koevinkje,2970
6,Araschnia levana,species,Landkaartje,6092
7,Arethusana arethusa,species,Oranje steppevlinder,3
8,Argynnis paphia,species,Keizersmantel,3
9,Aricia agestis,species,Bruin blauwtje,1197


## Verify data

### Relationships between files

In [41]:
occ_event = pd.merge(occ, event, how = "left") if occ_ext else False
mof_event = pd.merge(mof, event, how = "left") if mof_ext else False

Number of records with empty values when merging with event. Should be `False` or `0` for all.

In [42]:
occ_event[occ_event["type"].isnull()]["id"].unique() if occ_ext else False

array([], dtype=object)

In [43]:
mof_event[mof_event["type"].isnull()]["id"].unique() if mof_ext else False

array([], dtype=object)

### Unique IDs

Number of records with a duplicate ids. Should be `False` or `0` for all.

In [44]:
event[event["eventID"].duplicated(keep=False)]["eventID"].sort_values().count() if event_core else False

0

In [45]:
occ[occ["occurrenceID"].duplicated(keep=False)]["occurrenceID"].sort_values().count() if occ_core_ext else False

0

## Whip data

### Event

In [46]:
event_spec = yaml.load(open("../datasets/" + dataset_dir + "/specs/dwc_event.yaml").read()) if event_core else False

  """Entry point for launching an IPython kernel.


In [47]:
event_whipped = whip_csv("../data/event.txt", event_spec, delimiter="\t") if event_core else False

Dataset does not comply the specifications, check reportsfor a more detailed information.


In [48]:
display_html(HTML(event_whipped.get_report("html")), metadata=dict(isolated=True)) if event_core else False

#,Data value,Message,Failed rows,First row
1,https://xxxxxxx,unallowed value https://xxxxxxx,69745,1


### Occurrence

In [49]:
occ_spec = yaml.load(open("../datasets/" + dataset_dir + "/specs/dwc_occurrence.yaml").read()) if occ_core_ext else False

  """Entry point for launching an IPython kernel.


In [50]:
occ_whipped = whip_csv("../data/occurrence.txt", occ_spec, delimiter="\t") if occ_core_ext else False

Hooray, your data set is according to the guidelines!


In [51]:
display_html(HTML(occ_whipped.get_report("html")), metadata=dict(isolated=True)) if occ_core_ext else False

### Measurement or fact

In [55]:
mof_spec = yaml.load(open("../datasets/" + dataset_dir + "/specs/dwc_mof.yaml").read()) if mof_ext else False

  """Entry point for launching an IPython kernel.


In [56]:
mof_whipped = whip_csv("../data/measurementorfact.txt", mof_spec, delimiter="\t") if mof_ext else False

Hooray, your data set is according to the guidelines!


In [57]:
display_html(HTML(mof_whipped.get_report("html")), metadata=dict(isolated=True)) if mof_ext else False