# Whip a Darwin Core Archive

This script tests if a Darwin Core Archive confirms to defined [whip specifications](https://github.com/inbo/whip).

1. Define whip specs at `datasets/<dataset_dir>/specification/` (copy/paste and adapt from other datasets)
2. Place an unzipped Darwin Core Archive in the `data` directory
3. Indicate the directory name `dataset_dir` to pull specs
4. Indicate what core and extension files are part of the archive

In [None]:
dataset_dir = "34 meetnetten-rugstreeppad-zichtwaarneming-occurrences"
event_core = True
occ_core = False
occ_ext = True
mof_ext = False

In [None]:
import pandas as pd
import numpy as np
import yaml
from pywhip import whip_csv
from IPython.display import HTML, display_html

In [None]:
occ_core_ext = True if occ_core or occ_ext else False

## Read data

In [None]:
dataset_dir = "34 meetnetten-rugstreeppad-zichtwaarneming-occurrences"

In [None]:
event = pd.read_csv("../data/event.txt", delimiter="\t", dtype=object) if event_core else False

In [None]:
occ = pd.read_csv("../data/occurrence.txt", delimiter="\t", dtype=object) if occ_core_ext else False

In [None]:
mof = pd.read_csv("../data/measurementorfact.txt", delimiter="\t", dtype=object) if mof_ext else False

## Some stats

Number of records:

In [None]:
len(event) if event_core else False

In [None]:
len(occ) if occ_core_ext else False

In [None]:
len(mof) if mof_ext else False

In [None]:
event["eventDate"].min() if event_core else occ["eventDate"]

In [None]:
event["eventDate"].max() if event_core else occ["eventDate"]

In [None]:
occ["scientificName"].unique()

In [None]:
occ.groupby(["scientificName","taxonRank","vernacularName"])["occurrenceID"].count().reset_index()

## Verify data

### Relationships between files

In [None]:
occ_event = pd.merge(occ, event, how = "left") if occ_ext else False
mof_event = pd.merge(mof, event, how = "left") if mof_ext else False

Number of records with that have empty values when merging with event. Should be 0 for all.

In [None]:
occ_event[occ_event["type"].isnull()]["id"].unique() if occ_ext else False

In [None]:
mof_event[mof_event["type"].isnull()]["id"].unique() if mof_ext else False

### Unique IDs

Number of records with a duplicate ids. Should be 0 for all.

In [None]:
event[event["eventID"].duplicated(keep=False)]["eventID"].sort_values().count() if event_core else False

In [None]:
occ[occ["occurrenceID"].duplicated(keep=False)]["occurrenceID"].sort_values().count() if occ_core_ext else False

## Whip data

### Event

In [None]:
event_spec = yaml.load(open("../datasets/" + dataset_dir + "/specification/dwc-event.yaml").read()) if event_core else False

In [None]:
event_whipped = whip_csv("../data/event.txt", event_spec, delimiter="\t") if event_core else False

In [None]:
display_html(HTML(event_whipped.get_report("html")), metadata=dict(isolated=True)) if event_core else False

### Occurrence

In [None]:
occ_spec = yaml.load(open("../datasets/" + dataset_dir + "/specification/dwc-occurrence.yaml").read()) if occ_core_ext else False

In [None]:
occ_whipped = whip_csv("../data/occurrence.txt", occ_spec, delimiter="\t") if occ_core_ext else False

In [None]:
display_html(HTML(occ_whipped.get_report("html")), metadata=dict(isolated=True)) if occ_core_ext else False

### Measurement or fact

In [None]:
mof_spec = yaml.load(open("../datasets/" + dataset_dir + "/specification/dwc-mof.yaml").read()) if mof_ext else False

In [None]:
mof_whipped = whip_csv("../data/measurementorfact.txt", mof_spec, delimiter="\t") if mof_ext else False

In [None]:
display_html(HTML(mof_whipped.get_report("html")), metadata=dict(isolated=True)) if mof_ext else False