# Introduction

[pywhip](https://inbo.github.io/pywhip/) provides the ability to **validate a dataset** and receive a **report to identify potential issues** using [whip specifications](https://github.com/inbo/whip), a human and machine-readable syntax to express specifications for data.

In this notebook, we introduce the pywhip functionalities. As an introduction, we present a dummy example of the workflow to apply pywhip. Next, the usage in the context of a real-world example from the [TrIAS project](https://github.com/trias-project/) is shown. Finally, some envisioned ideas about pywhip applications are shortly discussed.

If you want to run the code in the code blocks (called "cells"), select the cell and do `Shift + Enter` (or use `Cell > Run Cells` from the menu at the top).

In [1]:
import yaml
from pywhip import whip_csv

## pywhip workflow with dummy data

Assume we received the following dataset from a research project that took place in Belgium and The Netherlands between 2016 and 2018:

In [2]:
!head ../docs/_static/observations_data.csv

eventDate,individualCount,country
2018-01-03,5,BA
2018-04-02,20,NL
2016-07-06,3300,BE
2017-03-02,2,BE
1018-01-08,1,NL

Within the scope of the project, we do know the following about the data:

- The project was happening in Belgium (BE) and The Netherlands (NL)
- The project was running from 2016 until 2018, so date values should be in this range
- Individual counts can not be higher than 100 and should be at least 1
- Empty values are not allowed

These specifications can be translated to [whip specifications](https://github.com/inbo/whip):

In [3]:
project_specs = """
    country:
       allowed: [BE, NL]
    eventDate:
        dateformat: '%Y-%m-%d'
        mindate: 2016-01-01
        maxdate: 2018-12-31
    individualCount:
        numberformat: x  # needs to be an integer value
        min: 1
        max: 100
"""
specifications = yaml.load(project_specs)

**pywhip** provides the ability to **test** these specifications:

In [4]:
import yaml
observations_whip = whip_csv("../docs/_static/observations_data.csv", specifications, delimiter=',')

Dataset does not comply the specifications, check reports by using the `get_report` method for a more detailed information.


and **report** the issues to the user:

In [5]:
html_report = observations_whip.get_report('html')

Using the Jupyter Notebook build-in functionalities, we can show the HTML-report here inline, but this page could be served elsewhere as well:

In [6]:
from IPython.display import HTML, display_html
display_html(HTML(html_report), 
             metadata=dict(isolated=True))

#,Data value,Message,Failed rows,First row
1,BA,unallowed value BA,1,1

#,Data value,Message,Failed rows,First row
1,1018-01-08,date '1018-01-08' is before min limit '2016-01-01',1,5

#,Data value,Message,Failed rows,First row
1,3300,max value is 100,1,3


The report can also be served in the json format, allowing integration with other websites and services:

In [7]:
observations_whip.get_report('json')

{'errors': [],
 'executed_at': '2018-08-28 10:49',
 'results': {'failed_rows': 3,
  'passed_row_ids': [2, 4],
  'passed_rows': 2,
  'specified_fields': {'country': {'allowed': {'constraint': 'BE, NL',
     'failed_rows': 1,
     'passed_rows': 4,
     'samples': {'BA': {'failed_rows': 1,
       'first_row': 1,
       'message': 'unallowed value BA'}}},
    'empty': {'constraint': 'False',
     'failed_rows': 0,
     'passed_rows': 5,
     'samples': {}}},
   'eventDate': {'dateformat': {'constraint': '%Y-%m-%d',
     'failed_rows': 0,
     'passed_rows': 5,
     'samples': {}},
    'empty': {'constraint': 'False',
     'failed_rows': 0,
     'passed_rows': 5,
     'samples': {}},
    'maxdate': {'constraint': '2018-12-31',
     'failed_rows': 0,
     'passed_rows': 5,
     'samples': {}},
    'mindate': {'constraint': '2016-01-01',
     'failed_rows': 1,
     'passed_rows': 4,
     'samples': {'1018-01-08': {'failed_rows': 1,
       'first_row': 5,
       'message': "date '1018-01-08' 

## Real-world case

The [Tracking Invasive Alien Species (TrIAS)](http://trias-project.be) project aims to build an open data-driven framework to support policy on invasive species. One of its aims is to standardize and publish occurrence and checklist data so these can be harvested by [GBIF](http://www.gbif.org/).

The [GBIF data validator](https://www.gbif.org/tools/data-validator) allows to validate TrIAS data for and within the context of GBIF. However, those tests are predefined. Using whip + pywhip, the publisher can **define, document and test their own rules** (whether these are generic or specific), complementing what the GBIF data validator offers.

As an example, the [alien-macroinvertebrates repository](https://github.com/trias-project/alien-macroinvertebrates) contains the functionality to standardize the data of Boets et al. (2016) to both a Darwin Core checklist and Darwin Core occurrence dataset. For each of the data sets, whip specifications where defined and added to the repository, see [this link](https://github.com/trias-project/alien-macroinvertebrates/tree/master/specification).

We can use these specifications to validate the status of each of the datasets. To illustrate this, we'll use the (small) `taxon.csv` dataset:

Reading the specifications from the URL:

In [8]:
import requests
alien_macroinvertebrates_yaml = 'https://raw.githubusercontent.com/trias-project/alien-macroinvertebrates/ccd9025adfed3ee6a710da73213d439f5ac89506/specification/dwc_taxon.yaml'
response = requests.get(alien_macroinvertebrates_yaml)
alien_macroinvertebrates_specifications = yaml.load(response.text)

A user working on the data, can check the data against the specifications:

In [9]:
checklist_whip = whip_csv("alien-taxon.csv", 
                          alien_macroinvertebrates_specifications, delimiter=',')

Dataset does not comply the specifications, check reports by using the `get_report` method for a more detailed information.


And check the report:

In [10]:
display_html(HTML(checklist_whip.get_report('html')), 
             metadata=dict(isolated=True))

#,Data value,Message,Failed rows,First row
1,Tubficida,unallowed value Tubficida,9,22

#,Data value,Message,Failed rows,First row
1,alien-macroinvertebrates-checklist:taxon:06b1921ec41577a8c4516c91837ea594,value does not match regex 'alien-macroinvertebrates-checklist:taxon:\d',1,3
2,alien-macroinvertebrates-checklist:taxon:b874beef91a87e42a7512e6d93de1b89,value does not match regex 'alien-macroinvertebrates-checklist:taxon:\d',1,49
3,alien-macroinvertebrates-checklist:taxon:64adea5e00ac03a99d28288ebc0e52dd,value does not match regex 'alien-macroinvertebrates-checklist:taxon:\d',1,25
4,alien-macroinvertebrates-checklist:taxon:15e890b3c78f52beed30d155c9eb7010,value does not match regex 'alien-macroinvertebrates-checklist:taxon:\d',1,7
5,alien-macroinvertebrates-checklist:taxon:b74cb0f47f65095d9d919e2a9c1fc61c,value does not match regex 'alien-macroinvertebrates-checklist:taxon:\d',1,48
6,alien-macroinvertebrates-checklist:taxon:5a7d3a2d0f5058bef64c0e463c16f835,value does not match regex 'alien-macroinvertebrates-checklist:taxon:\d',1,22
7,alien-macroinvertebrates-checklist:taxon:561d4e5573f8471fb2a071995dec3d52,value does not match regex 'alien-macroinvertebrates-checklist:taxon:\d',1,20
8,alien-macroinvertebrates-checklist:taxon:563893a5246109760d8b15c5692dab7f,value does not match regex 'alien-macroinvertebrates-checklist:taxon:\d',1,21
9,alien-macroinvertebrates-checklist:taxon:4e75374d4157cc047fea212a1f0774c7,value does not match regex 'alien-macroinvertebrates-checklist:taxon:\d',1,18
10,alien-macroinvertebrates-checklist:taxon:e290ce86d71a55d91f917e211570869d,value does not match regex 'alien-macroinvertebrates-checklist:taxon:\d',1,64


The report provides a quick overview of the issues for which action needs to be taken, similar to the current GBIF data validator reporting functionalities.

## pywhip envisioned applications

In essence, pywhip is a data validation tool, but one where you can define your own rules (in [whip](https://github.com/inbo/whip)). We envision the usefulness of whip + pywhip in a number of applications:

- Data publishers can use whip to **document decisions** made about data representation/standardization (e.g. in discussions with the data owner). These specifications could even be included in the published dataset (e.g. in a Darwin Core Archive). Use pywhip they can test these specifications in an iternative process to improve the standardization and quality of the published dataset.
- Data users can use whip to express their custom or **community-defined** data quality needs. Using pywhip, data meets those requirements can be _filtered_ and extracted.
- Pywhip functionalities could be **integrated** into the [GBIF data validator](https://www.gbif.org/tools/data-validator), so that in addition to the predefined rules, user-defined rules could tested and reported upon as well.
- The TDWG [Biodiversity Data Quality (BDQ) Interest Group](https://github.com/tdwg/bdq) is defining a fixed set of _Tests and Assertions_ to assess data quality and provide a common ground for data aggregators to report these issues. Whip could be considered to define existing or additional conformance tests.