In [1]:
import yaml
from pywhip import whip_csv

# Introduction

[pywhip](https://inbo.github.io/pywhip/) provides the ability to **validate a data** set and **receive a report** to identify potential issues using [whip specifications](https://github.com/inbo/whip), a human and machine-readable syntax to express specifications for data.

In this notebook, we introduce the pywhip functionalities using different data examples...

#### TODO

## Introductionary example

From a research project from 2016 till 2018 in Belgium and The Netherlands, we received the following data set:

In [2]:
!head ../docs/_static/observations_data.csv

eventDate,individualCount,country
2018-01-03,5,BA
2018-04-02,20,NL
2016-07-06,3300,BE
2017-03-02,2,BE
1018-01-08,1,NL

Within the scope of the project, we do know the following about the data:

- The project was running from 2016 until 2018, so date values should be in this range
- The project was happening in Belgium (BE), The Netherlands (NL)
- Individual counts can not be higher than 100 and should be at least 1
- Empty values are not allowed

These specifications can be translated to appropriate [whip specifications](https://github.com/inbo/whip):

In [3]:
project_specs = """
    country:
       allowed: [BE, NL]
    eventDate:
        dateformat: '%Y-%m-%d'
        mindate: 2016-01-01
        maxdate: 2018-12-31
    individualCount:
        numberformat: x  # needs to be an integer value
        min: 1
        max: 100
"""
specifications = yaml.load(project_specs)

**pywhip** provides the ability to **test** these specifications:

In [4]:
import yaml

observations_whip = whip_csv("../docs/_static/observations_data.csv", 
                             specifications, delimiter=',')

Dataset does not comply the specifications, check reports by using the `get_report` method for a more detailed information.


and **report** the issues to the user:

In [5]:
html_report = observations_whip.get_report('html')

Using the jupyter notebook build-in functionalities, we can show the HTML-report here inline, but this page could be served elsewhere as well:

In [6]:
from IPython.display import HTML, display_html
display_html(HTML(html_report), 
             metadata=dict(isolated=True))

#,Data value,Message,Failed rows,First row
1,BA,unallowed value BA,1,1

#,Data value,Message,Failed rows,First row
1,1018-01-08,date '1018-01-08' is before min limit '2016-01-01',1,5

#,Data value,Message,Failed rows,First row
1,3300,max value is 100,1,3


By requesting the information in json format, integration with other websites/services is possible as well:

In [7]:
observations_whip.get_report('json')

{'errors': [],
 'executed_at': '2018-08-27 15:08',
 'results': {'failed_rows': 3,
  'passed_row_ids': [2, 4],
  'passed_rows': 2,
  'specified_fields': {'country': {'allowed': {'constraint': 'BE, NL',
     'failed_rows': 1,
     'passed_rows': 4,
     'samples': {'BA': {'failed_rows': 1,
       'first_row': 1,
       'message': 'unallowed value BA'}}},
    'empty': {'constraint': 'False',
     'failed_rows': 0,
     'passed_rows': 5,
     'samples': {}}},
   'eventDate': {'dateformat': {'constraint': '%Y-%m-%d',
     'failed_rows': 0,
     'passed_rows': 5,
     'samples': {}},
    'empty': {'constraint': 'False',
     'failed_rows': 0,
     'passed_rows': 5,
     'samples': {}},
    'maxdate': {'constraint': '2018-12-31',
     'failed_rows': 0,
     'passed_rows': 5,
     'samples': {}},
    'mindate': {'constraint': '2016-01-01',
     'failed_rows': 1,
     'passed_rows': 4,
     'samples': {'1018-01-08': {'failed_rows': 1,
       'first_row': 5,
       'message': "date '1018-01-08' 

## Real-world case

The Tracking Invasive Alien Species (TrIAS) project aims to build an open data-driven framework to support policy on invasive species. Part of the project consists of standardization of data towards checklist and occurrence data sets that can be harvested by [GBIF](http://www.gbif.org/).

The [GBIF data validator](https://www.gbif.org/tool/81281/gbif-data-validator) reports about the syntactical correctness and the validity of the content contained within the dataset to identify potential issues. Still, more specific guidelines and specifications to apply for this specific project and data sets. By extending the current data validation with a whip-based validation using the pywhip validation, the general checks and reporting can be combined with the pywhip generated reporting.

As an example, the [alien-macroinvertebrates repository](https://github.com/trias-project/alien-macroinvertebrates) contains the functionality to standardize the data of Boets et al. (2016) to both a Darwin Core checklist and Darwin Core occurrence dataset. For each of the data sets, whip specifications where defined and added to the repository, see [this link](https://github.com/trias-project/alien-macroinvertebrates/tree/master/specification).

We can use these specifications to validate the status of each of the data sets. To illustrate this, we'll use the (small) taxon.csv data set:

Reading the specifications from the URL:

In [38]:
import requests
alien_macroinvertebrates_yaml = 'https://raw.githubusercontent.com/trias-project/alien-macroinvertebrates/ccd9025adfed3ee6a710da73213d439f5ac89506/specification/dwc_taxon.yaml'
response = requests.get(alien_macroinvertebrates_yaml)
alien_macroinvertebrates_specifications = yaml.load(response.text)

A user working on the data, can check the data against the specifications:

In [39]:
checklist_whip = whip_csv("alien-taxon.csv", 
                          alien_macroinvertebrates_specifications, delimiter=',')

Dataset does not comply the specifications, check reports by using the `get_report` method for a more detailed information.


And check the report:

In [35]:
display_html(HTML(checklist_whip.get_report('html')), 
             metadata=dict(isolated=True))

#,Data value,Message,Failed rows,First row
1,Tubficida,unallowed value Tubficida,9,22

#,Data value,Message,Failed rows,First row
1,alien-macroinvertebrates-checklist:taxon:b557e6ee22bcbca0621096f94b9eab42,value does not match regex 'alien-macroinvertebrates-checklist:taxon:\d',1,45
2,alien-macroinvertebrates-checklist:taxon:3db73d0aa90d26487c229ea067087531,value does not match regex 'alien-macroinvertebrates-checklist:taxon:\d',1,15
3,alien-macroinvertebrates-checklist:taxon:54cca150e1e0b7c0b3f5b152ae64d62b,value does not match regex 'alien-macroinvertebrates-checklist:taxon:\d',1,19
4,alien-macroinvertebrates-checklist:taxon:88d79edc97d544cd21a0923c87025404,value does not match regex 'alien-macroinvertebrates-checklist:taxon:\d',1,34
5,alien-macroinvertebrates-checklist:taxon:78b7dd71b7f4d54200f0ea10d3b232de,value does not match regex 'alien-macroinvertebrates-checklist:taxon:\d',1,31
6,alien-macroinvertebrates-checklist:taxon:ad8ff08319ff0e9051102e6d1722d423,value does not match regex 'alien-macroinvertebrates-checklist:taxon:\d',1,41
7,alien-macroinvertebrates-checklist:taxon:771544745a06d6420acf65d4caf558d3,value does not match regex 'alien-macroinvertebrates-checklist:taxon:\d',1,30
8,alien-macroinvertebrates-checklist:taxon:85ea227fedd73b5f44e27d7499bed4b3,value does not match regex 'alien-macroinvertebrates-checklist:taxon:\d',1,33
9,alien-macroinvertebrates-checklist:taxon:e9666c346f1d57a5c70f3565b265c7a6,value does not match regex 'alien-macroinvertebrates-checklist:taxon:\d',1,65
10,alien-macroinvertebrates-checklist:taxon:b25905a58f0d4cc12455803326d0d7e0,value does not match regex 'alien-macroinvertebrates-checklist:taxon:\d',1,43


The report provides a quick overview of the issues for which action need to be taken, similar to the current GBIF data validator reporting functionalities. 

The integration of the pywhip functionalities to the existing GBIF data validator would extend the data validation use cases, serving both a general data quality requirement as data set, publisher or community specific requirements.

## pywhip envisioned applications

In essence pywhip is a data validation tool that reports about data quality issues according to a set of rules. 

However, by making use of **whip specifications** instead of a base set of fixed rules, more **flexibility** is provided to both data publishers and data users. Hence, we envision the usefulness of pywhip in a wider range of applications:

* As a data publisher working with external partners data, standardizing data is an iterative process with continuous feedback towards and from the external partner. During this process decisions are made about the data representation which van be explicitly specified as whip specifications. Pywhip provides the ability to continuously check these specifications against the data set and report the existing issues. As the specifications can be data set specific, pywhip supports the required flexibility.
* The [Biodiversity Data Quality (BDQ) Interest Group](https://github.com/tdwg/bdq) is defining a fixed set of *Tests and Assertions* to assess data quality and provide a common ground for data aggregators to report these issues. Whereas these can be considered as a common ground, the usage of **whip specifications** provide the ability to define additional conformance tests agreed on **community level** to address specific data quality requirements. 
* As a data user, pywhip can be used to **filter** data records using a custom defined set of whip specifications. As different research questions have different data quality requirements, the combination of the **whip specifications** and **pywhip** validator support different researchers to filter data records to their needs.
* ...