## Demo of the `data_linter` package

In this notebook we demonstrate some uses of our metadata format:
- To produce a validation report of a dataset which validates successfully against a metadata schema
- To produce a validation report of a dataset which fails to validate
- To auto-create a draft metadata schema from an existing dataset
- Attempt to impose the column data types specified in metadata onto a dataframe

## Create validation report for valid data

In [1]:
from data_linter.lint import Linter

Load in data

In [2]:
import pandas as pd 
df = pd.read_csv("data_valid.csv", parse_dates=["mydatetime"])
df

Unnamed: 0,myint,myfloat,mychar,mydatetime
0,1,1.0,a,2019-01-01 00:00:00
1,3,2.3,b,2019-01-01 10:00:00
2,4,3.33,c,2010-01-02 00:00:00


Open the associated metadata

In [3]:
import json
with open("metadata.json") as f:
    meta_data = json.load(f)
print(json.dumps(meta_data, indent=4))

{
    "columns": [
        {
            "name": "myint",
            "type": "int",
            "description": "An integer column"
        },
        {
            "name": "myfloat",
            "type": "float",
            "description": "A float column"
        },
        {
            "name": "mychar",
            "type": "character",
            "description": "A character/string column",
            "enum": [
                "a",
                "b",
                "c"
            ]
        },
        {
            "name": "mydatetime",
            "type": "datetime",
            "description": "A datetime column"
        }
    ]
}


Use `data_linter` to perform all automated validation checks with `check_all`.  Within Jupyter, it automatically pretty-prints a summary markdown table

In [4]:
l = Linter(df, meta_data)
l.check_all()
l

| col_name   | validation_description        | success   |
|:-----------|:------------------------------|:----------|
| mychar     | check_column_exists_and_order | ✅        |
| mychar     | check_data_type               | ✅        |
| mychar     | check_enums                   | ✅        |
| mydatetime | check_column_exists_and_order | ✅        |
| mydatetime | check_data_type               | ✅        |
| myfloat    | check_column_exists_and_order | ✅        |
| myfloat    | check_data_type               | ✅        |
| myint      | check_column_exists_and_order | ✅        |
| myint      | check_data_type               | ✅        |

Create a markdown report.  Note I import from `IPython.display` so this markdown gets rendered properly within the Jupyter notebook

In [5]:
from IPython.display import Markdown
Markdown(l.markdown_report())

# Validation report 
## Summary
👍😎😎😎😎😎👍
✅**All tests on your dataset passed**
👍😎😎😎😎😎👍

## Your data

Here's a sample of your data:

|   myint |   myfloat | mychar   | mydatetime          |
|--------:|----------:|:---------|:--------------------|
|       1 |       1   | a        | 2019-01-01 00:00:00 |
|       3 |       2.3 | b        | 2019-01-01 10:00:00 |

and a summary of your metadata:

| myint   | myfloat   | mychar    | mydatetime   |
|:--------|:----------|:----------|:-------------|
| int     | float     | character | datetime     |

### Results for column myint

The metadata for this column is: `{"name": "myint", "type": "int", "description": "An integer column"}`.

**✅ check_column_exists_and_order was a success**



**✅ check_data_type was a success**


---
### Results for column myfloat

The metadata for this column is: `{"name": "myfloat", "type": "float", "description": "A float column"}`.

**✅ check_column_exists_and_order was a success**



**✅ check_data_type was a success**


---
### Results for column mychar

The metadata for this column is: `{"name": "mychar", "type": "character", "description": "A character/string column", "enum": ["a", "b", "c"]}`.

**✅ check_column_exists_and_order was a success**



**✅ check_enums was a success**

**✅ check_data_type was a success**


---
### Results for column mydatetime

The metadata for this column is: `{"name": "mydatetime", "type": "datetime", "description": "A datetime column"}`.

**✅ check_column_exists_and_order was a success**



**✅ check_data_type was a success**


---


## Create validation report for invalid data

Perform the same steps for an invalid dataset.  Note that `mychar` contains value `d` which is not in the `enum` options in the meta data

In [6]:
invalid_df = pd.read_csv("data_invalid.csv", parse_dates=["mydatetime"])
invalid_df

Unnamed: 0,myint,mychar,mydatetime
0,1.0,a,2019-01-01 00:00:00
1,3.0,b,2019-01-01 10:00:00
2,4.1,d,2010-01-02 00:00:00


In [7]:
l = Linter(invalid_df, meta_data)
l.check_all()
Markdown(l.markdown_report())

# Validation report 
## Summary
🔥🔥🔥🔥🔥🔥🔥
❌**Some tests failed, see below for details**
🔥🔥🔥🔥🔥🔥🔥

## Your data

Here's a sample of your data:

|   myint | mychar   | mydatetime          |
|--------:|:---------|:--------------------|
|       1 | a        | 2019-01-01 00:00:00 |
|       3 | b        | 2019-01-01 10:00:00 |

and a summary of your metadata:

| myint   | myfloat   | mychar    | mydatetime   |
|:--------|:----------|:----------|:-------------|
| int     | float     | character | datetime     |

### Results for column myint

The metadata for this column is: `{"name": "myint", "type": "int", "description": "An integer column"}`.

**✅ check_column_exists_and_order was a success**



**✅ check_data_type was a success**


---
### Results for column myfloat

The metadata for this column is: `{"name": "myfloat", "type": "float", "description": "A float column"}`.

**❌ check_column_exists_and_order was a failure**


Column does not exist in input data.

---
### Results for column mychar

The metadata for this column is: `{"name": "mychar", "type": "character", "description": "A character/string column", "enum": ["a", "b", "c"]}`.

**❌ check_column_exists_and_order was a failure**


Column exists but is in position 2 rather than the expected position 3.

**❌ check_enums was a failure**
Here's a sample of some rows which failed:

|   index |   myint | mychar   | mydatetime          |
|--------:|--------:|:---------|:--------------------|
|       2 |       4 | d        | 2010-01-02 00:00:00 |

**✅ check_data_type was a success**


---
### Results for column mydatetime

The metadata for this column is: `{"name": "mydatetime", "type": "datetime", "description": "A datetime column"}`.

**❌ check_column_exists_and_order was a failure**


Column exists but is in position 3 rather than the expected position 4.

**✅ check_data_type was a success**


---


## Auto-generate some draft metadata from a dataset

We're going to generate metadata for the following dataset

In [8]:
df

Unnamed: 0,myint,myfloat,mychar,mydatetime
0,1,1.0,a,2019-01-01 00:00:00
1,3,2.3,b,2019-01-01 10:00:00
2,4,3.33,c,2010-01-02 00:00:00


In [9]:
from data_linter.generate_meta_data import generate_from_pd_df
generated_metadata = generate_from_pd_df(df)
print(json.dumps(generated_metadata, indent=4))

{
    "columns": [
        {
            "name": "myint",
            "description": "",
            "type": "int"
        },
        {
            "name": "myfloat",
            "description": "",
            "type": "float"
        },
        {
            "name": "mychar",
            "description": "",
            "type": "character"
        },
        {
            "name": "mydatetime",
            "description": "",
            "type": "datetime"
        }
    ]
}


## Attempt to impose the column data types specified in metadata onto a dataframe

In [34]:
from data_linter.impose_data_types import impose_metadata_types_on_pd_df

In [35]:
with open("metadata.json") as f:
    meta_data = json.load(f)

In [36]:
df_strings = pd.read_csv("data_valid.csv", dtype=object)
df_strings.dtypes

myint         object
myfloat       object
mychar        object
mydatetime    object
dtype: object

In [37]:
df_imposed = impose_metadata_types_on_pd_df(df=df_strings, meta_data=meta_data, errors='raise')

In [38]:
df_imposed.dtypes

myint                  Int64
myfloat              float64
mychar                object
mydatetime    datetime64[ns]
dtype: object

But what happens if it's not possible?

In [39]:
meta_data["columns"][2]

{'name': 'mychar',
 'type': 'character',
 'description': 'A character/string column',
 'enum': ['a', 'b', 'c']}

In [40]:
meta_data["columns"][2]["type"] = "int"

In [41]:
impose_metadata_types_on_pd_df(df=df_strings, meta_data=meta_data, errors='raise')

ValueError: invalid literal for int() with base 10: 'a'