# InputChecker Example Notebook

This notebook shows the functionality of the InputChecker class. The fit method stores information about a benchmark dataset, such as the data type and range for each column, and the transformer checks that a new/comparison dataset matches these expectations.

The transformer performs 5 separate checks:
* The columns are of the correct type
* The correct columns are allowed to contain nulls
* Categorical columns contain only accepted values
* Numerical columns contain only values within the minimum and maximum limits
* Datetime columns contain only dates beyond the minimum date limit and (optionally) before the maximum date limit

There are currently no checks related to the relationship between columns (e.g. values in column1 must always be higher than column2) but there are plans for these to be potentially added in the future

## Setup

In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split

import input_checker
from input_checker.checker import InputChecker
from datetime import timedelta, datetime

In [2]:
input_checker.__version__

'0.3.7'

## Prepare data

We will use the wine dataset from sklearn to demonstrate the input_checker functionality. We will add some categorical and datetime fields as well as missing values to existing fields.

In [3]:
wine = load_wine()

In [4]:
df_wine = pd.DataFrame(wine['data'], columns = wine['feature_names'])
df_wine['target'] = wine['target']

### Add categorical columns

In [5]:
np.random.seed(seed=123)
df_wine['polytunnel'] = np.random.choice([0, 1], df_wine.shape[0])

np.random.seed(seed=123)
df_wine['pesticide'] = np.random.choice(['contact', 'systemic', 'pre-emergence', 'post-emergence', 'selective', 'nonselective', None], df_wine.shape[0])

### Add datetime columns

In [6]:
def random_date(min_date, max_date, sample_size, seed=0):
    """
    Function to generate a random array of dates
    """

    np.random.seed(seed=seed)

    if (max_date - min_date).total_seconds() < 0:
        raise ValueError("max_date must be at a later date than min_date")

    prop = np.random.random(size=sample_size)

    rand_dates = np.array(
        [min_date + p * (max_date - min_date) for p in prop], dtype="datetime64[s]"
    )

    return rand_dates

In [7]:
df_wine['date_logged'] = random_date(pd.Timestamp('01/01/2021'), pd.Timestamp('01/06/2021'), df_wine.shape[0], seed=123)
df_wine['first_harvest'] = random_date(pd.Timestamp('01/08/2020'), pd.Timestamp('30/09/2020'), df_wine.shape[0], seed=123)

# add nulls
df_wine['first_harvest'] = pd.to_datetime(np.where(df_wine['first_harvest'].dt.second > 50, pd.Timestamp(np.nan), df_wine['first_harvest']))

# remove time and just keep date
df_wine['first_harvest'] = pd.to_datetime(df_wine['first_harvest'].dt.date)

### Add missing values to existing fields

In [8]:
df_wine.loc[df_wine.sample(5, random_state=123).index.values, 'ash'] = np.nan
df_wine.loc[df_wine.sample(21, random_state=123).index.values, 'magnesium'] = np.nan
df_wine.loc[df_wine.sample(105, random_state=123).index.values, 'hue'] = np.nan

### Split into train and test

In [9]:
df_train, df_test = train_test_split(df_wine, test_size=0.2)

In [10]:
df_train.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target,polytunnel,pesticide,date_logged,first_harvest
123,13.05,5.8,2.13,21.5,86.0,2.62,2.65,0.3,2.01,2.6,0.73,3.1,380.0,1,1,selective,2021-01-03 14:24:04,2020-05-25
121,11.56,2.05,3.23,28.5,119.0,3.18,5.08,0.47,1.87,6.0,,3.69,465.0,1,0,systemic,2021-01-02 00:54:58,2020-03-03
167,12.82,3.37,2.3,19.5,88.0,1.48,0.66,0.4,0.97,10.26,,1.75,685.0,2,0,nonselective,2021-01-05 15:18:33,NaT
144,12.25,3.88,2.2,18.5,112.0,1.38,0.78,0.29,1.14,8.21,,2.0,855.0,2,1,systemic,2021-01-03 06:53:21,2020-05-08
41,13.41,3.84,2.12,18.8,,2.45,2.68,0.27,1.48,4.28,,3.0,1035.0,0,1,selective,2021-01-01 13:52:27,2020-02-07


## Initialize input checker

When initializing the input checker object, the user must specify the following parameters;

* `columns` the list of columns contained within the dataframe to run the type and null checks on. The default value is <b>None</b>, this means all columns are included in the check.
* `categorical_columns` the list of columns to run the categorical check on. The default value is <b>None</b> which skips this check. Finally, if `categorical_columns` is set equal to <b>"infer"</b>, the object will automatically run the categorical check on any columns of the following types: 'category', 'boolean' and 'object'.
* `numerical_columns` the list of columns to run the numerical check on. The default value is <b>None</b> which skips this check and the <b>"infer"</b> option is also available which automatically runs the check on 'float' and 'int' columns. Finally, a dictionary may be defined for the `numerical_columns` parameter where the key values are the column names to run the check on, these must themselves contain a 'maximum' and 'minimum' keys within them which themselves contain a boolean True or False to define if a maximum and / or minimum value check is desired.
* `datetime_columns` the list of columns to run the datetime check on. The default value is <b>None</b> which skips this check and the <b>"infer"</b> option is also available which automatically runs the check on 'float' and 'int' columns. Finally, a dictionary may be defined for the `datetime_columns` parameter where the key values are the column names to run the check on, these must themselves contain a 'maximum' and 'minimum' keys within them which themselves contain a boolean True or False to define if a maximum and / or minimum value check is desired.
* `skip_infer_columns` the list of columns to be excluded from the <b>"infer"</b> column selection for the `categorical_columns`, `numerical_columns` & `datetime_columns` parameters, these will still be included in the null and type checks. The default value is <b>None</b>, this means all that all columns included in the `columns` parameter will be included in the <b>"infer"</b> check.


<b>IMPORTANT: INFER FRINGE CASES!</b>

With the `numerical_columns` infer option, columns such as index or an ID on a typical use case which are numerical will raise an exception as new entries will not be contained within the accepted range. 

With the `categorical_columns` infer option, columns such as free text fields will be caught as categorical, this can lead to both extremely high execution times as well as a very likely InputChecker error on transfrom. 

In these cases it is recommended that the user specifies any ID / free-text fields using `skip_infer_columns` so that they are skipped by the infer column selection, alternatively, columns that are not defined in the columns parameter will not be picked up by the infer columns option either but then they wont have any type or null checks applied to them.

In [11]:
categorical_columns = ['polytunnel', 'pesticide']

numerical_columns = ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium',
       'total_phenols', 'flavanoids', 'nonflavanoid_phenols',
       'proanthocyanins', 'color_intensity', 'hue',
       'od280/od315_of_diluted_wines', 'proline']

datetime_columns = ['date_logged', 'first_harvest']

In [12]:
checker = InputChecker(columns=categorical_columns + numerical_columns + datetime_columns,
                       categorical_columns=categorical_columns, 
                       numerical_columns=numerical_columns, 
                       datetime_columns=datetime_columns)

## Fit checker

The InputChecker is built from the tubular BaseTransformer so it has the fit and transfrom methods similarly to sklearn transformers

In [13]:
checker.fit(df_train)

InputChecker(categorical_columns=['polytunnel', 'pesticide'],
             columns=['polytunnel', 'pesticide', 'alcohol', 'malic_acid', 'ash',
                      'alcalinity_of_ash', 'magnesium', 'total_phenols',
                      'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins',
                      'color_intensity', 'hue', 'od280/od315_of_diluted_wines',
                      'proline', 'date_logged', 'first_harvest'],
             datetime_columns=['date_logged', 'first_harvest'],
             numerical_columns=['alcohol', 'malic_acid', 'ash',
                                'alcalinity_of_ash', 'magnesium',
                                'total_phenols', 'flavanoids',
                                'nonflavanoid_phenols', 'proanthocyanins',
                                'color_intensity', 'hue',
                                'od280/od315_of_diluted_wines', 'proline'],
             skip_infer_columns=[])

We can check whether the infer option finds the same columns when we define the key columns

In [14]:
checker_inferred = InputChecker(categorical_columns='infer', 
                       numerical_columns='infer', 
                       datetime_columns='infer',
                       skip_infer_columns = ['target'])

checker_inferred.fit(df_train)

InputChecker(categorical_columns=['pesticide'],
             columns=['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash',
                      'magnesium', 'total_phenols', 'flavanoids',
                      'nonflavanoid_phenols', 'proanthocyanins',
                      'color_intensity', 'hue', 'od280/od315_of_diluted_wines',
                      'proline', 'target', 'polytunnel', 'pesticide',
                      'date_logged', 'first_harvest'],
             datetime_columns=['date_logged', 'first_harvest'],
             numerical_columns=['alcohol', 'malic_acid', 'ash',
                                'alcalinity_of_ash', 'magnesium',
                                'total_phenols', 'flavanoids',
                                'nonflavanoid_phenols', 'proanthocyanins',
                                'color_intensity', 'hue',
                                'od280/od315_of_diluted_wines', 'proline',
                                'polytunnel'],
             skip_infer_colum

Notice that in this case, input_checker has inferred that `polytunnel` is a numerical rather than categorical field, as it contains the values 0 and 1.

### Class attributes

We can check that the checker has been correctly fitted by examining the dictionaries that are now stored within the class attributes for each check as follows:

* `null_map` stores the null check values
* `expected_values` stores the categorical check values
* `column_classes` stores the type check values
* `numerical_values` stores the numerical check values
* `datetime_values` stores the datetime check values


#### Null check dictionary

In [15]:
checker.null_map

{'polytunnel': 0,
 'pesticide': 1,
 'alcohol': 0,
 'malic_acid': 0,
 'ash': 1,
 'alcalinity_of_ash': 0,
 'magnesium': 1,
 'total_phenols': 0,
 'flavanoids': 0,
 'nonflavanoid_phenols': 0,
 'proanthocyanins': 0,
 'color_intensity': 0,
 'hue': 1,
 'od280/od315_of_diluted_wines': 0,
 'proline': 0,
 'date_logged': 0,
 'first_harvest': 1}

In [16]:
for column in df_train.columns:
    if df_train[column].isna().any():
        print(column)

ash
magnesium
hue
pesticide
first_harvest


The checker correctly stored the columns not allowed to contain nulls, these would be the features in live that would not have imputers in the pipeline

#### Categorical check dictionary

In [17]:
checker.expected_values

{'polytunnel': [1, 0],
 'pesticide': ['selective',
  'systemic',
  'nonselective',
  'post-emergence',
  'pre-emergence',
  'contact',
  None]}

In [18]:
df_train['polytunnel'].unique(), df_train['pesticide'].unique()

(array([1, 0]),
 array(['selective', 'systemic', 'nonselective', 'post-emergence',
        'pre-emergence', 'contact', None], dtype=object))

### Type checker dictionary

In [19]:
checker.column_classes

{'polytunnel': dtype('int32'),
 'pesticide': dtype('O'),
 'alcohol': dtype('float64'),
 'malic_acid': dtype('float64'),
 'ash': dtype('float64'),
 'alcalinity_of_ash': dtype('float64'),
 'magnesium': dtype('float64'),
 'total_phenols': dtype('float64'),
 'flavanoids': dtype('float64'),
 'nonflavanoid_phenols': dtype('float64'),
 'proanthocyanins': dtype('float64'),
 'color_intensity': dtype('float64'),
 'hue': dtype('float64'),
 'od280/od315_of_diluted_wines': dtype('float64'),
 'proline': dtype('float64'),
 'date_logged': dtype('<M8[ns]'),
 'first_harvest': dtype('<M8[ns]')}

In [20]:
for column in df_train.columns:
    print(column, df_train[column].dtype)

alcohol float64
malic_acid float64
ash float64
alcalinity_of_ash float64
magnesium float64
total_phenols float64
flavanoids float64
nonflavanoid_phenols float64
proanthocyanins float64
color_intensity float64
hue float64
od280/od315_of_diluted_wines float64
proline float64
target int32
polytunnel int32
pesticide object
date_logged datetime64[ns]
first_harvest datetime64[ns]


The type checker correctly stores the types of each column. It is important that the columns are of the correct type during live, a float32 column would for example raise an exception for if the checker was trained on a float64 column, similarly a column contain only 1s and 0s of an int type will raise an exception if the checker was trained on a categorical column independently of if the values are the same

### Numerical checker dictionary

In [21]:
checker.numerical_values

{'alcohol': {'maximum': 14.83, 'minimum': 11.03},
 'malic_acid': {'maximum': 5.8, 'minimum': 0.74},
 'ash': {'maximum': 3.23, 'minimum': 1.36},
 'alcalinity_of_ash': {'maximum': 30.0, 'minimum': 10.6},
 'magnesium': {'maximum': 151.0, 'minimum': 80.0},
 'total_phenols': {'maximum': 3.88, 'minimum': 0.98},
 'flavanoids': {'maximum': 5.08, 'minimum': 0.34},
 'nonflavanoid_phenols': {'maximum': 0.66, 'minimum': 0.13},
 'proanthocyanins': {'maximum': 3.58, 'minimum': 0.41},
 'color_intensity': {'maximum': 13.0, 'minimum': 1.28},
 'hue': {'maximum': 1.71, 'minimum': 0.57},
 'od280/od315_of_diluted_wines': {'maximum': 4.0, 'minimum': 1.27},
 'proline': {'maximum': 1547.0, 'minimum': 278.0}}

In [22]:
for column in numerical_columns:
    print(column, df_train[column].max(), df_train[column].min())

alcohol 14.83 11.03
malic_acid 5.8 0.74
ash 3.23 1.36
alcalinity_of_ash 30.0 10.6
magnesium 151.0 80.0
total_phenols 3.88 0.98
flavanoids 5.08 0.34
nonflavanoid_phenols 0.66 0.13
proanthocyanins 3.58 0.41
color_intensity 13.0 1.28
hue 1.71 0.57
od280/od315_of_diluted_wines 4.0 1.27
proline 1547.0 278.0


The numerical checker stored the correct values, if a dictionary was passed as opposed to a list of columns then checks that were set to false would contain a None value. This is useful for cases such as vehicle age, a minimum value of 0 may useful whilst a maximum value might not 

### Datetime checker dictionary

In [23]:
checker.datetime_values

{'date_logged': {'maximum': None, 'minimum': Timestamp('2021-01-01 00:19:21')},
 'first_harvest': {'maximum': None,
  'minimum': Timestamp('2020-01-08 00:00:00')}}

In [24]:
for column in datetime_columns :
    print(column, df_train[column].min())

date_logged 2021-01-01 00:19:21
first_harvest 2020-01-08 00:00:00


## Transform

The transform method runs the checks on a new dataframe and returns the dataframe that was just validated. Lets see what happens if it runs on a row of the training dataframe.

In [25]:
df_val = df_train.tail(1)

In [26]:
df_val_checked = checker.transform(df_val)

In [27]:
df_val_checked.equals(df_val)

True

No checks were failed so all failed checks dictionaries remain empty and no exception message was printed. The original DF is returned.

### Failed checks

We will now check the test set.

In [28]:
df_test_checked = checker.transform(df_test)

InputCheckerError: Failed maximum value check for column: magnesium; Values above maximum: {95: 162.0}
Failed minimum value check for column: magnesium; Values below minimum: {67: 78.0, 66: 78.0, 89: 70.0}
Failed minimum value check for column: hue; Values below minimum: {153: 0.56}
Failed maximum value check for column: proline; Values above maximum: {18: 1680.0}


Some of the values in the numerical columns exceed the minimum/maximum limits and an exception is raised.
    
Let's try changing some other values and examine at the InputCheckerError output.

In [29]:
df_test_mod = df_test.copy()

In [30]:
# Failed numerical check
df_test_mod.loc[10, 'proline'] = 10000

# Failed categorical check
df_test_mod.loc[13, 'pesticide'] = 2

# Failed type check
df_test_mod['flavanoids'] = df_test_mod['flavanoids'].apply(str)

# Failed null check
df_test_mod.loc[171, 'alcohol'] = np.nan

# Failed datetime check
df_test_mod.loc[43, 'first_harvest'] = pd.to_datetime('01/01/2017')

In [31]:
df_test_mod_checked = checker.transform(df_test_mod)

InputCheckerError: Failed null check for column: alcohol
Failed type check for column: flavanoids; Expected: float64, Found: object
Failed categorical check for column: pesticide; Unexpected values: [2]
Failed maximum value check for column: magnesium; Values above maximum: {95: 162.0}
Failed minimum value check for column: magnesium; Values below minimum: {67: 78.0, 66: 78.0, 89: 70.0}
Failed minimum value check for column: hue; Values below minimum: {153: 0.56}
Failed maximum value check for column: proline; Values above maximum: {18: 1680.0, 10: 10000.0}
Failed minimum value check for column: first_harvest; Values below minimum: {43: Timestamp('2017-01-01 00:00:00')}


All checks were failed for the new dataset, these results are printed in the exception message.

### Batch mode

Alternatively, the transform can be applied with `batch_mode` set to True. In this case, rows which fail the input_checker checks are separated from rows which pass and returned as separate dataframes. No exceptions are raised.

In [32]:
# change flavanoids field back to float, otherwise all rows will fail
df_test_mod['flavanoids'] = df_test_mod['flavanoids'].apply(float)

In [None]:
df_pass, df_fail = checker.transform(df_test_mod, batch_mode=True)

In [34]:
print(df_pass.shape)
print(df_fail.shape)

(26, 18)
(10, 19)


In [36]:
df_pass.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target,polytunnel,pesticide,date_logged,first_harvest
15,13.63,1.81,2.7,17.2,112.0,2.85,2.91,0.3,1.46,7.3,1.28,2.88,1310.0,0,1,systemic,2021-01-04 16:33:34,2020-07-22
165,13.73,4.36,2.26,22.5,88.0,1.28,0.47,0.52,1.15,6.62,0.78,1.75,520.0,2,1,contact,2021-01-05 01:46:52,2020-08-11
11,14.12,1.48,2.32,16.8,95.0,2.2,2.43,0.26,1.57,5.0,,2.82,1280.0,0,1,systemic,2021-01-04 15:29:09,2020-07-19
134,12.51,1.24,2.25,17.5,85.0,2.0,0.58,0.6,1.25,5.45,,1.51,650.0,2,1,systemic,2021-01-05 22:01:21,2020-09-25
33,13.76,1.53,2.7,19.5,,2.95,2.74,0.5,1.35,5.4,,3.0,1235.0,0,1,pre-emergence,2021-01-03 11:14:31,2020-05-18


In [37]:
df_fail.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target,polytunnel,pesticide,date_logged,first_harvest,failed_checks
13,14.75,1.73,2.39,11.4,91.0,3.1,3.69,0.43,2.81,5.4,,2.73,1150.0,0,1,2,2021-01-01 07:09:40,2020-01-23,Failed categorical check for column: pesticide...
67,12.37,1.17,1.92,19.6,78.0,2.11,2.0,0.27,1.04,4.68,,3.48,510.0,1,0,nonselective,2021-01-02 05:14:23,NaT,Failed minimum value check for column: magnesi...
95,12.47,1.52,2.2,19.0,162.0,2.5,2.27,0.32,3.28,2.6,,2.63,937.0,1,1,selective,2021-01-04 11:00:15,2020-07-09,Failed maximum value check for column: magnesi...
66,13.11,1.01,1.7,15.0,78.0,2.98,3.18,0.26,2.28,5.3,1.12,3.18,502.0,1,0,post-emergence,2021-01-04 19:38:30,2020-07-29,Failed minimum value check for column: magnesi...
153,13.23,3.3,2.28,18.5,98.0,1.8,0.83,0.61,1.87,10.52,0.56,1.51,675.0,2,1,,2021-01-02 12:46:34,2020-03-29,Failed minimum value check for column: hue; Va...


Rows in the failed DF will have an extra column called 'failed_checks'. This contains information about why the input values in that row failed the input checks:

* Failed type checks - columns with the expected and actual types
* Failed null checks - columns which have nulls and are expected not to
* Failed categorical checks - categorical columns with unexpected values
* Failed numerical checks - numerical columns with values outside minimum and/or maximum
* Failed datetime checks - datetime columns with values outside minimum and/or maximum

In [38]:
for i in df_fail.index:
    
    print(i, df_fail['failed_checks'].loc[i])

13 Failed categorical check for column: pesticide. Unexpected values are [2]
Failed type check for column: pesticide; Expected: str, Found: int
67 Failed minimum value check for column: magnesium; Value below minimum: 78.0
95 Failed maximum value check for column: magnesium; Value above maximum: 162.0
66 Failed minimum value check for column: magnesium; Value below minimum: 78.0
153 Failed minimum value check for column: hue; Value below minimum: 0.56
89 Failed minimum value check for column: magnesium; Value below minimum: 70.0
18 Failed maximum value check for column: proline; Value above maximum: 1680.0
171 Failed null check for column: alcohol
10 Failed maximum value check for column: proline; Value above maximum: 10000.0
43 Failed minimum value check for column: first_harvest; Value below minimum: 2017-01-01
