# OutOfRangeNullTransformer
This notebook shows the functionality in the `OutOfRangeNullTransformer` class. This transformer works in a similar way to the `CappingTransformer` but it does not apply capping, instead any values that are outside the 'cap range' will be set to `null` instead of the limits of that range. <br>
For more examples of setting the limits to set null values, refer to the `CappingTransformer` examples notebook, as `OutOfRangeNullTransformer` works the same way.

In [1]:
import pandas as pd
import numpy as np

In [2]:
import tubular
from tubular.capping import OutOfRangeNullTransformer

In [3]:
tubular.__version__

'0.2.8'

## Load Boston house price dataset from sklearn
Note, the load_boston script modifies the original Boston dataset to include nulls values and pandas categorical dtypes.

In [4]:
boston_df = tubular.testing.test_data.prepare_boston_df()
boston_df.shape

(506, 17)

In [5]:
boston_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target,ZN_cat,CHAS_cat,RAD_cat
0,0.00632,18.0,2.31,0.0,0.538,6.575,,4.09,,296.0,15.3,396.9,4.98,24.0,18.0,0.0,
1,0.02731,,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6,,0.0,2.0
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,,17.8,392.83,4.03,34.7,0.0,0.0,2.0
3,,,2.18,0.0,0.458,,45.8,6.0622,3.0,222.0,18.7,,,33.4,,0.0,3.0
4,0.06905,0.0,2.18,0.0,0.458,,,6.0622,3.0,222.0,18.7,396.9,5.33,36.2,0.0,0.0,3.0


In [6]:
boston_df.dtypes

CRIM         float64
ZN            object
INDUS        float64
CHAS          object
NOX          float64
RM           float64
AGE          float64
DIS          float64
RAD           object
TAX          float64
PTRATIO      float64
B            float64
LSTAT        float64
target       float64
ZN_cat      category
CHAS_cat    category
RAD_cat     category
dtype: object

## Simple usage
First we will demonstrate the case where the user has pre determined the values to cap at, and codes these directly into the transformer.

### Initialising OutOfRangeNullTransformer

The `OutOfRangeNullTransformer` should be initialised (in the same way as the `CappingTransformer`) by specifying either a `capping_values` dict or a `quantiles` dict, but not both. <br>
These must be a `dict` where each key is a column to apply capping to and the items are lists of length 2, containing either `None` or numeric values;
- in the case of `capping_values` the user directly specifies the values to cap at giving a lower and upper value here 
- or if `quantiles` is specified the values in each list should be the quantiles to cap at (these will be determined when running the `fit` method) 
- if a `None` value is present then the relevant min or max capping is not applied. <br>

In the example below both upper and lower capping ranges will be considered, any values above `0.66` or below `0.44` will be replaced with `np.NaN` values. If `0.66` was replaced with `None` then only values below `0.44` would be set to null.

In [7]:
cap_1 = OutOfRangeNullTransformer(capping_values = {'NOX': [0.44, 0.66]})

### OutOfRangeNullTransformer fit
In the case where the user specifies `capping_values` then there is no need to use the `fit` method. This is only required is the user specifies the `quantiles` argument when initialising the transformer. In fact if `capping_values` were specified and `fit` is called a warning will be generated.

### OutOfRangeNullTransformer transform
Like the `CappingTransformer` if the transformer is applied to non-numeric columns then an exception will be raised. <br>
Below we demonstrate any values outside the set range of `[0.44, 0.66]` being set to null.

In [8]:
boston_df['NOX'].quantile([0, 0.2, 0.8, 1.0])

0.0    0.38500
0.2    0.44218
0.8    0.66620
1.0    0.87100
Name: NOX, dtype: float64

In [9]:
((boston_df['NOX'] < 0.44) | (boston_df['NOX'] > 0.66)).sum()

183

In [10]:
boston_df['NOX'].isnull().sum()

44

In [11]:
boston_df_2 = cap_1.transform(boston_df)

In [12]:
boston_df_2['NOX'].quantile([0, 1.0])

0.0    0.442
1.0    0.659
Name: NOX, dtype: float64

In [13]:
boston_df_2['NOX'].isnull().sum()

227

In [14]:
((boston_df_2['NOX'] < 0.44) | (boston_df_2['NOX'] > 0.66)).sum()

0