# CappingTransformer
This notebook shows the functionality in the CappingTransformer class. This transformer caps numeric columns at either a maximum value or minimum value or both. <br>

In [1]:
import pandas as pd
import numpy as np

In [2]:
import tubular
from tubular.capping import CappingTransformer

In [3]:
tubular.__version__

'0.2.8'

## Load Boston house price dataset from sklearn
Note, the load_boston script modifies the original Boston dataset to include nulls values and pandas categorical dtypes.

In [4]:
boston_df = tubular.testing.test_data.prepare_boston_df()
boston_df.shape

(506, 17)

In [5]:
boston_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target,ZN_cat,CHAS_cat,RAD_cat
0,0.00632,18.0,2.31,0.0,0.538,6.575,,4.09,,296.0,15.3,396.9,4.98,24.0,18.0,0.0,
1,0.02731,,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6,,0.0,2.0
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,,17.8,392.83,4.03,34.7,0.0,0.0,2.0
3,,,2.18,0.0,0.458,,45.8,6.0622,3.0,222.0,18.7,,,33.4,,0.0,3.0
4,0.06905,0.0,2.18,0.0,0.458,,,6.0622,3.0,222.0,18.7,396.9,5.33,36.2,0.0,0.0,3.0


In [6]:
boston_df.dtypes

CRIM         float64
ZN            object
INDUS        float64
CHAS          object
NOX          float64
RM           float64
AGE          float64
DIS          float64
RAD           object
TAX          float64
PTRATIO      float64
B            float64
LSTAT        float64
target       float64
ZN_cat      category
CHAS_cat    category
RAD_cat     category
dtype: object

## Simple usage
First we will demonstrate the case where the user has pre determined the values to cap at, and codes these directly into the transformer.

### Initialising CappingTransformer

The `CappingTransformer` should be initialised by specifying either a `capping_values` dict or a `quantiles` dict, but not both. <br>
These must be a `dict` where each key is a column to apply capping to and the items are lists of length 2, containing either `None` or numeric values;
- in the case of `capping_values` the user directly specifies the values to cap at giving a lower and upper value here 
- or if `quantiles` is specified the values in each list should be the quantiles to cap at (these will be determined when running the `fit` method) 
- if a `None` value is present then the relevant min or max capping is not applied. <br>

In the example below both min and max capping will be applied, if `0.66` was replaced with `None` then only min capping would be applied.

In [7]:
cap_1 = CappingTransformer(capping_values = {'NOX': [0.44, 0.66]})

### CappingTransformer fit
In the case where the user specifies `capping_values` then there is no need to use the `fit` method. This is only required is the user specifies the `quantiles` argument when initialising the transformer. In fact if `capping_values` were specified and `fit` is called a warning will be generated.

### CappingTransformer transform
Note, if the transformer is applied to non-numeric columns then an exception will be raised.

In [8]:
boston_df['NOX'].quantile([0, 0.2, 0.8, 1.0])

0.0    0.38500
0.2    0.44218
0.8    0.66620
1.0    0.87100
Name: NOX, dtype: float64

In [9]:
boston_df_2 = cap_1.transform(boston_df)

In [10]:
boston_df_2['NOX'].quantile([0, 1.0])

0.0    0.44
1.0    0.66
Name: NOX, dtype: float64

## Transform with nulls
Nulls are not imputed in the transform method. There are other transformers in the package to impute null values.

In [11]:
cap_2 = CappingTransformer({'CRIM': [0.44, 0.66]})

In [12]:
boston_df['CRIM'].isnull().sum()

55

In [13]:
boston_df_3 = cap_2.transform(boston_df)

In [14]:
boston_df_3['CRIM'].isnull().sum()

55

## Transforming multiple columns
The specific capping values are applied to each column. The user has the option to set different combinations of min and max for each column.

In [15]:
cap_3 = CappingTransformer(capping_values = {'INDUS': [None, 4], 'RM': [None, 3]})

In [16]:
boston_df[['INDUS', 'RM']].max()

INDUS    27.74
RM        8.78
dtype: float64

In [17]:
boston_df_4 = cap_3.transform(boston_df)

In [18]:
boston_df_4[['INDUS', 'RM']].max()

INDUS    4.0
RM       3.0
dtype: float64

## Capping at quantiles
If the user does not want to pre specify quantiles up front but rather set them at quantiles in the data then the user can use the `quantiles` argument instead of `capping_values` when initialising the transformer. <br>
This takes the same structure as `capping_values` but the user is specifying quantiles to cap at instead of the values directly. <br>
The `fit` method of the transformer calculates the requested quantiles for each column. If the user also specified `weights_column` when initialising the transformer then weighted quantiles will be calculated using that column in `X`. <br>
The user must run `fit` before running `transform` when using quantiles, otherwise and exception will be raised. 

In [19]:
cap_4 = CappingTransformer(quantiles = {'INDUS': [None, 0.8], 'RM': [0.1, None], 'CRIM': [0.2, 0.8]})

In [20]:
cap_4.fit(boston_df)

CappingTransformer(capping_values={'CRIM': [0.065938, 5.6493100000000025],
                                   'INDUS': [None, 18.1], 'RM': [5.593, None]},
                   quantiles={'CRIM': [0.2, 0.8], 'INDUS': [None, 0.8],
                              'RM': [0.1, None]},
                   weights_column=None)

In [21]:
boston_df[['INDUS', 'RM', 'CRIM']].min()

INDUS    0.46000
RM       3.56100
CRIM     0.00632
dtype: float64

In [22]:
boston_df[['INDUS', 'RM', 'CRIM']].max()

INDUS    27.7400
RM        8.7800
CRIM     88.9762
dtype: float64

In [23]:
boston_df_5 = cap_4.transform(boston_df)

In [24]:
boston_df_5[['INDUS', 'RM', 'CRIM']].min()

INDUS    0.460000
RM       5.593000
CRIM     0.065938
dtype: float64

In [25]:
boston_df_5[['INDUS', 'RM', 'CRIM']].max()

INDUS    18.10000
RM        8.78000
CRIM      5.64931
dtype: float64

## Weighted quantiles
The user can also specify the `weights_column` argument when initialising the transformer, so that `fit` will use the column in the input data `X` when calculating the quantiles.

### Equal weights
First we will use an equal weight column and demonstrate we can recover the same quantiles from `cap_4` where no `weights_column` was specified.

In [26]:
boston_df['weights'] = 1

In [27]:
cap_5 = CappingTransformer(quantiles = {'CRIM': [0.2, 0.8]}, weights_column = 'weights')

In [28]:
cap_5.fit(boston_df)

CappingTransformer(capping_values={'CRIM': [0.065938, 5.6493100000000025]},
                   quantiles={'CRIM': [0.2, 0.8]}, weights_column='weights')

In [29]:
cap_5.capping_values['CRIM'] == cap_4.capping_values['CRIM']

True

### Non-equal weights
Next we will set weights to zero for values that are beyond the `[0.2, 0.8]` quantiles identified previously, and check that the `[0, 1]` weighted quantiles identified are the min and max, where the weight is > 0.

In [30]:
boston_df['weights2'] = 1
boston_df.loc[boston_df['CRIM'] < 0.065938, 'weights2'] = 0
boston_df.loc[boston_df['CRIM'] > 5.6493100000000025, 'weights2'] = 0

In [31]:
cap_6 = CappingTransformer(quantiles = {'CRIM': [0.0, 1.0]}, weights_column = 'weights2')

In [32]:
cap_6.fit(boston_df)

CappingTransformer(capping_values={'CRIM': [0.06617, 5.58107]},
                   quantiles={'CRIM': [0.0, 1.0]}, weights_column='weights2')

In [33]:
cap_6.capping_values['CRIM']

[0.06617, 5.58107]

In [34]:
boston_df.loc[boston_df['weights2'] > 0, 'CRIM'].min(), boston_df.loc[boston_df['weights2'] > 0, 'CRIM'].max()

(0.06617, 5.58107)

### Weights columns restrictions 
The weights column can have 0 values in it, but it cannot have `null` values, `np.Inf`, `-np.Inf` or negative values.

In [35]:
cap_7 = CappingTransformer(quantiles = {'CRIM': [0.1, 0.8]}, weights_column = 'RM')

In [36]:
boston_df['RM'].fillna(0, inplace = True)

In [37]:
cap_7.fit(boston_df)

CappingTransformer(capping_values={'CRIM': [0.03682051473851029,
                                            5.658848128316456]},
                   quantiles={'CRIM': [0.1, 0.8]}, weights_column='RM')

In [38]:
cap_7.capping_values['CRIM']

[0.03682051473851029, 5.658848128316456]