# CutTransformer
This notebook shows the functionality in the `CutTransformer` class. This transformer discretises a numeric column i.e. bins the column into discrete values. `CutTransformer` uses the [pd.cut](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html) methhod. <br>

In [1]:
import pandas as pd
import numpy as np

In [2]:
import tubular
from tubular.numeric import CutTransformer

In [3]:
tubular.__version__

'0.2.8'

## Load Boston house price dataset from sklearn
Note, the load_boston script modifies the original Boston dataset to include nulls values and pandas categorical dtypes.

In [4]:
boston_df = tubular.testing.test_data.prepare_boston_df()
boston_df.shape

(506, 17)

In [5]:
boston_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target,ZN_cat,CHAS_cat,RAD_cat
0,0.00632,18.0,2.31,0.0,0.538,6.575,,4.09,,296.0,15.3,396.9,4.98,24.0,18.0,0.0,
1,0.02731,,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6,,0.0,2.0
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,,17.8,392.83,4.03,34.7,0.0,0.0,2.0
3,,,2.18,0.0,0.458,,45.8,6.0622,3.0,222.0,18.7,,,33.4,,0.0,3.0
4,0.06905,0.0,2.18,0.0,0.458,,,6.0622,3.0,222.0,18.7,396.9,5.33,36.2,0.0,0.0,3.0


In [6]:
boston_df.dtypes

CRIM         float64
ZN            object
INDUS        float64
CHAS          object
NOX          float64
RM           float64
AGE          float64
DIS          float64
RAD           object
TAX          float64
PTRATIO      float64
B            float64
LSTAT        float64
target       float64
ZN_cat      category
CHAS_cat    category
RAD_cat     category
dtype: object

## Simple usage

### Initialising CutTransformer

The user must specify the following;
- `column` giving the column to convert to bin
- `new_column_name` giving the name of the column to assign the converted column to
- `cut_kwargs` a dictionary of keyword arguments that will be passed to `pd.cut` when it is called in the `transform` method. See the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html) for all the arguments that `pd.cut` can accept.

In [7]:
cut_1 = CutTransformer(
    column = 'CRIM', 
    new_column_name = 'CRIM_cut', 
    cut_kwargs = {'bins': 10, 'precision': 5}
)

### CutTransformer fit
`CutTransformer` has not fit method, there is nothing learnt from the input data `X`.

### CutTransformer transform
Note, if the transformer is applied to non-numeric columns then an exception will be raised.

In [8]:
boston_df['CRIM'].isnull().sum()

55

In [9]:
boston_df_2 = cut_1.transform(boston_df)

In [10]:
boston_df_2['CRIM_cut'].value_counts(dropna = False).sort_index()

(-0.08265, 8.90331]     390
(8.90331, 17.8003]       38
(17.8003, 26.69728]      14
(26.69728, 35.59427]      1
(35.59427, 44.49126]      3
(44.49126, 53.38825]      2
(53.38825, 62.28524]      0
(62.28524, 71.18222]      1
(71.18222, 80.07921]      1
(80.07921, 88.9762]       1
NaN                      55
Name: CRIM_cut, dtype: int64

Note, `pd.cut` extends the range of x to ensure min and max values are capture. Hence the negative value in the lowest bin. See the[docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html) for more details.

In [11]:
boston_df['CRIM'].quantile([0, 1.0])

0.0     0.00632
1.0    88.97620
Name: CRIM, dtype: float64