# CutTransformer
This notebook shows the functionality in the `CutTransformer` class. This transformer discretises a numeric column i.e. bins the column into discrete values. `CutTransformer` uses the [pd.cut](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html) methhod. <br>

In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing

In [2]:
import tubular
from tubular.numeric import CutTransformer

In [3]:
tubular.__version__

'0.3.0'

## Load California housing dataset from sklearn

In [4]:
cali = fetch_california_housing()
cali_df = pd.DataFrame(cali["data"], columns=cali["feature_names"])
cali_df["Population"] = cali_df["Population"].sample(frac=0.995, random_state=123)

In [5]:
cali_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


In [6]:
cali_df.dtypes

MedInc        float64
HouseAge      float64
AveRooms      float64
AveBedrms     float64
Population    float64
AveOccup      float64
Latitude      float64
Longitude     float64
dtype: object

## Simple usage

### Initialising CutTransformer

The user must specify the following;
- `column` giving the column to convert to bin
- `new_column_name` giving the name of the column to assign the converted column to
- `cut_kwargs` a dictionary of keyword arguments that will be passed to `pd.cut` when it is called in the `transform` method. See the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html) for all the arguments that `pd.cut` can accept.

In [7]:
cut_1 = CutTransformer(
    column="Population",
    new_column_name="Population_cut",
    cut_kwargs={"bins": 10, "precision": 5},
)

### CutTransformer fit
`CutTransformer` has not fit method, there is nothing learnt from the input data `X`.

### CutTransformer transform
Note, if the transformer is applied to non-numeric columns then an exception will be raised.

In [8]:
cali_df["Population"].isnull().sum()

103

In [9]:
cali_df_2 = cut_1.transform(cali_df)

In [10]:
cali_df_2["Population_cut"].value_counts(dropna=False).sort_index()

(-32.679, 3570.9]     19716
(3570.9, 7138.8]        709
(7138.8, 10706.7]        94
(10706.7, 14274.6]       12
(14274.6, 17842.5]        4
(17842.5, 21410.4]        0
(21410.4, 24978.3]        0
(24978.3, 28546.2]        0
(28546.2, 32114.1]        1
(32114.1, 35682.0]        1
NaN                     103
Name: Population_cut, dtype: int64

Note, `pd.cut` extends the range of x to ensure min and max values are capture. Hence the negative value in the lowest bin. See the[docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html) for more details.

In [11]:
cali_df["Population"].quantile([0, 1.0])

0.0        3.0
1.0    35682.0
Name: Population, dtype: float64