# MeanResponseTransformer
This notebook shows the functionality in the `MeanResponseTransformer` class. This transformer applies mean response encoding such that categorical levels are mapped to the average value of the response (target) for a particular problem.


In [1]:
import pandas as pd
import numpy as np
from pprint import pprint
from sklearn.datasets import load_diabetes

In [2]:
import tubular
from tubular.nominal import MeanResponseTransformer

In [3]:
tubular.__version__

'0.3.4'

## Load diabetes dataset from sklearn
We also create a categorical column from `bmi` and treat it as unordered for demonstration purposes in this notebook.

In [4]:
diabetes, target = load_diabetes(return_X_y=True, as_frame=True)

In [5]:
diabetes["bmi_cut"] = pd.cut(diabetes["bmi"], bins=20)

In [6]:
diabetes["target"] = target

In [7]:
diabetes.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,bmi_cut,target
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646,"(0.0532, 0.0662]",151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204,"(-0.0642, -0.0512]",75.0
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593,"(0.0401, 0.0532]",141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362,"(-0.012, 0.00102]",206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641,"(-0.0381, -0.0251]",135.0


## Simple usage

### Initialising MeanResponseTransformer
There can be no nulls in the response column otherwise an exception will be raised.

For a binary or continuous response, leave the `level` parameter as `None` (default).

In [8]:
mre_1 = MeanResponseTransformer(columns="bmi_cut", copy=True, verbose=True)

BaseTransformer.__init__() called


### MeanResponseTransformer fit
The `fit` method calculates the average response column value for each level, it must be run before the `transform` method. The first parameter of fit method is the input dataframe and the second parameter is the target/response column. <br>
The mappings are stored in an attribute called `mappings`.

In [9]:
mre_1.fit(diabetes, diabetes["target"])

BaseTransformer.fit() called


In [10]:
pprint(mre_1.mappings)

{'bmi_cut': {Interval(-0.0905, -0.0772, closed='right'): 95.1,
             Interval(-0.0772, -0.0642, closed='right'): 92.9090909090909,
             Interval(-0.0642, -0.0512, closed='right'): 96.39285714285714,
             Interval(-0.0512, -0.0381, closed='right'): 108.52631578947368,
             Interval(-0.0381, -0.0251, closed='right'): 117.28571428571429,
             Interval(-0.0251, -0.012, closed='right'): 127.38775510204081,
             Interval(-0.012, 0.00102, closed='right'): 142.82692307692307,
             Interval(0.00102, 0.0141, closed='right'): 154.6315789473684,
             Interval(0.0141, 0.0271, closed='right'): 194.63888888888889,
             Interval(0.0271, 0.0401, closed='right'): 194.64,
             Interval(0.0401, 0.0532, closed='right'): 181.36,
             Interval(0.0532, 0.0662, closed='right'): 195.07142857142858,
             Interval(0.0662, 0.0793, closed='right'): 215.75,
             Interval(0.0793, 0.0923, closed='right'): 265.2857142

### MeanResponseTransformer transform

In [11]:
diabetes_2 = mre_1.transform(diabetes)

BaseTransformer.transform() called


In [12]:
diabetes_2["bmi_cut"].value_counts(dropna=False)

142.826923    52
127.387755    49
117.285714    49
154.631579    38
108.526316    38
194.638889    36
96.392857     28
195.071429    28
194.640000    25
181.360000    25
92.909091     22
215.750000    16
95.100000     10
234.888889     9
265.285714     7
297.250000     4
277.000000     3
294.000000     2
233.000000     1
Name: bmi_cut, dtype: int64

## Transform with nulls
Null values are not converted in the `MeanResponseTransformer`. There are other transforrmers in the package which can be used to deal with imputation first.

In [13]:
diabetes["bmi_cut_str"] = diabetes["bmi_cut"].astype(str)

In [14]:
diabetes.loc[0, "bmi_cut_str"] = np.NaN

In [15]:
diabetes["bmi_cut_str"].isnull().sum()

1

In [16]:
mre_2 = MeanResponseTransformer(columns=["bmi_cut_str"], copy=True, verbose=True)

BaseTransformer.__init__() called


In [17]:
mre_2.fit(diabetes, diabetes["target"])

BaseTransformer.fit() called


In [18]:
try:
    mre_2.transform(diabetes)
except Exception as err:
    print(type(err), err)

<class 'ValueError'> MeanResponseTransformer: nulls would be introduced into column bmi_cut_str from levels not present in mapping


## Weights column
It is possible to specify a weights column using the `weights_column` argument when initialising the transformer. <br>
If this is the case then a weighted mean will be calculated by `fit`.

In [19]:
diabetes["weights"] = diabetes["bp"].abs()

In [20]:
mre_3 = MeanResponseTransformer(columns="bmi_cut", weights_column="weights")

In [21]:
mre_3.fit(diabetes, diabetes["target"])

In [22]:
diabetes_4 = mre_3.transform(diabetes)

In [23]:
diabetes_4["bmi_cut"].value_counts(dropna=False)

152.094173    52
138.093321    49
121.853416    49
147.126566    38
107.557094    38
193.381536    36
92.294201     28
214.058486    28
212.598650    25
193.878729    25
88.381850     22
205.983233    16
91.181897     10
251.926906     9
271.976121     7
291.866679     4
289.963993     3
320.849412     2
233.000000     1
Name: bmi_cut, dtype: int64

# Multi-level response

Use of the MeanResponseTransformer with a multi-level response is controlled using the `level` parameter, which defaults to `None`. 

This is done by creating a mean response encoded column for each level in the response. It is possible to specify a subset of levels, by passing a list of levels you wish to encode against, or to encode against all levels in the response set `level` to 'all'. 

Note that any weights or prior will be applied to encoding against each response level.


In [24]:
data = pd.DataFrame(
    {
        "column_1": ["a", "b", "a", "a"],
        "column_2": ["d", "d", "c", "c"],
        "column_3": ["yellow", "yellow", "blue", "green"],
    }
)

In [25]:
t = MeanResponseTransformer(columns=["column_1", "column_2"], level=["yellow", "blue"])

In [26]:
t = t.fit(data, data["column_3"])

In [27]:
t.transform(data)

Unnamed: 0,column_3,column_1_yellow,column_2_yellow,column_1_blue,column_2_blue
0,yellow,0.333333,1.0,0.333333,0.0
1,yellow,1.0,1.0,0.0,0.0
2,blue,0.333333,0.0,0.333333,0.5
3,green,0.333333,0.0,0.333333,0.5


# Unseen Level Handling  

The MeanResponseTransformer also has support for unseen level handling, i.e., handling values/categories that only appear in data to transform, but not in data that the MeanResponseTransformer was fit on. This is controlled using the `unseen_level_handling` parameter, which defaults to `None`. The following options can be used for the `unseen_level_handling` parameter:
1. **Mean**: Use the mean of encoded values of the column to replace unseen levels of that column in data to transform 
2. **Median**: Use median of the encoded values of the column to replace unseen levels of that column in data to transform
3. **Highest**: Use the highest encoded value of the column to replace unseen levels of that column in data to transform
4. **Lowest**: Use the lowest encoded value of the column to replace unseen levels of that column in data to transform
5. **Abitrary int/float**: Use int/float value passed by the user to replace all unseen levels in data to transform

**Note** that the default value of 'None' will output an error if transform is passed unseen levels

**Example of Unseen level Handling:**

The example below shows how one can use the unseen_level_handling parameter to replace unseen levels in data that has a single level response. Same can also be used for data with multi-level response.

In [28]:
# data to fit the MeanResponseTransformer on
df = pd.DataFrame(
    {
        "Col1": ["s1", "s1", "s1", "s1", "s4", "s3", "s2", "s1", "s2", "s4", "s1"],
        "Col2": ["A", "A", "B", "A", "C", "C", "A", "C", "C", "B", "A"],
        "Target": [1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0],
    }
)

In [29]:
print(df)

   Col1 Col2  Target
0    s1    A       1
1    s1    A       0
2    s1    B       1
3    s1    A       1
4    s4    C       1
5    s3    C       0
6    s2    A       0
7    s1    C       1
8    s2    C       1
9    s4    B       1
10   s1    A       0


In [30]:
# data with unseen levels to use for transform
df1 = pd.DataFrame(
    {
        "Col1": [
            "s1",
            "s2",
            "s3",
            "s6",
            "s4",
            "s3",
            "s6",
            "s1",
            "s2",
            "s4",
            "s7",
            "s5",
        ],
        "Col2": ["A", "A", "B", "E", "C", "D", "A", "C", "F", "B", "A", "A"],
        "Target": [1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1],
    }
)

In [31]:
df1

Unnamed: 0,Col1,Col2,Target
0,s1,A,1
1,s2,A,0
2,s3,B,1
3,s6,E,1
4,s4,C,1
5,s3,D,0
6,s6,A,0
7,s1,C,1
8,s2,F,1
9,s4,B,1


In [32]:
# setting unseen_level_handling to 'Mean'. Can also use other options mentioned above
t = MeanResponseTransformer(columns=["Col1", "Col2"], unseen_level_handling="Mean")

In [33]:
t.fit(df, df["Target"])

In [34]:
# replacing unseen levels with mean of encoded values
t.transform(df1)

Unnamed: 0,Col1,Col2,Target
0,0.666667,0.4,1
1,0.5,0.4,0
2,0.0,1.0,1
3,0.636364,0.636364,1
4,1.0,0.75,1
5,0.0,0.636364,0
6,0.636364,0.4,0
7,0.666667,0.75,1
8,0.5,0.636364,1
9,1.0,1.0,1
