<a href="https://colab.research.google.com/github/kessingtonosazee/GCP_Project_1/blob/master/mlc_2324_w9_lec_encoders_leakage.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MLC 23/24 week 9: Categorical Encoding, Data Leakage

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(
    { "figure.figsize": (6, 4) },
    style='ticks',
    color_codes=True,
    font_scale=0.8
)
%config InlineBackend.figure_format = 'retina'
import warnings
warnings.filterwarnings('ignore')

!pip install --upgrade scikit-learn==1.3.2 -q

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV, ParameterGrid

import sys
if 'google.colab' in sys.modules:
    !pip install -q dtreeviz
import dtreeviz

## Datasets

In [None]:
income = fetch_openml("adult", as_frame=True, parser="pandas")['data']
income.head(1)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,native-country
0,2,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,1,0,2,United-States


In [None]:
rides = pd.read_csv(
    'https://raw.githubusercontent.com/gerberl/6G7V0015-2324/main/datasets/bike.csv'
)
rides.head(1)

Unnamed: 0,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,hum,windspeed,cnt,days_since_2011
0,WINTER,2011,JAN,NO HOLIDAY,SAT,NO WORKING DAY,MISTY,8.175849,80.5833,10.749882,985,0


## Categorical Encoding

Two main strategies at these initial stages of Machine Learning:

* **one-hot encoding**: one binary feature for each categorical value. OK for low-cardinality features.
* **target encoding**: replace categorical value for the target's average value in that category's group. Better alternative for high-cardinality features.

### One-Hot Encoding

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
# the usual interface: instantiate, fit, transform (apply on the data)
rides_ohe = OneHotEncoder(
    sparse_output=False
).set_output(transform='pandas')

In [None]:
rides.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   season           731 non-null    object 
 1   yr               731 non-null    int64  
 2   mnth             731 non-null    object 
 3   holiday          731 non-null    object 
 4   weekday          731 non-null    object 
 5   workingday       731 non-null    object 
 6   weathersit       731 non-null    object 
 7   temp             731 non-null    float64
 8   hum              731 non-null    float64
 9   windspeed        731 non-null    float64
 10  cnt              731 non-null    int64  
 11  days_since_2011  731 non-null    int64  
dtypes: float64(3), int64(3), object(6)
memory usage: 68.7+ KB


In [None]:
rides['season'].value_counts()

SUMMER    188
SPRING    184
WINTER    181
FALL      178
Name: season, dtype: int64

In [None]:
rides['workingday'].value_counts()

WORKING DAY       500
NO WORKING DAY    231
Name: workingday, dtype: int64

In [None]:
rides['weekday'].value_counts()

SAT    105
SUN    105
MON    105
TUE    104
WED    104
THU    104
FRI    104
Name: weekday, dtype: int64

In [None]:
cat_feat = [ 'season', 'workingday', 'weekday']

In [None]:
rides_ohe.fit(rides[cat_feat])
rides_ohe.transform(rides[cat_feat]).head(1)

Unnamed: 0,season_FALL,season_SPRING,season_SUMMER,season_WINTER,workingday_NO WORKING DAY,workingday_WORKING DAY,weekday_FRI,weekday_MON,weekday_SAT,weekday_SUN,weekday_THU,weekday_TUE,weekday_WED
0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


In [None]:
rides[cat_feat].head(1)

Unnamed: 0,season,workingday,weekday
0,WINTER,NO WORKING DAY,SAT


In [None]:
rides_ohe.fit_transform(rides[cat_feat]).columns

Index(['season_FALL', 'season_SPRING', 'season_SUMMER', 'season_WINTER',
       'workingday_NO WORKING DAY', 'workingday_WORKING DAY', 'weekday_FRI',
       'weekday_MON', 'weekday_SAT', 'weekday_SUN', 'weekday_THU',
       'weekday_TUE', 'weekday_WED'],
      dtype='object')

In [None]:
# if working already with binary categorical features, we only need to keep on
# numeric binary feature
rides_ohe_d = OneHotEncoder(
    sparse_output=False,
    drop='if_binary'
).set_output(transform='pandas')

In [None]:
rides_ohe_d.fit_transform(rides[cat_feat]).columns

Index(['season_FALL', 'season_SPRING', 'season_SUMMER', 'season_WINTER',
       'workingday_WORKING DAY', 'weekday_FRI', 'weekday_MON', 'weekday_SAT',
       'weekday_SUN', 'weekday_THU', 'weekday_TUE', 'weekday_WED'],
      dtype='object')

In [None]:
rides_ohe.fit_transform(rides[cat_feat]).columns

Index(['season_FALL', 'season_SPRING', 'season_SUMMER', 'season_WINTER',
       'workingday_NO WORKING DAY', 'workingday_WORKING DAY', 'weekday_FRI',
       'weekday_MON', 'weekday_SAT', 'weekday_SUN', 'weekday_THU',
       'weekday_TUE', 'weekday_WED'],
      dtype='object')

### Target Encoding

In [None]:
cat_feat = [ 'season', 'workingday', 'weekday']

In [None]:
rides.groupby('season')['cnt'].mean()

season
FALL      4728.162921
SPRING    4992.331522
SUMMER    5644.303191
WINTER    2604.132597
Name: cnt, dtype: float64

In [None]:
rides.groupby('workingday')['cnt'].mean()

workingday
NO WORKING DAY    4330.168831
WORKING DAY       4584.820000
Name: cnt, dtype: float64

In [None]:
from sklearn.preprocessing import TargetEncoder

In [None]:
rides_te = TargetEncoder(
    target_type='continuous'
).set_output(transform='pandas')

In [None]:
rides_te.fit(rides[cat_feat], rides['cnt'])
rides_te.transform(rides[cat_feat]).sample(5, random_state=0)

Unnamed: 0,season,workingday,weekday
196,5640.8839,4331.008405,4549.988517
187,5640.8839,4584.669058,4665.717278
14,2609.576783,4331.008405,4549.988517
31,2609.576783,4584.669058,4510.610358
390,2609.576783,4584.669058,4665.717278


In [None]:
rides[cat_feat].sample(5, random_state=0)

Unnamed: 0,season,workingday,weekday
196,SUMMER,NO WORKING DAY,SAT
187,SUMMER,WORKING DAY,THU
14,WINTER,NO WORKING DAY,SAT
31,WINTER,WORKING DAY,TUE
390,WINTER,WORKING DAY,THU


In [None]:
# the values produced by TargetEncoder might not be exactly the mean of the groups
# in the training data (some "smoothing" is applied so as to try to prevent
# overfitting; more on that in AML unit)

### Dealing with Infrequent Categories

In [None]:
income.head(1)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,native-country
0,2,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,1,0,2,United-States


In [None]:
income_cat_feat = [ 'native-country' ]

In [None]:
income[income_cat_feat].value_counts()

native-country            
United-States                 43832
Mexico                          951
Philippines                     295
Germany                         206
Puerto-Rico                     184
Canada                          182
El-Salvador                     155
India                           151
Cuba                            138
England                         127
China                           122
South                           115
Jamaica                         106
Italy                           105
Dominican-Republic              103
Japan                            92
Guatemala                        88
Poland                           87
Vietnam                          86
Columbia                         85
Haiti                            75
Portugal                         67
Taiwan                           65
Iran                             59
Greece                           49
Nicaragua                        49
Peru                             46
E

In [None]:
len(income[income_cat_feat].value_counts())

41

In [None]:
income_ohe = OneHotEncoder(
    sparse_output=False,
    min_frequency=100
).set_output(transform='pandas')

In [None]:
income_ohe.fit(income[income_cat_feat])

In [None]:
income_ohe.transform(income[income_cat_feat])

Unnamed: 0,native-country_Canada,native-country_China,native-country_Cuba,native-country_Dominican-Republic,native-country_El-Salvador,native-country_England,native-country_Germany,native-country_India,native-country_Italy,native-country_Jamaica,native-country_Mexico,native-country_Philippines,native-country_Puerto-Rico,native-country_South,native-country_United-States,native-country_nan,native-country_infrequent_sklearn
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
48838,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
48839,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
48840,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [None]:
income_ohe.transform(income[income_cat_feat]).columns

Index(['native-country_Canada', 'native-country_China', 'native-country_Cuba',
       'native-country_Dominican-Republic', 'native-country_El-Salvador',
       'native-country_England', 'native-country_Germany',
       'native-country_India', 'native-country_Italy',
       'native-country_Jamaica', 'native-country_Mexico',
       'native-country_Philippines', 'native-country_Puerto-Rico',
       'native-country_South', 'native-country_United-States',
       'native-country_nan', 'native-country_infrequent_sklearn'],
      dtype='object')

## Q: Why is `OrdinalEncoder()` Not Usually a Good Approach?

* https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html
* https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-categorical-features
* https://scikit-learn.org/stable/auto_examples/preprocessing/plot_target_encoder.html#sphx-glr-auto-examples-preprocessing-plot-target-encoder-py

## Preventing Data Leakage with Transformers!

Always bear in mind that the test set's purpose is for estimating the model's performance on unseen data (simulating as much as possible the future scenarios in the model's deployment).

So, in most situations, we would like to **fit** the **transformer** (e.g., scaler, categorical encoder) on the **training data only** and then apply (`.transform`) them to train/validation/test.

In [None]:
X, y = rides[cat_feat], rides['cnt']

In [None]:
X.head(1)

Unnamed: 0,season,workingday,weekday
0,WINTER,NO WORKING DAY,SAT


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [None]:
rides_te = TargetEncoder(
    target_type='continuous'
).set_output(transform='pandas')

In [None]:
rides_te.fit(X_train, y_train)

In [None]:
X_train_enc = rides_te.transform(X_train)

In [None]:
X_train_enc.head(1)

Unnamed: 0,season,workingday,weekday
452,4828.304827,4564.513133,4490.149344


In [None]:
X_test_enc = rides_te.transform(X_test)

In [None]:
X_test_enc.head(1)

Unnamed: 0,season,workingday,weekday
196,5739.604061,4365.98017,4488.863886


## Further Learning

* sklearn **pipelines** help greatly with organising the machine learning data transformations and model building and preventing data leakage. We will be looking at these later in the unit.
    - https://scikit-learn.org/stable/modules/compose.html

* (optional) extra packages have many useful, more sophisticated approaches of pre-processing data:
    - [Category Encoders](https://contrib.scikit-learn.org/category_encoders/index.html)
    - [Feature Engine](https://feature-engine.trainindata.com/)