## Learn how to Handle Categorical Variable Encoding, using One Hot Encoder, Ordinal Encoder and Rare Label Encoder

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
from sklearn.pipeline import Pipeline

## One Hot Encoder

This technique replaces the categorical variable with a combination of binary variables (which takes value 0 or 1) where each new binary variable is related to a label from the categorical variable. The function is called `OneHotEncoder()` and its documentation is found [here](https://feature-engine.trainindata.com/en/latest/user_guide/encoding/OneHotEncoder.html)
* For example, imagine if our variable is `Colour`, and has three labels: Yellow, Blue and Green
* When you One Hot Encode (OHE) `Colour`, it is replaced by three binary variables `Colour_Yellow`, `Colour_Blue` and `Colour_Green`
* Imagine if a given row of Colour is Yellow. Once One Hot Encoded, this row will be transformed to  Colour_Yellow = 1, Colour_Blue = 0 and Colour_Green = 0.
* There is a concept called a redundant feature. Stop for a moment: do I need three binary variables to represent the variable `Colour`? 
  * The answer is no. If you have two binary variables for Colour, say Colour_Yellow and Colour_Blue, you can represent all possibilities as: 
    * Colour_Yellow = 1 and Colour_Blue = 0, meaning yellow
    * Colour_Yellow = 0  and Colour_Blue = 1, meaning blue
    * Colour_Yellow = 0  and Colour_Blue = 0, meaning green

In [29]:
from feature_engine.encoding import OneHotEncoder

Let's consider only categorical variables from the 'insurance' dataset, which contains information on the relationship between personal attributes (age, gender, BMI: body mass index, family size, smoking habits), geographic factors, and their impact on medical insurance charges.

In [3]:
df = pd.read_csv('insurance.csv')
print(df.shape)
df.head()

(1338, 7)


Unnamed: 0,age,sex,bmi,children,smoker,region,expenses
0,19,female,27.9,0,yes,southwest,16884.92
1,18,male,33.8,1,no,southeast,1725.55
2,28,male,33.0,3,no,southeast,4449.46
3,33,male,22.7,0,no,northwest,21984.47
4,32,male,28.9,0,no,northwest,3866.86


In [94]:
df = df.filter(['sex', 'smoker', 'region'])
df.head()

Unnamed: 0,sex,smoker,region
0,female,yes,southwest
1,male,no,southeast
2,male,no,southeast
3,male,no,northwest
4,male,no,northwest


Let's create the pipeline with two steps (Handle Missing data and categorical encoding), and then use `.fit_transform()`
* Note: we can't encode a categorical variable that has missing data. For the exercise, we dropped the missing data using the transformer from the previous unit (DropMissingData).
* Using OneHotEncoder, we pass a list of variables that we are interested in one hot encoding.

In [15]:
from feature_engine.imputation import DropMissingData

In [97]:
pipeline = Pipeline ([
    ('drop_na', DropMissingData() ),
    ('ohe', OneHotEncoder(variables=['sex', 'smoker', 'region']))
])

df = pipeline.fit_transform(df)
df

Unnamed: 0,sex_female,sex_male,smoker_yes,smoker_no,region_southwest,region_southeast,region_northwest,region_northeast
0,1,0,1,0,1,0,0,0
1,0,1,0,1,0,1,0,0
2,0,1,0,1,0,1,0,0
3,0,1,0,1,0,0,1,0
4,0,1,0,1,0,0,1,0
...,...,...,...,...,...,...,...,...
1333,0,1,0,1,0,0,1,0
1334,1,0,0,1,0,0,0,1
1335,1,0,0,1,0,1,0,0
1336,1,0,0,1,1,0,0,0


In [99]:
df = pd.read_csv('insurance.csv').filter(['sex', 'smoker', 'region'])
df.head()

Unnamed: 0,sex,smoker,region
0,female,yes,southwest
1,male,no,southeast
2,male,no,southeast
3,male,no,northwest
4,male,no,northwest


Then set the same pipeline, but now add `drop_last=True`. Compare to the previous transformation and check which binary variables were removed.
* Note there is only one binary variable related to sex and smoking. There are three binary variables pertaining to the region. This same set of variables carries the same amount of information as the previous OHE transformation.
* You probably noticed that this transformation has the potential to generate a lot of new columns. That increases the feature space and may increase the chance of overfitting your model. To manage that, you may use, when possible, a FeatureSelection() step in your pipeline to select the most relevant features in your dataset. Don't worry. This topic will be covered in the next lesson.

In [86]:
pipeline = Pipeline([
      ('drop_na', DropMissingData() ),
      ('ohe', OneHotEncoder(variables=['sex', 'smoker', 'region'], drop_last=True) )
])

df = pipeline.fit_transform(df)

df

Unnamed: 0,sex_female,smoker_yes,region_southwest,region_southeast,region_northwest
0,1,1,1,0,0
1,0,0,0,1,0
2,0,0,0,1,0
3,0,0,0,0,1
4,0,0,0,0,1
...,...,...,...,...,...
1333,0,0,0,0,1
1334,1,0,0,0,0
1335,1,0,0,1,0
1336,1,0,1,0,0


## Ordinal Encoder

It replaces categories with ordinal numbers, like 0, 1, 2, 3 etc.  
* The numbers can be on a first-seen-first basis.
* You can pass in a list of variables to encode otherwise it will encode all categorical variables.

The function is `OrdinalEncoder()` and its documentation is found [here](https://feature-engine.trainindata.com/en/latest/user_guide/encoding/OrdinalEncoder.html)

The encoding method can be set to `ordered` or `arbitrary`.
When set to `ordered`, the categories are numbered in ascending order, based on the target mean value per category. When set to `arbitrary`, the categories are numbered arbitrarily. 
Throughout the course, when we use this transformer, we will set it as `arbitrary`. In fact, "arbitrary" is the method argument used in a similar transformer from scikit-learn, which will be covered in a future lesson. There are multiple packages that can engineer your variables, including both scikit-learn and feature-engine.

When using `ordered`, remember your ML task must contain a target (like regression or classification). If it is a cluster, for example, the transformer will not work.

For the teaching examples in this course, when we need to set the encoding method for this transformer, we will set it to `arbitrary`. However, you may try different options in your personal project or the workplace. After all, this transformation is part of your feature engineering strategy and as we studied, there is no fixed recipe when engineering your variables. It is a trial-and-error approach. 

In [104]:
from feature_engine.encoding import OrdinalEncoder

In [125]:
df = pd.read_csv('insurance.csv').filter(['sex', 'smoker', 'region'])
df.head()

Unnamed: 0,sex,smoker,region
0,female,yes,southwest
1,male,no,southeast
2,male,no,southeast
3,male,no,northwest
4,male,no,northwest


Let's create the pipeline with two steps (Handle Missing data and ordinal encoding) and then use `.fit_transform()`
* We will not pass any variable list argument to `OrdinalEncoder()`, which means we will encode all variables. We set `encoding_method='arbitrary'`

In [128]:
pipeline = Pipeline([
    ('drop_na', DropMissingData()),
    ('ohe', OrdinalEncoder(encoding_method='arbitrary'))
])

df = pipeline.fit_transform(df)
df

Unnamed: 0,sex,smoker,region
0,0,0,0
1,1,1,1
2,1,1,1
3,1,1,2
4,1,1,2
...,...,...,...
1333,1,1,2
1334,0,1,3
1335,0,1,1
1336,0,1,0


Let's check the frequencies and label names.
* We use a for loop on DataFrame columns and print the variable name + the value counts for that variable.
* Note the labels were replaced by numbers. For example, male and female were replaced by 0 and 1.

In [150]:
# df.columns, gives Index
# df.columns.to_list(), gives List

for col in df.columns.to_list():
  print(f"\n{df[col].value_counts()} \n\n")


sex
1    676
0    662
Name: count, dtype: int64 



smoker
1    1064
0     274
Name: count, dtype: int64 



region
1    364
0    325
2    325
3    324
Name: count, dtype: int64 




Let's check the encoder dictionary to see how the transformer mapped the labels to numbers.

In [148]:
pipeline['ohe'].encoder_dict_

{'sex': {'female': 0, 'male': 1},
 'smoker': {'yes': 0, 'no': 1},
 'region': {'southwest': 0, 'southeast': 1, 'northwest': 2, 'northeast': 3}}

## Rare Label Encoder

This encoder groups **infrequent categories in a new category** called 'Rare' (or other defined name)
* __For example, if your variable is Fruit, and the  percentage of rows for the labels banana, grape and apple is less than < 6 %, all these labels will be replaced by 'Rare'. That helps to decrease the chance of a model overfitting.__
* The function is `RareLabelEncoder()` and its documentation is found [here](https://feature-engine.trainindata.com/en/latest/user_guide/encoding/RareLabelEncoder.html). The arguments are:
  * `tol`, which is the tolerance, or the minimum frequency a label should have to be considered frequent. Categories with frequencies lower than tol will be replaced as 'Rare'.
  * `n_categories`: The minimum number of categories a variable should have for the encoder to find frequent labels. If the variable contains fewer categories, all of them will be considered frequent.
  * `variables`: list of variables that you would like to apply this transformation on. If you don't parse anything, it will select all categorical variables.

In [13]:
from feature_engine.encoding import RareLabelEncoder

In [5]:
df = pd.read_csv('CarPrice_Assignment.csv').filter(['fueltype','carbody','drivewheel','fuelsystem']).astype('object')
print(df.shape)
df.head()

(205, 4)


Unnamed: 0,fueltype,carbody,drivewheel,fuelsystem
0,gas,convertible,rwd,mpfi
1,gas,convertible,rwd,mpfi
2,gas,hatchback,rwd,mpfi
3,gas,sedan,fwd,mpfi
4,gas,sedan,4wd,mpfi


In [166]:
df.isnull().sum()

fueltype      0
carbody       0
drivewheel    0
fuelsystem    0
dtype: int64

Now let's check the label's frequencies for each variable
* We loop on each variable and count its labels frequencies using .value_counts(normalize=True)
* We note that there are some labels which are infrequent, like 'convertible' for carbody.

In [7]:
for col in df.columns.to_list():
    print(f'{df[col].value_counts(normalize=True)}\n\n')

fueltype
gas       0.902439
diesel    0.097561
Name: proportion, dtype: float64


carbody
sedan          0.468293
hatchback      0.341463
wagon          0.121951
hardtop        0.039024
convertible    0.029268
Name: proportion, dtype: float64


drivewheel
fwd    0.585366
rwd    0.370732
4wd    0.043902
Name: proportion, dtype: float64


fuelsystem
mpfi    0.458537
2bbl    0.321951
idi     0.097561
1bbl    0.053659
spdi    0.043902
4bbl    0.014634
mfi     0.004878
spfi    0.004878
Name: proportion, dtype: float64




Let's create the pipeline with two steps (rare label encoding), and then use `.fit_transform()`. We show here the use case where we can perform multiple rare label encoding for 2 chosen variables.
* The first RareLabelEncoder deals with 'carbody' and sets the tolerance to 10% (this is a random number and is used to explain the concept). In the end, any carbody label that is less frequent than 10% will be replaced by 'Rare'.
* The second RareLabelEncoder deals with 'fuelsystem' and sets the tolerance to 8% (again, a random number to illustrate the concept). In the end, any fuelsystem label that is less frequent than 8% will be replaced by 'Rare'.
* Note: you can perform this technique with a set of variables. We created the example with single variables with different tolerances to illustrate the concept. In the workplace, the tol level will be selected based on the business context.
* We set ``n_categories=2`` since we want to encode all possible labels.

In [21]:
pipeline = Pipeline([
    ('drop_na', DropMissingData()),
    ('rle_carbody', RareLabelEncoder(tol=0.1,
                                    n_categories=1,
                                    variables=['carbody'])),
    ('rle_fuelsystem', RareLabelEncoder(tol=0.1,
                                    n_categories=2,
                                    variables=['fuelsystem']))
])

df = pipeline.fit_transform(df)
df

Unnamed: 0,fueltype,carbody,drivewheel,fuelsystem
0,gas,Rare,rwd,mpfi
1,gas,Rare,rwd,mpfi
2,gas,hatchback,rwd,mpfi
3,gas,sedan,fwd,mpfi
4,gas,sedan,4wd,mpfi
...,...,...,...,...
200,gas,sedan,rwd,mpfi
201,gas,sedan,rwd,mpfi
202,gas,sedan,rwd,mpfi
203,diesel,sedan,rwd,Rare


Now let's check the label's frequencies for each variable again
* Note the labels were grouped into a label called 'Rare' according to the rules defined in the pipeline.

In [192]:
for col in df.columns.to_list():
    print(f'{df[col].value_counts(normalize=True)}') # Now we can see the RARE in Carbody and Fuelsystem output

fueltype
gas       0.902439
diesel    0.097561
Name: proportion, dtype: float64
carbody
sedan        0.468293
hatchback    0.341463
wagon        0.121951
Rare         0.068293
Name: proportion, dtype: float64
drivewheel
fwd    0.585366
rwd    0.370732
4wd    0.043902
Name: proportion, dtype: float64
fuelsystem
mpfi    0.458537
2bbl    0.321951
Rare    0.219512
Name: proportion, dtype: float64


In one cell, we will do the following tasks:
* Create a pipeline with four steps: drop missing data, two rare label encoders and an ordinal encoder.
* Then we fit and transform the data.
* Finally, we loop over the variables to check labels frequencies.

In [196]:
pipeline = Pipeline([
      ('drop_na', DropMissingData() ),
      ('rle_carbody', RareLabelEncoder(tol=0.1,
                                     n_categories=2, # It is the number of unique values in the column
                                     variables=['carbody']) ), 
      ('rle_fuelsystem', RareLabelEncoder(tol=0.08,
                                     n_categories=2,
                                     variables=['fuelsystem']) ),
      ('ordinal_encoder', OrdinalEncoder(encoding_method='arbitrary',
                                         variables= ['fueltype','carbody','drivewheel','fuelsystem']) )
])

df = pipeline.fit_transform(df)

for col in df.columns.to_list():
  print(f"{col} \n{df[col].value_counts(normalize=True)} \n\n")

fueltype 
fueltype
0    0.902439
1    0.097561
Name: proportion, dtype: float64 


carbody 
carbody
2    0.468293
1    0.341463
3    0.121951
0    0.068293
Name: proportion, dtype: float64 


drivewheel 
drivewheel
1    0.585366
0    0.370732
2    0.043902
Name: proportion, dtype: float64 


fuelsystem 
fuelsystem
0    0.458537
1    0.321951
2    0.219512
Name: proportion, dtype: float64 


