# EDA and Feature Engineering

## Encoding Techniques

In this series, we will focus on `Encoding` techniques as another piece under `EDA and Feature Engineering`. Feature Engineering is an extremely critical step in Data Science process and encoding is very useful as part the data pre-processing. Handling non-numeric data for use by machine learning algorithms is not which we can avoid and present its own set of challenges.

## 0. Import Libraries and Load sample dataset

In [1]:
import pandas as pd
import numpy as np

from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

In [2]:
my_data = {'Id':[1001,1003,1004,1007,1008,1012],
           'Gender':['Male','Female','Female',np.nan,'Male','Male'],
           'Age':[24,np.nan,31,19,43,22],
           'COVID_Result':['Positive','Negative','Positive','Negative','Positive','Positive']}
df = pd.DataFrame(my_data)
df

Unnamed: 0,Id,Gender,Age,COVID_Result
0,1001,Male,24.0,Positive
1,1003,Female,,Negative
2,1004,Female,31.0,Positive
3,1007,,19.0,Negative
4,1008,Male,43.0,Positive
5,1012,Male,22.0,Positive


### Handle Missing values

In [3]:
# Imputing with previous value
# fillna(method = 'pad')

df1 = df.copy()
df1 = df1.fillna(method = 'pad')

In [4]:
df1

Unnamed: 0,Id,Gender,Age,COVID_Result
0,1001,Male,24.0,Positive
1,1003,Female,24.0,Negative
2,1004,Female,31.0,Positive
3,1007,Female,19.0,Negative
4,1008,Male,43.0,Positive
5,1012,Male,22.0,Positive


## 1. One-Hot Encoding

### 1.1 With sample data created above - using pd.get_dummies

In [5]:
one_hot_encoded_data = pd.get_dummies(df1, columns = ['Gender'],drop_first=True)
one_hot_encoded_data

Unnamed: 0,Id,Age,COVID_Result,Gender_Male
0,1001,24.0,Positive,1
1,1003,24.0,Negative,0
2,1004,31.0,Positive,0
3,1007,19.0,Negative,0
4,1008,43.0,Positive,1
5,1012,22.0,Positive,1


In [6]:
one_hot_encoded_data = pd.get_dummies(df1, columns = ['Gender', 'COVID_Result'],drop_first=True)
one_hot_encoded_data

Unnamed: 0,Id,Age,Gender_Male,COVID_Result_Positive
0,1001,24.0,1,1
1,1003,24.0,0,0
2,1004,31.0,0,1
3,1007,19.0,0,0
4,1008,43.0,1,1
5,1012,22.0,1,1


### 1.2 With another sample dataset - country dataset - using sklearn

In [7]:
# Creating instance of one-hot-encoder
enc = OneHotEncoder(handle_unknown='ignore')

In [8]:
# Creating initial dataframe
Country_Names = ('Belgium','Japan','India','Argentina','Australia','Qatar')
country_df = pd.DataFrame(Country_Names, columns=['Country_Names'])

country_df

Unnamed: 0,Country_Names
0,Belgium
1,Japan
2,India
3,Argentina
4,Australia
5,Qatar


In [9]:
enc_df = pd.DataFrame(enc.fit_transform(country_df[['Country_Names']]).toarray())

# Merge with main df on key values
country_df = country_df.join(enc_df)
country_df

Unnamed: 0,Country_Names,0,1,2,3,4,5
0,Belgium,0.0,0.0,1.0,0.0,0.0,0.0
1,Japan,0.0,0.0,0.0,0.0,1.0,0.0
2,India,0.0,0.0,0.0,1.0,0.0,0.0
3,Argentina,1.0,0.0,0.0,0.0,0.0,0.0
4,Australia,0.0,1.0,0.0,0.0,0.0,0.0
5,Qatar,0.0,0.0,0.0,0.0,0.0,1.0


### 1.3 Another sample dataset - using sklearn

In [10]:
# Create sample data in the dataframe
my_data = {'Id':[7493,7494,7495,7496,7497,7498,7499,7500,7501,7502],
           'Chiller_Temp':['Hot','Cold','Very Hot','Warm','Hot','Warm','Warm','Hot','Hot','Hot'],
           'Model':['RX','YX','BX','BX','RX','YX','RX','BX','YX','YX'],
           'Target':[1,1,1,0,1,0,1,0,1,1]}
df2 = pd.DataFrame(my_data)
df2

Unnamed: 0,Id,Chiller_Temp,Model,Target
0,7493,Hot,RX,1
1,7494,Cold,YX,1
2,7495,Very Hot,BX,1
3,7496,Warm,BX,0
4,7497,Hot,RX,1
5,7498,Warm,YX,0
6,7499,Warm,RX,1
7,7500,Hot,BX,0
8,7501,Hot,YX,1
9,7502,Hot,YX,1


In [11]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown='ignore') # Creating instance of one-hot-encoder

ohet = ohe.fit_transform(df2.Chiller_Temp.values.reshape(-1,1)).toarray()

dfOneHot = pd.DataFrame(ohet, columns=["ChillerTemp_" + str(ohe.categories_[0][i])
                           for i in range(len(ohe.categories_[0]))])
dfh = pd.concat([df2, dfOneHot], axis=1)

dfh

Unnamed: 0,Id,Chiller_Temp,Model,Target,ChillerTemp_Cold,ChillerTemp_Hot,ChillerTemp_Very Hot,ChillerTemp_Warm
0,7493,Hot,RX,1,0.0,1.0,0.0,0.0
1,7494,Cold,YX,1,1.0,0.0,0.0,0.0
2,7495,Very Hot,BX,1,0.0,0.0,1.0,0.0
3,7496,Warm,BX,0,0.0,0.0,0.0,1.0
4,7497,Hot,RX,1,0.0,1.0,0.0,0.0
5,7498,Warm,YX,0,0.0,0.0,0.0,1.0
6,7499,Warm,RX,1,0.0,0.0,0.0,1.0
7,7500,Hot,BX,0,0.0,1.0,0.0,0.0
8,7501,Hot,YX,1,0.0,1.0,0.0,0.0
9,7502,Hot,YX,1,0.0,1.0,0.0,0.0


## 2. Label Encoding

### 2.1 Using Category Codes Approach

This approach requires the category column to be of ‘category’ datatype. By default, a non-numerical column is of ‘object’ type. So you might have to change type to ‘category’ before using this approach.

In [12]:
# Create sample data in the dataframe
my_data = {'Id':['AIX-7493','AIX-7494','AIX-7495','AIX-7496','AIX-7497','AIX-7498','AIX-7499','AIX-7500','AIX-7501','AIX-7502'],
           'Chiller_Temp':['Hot','Cold','Very Hot','Warm','Hot','Warm','Warm','Hot','Hot','Hot'],
           'Model':['RX','YX','BX','BX','RX','YX','RX','BX','YX','YX'],
           'Target':[1,1,1,0,1,0,1,0,1,1]}
df = pd.DataFrame(my_data)
df

Unnamed: 0,Id,Chiller_Temp,Model,Target
0,AIX-7493,Hot,RX,1
1,AIX-7494,Cold,YX,1
2,AIX-7495,Very Hot,BX,1
3,AIX-7496,Warm,BX,0
4,AIX-7497,Hot,RX,1
5,AIX-7498,Warm,YX,0
6,AIX-7499,Warm,RX,1
7,AIX-7500,Hot,BX,0
8,AIX-7501,Hot,YX,1
9,AIX-7502,Hot,YX,1


In [13]:
# Converting type of columns to 'category'
df['Chiller_Temp'] = df['Chiller_Temp'].astype('category')

# Assigning numerical values and storing in another column
df['Chiller_Temp_Cat'] = df['Chiller_Temp'].cat.codes
df

Unnamed: 0,Id,Chiller_Temp,Model,Target,Chiller_Temp_Cat
0,AIX-7493,Hot,RX,1,1
1,AIX-7494,Cold,YX,1,0
2,AIX-7495,Very Hot,BX,1,2
3,AIX-7496,Warm,BX,0,3
4,AIX-7497,Hot,RX,1,1
5,AIX-7498,Warm,YX,0,3
6,AIX-7499,Warm,RX,1,3
7,AIX-7500,Hot,BX,0,1
8,AIX-7501,Hot,YX,1,1
9,AIX-7502,Hot,YX,1,1


Here we could note above that the `Chiller_Temp_Cat` is the encoded value based on categories. It does not necessarily form a sequence, instead just applies a label for encoding.
- Cold : encoded to 0
- Hot : encoded to 1
- Very Hot : encoded to 2
- Warm : encoded to 3

We can see similar result using sklearn library itself as another approach for `Label Encoding`.

### 2.2 Using sklearn library (with same Servers, Chiller temp example)

In [14]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder

# Create sample data in the dataframe
my_data = {'Id':['AIX-7493','AIX-7494','AIX-7495','AIX-7496','AIX-7497','AIX-7498','AIX-7499','AIX-7500','AIX-7501','AIX-7502'],
           'Chiller_Temp':['Hot','Cold','Very Hot','Warm','Hot','Warm','Warm','Hot','Hot','Hot'],
           'Model':['RX','YX','BX','BX','RX','YX','RX','BX','YX','YX'],
           'Target':[1,1,1,0,1,0,1,0,1,1]}
df = pd.DataFrame(my_data)
df

Unnamed: 0,Id,Chiller_Temp,Model,Target
0,AIX-7493,Hot,RX,1
1,AIX-7494,Cold,YX,1
2,AIX-7495,Very Hot,BX,1
3,AIX-7496,Warm,BX,0
4,AIX-7497,Hot,RX,1
5,AIX-7498,Warm,YX,0
6,AIX-7499,Warm,RX,1
7,AIX-7500,Hot,BX,0
8,AIX-7501,Hot,YX,1
9,AIX-7502,Hot,YX,1


In [15]:
# Creating instance of labelencoder
labelencoder = LabelEncoder()

In [16]:
# Assigning numerical values and storing in another column
df['ChillerTempEncoded'] = labelencoder.fit_transform(df['Chiller_Temp'])
df

Unnamed: 0,Id,Chiller_Temp,Model,Target,ChillerTempEncoded
0,AIX-7493,Hot,RX,1,1
1,AIX-7494,Cold,YX,1,0
2,AIX-7495,Very Hot,BX,1,2
3,AIX-7496,Warm,BX,0,3
4,AIX-7497,Hot,RX,1,1
5,AIX-7498,Warm,YX,0,3
6,AIX-7499,Warm,RX,1,3
7,AIX-7500,Hot,BX,0,1
8,AIX-7501,Hot,YX,1,1
9,AIX-7502,Hot,YX,1,1


Here also we could note above that the `Chiller_Temp_Cat` is the encoded value based on categories. It does not necessarily form a sequence, instead just applies a label for encoding. This yielded same result as previous outcome. Just the approach is different by invoking sklearn method.
- Cold : encoded to 0
- Hot : encoded to 1
- Very Hot : encoded to 2
- Warm : encoded to 3

## 3. Binary Encoding

`Binary encoding` can be used when number of categories are huge. It uses a combination of `Hash encoding` along with `One Hot encoding`. Therefore number of categories encodes are much less compared to `One Hot encoding` method and it is more memory-efficient.

We can use `category_encoders` to leverage this technique.

### Using Category_Encoders

In [17]:
# Please install if not installed in your environment
# !pip install category_encoders

In [18]:
import category_encoders as ce
import pandas as pd

In [19]:
# Create sample data in the dataframe
my_data = {'Id':['AIX-7493','AIX-7494','AIX-7495','AIX-7496','AIX-7497','AIX-7498','AIX-7499','AIX-7500','AIX-7501','AIX-7502'],
           'Chiller_Temp':['Hot','Cold','Very Hot','Warm','Hot','Warm','Warm','Hot','Hot','Hot'],
           'Model':['RX','YX','BX','BX','RX','YX','RX','BX','YX','YX'],
           'Target':[1,1,1,0,1,0,1,0,1,1]}
df = pd.DataFrame(my_data)
df

Unnamed: 0,Id,Chiller_Temp,Model,Target
0,AIX-7493,Hot,RX,1
1,AIX-7494,Cold,YX,1
2,AIX-7495,Very Hot,BX,1
3,AIX-7496,Warm,BX,0
4,AIX-7497,Hot,RX,1
5,AIX-7498,Warm,YX,0
6,AIX-7499,Warm,RX,1
7,AIX-7500,Hot,BX,0
8,AIX-7501,Hot,YX,1
9,AIX-7502,Hot,YX,1


In [20]:
#Create object for binary encoding
encoder= ce.BinaryEncoder(cols=['Chiller_Temp'],return_df=True)

In [21]:
#Fit and Transform Data 
df_binaryencoding =encoder.fit_transform(df) 
df_binaryencoding

Unnamed: 0,Id,Chiller_Temp_0,Chiller_Temp_1,Chiller_Temp_2,Model,Target
0,AIX-7493,0,0,1,RX,1
1,AIX-7494,0,1,0,YX,1
2,AIX-7495,0,1,1,BX,1
3,AIX-7496,1,0,0,BX,0
4,AIX-7497,0,0,1,RX,1
5,AIX-7498,1,0,0,YX,0
6,AIX-7499,1,0,0,RX,1
7,AIX-7500,0,0,1,BX,0
8,AIX-7501,0,0,1,YX,1
9,AIX-7502,0,0,1,YX,1


Had there been more number of categories, `Binary encoding` would have been much more efficient in above example compared to `One Hot encoding` method.

That's all for now. Similarly, other methods of encoding can also be implemented.