# Encoding Techniques 

Encoding is the process of converting data from one form to another. It is done in machine learning because most models dont work well with categorical data, so the categorical columns are encoded. 

To skip to [Encoding](#encode).

Data Set - https://www.kaggle.com/tasneemabdulrahim/tips-dataset

## Table of Content

1. **[Header Files](#lib)**
2. **[About Data Set](#about)**
3. **[Data Preparation](#prep)**
    - 3.1 - **[Read Data](#read)**
    - 3.2 - **[Analysing Missing Values](#miss)**
    - 3.3 - **[Encoding](#encode)**
        - 3.3.1 - **[Dummy Encoding](#dum)**
        - 3.3.2 - **[One Hot Encoding](#one)**
        - 3.3.3 - **[Label Encoding](#label)**
        - 3.3.4 - **[Ordinal Encoding](#ordinal)**
        - 3.3.5 - **[Frequency Encoding](#frequency)**
    
    

<a id="lib"></a>
## 1. Import Libraries

In [3]:
import pandas as pd
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt



<a id="about"></a>
## 2. About the Dataset

<a id="prep"></a>
## 3. Data Preperation

<a id="read"></a>
## 3.1 Read Data

In [6]:
df=pd.read_csv('tips.csv')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


<a id="miss"></a>
## 3.2 Analysing Missing Values

In [7]:
df.isnull().sum()

total_bill    0
tip           0
sex           0
smoker        0
day           0
time          0
size          0
dtype: int64

<a id="encode"></a>
## 3.3 Encoding

<a id="dum"></a>
### 3.3.1 Dummy Encoding

Also called n-1 Encoding because it creates n-1 encoded columns the n'th column is represented by 0's in all the encoded fields of that column. Advantage over one hot encoding is that it can avoid dummy variable trap.

Note - The Dummy Variable trap is a scenario in which the independent variables are multicollinear - a scenario in which two or more variables are highly correlated; in simple terms one variable can be predicted from the others.

In [9]:
pd.get_dummies(df,drop_first=True)

Unnamed: 0,total_bill,tip,size,sex_Male,smoker_Yes,day_Sat,day_Sun,day_Thur,time_Lunch
0,16.99,1.01,2,0,0,0,1,0,0
1,10.34,1.66,3,1,0,0,1,0,0
2,21.01,3.50,3,1,0,0,1,0,0
3,23.68,3.31,2,1,0,0,1,0,0
4,24.59,3.61,4,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,3,1,0,1,0,0,0
240,27.18,2.00,2,0,1,1,0,0,0
241,22.67,2.00,2,1,1,1,0,0,0
242,17.82,1.75,2,1,0,1,0,0,0


<a id="one"></a>
### 3.3.2 One Hot Encoding

Creates n columns of encoded data. 

Disadvatage - Dummy variable trap may occur.

In [10]:
pd.get_dummies(df,drop_first=False)

Unnamed: 0,total_bill,tip,size,sex_Female,sex_Male,smoker_No,smoker_Yes,day_Fri,day_Sat,day_Sun,day_Thur,time_Dinner,time_Lunch
0,16.99,1.01,2,1,0,1,0,0,0,1,0,1,0
1,10.34,1.66,3,0,1,1,0,0,0,1,0,1,0
2,21.01,3.50,3,0,1,1,0,0,0,1,0,1,0
3,23.68,3.31,2,0,1,1,0,0,0,1,0,1,0
4,24.59,3.61,4,1,0,1,0,0,0,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,3,0,1,1,0,0,1,0,0,1,0
240,27.18,2.00,2,1,0,0,1,0,1,0,0,1,0
241,22.67,2.00,2,0,1,0,1,0,1,0,0,1,0
242,17.82,1.75,2,0,1,1,0,0,1,0,0,1,0


(or)

In [11]:
from sklearn.preprocessing import OneHotEncoder

In [32]:
encoder=OneHotEncoder()
pd.DataFrame(encoder.fit_transform(df[['smoker']]).toarray(),columns=['smoker_no','smoker_yes'])

Unnamed: 0,smoker_no,smoker_yes
0,1.0,0.0
1,1.0,0.0
2,1.0,0.0
3,1.0,0.0
4,1.0,0.0
...,...,...
239,1.0,0.0
240,0.0,1.0
241,0.0,1.0
242,1.0,0.0


<a id="label"></a>
### 3.3.3 Label Encoding

Label encoding and one hot encoding increases the dimensionality of the data.

Label Encoding encodes by assigning a numeric value for each category.

Note: Label Encoding assigns a weightage to each category which may bias the model.

In [33]:
from sklearn.preprocessing import LabelEncoder

In [35]:
encoder=LabelEncoder()
encoder.fit_transform(df.day)

array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 3])

In [39]:
df.day.value_counts().index

Index(['Sat', 'Sun', 'Thur', 'Fri'], dtype='object')

<a id="ordinal"></a>
### 3.3.4 Ordinal Encoding

Ordinal Encoding is very similar to Label Encoding only difference being the user has the ability to select the value that each category is being replaced with. Can come in handy when order needs to be specified , hence the name ordinal encoding.

Dimensionality remains same after encoding.

In [37]:
from sklearn.preprocessing import OrdinalEncoder

In [48]:
encoder=OrdinalEncoder(categories=[['Thur', 'Fri','Sat', 'Sun']])
encoder.fit_transform(df[['day']])

array([[3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],

<a id="frequency"></a>
### 3.3.5 Frequency Encoding

Frequency encoding tries to avoid assigning weightage to the categories without any reason by encoding the categories with its frequencies.

In frequency encoding also dimensionality remains the same after encoding.

In [49]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [63]:
encoding=df.groupby('day')['day'].agg('count')
encoding=encoding/len(df)
df.day.map(encoding)

0      0.311475
1      0.311475
2      0.311475
3      0.311475
4      0.311475
         ...   
239    0.356557
240    0.356557
241    0.356557
242    0.356557
243    0.254098
Name: day, Length: 244, dtype: float64