# **Handling Categorical Variables**

Handling Categorical/Qualitative variables is an important step in data preprocessing.Many Machine learning algorithms can not handle categorical variables by themself unless we convert them to numerical values.<br>
And performance of ML algorithms is based on how Categorical variables are encoded.
The results produced by the model varies from different encoding techniques used.

Categorical variables can be divided into two categories:<br>
1. Nominal (No particular order) 
2. Ordinal (some ordered).

In [1]:
# We are gonna use following libraries to perform encoding.
!pip install scikit-learn
!pip install category-encoders



In [2]:
import pandas as pd , numpy as np
import category_encoders as ce

In [14]:
data = {'Temperature':['Hot','Cold','Very Hot','Warm','Hot','Warm','Warm','Hot','Hot','Cold'],
        'Color':['Red','Yellow','Blue','Blue','Red','Yellow','Red','Yellow','Yellow','Yellow'],
        'Target':[1,1,1,0,1,0,1,0,1,1]}
df = pd.DataFrame(data)
df

Unnamed: 0,Temperature,Color,Target
0,Hot,Red,1
1,Cold,Yellow,1
2,Very Hot,Blue,1
3,Warm,Blue,0
4,Hot,Red,1
5,Warm,Yellow,0
6,Warm,Red,1
7,Hot,Yellow,0
8,Hot,Yellow,1
9,Cold,Yellow,1


# 1. One Hot Encoding

In this technique, it creates a new column/feature for each category in the Categorical Variable and replaces with either 1 (presence of the feature) or 0 (absence of the feature). The number of column/feature depends on the number of categories in the Categorical Variable.This method slows down the learning process significantly if the number of the categories are very high.

In [15]:
# Using get_dummies method in pandas
df_ohe = df.copy()
one_hot_1 = pd.get_dummies(df_ohe,prefix = 'Temp' ,columns=['Temperature'],drop_first=False)
one_hot_1.insert(loc=2, column='Temperature', value=df.Temperature.values)
one_hot_1

Unnamed: 0,Color,Target,Temperature,Temp_Cold,Temp_Hot,Temp_Very Hot,Temp_Warm
0,Red,1,Hot,0,1,0,0
1,Yellow,1,Cold,1,0,0,0
2,Blue,1,Very Hot,0,0,1,0
3,Blue,0,Warm,0,0,0,1
4,Red,1,Hot,0,1,0,0
5,Yellow,0,Warm,0,0,0,1
6,Red,1,Warm,0,0,0,1
7,Yellow,0,Hot,0,1,0,0
8,Yellow,1,Hot,0,1,0,0
9,Yellow,1,Cold,1,0,0,0


In [5]:
# Using OneHotEncoder in sklearn
from sklearn.preprocessing import OneHotEncoder
# ohe = OneHotEncoder(drop='first')
ohe = OneHotEncoder()
oh_array = ohe.fit_transform(df['Temperature'].values.reshape(-1, 1)).toarray()
oh_df = pd.DataFrame(oh_array,columns=['Temp_Cold','Temp_Hot','Temp_Very_Hot','Temp_Warm'])
pd.concat([df,oh_df],axis=1)

Unnamed: 0,Temperature,Color,Target,Temp_Cold,Temp_Hot,Temp_Very_Hot,Temp_Warm
0,Hot,Red,1,0.0,1.0,0.0,0.0
1,Cold,Yellow,1,1.0,0.0,0.0,0.0
2,Very Hot,Blue,1,0.0,0.0,1.0,0.0
3,Warm,Blue,0,0.0,0.0,0.0,1.0
4,Hot,Red,1,0.0,1.0,0.0,0.0
5,Warm,Yellow,0,0.0,0.0,0.0,1.0
6,Warm,Red,1,0.0,0.0,0.0,1.0
7,Hot,Yellow,0,0.0,1.0,0.0,0.0
8,Hot,Yellow,1,0.0,1.0,0.0,0.0
9,Cold,Yellow,1,1.0,0.0,0.0,0.0


In [6]:
# Using category_encoders OneHotEncoder
import category_encoders as ce
ohe = ce.OneHotEncoder(cols=['Temperature'])
ce_ohe = ohe.fit_transform(df.iloc[:,0], df.iloc[:,-1])
ce_ohe.columns = ['Temp_Hot','Temp_Cold','Temp_Very_Hot','Temp_Warm']
pd.concat([df,ce_ohe],axis=1)

Unnamed: 0,Temperature,Color,Target,Temp_Hot,Temp_Cold,Temp_Very_Hot,Temp_Warm
0,Hot,Red,1,1,0,0,0
1,Cold,Yellow,1,0,1,0,0
2,Very Hot,Blue,1,0,0,1,0
3,Warm,Blue,0,0,0,0,1
4,Hot,Red,1,1,0,0,0
5,Warm,Yellow,0,0,0,0,1
6,Warm,Red,1,0,0,0,1
7,Hot,Yellow,0,1,0,0,0
8,Hot,Yellow,1,1,0,0,0
9,Cold,Yellow,1,0,1,0,0


1. For Regression, we can use N-1 (drop first or last column of One Hot Coded new feature ), 
2. For classification, the recommendation is to use all N columns as most of the tree-based algorithm builds a tree based on all available variables. 

**Disadvantages:** 
1. Tree algorithms cannot be applied to one-hot encoded data since it creates a sparse matrix.
2. When the feature contains too many unique values, that many features are created which may result in overfitting.

# 2. Label Encoding

1. In this encoding, a unique value is assigned for different labels/categories.<br>
2. One major issue with sklearn.LabelEncoder is it assigns the values to the labels based on the Alphabetical order of the lables.<br>
Ex : Cold<Hot<Very Hot<Warm….0 < 1 < 2 < 3 

In [7]:
# Using sklearn LabelEncoder()
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df_ohe['Temperature_encoded'] = le.fit_transform(df.Temperature)
df_ohe

Unnamed: 0,Temperature,Color,Target,Temperature_encoded
0,Hot,Red,1,1
1,Cold,Yellow,1,0
2,Very Hot,Blue,1,2
3,Warm,Blue,0,3
4,Hot,Red,1,1
5,Warm,Yellow,0,3
6,Warm,Red,1,3
7,Hot,Yellow,0,1
8,Hot,Yellow,1,1
9,Cold,Yellow,1,0


In [8]:
# Using Pandas factorize()
fact = df.copy()
fact['Temperature_factor'] = pd.factorize(df.Temperature)[0]
fact

Unnamed: 0,Temperature,Color,Target,Temperature_factor
0,Hot,Red,1,0
1,Cold,Yellow,1,1
2,Very Hot,Blue,1,2
3,Warm,Blue,0,3
4,Hot,Red,1,0
5,Warm,Yellow,0,3
6,Warm,Red,1,3
7,Hot,Yellow,0,0
8,Hot,Yellow,1,0
9,Cold,Yellow,1,1


**Disadvantages:** 
1. It mis-leads the information by assigning values based on Alphabetical order instead of actual label order.

# 3. Ordinal Encoding

In [9]:
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder()
oe_val = oe.fit_transform(df['Temperature'].values.reshape(-1, 1))
pd.concat([df,pd.DataFrame(oe_val,columns=['Temperature_Oe'])],axis=1)

Unnamed: 0,Temperature,Color,Target,Temperature_Oe
0,Hot,Red,1,1.0
1,Cold,Yellow,1,0.0
2,Very Hot,Blue,1,2.0
3,Warm,Blue,0,3.0
4,Hot,Red,1,1.0
5,Warm,Yellow,0,3.0
6,Warm,Red,1,3.0
7,Hot,Yellow,0,1.0
8,Hot,Yellow,1,1.0
9,Cold,Yellow,1,0.0


In [10]:
# Using category_encoders OrdinalEncoder
import category_encoders as ce
ohe = ce.OrdinalEncoder(cols=['Temperature'])
df['Temp_ce_oe'] = ohe.fit_transform(df.iloc[:,0], df.iloc[:,-1])
df

Unnamed: 0,Temperature,Color,Target,Temp_ce_oe
0,Hot,Red,1,1
1,Cold,Yellow,1,2
2,Very Hot,Blue,1,3
3,Warm,Blue,0,4
4,Hot,Red,1,1
5,Warm,Yellow,0,4
6,Warm,Red,1,4
7,Hot,Yellow,0,1
8,Hot,Yellow,1,1
9,Cold,Yellow,1,2


In [11]:
# Best way is mapping based on their actual label order
# Ex : Cold < Warm <Hot < Very Hot = 1 < 2 < 3 < 4
Temp_order = {'Cold' : 1 , 'Warm' : 2 , 'Hot' : 3 , 'Very Hot' : 4}
df['Temperature_Order'] = df.Temperature.map(Temp_order)
df

Unnamed: 0,Temperature,Color,Target,Temp_ce_oe,Temperature_Order
0,Hot,Red,1,1,3
1,Cold,Yellow,1,2,1
2,Very Hot,Blue,1,3,4
3,Warm,Blue,0,4,2
4,Hot,Red,1,1,3
5,Warm,Yellow,0,4,2
6,Warm,Red,1,4,2
7,Hot,Yellow,0,1,3
8,Hot,Yellow,1,1,3
9,Cold,Yellow,1,2,1


# 4. Frequency or Count Encoder

In frequency encoding, each of the categories in the feature is replaced with the frequencies of categories.<br>
Here frequency of the categories is related somewhat with the target variable, it helps the model to understand and assign the weight in direct and inverse proportion, depending on the nature of the data.



frequency/size

Category refers to each of the unique values in a feature.
1. **Frequency(category)** = Number of values in that category
2. **Size(data)** = Size of the entire dataset.

In [12]:
# Using Pandas groupby()
cat_freq = df.groupby('Temperature').size() / len(df)
df['Temp_Freq_Enc'] = df.Temperature.map(cat_freq)
df

Unnamed: 0,Temperature,Color,Target,Temp_ce_oe,Temperature_Order,Temp_Freq_Enc
0,Hot,Red,1,1,3,0.4
1,Cold,Yellow,1,2,1,0.2
2,Very Hot,Blue,1,3,4,0.1
3,Warm,Blue,0,4,2,0.3
4,Hot,Red,1,1,3,0.4
5,Warm,Yellow,0,4,2,0.3
6,Warm,Red,1,4,2,0.3
7,Hot,Yellow,0,1,3,0.4
8,Hot,Yellow,1,1,3,0.4
9,Cold,Yellow,1,2,1,0.2


In [13]:
# Using category_encoders CountEncoder
import category_encoders as ce
ce = ce.CountEncoder(cols=['Temperature'])
df['Temp_Count_Enc'] = ce.fit_transform(df.iloc[:,0], df.iloc[:,-1])
df

Unnamed: 0,Temperature,Color,Target,Temp_ce_oe,Temperature_Order,Temp_Freq_Enc,Temp_Count_Enc
0,Hot,Red,1,1,3,0.4,4
1,Cold,Yellow,1,2,1,0.2,2
2,Very Hot,Blue,1,3,4,0.1,1
3,Warm,Blue,0,4,2,0.3,3
4,Hot,Red,1,1,3,0.4,4
5,Warm,Yellow,0,4,2,0.3,3
6,Warm,Red,1,4,2,0.3,3
7,Hot,Yellow,0,1,3,0.4,4
8,Hot,Yellow,1,1,3,0.4,4
9,Cold,Yellow,1,2,1,0.2,2


**Disadvantage**:
1. If two categories have the same frequency then it is hard to distinguish between them.