# One Hot Encoding:

#### Dealing with Categorical Data:

Most real-life datasets we encounter during our data science project development have columns of mixed data type. These datasets consist of both categorical as well as numerical columns.

However, various Machine Learning models do not work with categorical data and to fit this data into the machine learning model it needs to be converted into numerical data.

*For example*, suppose a dataset has a Gender column with categorical elements like Male and  Female. These labels have no specific order of preference and also since the data is string labels, machine learning models misinterpreted that there is some sort of hierarchy in them.

One approach to solve this problem can be label encoding where we will assign a numerical value to these labels for example Male and Female mapped to 0 and 1. But this can add bias in our model as it will start giving higher preference to the Female parameter as 1>0 but ideally, both labels are equally important in the dataset. To deal with this issue we will use the **One Hot Encoding technique**.

### OneHotEncoder:

Scikit-learn(sklearn) is a popular machine-learning library in Python that provide numerous tools for data preprocessing. It provides a **OneHotEncoder** function that we use for encoding categorical and numerical variables into binary vectors.

In [1]:
# importing libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder

In [2]:
# loading the dataset
df = pd.read_csv('data.csv')
df.head()

Unnamed: 0,Id,Colour,Country
0,1,Red,USA
1,2,Red,USA
2,3,Blue,UK
3,4,Green,Canada
4,5,Green,Canada


In [3]:
# bottom five rows
df.tail()

Unnamed: 0,Id,Colour,Country
295,296,Blue,USA
296,297,Green,USA
297,298,Green,UK
298,299,Green,Canada
299,300,Blue,USA


In [4]:
# checking the data types of each column
df.dtypes

Id          int64
Colour     object
Country    object
dtype: object

We are going to apply **OneHotEncoder** on object variables 'Country' and 'Colour'. Let's check unique values for both columns,

In [5]:
# for colours
df["Colour"].unique()

array(['Red', 'Blue', 'Green'], dtype=object)

In the output you can see that we got 'Red', 'Blue' and 'Green'. These are the unique colours present in the variable 'Colour'.
<br>
When we apply the **OneHotEncoder** on this variable we will get three new columns 'red', 'blue' and 'green'.

In [6]:
# for country
df["Country"].unique()

array(['USA', 'UK', 'Canada'], dtype=object)

In the output you can see that we got 'USA', 'UK' and 'Canada'. These are the unique countries present in the variable 'Countries'.
<br>
When we apply the **OneHotEncoder** on this variable we will get three new columns 'usa', 'uk' and 'canada'.

In [7]:
# making an object
ohe = OneHotEncoder()
print(ohe)

OneHotEncoder()


I'm going apply the method 'fit' and 'transform' on variable 'Colour' and 'Country'.

In [8]:
feature_arr = ohe.fit_transform(df[["Colour", "Country"]]).toarray()
feature_arr

array([[0., 0., 1., 0., 0., 1.],
       [0., 0., 1., 0., 0., 1.],
       [1., 0., 0., 0., 1., 0.],
       ...,
       [0., 1., 0., 0., 1., 0.],
       [0., 1., 0., 1., 0., 0.],
       [1., 0., 0., 0., 0., 1.]])

In [9]:
ohe.categories_

[array(['Blue', 'Green', 'Red'], dtype=object),
 array(['Canada', 'UK', 'USA'], dtype=object)]

These are the two arrays one 'Colour' and one for 'Country'.
<br>
We want to save this in one single array!

In [10]:
feature_labels = ohe.categories_

In [11]:
np.array(feature_labels).ravel()

array(['Blue', 'Green', 'Red', 'Canada', 'UK', 'USA'], dtype=object)

Now, you can see that all the categories are stored in one single array.

In [12]:
feature_labels = np.array(feature_labels).ravel()
print(feature_labels)

['Blue' 'Green' 'Red' 'Canada' 'UK' 'USA']


In [13]:
# Now, let's make a DataFrame
features = pd.DataFrame(feature_arr, columns = feature_labels)
features

Unnamed: 0,Blue,Green,Red,Canada,UK,USA
0,0.0,0.0,1.0,0.0,0.0,1.0
1,0.0,0.0,1.0,0.0,0.0,1.0
2,1.0,0.0,0.0,0.0,1.0,0.0
3,0.0,1.0,0.0,1.0,0.0,0.0
4,0.0,1.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...
295,1.0,0.0,0.0,0.0,0.0,1.0
296,0.0,1.0,0.0,0.0,0.0,1.0
297,0.0,1.0,0.0,0.0,1.0,0.0
298,0.0,1.0,0.0,1.0,0.0,0.0


We can join this DataFrame with the original DataFrame 'df'. We will use the method 'concatenate'.

In [14]:
new_df = pd.concat([df, features], axis=1)
new_df.head()

Unnamed: 0,Id,Colour,Country,Blue,Green,Red,Canada,UK,USA
0,1,Red,USA,0.0,0.0,1.0,0.0,0.0,1.0
1,2,Red,USA,0.0,0.0,1.0,0.0,0.0,1.0
2,3,Blue,UK,1.0,0.0,0.0,0.0,1.0,0.0
3,4,Green,Canada,0.0,1.0,0.0,1.0,0.0,0.0
4,5,Green,Canada,0.0,1.0,0.0,1.0,0.0,0.0


This is how you can apply **One Hot Encoding** on categorical variables.