## One Hot Encoding
* It refers to splitting the column which contains categorical data to many columns depending on the number of categories present in that column.
* Each column contains “0” or “1” corresponding to which column it has been placed.
* One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction.



In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("/home/ubuntu/stuff/projects/machine_learning_algorithms/extra/data/mercedes_data/train.csv", usecols=["X0", "X1", "X2", "X3", "X4", "X5"])

In [3]:
df
# As we can see out columns contain categorical values, hence we can apply one hot encoding before feeding it to a ML model

Unnamed: 0,X0,X1,X2,X3,X4,X5
0,k,v,at,a,d,u
1,k,t,av,e,d,y
2,az,w,n,c,d,x
3,az,t,n,f,d,x
4,az,v,n,f,d,h
...,...,...,...,...,...,...
4204,ak,s,as,c,d,aa
4205,j,o,t,d,d,aa
4206,ak,v,r,a,d,aa
4207,al,r,e,f,d,aa


In [4]:
# Now printing unique categories in each column
for col in df.columns:
    print(col, ":", len(df[col].unique()))

X0 : 47
X1 : 27
X2 : 44
X3 : 7
X4 : 4
X5 : 29


In [5]:
# We can perform one hot encoding using pandas and sklearn both
# Using pandas here:
pd.get_dummies(df, drop_first=True)

Unnamed: 0,X0_aa,X0_ab,X0_ac,X0_ad,X0_af,X0_ai,X0_aj,X0_ak,X0_al,X0_am,...,X5_o,X5_p,X5_q,X5_r,X5_s,X5_u,X5_v,X5_w,X5_x,X5_y
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4204,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4205,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4206,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4207,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


- Here for any column we didn't have many categories, so this is fine, but sometimes we also face this problem, let's say a column has more than 500 categories, then it will create 499 columns for it, it can impact on model.
- What we can do is, we can take n(let's say 10) most frequent labels of the variable. That means they would make binary variable for each of the 10 most frequent labels only.
- And we can drop other remaining ones or group all of them in a different category.


In [6]:
# Displaying 10 most frequent categories for a variable X0
df.X0.value_counts().sort_values(ascending=False).head(10)

z     360
ak    349
y     324
ay    313
t     306
x     300
o     269
f     227
n     195
w     182
Name: X0, dtype: int64

In [7]:
# Now perform this
# First create list of top 10 categories for a selected column
t10 = [x for x in df.X0.value_counts().sort_values(ascending=False).head(10).index]
t10

['z', 'ak', 'y', 'ay', 't', 'x', 'o', 'f', 'n', 'w']

In [8]:
# Now making 10 binary variables
for label in t10:
    df[label] = np.where(df["X0"] == label, 1, 0)

In [9]:
# Showing top 10
df[["X0"] + t10].head(20)

Unnamed: 0,X0,z,ak,y,ay,t,x,o,f,n,w
0,k,0,0,0,0,0,0,0,0,0,0
1,k,0,0,0,0,0,0,0,0,0,0
2,az,0,0,0,0,0,0,0,0,0,0
3,az,0,0,0,0,0,0,0,0,0,0
4,az,0,0,0,0,0,0,0,0,0,0
5,t,0,0,0,0,1,0,0,0,0,0
6,al,0,0,0,0,0,0,0,0,0,0
7,o,0,0,0,0,0,0,1,0,0,0
8,w,0,0,0,0,0,0,0,0,0,1
9,j,0,0,0,0,0,0,0,0,0,0


In [10]:
# Now to apply same thing to all the features
# To get whole set of dummy variables, for all the categorical variables
def one_hot_encoding(df, variable, top_x_labels):
    # function to create the dummy variables for the most frequent labels
    # we can vary the number of most frequent labels that we encode
    
    for label in top_x_labels:
        df[variable+'_'+label] = np.where(df[variable]==label, 1, 0)
    

In [11]:
df = pd.read_csv('/home/ubuntu/stuff/projects/machine_learning_algorithms/extra/data/mercedes_data/train.csv', usecols=['X1', 'X2'])
# encode X2 into the 10 most frequent categories
one_hot_encoding(df, 'X2', t10)
df.head()

Unnamed: 0,X1,X2,X2_z,X2_ak,X2_y,X2_ay,X2_t,X2_x,X2_o,X2_f,X2_n,X2_w
0,v,at,0,0,0,0,0,0,0,0,0,0
1,t,av,0,0,0,0,0,0,0,0,0,0
2,w,n,0,0,0,0,0,0,0,0,1,0
3,t,n,0,0,0,0,0,0,0,0,1,0
4,v,n,0,0,0,0,0,0,0,0,1,0
