# One hot encoding

- To convert categorical into numeric data
- new binary variable is added for each unique integer value.
- In label encoding I will rank and those categories will have relationship with each other. ex: Good, Average, Bad 
- In OHE, I do not have any relationship between the categories within the feature. 
- If there are n categories, new (n-1) features will be created. 
- This creates extra dimensions(features/columns). 1 column per category. This can be prevented by using drop_first
- If there are multiple categories within feature, drop_first will drop just 1 column and might not be useful. 
- It caused CURSE OF DIMENSIONALITY. 
- For the features like pincode or for features which has many categories. it creates more columns. 


In [1]:
import numpy as np
import pandas as pd

In [2]:
df1= pd.read_csv('titanic.csv')

In [3]:
df1.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [4]:
df= pd.read_csv('titanic.csv', usecols= ['Sex'])
df.head()

Unnamed: 0,Sex
0,male
1,female
2,female
3,female
4,male


In [6]:
pd.get_dummies(df).head()
# OHE-> 2 new columns are created, alphabetically sorted-> female, male
# then 0 and 1 will be assigned. 


Unnamed: 0,Sex_female,Sex_male
0,0,1
1,1,0
2,1,0
3,1,0
4,0,1


In [7]:
#As I have just 2 features, I can drop one and use just one, so that it will indicate
# both categories. 

# First will be dropped. 

# this will also reduce possibilty of Curse of dimensionality 

pd.get_dummies(df,drop_first=True).head()

Unnamed: 0,Sex_male
0,1
1,0
2,0
3,0
4,1


In [8]:
# lets take next category Embarked 

df= pd.read_csv('titanic.csv', usecols= ['Embarked'])
df.head()

Unnamed: 0,Embarked
0,S
1,C
2,S
3,S
4,S


In [12]:
df['Embarked'].unique()

# there are 4 categories with NaN values

array(['S', 'C', 'Q', nan], dtype=object)

In [14]:
# Right now, as we are just considering OHE process, we can drop NaN values 

df['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [20]:
#lets replace with most frequent ones

df.fillna("S")

Unnamed: 0,Embarked
0,S
1,C
2,S
3,S
4,S
...,...
886,S
887,S
888,S
889,C


In [23]:
df.isnull().sum()

Embarked    2
dtype: int64

In [24]:
# we can also drop NA

df['Embarked'].dropna()

0      S
1      C
2      S
3      S
4      S
      ..
886    S
887    S
888    S
889    C
890    Q
Name: Embarked, Length: 889, dtype: object

In [25]:
df.isnull().sum()

Embarked    2
dtype: int64

In [26]:
# applying OHE


pd.get_dummies(df, drop_first= False).head() #not dropping at the first go 

Unnamed: 0,Embarked_C,Embarked_Q,Embarked_S
0,0,0,1
1,1,0,0
2,0,0,1
3,0,0,1
4,0,0,1


In [27]:
pd.get_dummies(df, drop_first= True).head()

Unnamed: 0,Embarked_Q,Embarked_S
0,0,1
1,0,0
2,0,1
3,0,1
4,0,1


# One hot Encoding for multpl category within feature

In [29]:
df1= pd.read_csv('mercedes.csv')
df1

#take x0 till x6 which are categorical data

Unnamed: 0,ID,y,X0,X1,X2,X3,X4,X5,X6,X8,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,130.81,k,v,at,a,d,u,j,o,...,0,0,1,0,0,0,0,0,0,0
1,6,88.53,k,t,av,e,d,y,l,o,...,1,0,0,0,0,0,0,0,0,0
2,7,76.26,az,w,n,c,d,x,j,x,...,0,0,0,0,0,0,1,0,0,0
3,9,80.62,az,t,n,f,d,x,l,e,...,0,0,0,0,0,0,0,0,0,0
4,13,78.02,az,v,n,f,d,h,d,n,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4204,8405,107.39,ak,s,as,c,d,aa,d,q,...,1,0,0,0,0,0,0,0,0,0
4205,8406,108.77,j,o,t,d,d,aa,h,h,...,0,1,0,0,0,0,0,0,0,0
4206,8412,109.22,ak,v,r,a,d,aa,g,e,...,0,0,1,0,0,0,0,0,0,0
4207,8415,87.48,al,r,e,f,d,aa,l,u,...,0,0,0,0,0,0,0,0,0,0


In [30]:
df= pd.read_csv('mercedes.csv', usecols= ['X0','X1','X2','X3','X4','X5','X6'])
df.head()

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6
0,k,v,at,a,d,u,j
1,k,t,av,e,d,y,l
2,az,w,n,c,d,x,j
3,az,t,n,f,d,x,l
4,az,v,n,f,d,h,d


In [32]:
df['X0'].value_counts()

z     360
ak    349
y     324
ay    313
t     306
x     300
o     269
f     227
n     195
w     182
j     181
az    175
aj    151
s     106
ap    103
h      75
d      73
al     67
v      36
af     35
m      34
ai     34
e      32
ba     27
at     25
a      21
ax     19
aq     18
am     18
i      18
u      17
aw     16
l      16
ad     14
au     11
k      11
b      11
r      10
as     10
bc      6
ao      4
c       3
aa      2
q       2
ac      1
g       1
ab      1
Name: X0, dtype: int64

In [35]:
df['X0'].unique()

array(['k', 'az', 't', 'al', 'o', 'w', 'j', 'h', 's', 'n', 'ay', 'f', 'x',
       'y', 'aj', 'ak', 'am', 'z', 'q', 'at', 'ap', 'v', 'af', 'a', 'e',
       'ai', 'd', 'aq', 'c', 'aa', 'ba', 'as', 'i', 'r', 'b', 'ax', 'bc',
       'u', 'ad', 'au', 'm', 'l', 'aw', 'ao', 'ac', 'g', 'ab'],
      dtype=object)

In [36]:
len(df['X0'].unique())

47

In [38]:
#to print the number of categories in each feature
for i in df.columns:
    print(len(df[i].unique()))
    
#X0 has 47, X1 has 27 etc 

47
27
44
7
4
29
12


- As there are huge number of categories, creating OHE will add many dimesions in my dataset. 

- It is waste of resources, if I create OHE for all these categories. 

- In Kaggle competition, one of the idea has been selected and implemented that has given best results with good accuracy.

- Only top 10 categories within the feature has been considered and rest of them were dropped. 


In [39]:
df.X1.value_counts()
#in this we can see after u we have got very less value occurances. 

aa    833
s     598
b     592
l     590
v     408
r     251
i     203
a     143
c     121
o      82
w      52
z      46
u      37
e      33
m      32
t      31
h      29
y      23
f      23
j      22
n      19
k      17
p       9
g       6
d       3
q       3
ab      3
Name: X1, dtype: int64

In [40]:
# take top 10
df.X1.value_counts().sort_values(ascending= False).head(10)

aa    833
s     598
b     592
l     590
v     408
r     251
i     203
a     143
c     121
o      82
Name: X1, dtype: int64

In [44]:
# apply OHE for just these top 10. we can also consider more or less than 10 according to my observation 

#to get top 10. 
# from the last set of code, get the indices as they have actual categorical value like aa,s, b etc

top_10= df.X1.value_counts().sort_values(ascending= False).head(10).index
top_10

Index(['aa', 's', 'b', 'l', 'v', 'r', 'i', 'a', 'c', 'o'], dtype='object')

In [45]:
top_10.dtype# this is of index type denoted by object. which is inside index. 
# in last op its clearly mentioned as Index([... ])

dtype('O')

In [52]:
# converting object dtype into list
list(top_10)

#Now I can see that top_10 is converted into list. Denoted by [......]

['aa', 's', 'b', 'l', 'v', 'r', 'i', 'a', 'c', 'o']

In [67]:
for categories in top_10:
    df[categories]= np.where(df["X1"]== categories, 1,0)

#where df[X1] is equal to categories in my top_10, replace them with 1 else 0 
#ie, if my category doesnt belong to top_10 (ex: w,z etc) 0 will be replaced. 
#If my category is in top_10 (ex: aa,s,b etc), 1 will be replaced. 




In [68]:
top_10.append('X1')

TypeError: all inputs must be Index

In [63]:
df[top_10]

Unnamed: 0,aa,s,b,l,v,r,i,a,c,o
0,0,0,0,0,1,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...
4204,0,1,0,0,0,0,0,0,0,0
4205,0,0,0,0,0,0,0,0,0,1
4206,0,0,0,0,1,0,0,0,0,0
4207,0,0,0,0,0,1,0,0,0,0


In [71]:
df[top_10].append('X1')

  df[top_10].append('X1')


TypeError: cannot concatenate object of type '<class 'str'>'; only Series and DataFrame objs are valid

In [72]:
df[top_10].head()

Unnamed: 0,aa,s,b,l,v,r,i,a,c,o
0,0,0,0,0,1,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0,0,0
