# Feature Engineering - Encoding

As we can't pass the categorical data to any ml model we will use some encoding techniques to convert categorical data to numric data

### There are many types of encoding techniques

Here I am discussing important techniques

1) One hot encoding 
2) One hot encoding with many unique values
3) Target guided ordinal encoding
4) Mean encoding
5) Count or Frequency encoding
6) Label encoding
7) Probability ratio encoding

To get more info on encoders use this link : https://contrib.scikit-learn.org/category_encoders/

To get more info on encoders use this link : https://feature-engine.readthedocs.io/en/latest/encoding/index.html

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
df = pd.read_csv('titanic.csv', usecols = ['Sex', 'Embarked','Survived'])

In [3]:
df.head()

Unnamed: 0,Survived,Sex,Embarked
0,0,male,S
1,1,female,C
2,1,female,S
3,1,female,S
4,0,male,S


# 1) One hot encoding

* One-hot encoding is one of the most common encoding methods in machine learning. 
* This method spreads the values in a column to multiple flag columns and assigns 0 or 1 to them.

<img src="one hot encoding image.png">

In [4]:
df['Embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [5]:
df['Sex'].unique()

array(['male', 'female'], dtype=object)

In [6]:
pd.get_dummies(df, drop_first = False).head()

Unnamed: 0,Survived,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,0,0,1,0,0,1
1,1,1,0,1,0,0
2,1,1,0,0,0,1
3,1,1,0,0,0,1
4,0,0,1,0,0,1


In [7]:
pd.get_dummies(df, drop_first = True).head()

Unnamed: 0,Survived,Sex_male,Embarked_Q,Embarked_S
0,0,1,0,1
1,1,0,0,0
2,1,0,0,1
3,1,0,0,1
4,0,1,0,1


# 2) One hot encoding with many unique values

In [8]:
data = pd.read_csv('mercedes.csv', usecols = ['X1', 'X2'])

In [9]:
data.head()

Unnamed: 0,X1,X2
0,v,at
1,t,av
2,w,n
3,t,n
4,v,n


In [10]:
data.shape

(4209, 2)

In [11]:
data.X1.unique()

array(['v', 't', 'w', 'b', 'r', 'l', 's', 'aa', 'c', 'a', 'e', 'h', 'z',
       'j', 'o', 'u', 'p', 'n', 'i', 'y', 'd', 'f', 'm', 'k', 'g', 'q',
       'ab'], dtype=object)

In [12]:
data.X2.unique()

array(['at', 'av', 'n', 'e', 'as', 'aq', 'r', 'ai', 'ak', 'm', 'a', 'k',
       'ae', 's', 'f', 'd', 'ag', 'ay', 'ac', 'ap', 'g', 'i', 'aw', 'y',
       'b', 'ao', 'al', 'h', 'x', 'au', 't', 'an', 'z', 'ah', 'p', 'am',
       'j', 'q', 'af', 'l', 'aa', 'c', 'o', 'ar'], dtype=object)

In [13]:
for col in data.columns:
    print(col, ': ', len(data[col].unique()), 'unique counts')

X1 :  27 unique counts
X2 :  44 unique counts


##### Here if we perform one hot encoding then we will get more columns which trigers the curse of dimensionality

### KDD Cup Orange Challenge
What can we do instead?

http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf 
In the winning solution of the KDD 2009 cup: "Winning the KDD Cup Orange Challenge with Ensemble

The Team suggested using 10 most frequent labels convert them into dummy variables using onehotencoding

How can we do that in python?

In [14]:
# let's find the top 10 most frequent categories for the variable X2
data.X2.value_counts().sort_values(ascending=False).head(20)

as    1659
ae     496
ai     415
m      367
ak     265
r      153
n      137
s       94
f       87
e       81
aq      63
ay      54
a       47
t       29
i       25
k       25
b       21
ao      20
ag      19
z       19
Name: X2, dtype: int64

In [15]:
# let's make a list with the most frequent categories of the variable
top_10_labels = [y for y in data.X2.value_counts().sort_values(ascending=False).head(10).index]
top_10_labels

['as', 'ae', 'ai', 'm', 'ak', 'r', 'n', 's', 'f', 'e']

In [16]:
# get whole set of dummy variables, for all the categorical variables

def one_hot_encoding_top_x(df, variable, top_x_labels):
    # function to create the dummy variables for the most frequent labels
    # we can vary the number of most frequent labels that we encode
    
    for label in top_x_labels:
        df[variable + '_' + label] = np.where(df[variable] == label, 1, 0)

In [17]:
one_hot_encoding_top_x(data, 'X2', top_10_labels)

In [18]:
data.head()

Unnamed: 0,X1,X2,X2_as,X2_ae,X2_ai,X2_m,X2_ak,X2_r,X2_n,X2_s,X2_f,X2_e
0,v,at,0,0,0,0,0,0,0,0,0,0
1,t,av,0,0,0,0,0,0,0,0,0,0
2,w,n,0,0,0,0,0,0,1,0,0,0
3,t,n,0,0,0,0,0,0,1,0,0,0
4,v,n,0,0,0,0,0,0,1,0,0,0


# 3) Target guided ordinal encoding

In [19]:
df = pd.read_csv('titanic.csv', usecols = ['Sex', 'Embarked','Survived','Cabin'])

In [20]:
# replacing with 'Missing' for nan values
df['Cabin'].fillna('Missing', inplace = True)

In [21]:
# converting the values in cabin column to list and taking the first letter in each value
df['First_letter_in_Cabin'] = df['Cabin'].astype(str).str[0]

In [22]:
df.head()

Unnamed: 0,Survived,Sex,Cabin,Embarked,First_letter_in_Cabin
0,0,male,Missing,S,M
1,1,female,C85,C,C
2,1,female,Missing,S,M
3,1,female,C123,S,C
4,0,male,Missing,S,M


In [23]:
# we are doing groupby to cabin column based on survived 

#  means probability of A being survived is 0.466667
#  means probability of B being survived is 0.744681
#  means probability of C being survived is 0.593220
# ..

df.groupby(['First_letter_in_Cabin'])['Survived'].mean()

First_letter_in_Cabin
A    0.466667
B    0.744681
C    0.593220
D    0.757576
E    0.750000
F    0.615385
G    0.500000
M    0.299854
T    0.000000
Name: Survived, dtype: float64

In [24]:
df.groupby(['First_letter_in_Cabin'])['Survived'].mean().sort_values()

# we can sort A,B,C like that also  by using the below code

# df.groupby(['First_letter_in_Cabin'])['Survived'].mean().sort_index()

First_letter_in_Cabin
T    0.000000
M    0.299854
A    0.466667
G    0.500000
C    0.593220
F    0.615385
B    0.744681
E    0.750000
D    0.757576
Name: Survived, dtype: float64

In [25]:
# getting the index of that
ordinal_labels = df.groupby(['First_letter_in_Cabin'])['Survived'].mean().sort_values().index
ordinal_labels

Index(['T', 'M', 'A', 'G', 'C', 'F', 'B', 'E', 'D'], dtype='object', name='First_letter_in_Cabin')

In [26]:
#  creating a dictonnary
odinal_dict = {k:i for i,k in enumerate(ordinal_labels,0)}
odinal_dict

{'T': 0, 'M': 1, 'A': 2, 'G': 3, 'C': 4, 'F': 5, 'B': 6, 'E': 7, 'D': 8}

In [27]:
# creating new column and mapping the dictonary to the Cabin column
df['Cabin_ordinal_labels']=df['First_letter_in_Cabin'].map(odinal_dict)
df.head()

Unnamed: 0,Survived,Sex,Cabin,Embarked,First_letter_in_Cabin,Cabin_ordinal_labels
0,0,male,Missing,S,M,1
1,1,female,C85,C,C,4
2,1,female,Missing,S,M,1
3,1,female,C123,S,C,4
4,0,male,Missing,S,M,1


# 3) Mean encoding

* The MeanEncoder() replaces categories by the mean value of the target for each category.
* For example in the variable colour, if the mean of the target for blue, red and grey is 0.5, 0.8 and 0.1 respectively, blue is replaced by 0.5, red by 0.8 and grey by 0.1.

In [28]:
df['Embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [29]:
df.groupby(['Embarked'])['Survived'].mean()

Embarked
C    0.553571
Q    0.389610
S    0.336957
Name: Survived, dtype: float64

In [30]:
# converting to dictionary (helps us to map)

mean_ordinal = df.groupby(['Embarked'])['Survived'].mean().to_dict()
mean_ordinal

{'C': 0.5535714285714286, 'Q': 0.38961038961038963, 'S': 0.33695652173913043}

In [31]:
# creating new column and mapping the dictonary to the embarked column
df['mean_ordinal_encode'] = df['Embarked'].map(mean_ordinal)
df.head()

Unnamed: 0,Survived,Sex,Cabin,Embarked,First_letter_in_Cabin,Cabin_ordinal_labels,mean_ordinal_encode
0,0,male,Missing,S,M,1,0.336957
1,1,female,C85,C,C,4,0.553571
2,1,female,Missing,S,M,1,0.336957
3,1,female,C123,S,C,4,0.336957
4,0,male,Missing,S,M,1,0.336957


# 4) Count or Frequency encoding

* same like mean encoding but instead of mean we will use value counts

In [32]:
df['Embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [33]:
df['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [34]:
count_dict = df['Embarked'].value_counts().to_dict()
count_dict

{'S': 644, 'C': 168, 'Q': 77}

In [35]:
# creating new column and mapping the dictonary to the embarked column
df['count_encoding'] = df['Embarked'].map(count_dict)
df.head()

Unnamed: 0,Survived,Sex,Cabin,Embarked,First_letter_in_Cabin,Cabin_ordinal_labels,mean_ordinal_encode,count_encoding
0,0,male,Missing,S,M,1,0.336957,644.0
1,1,female,C85,C,C,4,0.553571,168.0
2,1,female,Missing,S,M,1,0.336957,644.0
3,1,female,C123,S,C,4,0.336957,644.0
4,0,male,Missing,S,M,1,0.336957,644.0


# 5) Label Encoding

* Same like mean encoding but we will give our own dictonary and we will map that dictonary to the column
* mainly used if we have ordinal data, eg: good, better, best (good-1, better-2, best-3)

In [36]:
df['Embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [37]:
# manually creating dictonary
label_dict = {'C':1, 'S':2, 'Q':3}

In [38]:
#creating dictonary by code

keys = list(df['Embarked'].unique())
values = range(df['Embarked'].nunique())

label_dict = dict(zip(keys, values))
print(label_dict)

{'S': 0, 'C': 1, 'Q': 2}


In [39]:
# another method to create dictonary
ordinal_labels2 = {k:i for i, k in enumerate(df['Embarked'].unique(),0)}
ordinal_labels2

{'S': 0, 'C': 1, 'Q': 2, nan: 3}

In [40]:
df['label_encoding'] = df['Embarked'].map(label_dict)
df.head()

Unnamed: 0,Survived,Sex,Cabin,Embarked,First_letter_in_Cabin,Cabin_ordinal_labels,mean_ordinal_encode,count_encoding,label_encoding
0,0,male,Missing,S,M,1,0.336957,644.0,0.0
1,1,female,C85,C,C,4,0.553571,168.0,1.0
2,1,female,Missing,S,M,1,0.336957,644.0,0.0
3,1,female,C123,S,C,4,0.336957,644.0,0.0
4,0,male,Missing,S,M,1,0.336957,644.0,0.0


# 6) Probability ratio encoding

* The PRatioEncoder() replaces categories by the ratio of the probability of the target = 1 and the probability of the target = 0.
* This categorical encoding is exclusive for binary classification.

1. Probability of Survived based on Cabin--- Categorical Feature
2. Probability of Not Survived---1-pr(Survived)
3. pr(Survived)/pr(Not Survived)
4. Dictonary to map cabin with probability
5. replace with the categorical feature

In [41]:
prob_df = df.groupby(['Embarked'])['Survived'].mean()
# converting to data frame
prob_df = pd.DataFrame(prob_df)

In [42]:
prob_df

Unnamed: 0_level_0,Survived
Embarked,Unnamed: 1_level_1
C,0.553571
Q,0.38961
S,0.336957


In [43]:
# creating new column died (i.e. 1 - survived)
prob_df['Died'] = 1 - prob_df['Survived']

In [44]:
prob_df

Unnamed: 0_level_0,Survived,Died
Embarked,Unnamed: 1_level_1,Unnamed: 2_level_1
C,0.553571,0.446429
Q,0.38961,0.61039
S,0.336957,0.663043


In [45]:
#creating a new column and calculating the ratio
prob_df['Probability_ratio'] = prob_df['Survived']/prob_df['Died']

In [46]:
prob_df

Unnamed: 0_level_0,Survived,Died,Probability_ratio
Embarked,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
C,0.553571,0.446429,1.24
Q,0.38961,0.61039,0.638298
S,0.336957,0.663043,0.508197


In [47]:
# converting to dict
probability_encoded = prob_df['Probability_ratio'].to_dict()
probability_encoded

{'C': 1.2400000000000002, 'Q': 0.6382978723404256, 'S': 0.5081967213114753}

In [48]:
df['probablity_ratio_encoding'] = df['Embarked'].map(probability_encoded)
df.head()

Unnamed: 0,Survived,Sex,Cabin,Embarked,First_letter_in_Cabin,Cabin_ordinal_labels,mean_ordinal_encode,count_encoding,label_encoding,probablity_ratio_encoding
0,0,male,Missing,S,M,1,0.336957,644.0,0.0,0.508197
1,1,female,C85,C,C,4,0.553571,168.0,1.0,1.24
2,1,female,Missing,S,M,1,0.336957,644.0,0.0,0.508197
3,1,female,C123,S,C,4,0.336957,644.0,0.0,0.508197
4,0,male,Missing,S,M,1,0.336957,644.0,0.0,0.508197
