# Encoding

**Categorical data :**
    Variables that are usually represented as ‘strings’ or ‘categories’ and are finite in number,
    There are two kinds of categorical data:
        
   * Ordinal Data: while encoding, one should retain the information regarding the order in which the category is provided
      (highest degree a person possesses)
   * Nominal Data: we have to consider the presence or absence of a feature,no notion of order is present like 
      (city a person lives in)
        

**Label Encoding or Ordinal Encoding:** 
      This Encoding technique is used when retaining the order is important, hence encoding should reflect the correct 
         sequence in Label encoding, each label is converted into an integer value.

In [8]:
import category_encoders as ce
import pandas as pd
train_df=pd.DataFrame({'Degree':['High school','Masters','Diploma','Bachelors','Bachelors','Masters','Phd','High school','High school']})

# create object of Ordinalencoding
encoder= ce.OrdinalEncoder(cols=['Degree'],return_df=True,
                           mapping=[{'col':'Degree',
'mapping':{'None':0,'High school':1,'Diploma':2,'Bachelors':3,'Masters':4,'Phd':5}}])

#Original data
train_df

Unnamed: 0,Degree
0,High school
1,Masters
2,Diploma
3,Bachelors
4,Bachelors
5,Masters
6,Phd
7,High school
8,High school


In [12]:
#fit and transform train data 
df_train_transformed = encoder.fit_transform(train_df)
df_train_transformed

Unnamed: 0,Degree
0,1
1,4
2,2
3,3
4,3
5,4
6,5
7,1
8,1


**One Hot Encoding :**
    This Encoding technique is used when the features are nominal(do not have any order).for each level of a categorical feature, 
    we create a new variable. Each category is mapped with a binary variable containing either 0 or 1. Here, 0 represents the absence, 
    and 1 represents the presence of that category.These newly created binary features are known as Dummy variables. 
    The number of dummy variables depends on the levels present in the categorical variable.

In [13]:
import category_encoders as ce
import pandas as pd
data=pd.DataFrame({'City':[
'Delhi','Mumbai','Hydrabad','Chennai','Bangalore','Delhi','Hydrabad','Bangalore','Delhi'
]})

#Create object for one-hot encoding
encoder=ce.OneHotEncoder(cols='City',handle_unknown='return_nan',return_df=True,use_cat_names=True)

#Original Data
data

Unnamed: 0,City
0,Delhi
1,Mumbai
2,Hydrabad
3,Chennai
4,Bangalore
5,Delhi
6,Hydrabad
7,Bangalore
8,Delhi


In [16]:
#Fit and transform Data
data_encoded = encoder.fit_transform(data)
data_encoded

  elif pd.api.types.is_categorical(cols):


Unnamed: 0,City_Delhi,City_Mumbai,City_Hydrabad,City_Chennai,City_Bangalore
0,1.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,0.0,1.0
5,1.0,0.0,0.0,0.0,0.0
6,0.0,0.0,1.0,0.0,0.0
7,0.0,0.0,0.0,0.0,1.0
8,1.0,0.0,0.0,0.0,0.0


**Dummy Encoding :**
    Dummy coding scheme is similar to one-hot encoding the only difference is Dummy encoding uses N-1 features to represent N       labels/categories while one-hot encoding, for N categories in a variable, it uses N binary variables.

In [18]:
import category_encoders as ce
import pandas as pd
data=pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi','Hyderabad']})

#Original Data
data

Unnamed: 0,City
0,Delhi
1,Mumbai
2,Hyderabad
3,Chennai
4,Bangalore
5,Delhi
6,Hyderabad


In [19]:
#encode the data
data_encoded=pd.get_dummies(data=data,drop_first=True)
data_encoded

Unnamed: 0,City_Chennai,City_Delhi,City_Hyderabad,City_Mumbai
0,0,1,0,0
1,0,0,0,1
2,0,0,1,0
3,1,0,0,0
4,0,0,0,0
5,0,1,0,0
6,0,0,1,0


**Drawbacks of  One-Hot and Dummy Encoding**
  1.  If a column have 30 different values then it will require 30 new variables, thus increasing the columns in dataset which       inturn introduce sparsity in the dataset i.e several columns having 0s and a few of them having 1s i.e. it creates             multiple dummy features in the dataset without adding much information.
   
      Also, they might lead to a Dummy variable trap. It is a phenomenon where features are highly correlated. That means using       the other variables, we can easily predict the value of a variable.
      
      Due to the massive increase in the dataset, coding slows down the learning of the model along with deteriorating the           overall performance that ultimately makes the model computationally expensive. Further, while using tree-based models           these encodings are not an optimum choice.

**Effect Encoding :**
    This encoding technique is also known as **Deviation Encoding** or **Sum Encoding**. Effect encoding is almost similar to dummy         encoding, with a little difference. In dummy coding, we use 0 and 1 to represent the data but in effect encoding, we use       three values i.e. 1,0, and -1.
    
   The row containing only 0s in dummy encoding is encoded as -1 in effect encoding. In the dummy encoding example, the city       Bangalore at index 4  was encoded as 0000. Whereas in effect encoding it is represented by -1-1-1-1.

In [21]:
import category_encoders as ce
import pandas as pd
data=pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi','Hyderabad']}) 
encoder=ce.sum_coding.SumEncoder(cols='City',verbose=False,)

#Original Data
data

Unnamed: 0,City
0,Delhi
1,Mumbai
2,Hyderabad
3,Chennai
4,Bangalore
5,Delhi
6,Hyderabad


In [22]:
encoder.fit_transform(data)

  elif pd.api.types.is_categorical(cols):


Unnamed: 0,intercept,City_0,City_1,City_2,City_3
0,1,1.0,0.0,0.0,0.0
1,1,0.0,1.0,0.0,0.0
2,1,0.0,0.0,1.0,0.0
3,1,0.0,0.0,0.0,1.0
4,1,-1.0,-1.0,-1.0,-1.0
5,1,1.0,0.0,0.0,0.0
6,1,0.0,0.0,1.0,0.0


**Hash Encoder :**
    To understand Hash encoding it is necessary to know about hashing. Hashing is the transformation of arbitrary size input in the form of a fixed-size value. We use hashing algorithms to perform hashing operations i.e to generate the hash value of an input. Further, hashing is a one-way process, in other words, one can not generate original input from the hash representation.

Hashing has several applications like data retrieval, checking data corruption, and in data encryption also. We have multiple hash functions available for example Message Digest (MD, MD2, MD5), Secure Hash Function (SHA0, SHA1, SHA2), and many more.

Just like one-hot encoding, the Hash encoder represents categorical features using the new dimensions. Here, the user can fix the number of dimensions after transformation using n_component argument. Here is what I mean – A feature with 5 categories can be represented using N new features similarly, a feature with 100 categories can also be transformed using N new features. Doesn’t this sound amazing?

By default, the Hashing encoder uses the md5 hashing algorithm but a user can pass any algorithm of his choice.

In [24]:
import category_encoders as ce
import pandas as pd

#Create the dataframe
data=pd.DataFrame({'Month':['January','April','March','April','Februay','June','July','June','September']})

#Create object for hash encoder
encoder=ce.HashingEncoder(cols='Month',n_components=6)

#Original Data
data

Unnamed: 0,Month
0,January
1,April
2,March
3,April
4,Februay
5,June
6,July
7,June
8,September


In [25]:
#Fit and Transform Data
encoder.fit_transform(data)

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5
0,0,0,0,0,1,0
1,0,0,0,1,0,0
2,0,0,0,0,1,0
3,0,0,0,1,0,0
4,0,0,0,1,0,0
5,0,1,0,0,0,0
6,1,0,0,0,0,0
7,0,1,0,0,0,0
8,0,0,0,0,1,0


Since Hashing transforms the data in lesser dimensions, it may lead to loss of information. Another issue faced by hashing encoder is the collision. Since here, a large number of features are depicted into lesser dimensions, hence multiple values can be represented by the same hash value, this is known as a collision.

**Binary Encoding :**
    Binary encoding is a combination of Hash encoding and one-hot encoding. In this encoding scheme, the categorical feature is first converted into numerical using an ordinal encoder. Then the numbers are transformed in the binary number. After that binary value is split into different columns.

Binary encoding works really well when there are a high number of categories. For example the cities in a country where a company supplies its products.

In [28]:
#Import the libraries
import category_encoders as ce
import pandas as pd

#Create the Dataframe
data=pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi','Hyderabad','Mumbai','Agra']})

#Create object for binary encoding
encoder= ce.BinaryEncoder(cols=['City'],return_df=True)

#Original Data
data

Unnamed: 0,City
0,Delhi
1,Mumbai
2,Hyderabad
3,Chennai
4,Bangalore
5,Delhi
6,Hyderabad
7,Mumbai
8,Agra


In [29]:
#Fit and Transform Data 
data_encoded=encoder.fit_transform(data) 
data_encoded

  elif pd.api.types.is_categorical(cols):


Unnamed: 0,City_0,City_1,City_2,City_3
0,0,0,0,1
1,0,0,1,0
2,0,0,1,1
3,0,1,0,0
4,0,1,0,1
5,0,0,0,1
6,0,0,1,1
7,0,0,1,0
8,0,1,1,0


Binary encoding is a memory-efficient encoding scheme as it uses fewer features than one-hot encoding. Further, It reduces the curse of dimensionality for data with high cardinality.

**Base N Encoding :**
    For Binary encoding, the Base is 2 which means it converts the numerical values of a category into its respective Binary form. If you want to change the Base of encoding scheme you may use Base N encoder. In the case when categories are more and binary encoding is not able to handle the dimensionality then we can use a larger base such as 4 or 8.

In [31]:
#Import the libraries
import category_encoders as ce
import pandas as pd

#Create the dataframe
data=pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi','Hyderabad','Mumbai','Agra']})

#Create an object for Base N Encoding
encoder= ce.BaseNEncoder(cols=['City'],return_df=True,base=5)

#Original Data
data

Unnamed: 0,City
0,Delhi
1,Mumbai
2,Hyderabad
3,Chennai
4,Bangalore
5,Delhi
6,Hyderabad
7,Mumbai
8,Agra


In [32]:
#Fit and Transform Data
data_encoded=encoder.fit_transform(data)
data_encoded

  elif pd.api.types.is_categorical(cols):


Unnamed: 0,City_0,City_1,City_2
0,0,0,1
1,0,0,2
2,0,0,3
3,0,0,4
4,0,1,0
5,0,0,1
6,0,0,3
7,0,0,2
8,0,1,1


In the above example, I have used base 5 also known as the Quinary system. It is similar to the example of Binary encoding. While Binary encoding represents the same data by 4 new features the BaseN encoding uses only 3 new variables.

Hence BaseN encoding technique further reduces the number of features required to efficiently represent the data and improving memory usage. The default Base for Base N is 2 which is equivalent to Binary Encoding.

**Target Encoding :**
    Target encoding is a Baysian encoding technique.
    
  In target encoding, we calculate the mean of the target variable for each category and replace the category variable with the mean value. In the case of the categorical target variables, the posterior probability of the target replaces each category.

In [35]:
#import the libraries
import pandas as pd
import category_encoders as ce

#Create the Dataframe
Data=pd.DataFrame({'class':['A','B','C','B','C','A','A','A'],'Marks':[50,30,70,80,45,97,80,68]})

#Create target encoding object
encoder=ce.TargetEncoder(cols='class') 

#Original Data
Data

Unnamed: 0,class,Marks
0,A,50
1,B,30
2,C,70
3,B,80
4,C,45
5,A,97
6,A,80
7,A,68


In [36]:
#Fit and Transform Train Data
encoder.fit_transform(data['class'],data['Marks'])

  elif pd.api.types.is_categorical(cols):


Unnamed: 0,class
0,65.0
1,57.689414
2,59.517061
3,57.689414
4,59.517061
5,79.679951
6,79.679951
7,79.679951


We perform Target encoding for train data only and code the test data using results obtained from the training dataset. Although, a very efficient coding system, it has the following issues responsible for deteriorating the model performance-

1. It can lead to target leakage or overfitting. To address overfitting we can use different techniques.

     * In the leave one out encoding, the current target value is reduced from the overall mean of the target to avoid leakage.
     * In another method, we may introduce some Gaussian noise in the target statistics. The value of this noise is                    hyperparameter to the model.
 
 
2. The second issue, we may face is the improper distribution of categories in train and test data. In such a case, the categories may assume extreme values. Therefore the target means for the category are mixed with the marginal mean of the target.

 # Encoding using Sklearn

**Label Encoding in Python Using category codes approach :**

In [37]:
# import required libraries
import pandas as pd
import numpy as np
# creating initial dataframe
bridge_types = ('Arch','Beam','Truss','Cantilever','Tied Arch','Suspension','Cable')
bridge_df = pd.DataFrame(bridge_types, columns=['Bridge_Types'])
# converting type of columns to 'category'
bridge_df['Bridge_Types'] = bridge_df['Bridge_Types'].astype('category')
# Assigning numerical values and storing in another column
bridge_df['Bridge_Types_Cat'] = bridge_df['Bridge_Types'].cat.codes
bridge_df

Unnamed: 0,Bridge_Types,Bridge_Types_Cat
0,Arch,0
1,Beam,1
2,Truss,6
3,Cantilever,3
4,Tied Arch,5
5,Suspension,4
6,Cable,2


**Using sci-kit learn library approach:**

In [38]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
# creating initial dataframe
bridge_types = ('Arch','Beam','Truss','Cantilever','Tied Arch','Suspension','Cable')
bridge_df = pd.DataFrame(bridge_types, columns=['Bridge_Types'])
# creating instance of labelencoder
labelencoder = LabelEncoder()
# Assigning numerical values and storing in another column
bridge_df['Bridge_Types_Cat'] = labelencoder.fit_transform(bridge_df['Bridge_Types'])
bridge_df

Unnamed: 0,Bridge_Types,Bridge_Types_Cat
0,Arch,0
1,Beam,1
2,Truss,6
3,Cantilever,3
4,Tied Arch,5
5,Suspension,4
6,Cable,2


Though label encoding is straight but it has the disadvantage that the numeric values can be misinterpreted by algorithms as having some sort of hierarchy/order in them. This ordering issue is addressed in another common alternative approach called ‘One-Hot Encoding’. In this strategy, each category value is converted into a new column and assigned a 1 or 0 (notation for true/false) value to the column. Let’s consider the previous example of bridge type with one-hot encoding.


Though this approach eliminates the hierarchy/order issues but does have the downside of adding more columns to the data set. It can cause the number of columns to expand greatly if you have many unique values in a category column. In the below example, it was manageable, but it will get really challenging to manage when encoding gives many columns.

**One-Hot Encoding in Python Using sci-kit learn library approach :**
        OneHotEncoder from SciKit library only takes numerical categorical values, hence any value of string type should be label encoded before one hot encoded. So taking the dataframe from the previous example, we will apply OneHotEncoder on column Bridge_Types_Cat.

In [53]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
# creating instance of one-hot-encoder
enc = OneHotEncoder(handle_unknown='ignore')
# passing bridge-types-cat column (label encoded values of bridge_types)
enc_df = pd.DataFrame(enc.fit_transform(bridge_df[['Bridge_Types_Cat']]).toarray())
# merge with main df bridge_df on key values
#bridge_df = bridge_df.join(enc_df)
bridge_df

Unnamed: 0,Bridge_Types,Bridge_Types_Cat,0,1,2,3,4,5,6
0,Arch,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Beam,1,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,Truss,6,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,Cantilever,3,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,Tied Arch,5,0.0,0.0,0.0,0.0,0.0,1.0,0.0
5,Suspension,4,0.0,0.0,0.0,0.0,1.0,0.0,0.0
6,Cable,2,0.0,0.0,1.0,0.0,0.0,0.0,0.0


**Using dummies values approach :**
    This approach is more flexible because it allows encoding as many category columns as you would like and choose how to label     the columns using a prefix. Proper naming will make the rest of the analysis just a little bit easier.

In [54]:
import pandas as pd
import numpy as np
# creating initial dataframe
bridge_types = ('Arch','Beam','Truss','Cantilever','Tied Arch','Suspension','Cable')
bridge_df = pd.DataFrame(bridge_types, columns=['Bridge_Types'])
# generate binary values using get_dummies
dum_df = pd.get_dummies(bridge_df, columns=["Bridge_Types"], prefix=["Type_is"] )
# merge with main df bridge_df on key values
bridge_df = bridge_df.join(dum_df)
bridge_df

Unnamed: 0,Bridge_Types,Type_is_Arch,Type_is_Beam,Type_is_Cable,Type_is_Cantilever,Type_is_Suspension,Type_is_Tied Arch,Type_is_Truss
0,Arch,1,0,0,0,0,0,0
1,Beam,0,1,0,0,0,0,0
2,Truss,0,0,0,0,0,0,1
3,Cantilever,0,0,0,1,0,0,0
4,Tied Arch,0,0,0,0,0,1,0
5,Suspension,0,0,0,0,1,0,0
6,Cable,0,0,1,0,0,0,0
