In [2]:
pip install category_encoders

Collecting category_encoders
  Downloading category_encoders-2.4.1-py2.py3-none-any.whl (80 kB)

Installing collected packages: category-encoders
Successfully installed category-encoders-2.4.1


In [10]:
import category_encoders as ce
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder

## Label Encoding or Ordinal Encoding
We use this categorical data encoding technique when the categorical feature is ordinal.
In this case, retaining the order is important.

In [14]:
train_df = pd.DataFrame({'Degree': ['Phd','Masters','Bachelors','Masters','Bachelors']})

In [15]:
train_df

Unnamed: 0,Degree
0,Phd
1,Masters
2,Bachelors
3,Masters
4,Bachelors


In [25]:
or_encoder = ce.OrdinalEncoder(cols=['Degree'],
           return_df=True, 
           mapping=[{'col':'Degree','mapping': {'Phd': 0, 'Masters': 1,'Bachelors': 2}}])

In [26]:
result = or_encoder.fit_transform(train_df)
print(result)

   Degree
0       0
1       1
2       2
3       1
4       2


## One Hot Encoding

We use this categorical data encoding technique when the features are nominal(do not have any order).
In one hot encoding, for each level of a categorical feature, we create a new variable. 
Each category is mapped with a binary variable containing either 0 or 1.
Here, 0 represents the absence, and 1 represents the presence of that category.

These newly created binary features are known as Dummy variables. 

In [27]:
train_one_hot_en = pd.DataFrame({'City':['Delhi','Chennai','Kolkata','Delhi','Chennai','Kolkata']})

In [34]:
encoder_one_hot_en = ce.OneHotEncoder(cols='City',return_df=True,use_cat_names=True)

In [35]:
## Orginal Data
train_one_hot_en

Unnamed: 0,City
0,Delhi
1,Chennai
2,Kolkata
3,Delhi
4,Chennai
5,Kolkata


In [36]:
encoded_one_hot_enc = encoder_one_hot_en.fit_transform(train_one_hot_en)
encoded_one_hot_enc

Unnamed: 0,City_Delhi,City_Chennai,City_Kolkata
0,1,0,0
1,0,1,0
2,0,0,1
3,1,0,0
4,0,1,0
5,0,0,1


## Dummy Encoding
Dummy coding scheme is similar to one-hot encoding. 
This categorical data encoding method transforms the categorical variable into a set of binary variables (also known as dummy variables). 
In the case of one-hot encoding, for N categories in a variable, it uses N binary variables.
The dummy encoding is a small improvement over one-hot-encoding. 
Dummy encoding uses N-1 features to represent N labels/categories.


In [37]:
train_dummy_encod = pd.DataFrame({'City':['Delhi','Chennai','Kolkata','Delhi','Chennai','Kolkata']})

In [38]:
train_dummy_encod

Unnamed: 0,City
0,Delhi
1,Chennai
2,Kolkata
3,Delhi
4,Chennai
5,Kolkata


In [39]:
encoded_dummy = pd.get_dummies(data=train_dummy_encod,drop_first=True)

In [40]:
encoded_dummy

Unnamed: 0,City_Delhi,City_Kolkata
0,1,0
1,0,0
2,0,1
3,1,0
4,0,0
5,0,1


## Effect Encoding

This encoding technique is also known as Deviation Encoding or Sum Encoding.
Effect encoding is almost similar to dummy encoding, with a little difference. 
In dummy coding, we use 0 and 1 to represent the data but in effect encoding, 
we use three values i.e. 1,0, and -1
The row containing only 0s in dummy encoding is encoded as -1 in effect encoding.


In [41]:
train_effect_encod = pd.DataFrame({'City':['Delhi','Chennai','Kolkata','Delhi','Chennai','Kolkata']})

In [42]:
train_effect_encod

Unnamed: 0,City
0,Delhi
1,Chennai
2,Kolkata
3,Delhi
4,Chennai
5,Kolkata


## Hash Encoder
Hashing is the transformation of arbitrary size input in the form of a fixed-size value.
We use hashing algorithms to perform hashing operations i.e to generate the hash value of an input. 
Further, hashing is a one-way process, in other words, one can not generate original input from the hash representation.
By default, the Hashing encoder uses the md5 hashing algorithm.

Just like one-hot encoding, the Hash encoder represents categorical features using the new dimensions.
Here, the user can fix the number of dimensions after transformation using n_component argument.

Since Hashing transforms the data in lesser dimensions, it may lead to loss of information. Another issue faced by hashing encoder is the collision.


In [44]:
train_hash_encod = pd.DataFrame({'City':['Delhi','Chennai','Kolkata','Delhi','Chennai','Kolkata']})

In [47]:
hash_encoded = ce.HashingEncoder(cols='City',n_components=3)

In [48]:
hash_encoded.fit_transform(train_hash_encod)

Unnamed: 0,col_0,col_1,col_2
0,0,1,0
1,1,0,0
2,1,0,0
3,0,1,0
4,1,0,0
5,1,0,0


## Binary Encoding

Binary encoding is a combination of Hash encoding and one-hot encoding.
In this encoding scheme, the categorical feature is first converted into 
numerical using an ordinal encoder. 
Then the numbers are transformed in the binary number.
After that binary value is split into different columns.

In [49]:
train_Binary_encod = pd.DataFrame({'City':['Delhi','Chennai','Kolkata','Delhi','Chennai','Kolkata']})

In [55]:
binary_encoder = ce.BinaryEncoder(cols=['City'],return_df=True)

In [56]:
bin_encoded= binary_encoder.fit_transform(train_Binary_encod)

In [57]:
bin_encoded

Unnamed: 0,City_0,City_1
0,0,1
1,1,0
2,1,1
3,0,1
4,1,0
5,1,1


## Base N Encoding

For Binary encoding, the Base is 2 which means it converts the numerical values of a category into its respective Binary form.
If you want to change the Base of encoding scheme you may use Base N encoder.
In the case when categories are more and binary encoding is not able to handle 
the dimensionality then we can use a larger base such as 4 or 8.


In [54]:
train_BaseN_encod = pd.DataFrame({'City':['Delhi','Chennai','Kolkata','Delhi','Chennai','Kolkata']})

In [61]:
base_encoder = ce.BaseNEncoder(cols=['City'],return_df=True,base=2)

In [62]:
base_encoded = base_encoder.fit_transform(train_BaseN_encod)

In [63]:
base_encoded

Unnamed: 0,City_0,City_1
0,0,1
1,1,0
2,1,1
3,0,1
4,1,0
5,1,1
