# Types of Encoding

### 1. Nominal Encoding - no ranks between features. Eg: gender, states

### 2. Ordinal Encoding - ranks between features. Eg: Education - PhD>MTech>BTech 

Types of Nominal Encoding - 
1. One Hot Encoding
2. One Hot Encoding with many categories
3. Mean Encoding


Types of Ordinal Encoding - 
1. Label Encoding
2. Target Guided Ordinal Encoding

# One Hot Encoding - 

Country|Germany  France   Spain
Germany|	1		0		0
France |	0		1		0
Spain  |	0		0		1

Now to optimize, we can remove Spain -- because 0 for both Germany, France means the Country is Spain

# One Hot Encoding with Multiple Category - 

Only take top-n categories for One Hot Encoding and keep the rest of them as 0.


In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv('res/mercedesbenz.csv', usecols=['X1', 'X2', 'X3', 'X4', 'X5', 'X6'])
data.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6
0,v,at,a,d,u,j
1,t,av,e,d,y,l
2,w,n,c,d,x,j
3,t,n,f,d,x,l
4,v,n,f,d,h,d


In [3]:
#print all labels in each category

for col in data.columns:
    print(col, ":", len(data[col].unique()), "labels")

X1 : 27 labels
X2 : 44 labels
X3 : 7 labels
X4 : 4 labels
X5 : 29 labels
X6 : 12 labels


In [4]:
#shape after obtaining one hot encoding 

pd.get_dummies(data, drop_first=True).shape

(4209, 117)

In [5]:
# a feasible way is including only a few frequent categories of the variable

top_10 = [x for x in data.X2.value_counts().sort_values(ascending=False).head(10).index]

print("List of top 10: {}".format(top_10))

#make 10 binary variables
for label in top_10:
    data[label] = np.where(data['X2']==label, 1,  0)

data[['X2']+top_10].head(40)

List of top 10: ['as', 'ae', 'ai', 'm', 'ak', 'r', 'n', 's', 'f', 'e']


Unnamed: 0,X2,as,ae,ai,m,ak,r,n,s,f,e
0,at,0,0,0,0,0,0,0,0,0,0
1,av,0,0,0,0,0,0,0,0,0,0
2,n,0,0,0,0,0,0,1,0,0,0
3,n,0,0,0,0,0,0,1,0,0,0
4,n,0,0,0,0,0,0,1,0,0,0
5,e,0,0,0,0,0,0,0,0,0,1
6,e,0,0,0,0,0,0,0,0,0,1
7,as,1,0,0,0,0,0,0,0,0,0
8,as,1,0,0,0,0,0,0,0,0,0
9,aq,0,0,0,0,0,0,0,0,0,0


## One hot top encoding code: 

In [6]:
#get whole set of dummy variables, for all the categorical variables

def one_hot_top_X(df, variable, top_x_labels):
    
    for label in top_x_labels:
        df[variable+'_'+label] = np.where(data[variable] == label, 1, 0)
        
#read the data
data = pd.read_csv('res/mercedesbenz.csv', usecols=['X1','X2','X3','X4','X5','X6'])

#read the top-n categories
top_10 = [x for x in data.X2.value_counts().sort_values(ascending=False).head(10).index]

#encode X2 into the 10 most freq categories
one_hot_top_X(data, 'X2', top_10)
data.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X2_as,X2_ae,X2_ai,X2_m,X2_ak,X2_r,X2_n,X2_s,X2_f,X2_e
0,v,at,a,d,u,j,0,0,0,0,0,0,0,0,0,0
1,t,av,e,d,y,l,0,0,0,0,0,0,0,0,0,0
2,w,n,c,d,x,j,0,0,0,0,0,0,1,0,0,0
3,t,n,f,d,x,l,0,0,0,0,0,0,1,0,0,0
4,v,n,f,d,h,d,0,0,0,0,0,0,1,0,0,0


## Of  Top variables:

### Advantages
     1. Straightforward to implement
     2. Does not require hrs of variables exploration 
     3. Does not expand massively the feature space (number of columns in the dataset)
     
### Disdvantages
    1. Does not add any information that may make the variable more predictive
    2. Does not keep the information of the ignored labels


# Mean Encoding  - 

f1	O/P		Label = Mean(for each category of f1)
A	1		0.73
B	1		0.4
C	0		0.6
D	1
A	0 
B	0...
.
.
.


# Label Encoding - 

Education	-
BE			3
MTech		2
PhD			1

Ranks are given based on priority.


# Target Guided Ordinal Encoding - 

f1	O/P		Mean
A	1		0.73
B	1		0.4
C	0		0.6
D	1
A	0 
B	0

Now rearrange with descending order of mean.
f1	O/P		Mean	Labels
A	1		0.73	4
B	1		0.4		2		
C	0		0.6		3
D	1		0.1		1
A	0 
B	0...
.
.
.