##### Datasets Used:
[Titanic Dataset](https://www.kaggle.com/c/titanic/overview)

[Mercedes Dataset](https://www.kaggle.com/c/mercedes-benz-greener-manufacturing/overview)

In [1]:
# Importing all required packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

##### Why Categorical Variables should be encoded?
Input and Output values of Machine learning algorithms and deep learning neural networks are numbers.
This means that categorical data must be encoded to numbers before we can use it to fit and evaluate a model.

#### Types of Encoding:
- Nominal Encoding
 - One Hot Encoding
 - One Hot Encoding with many categories
 - Mean Encoding
- Ordinal Encoding
 - Label Encoding
 - Target Guided Ordinal Encoding
 - Count or Frequency Encoding

**Nominal Encoding** : Categorical Feature in which there is order of cateogories has no significance.
- Eg: Gender (Male,Female), State (AP,TS)

**Ordinal Encoding** : Categorical Feature in which there is a specific order of cateogories based on their ranking.
- Eg: Grade (A,B,C,D), Education (PhD,MS,BE,BCom)

For a job, interviewer order of preference is PhD>MS>BE>BCom. For a student, order of hard work done based on grade is A>B>C>D.

### 1. One Hot Encoding
Create new column for each category in the feature.

*Dummy Variable Trap* : If N dummy variables are created. One dummy variable value can be dervied from values of remaining N-1 dummy variables. So, One dummy variable can be dropped with out lossing any information.

In [2]:
df=pd.read_csv('titanic_dataset.csv',usecols=['Sex'])
df.head()

Unnamed: 0,Sex
0,male
1,female
2,female
3,female
4,male


In [3]:
# Creating dummy variables for all categories in Sex and dropping column of first category (as per Dummy Variable Trap)
pd.get_dummies(df,drop_first=True).head()

Unnamed: 0,Sex_male
0,1
1,0
2,0
3,0
4,1


In [4]:
df=pd.read_csv('titanic_dataset.csv',usecols=['Embarked'])

In [5]:
# Checking unique values in Embarked
df['Embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [6]:
# removing records with NaN values
df.dropna(inplace=True)
pd.get_dummies(df,drop_first=True).head()

Unnamed: 0,Embarked_Q,Embarked_S
0,0,1
1,0,0
2,0,1
3,0,1
4,0,1


### 2. One Hot Encoding with many categories
Performing One Hot Encoding on features with more categories, will lead to more columns which causes Curse of Dimensionality.

*KDD Orange Cup Challenge Winners Approach*: Limit One Hot Encoding to 10 most frequent categories of the feature. This will create a new column for each most frequent category. 10 New Columns will be created. Group all remaining categories of that feature under a new category which will be dropped due to Dummy Variable Trap.

In [7]:
df=pd.read_csv("mercedes.csv",usecols=["X0","X1","X2","X3","X4","X5","X6"])
df.head()

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6
0,k,v,at,a,d,u,j
1,k,t,av,e,d,y,l
2,az,w,n,c,d,x,j
3,az,t,n,f,d,x,l
4,az,v,n,f,d,h,d


In [8]:
# Getting number of unique categories present in each column
for i in df.columns:
    print(i,len(df[i].unique()))

X0 47
X1 27
X2 44
X3 7
X4 4
X5 29
X6 12


So many categories are present in most of the features.

In [9]:
# Descending order based on frequency of each category in feature X1
df['X1'].value_counts().sort_values(ascending=False).head(10)

aa    833
s     598
b     592
l     590
v     408
r     251
i     203
a     143
c     121
o      82
Name: X1, dtype: int64

In [10]:
# getting list of top 10 frequent categories in feature X1
top_10=df['X1'].value_counts().sort_values(ascending=False).head(10).index
top_10=list(top_10)
top_10

['aa', 's', 'b', 'l', 'v', 'r', 'i', 'a', 'c', 'o']

In [11]:
# creating a column for each category and fill 1 if the record is having that category else 0
for category in top_10:
    df[category]=np.where(df['X1']==category,1,0)

In [12]:
df[top_10+["X1"]].head()

Unnamed: 0,aa,s,b,l,v,r,i,a,c,o,X1
0,0,0,0,0,1,0,0,0,0,0,v
1,0,0,0,0,0,0,0,0,0,0,t
2,0,0,0,0,0,0,0,0,0,0,w
3,0,0,0,0,0,0,0,0,0,0,t
4,0,0,0,0,1,0,0,0,0,0,v


In [13]:
for feature in ["X0","X1","X2","X5","X6"]:
    top_10=df[feature].value_counts().sort_values(ascending=False).head(10).index
    top_10=list(top_10)
    for category in top_10:
        df[category]=np.where(df[feature]==category,1,0)
    df.drop(feature,axis=1,inplace=True)

In [14]:
df.head()

Unnamed: 0,X3,X4,aa,s,b,l,v,r,i,a,...,ai,m,e,q,d,p,g,j,h,k
0,a,d,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,e,d,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,c,d,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,f,d,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,f,d,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


### 3. Mean Encoding
Calculate mean of output of each category and replace each category with it.

In [15]:
df=pd.read_csv("titanic_dataset.csv",usecols=["Cabin","Survived"])
df.head()

Unnamed: 0,Survived,Cabin
0,0,
1,1,C85
2,1,
3,1,C123
4,0,


In [16]:
# Filling NaN with Missing, to do encoding
df["Cabin"].fillna('Missing',inplace=True)

In [17]:
# Capturing only the first alphabet which is present in the cabin
df['Cabin']=df['Cabin'].astype(str).str[0]

In [18]:
mean_nominal=df.groupby(['Cabin'])['Survived'].mean().to_dict()
mean_nominal

{'A': 0.4666666666666667,
 'B': 0.7446808510638298,
 'C': 0.5932203389830508,
 'D': 0.7575757575757576,
 'E': 0.75,
 'F': 0.6153846153846154,
 'G': 0.5,
 'M': 0.29985443959243085,
 'T': 0.0}

In [19]:
df['mean_nominal_encoded']=df['Cabin'].map(mean_nominal)
df.head()

Unnamed: 0,Survived,Cabin,mean_nominal_encoded
0,0,M,0.299854
1,1,C,0.59322
2,1,M,0.299854
3,1,C,0.59322
4,0,M,0.299854


### 4. Label Encoding
Replace each category in the feature with its rank.

In [20]:
df=pd.DataFrame(pd.Series(["MS","BCom","PhD","BE"],index=[0,1,2,3]),columns=["Education"])
df

Unnamed: 0,Education
0,MS
1,BCom
2,PhD
3,BE


In [21]:
# Ranking based on the level and importance of specialization
# Higher the Rank, higher is the salary
education_map={"BCom":1,"BE":2,"MS":3,"PhD":4}

In [22]:
df["Education"]=df["Education"].map(education_map)
df

Unnamed: 0,Education
0,3
1,1
2,4
3,2


### 5. Target Guided Ordinal Encoding
Calculate mean of output of each category and replace each category with its rank based on the calculated mean. 

In [23]:
df=pd.read_csv("titanic_dataset.csv",usecols=['Cabin','Survived'])
df.head()

Unnamed: 0,Survived,Cabin
0,0,
1,1,C85
2,1,
3,1,C123
4,0,


In [24]:
df['Cabin'].fillna('Missing',inplace=True)

In [25]:
df['Cabin']=df['Cabin'].astype(str).str[0]
df.head()

Unnamed: 0,Survived,Cabin
0,0,M
1,1,C
2,1,M
3,1,C
4,0,M


In [26]:
df["Cabin"].unique()

array(['M', 'C', 'E', 'G', 'D', 'A', 'B', 'F', 'T'], dtype=object)

In [27]:
df.groupby(['Cabin'])['Survived'].mean().sort_values()

Cabin
T    0.000000
M    0.299854
A    0.466667
G    0.500000
C    0.593220
F    0.615385
B    0.744681
E    0.750000
D    0.757576
Name: Survived, dtype: float64

In [28]:
labels_order=df.groupby(['Cabin'])['Survived'].mean().sort_values().index
labels_order

Index(['T', 'M', 'A', 'G', 'C', 'F', 'B', 'E', 'D'], dtype='object', name='Cabin')

In [29]:
ordinal_labels={k:i for i,k in enumerate(labels_order)}
ordinal_labels

{'T': 0, 'M': 1, 'A': 2, 'G': 3, 'C': 4, 'F': 5, 'B': 6, 'E': 7, 'D': 8}

In [30]:
df['Cabin_ordinal_labels']=df['Cabin'].map(ordinal_labels)
df.head()

Unnamed: 0,Survived,Cabin,Cabin_ordinal_labels
0,0,M,1
1,1,C,4
2,1,M,1
3,1,C,4
4,0,M,1


### 6. Count or Frequency Encoding
Count or Frequency Encoding should be used when there are many categories in the feature.

*High Cardinality* : Variables with multitude of categories.

Replace each label of categorical variable by its count.

In [31]:
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data' , header = None,index_col=None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [32]:
columns=[1,3,5,6,7,8,9,13]
df=df[columns]

In [33]:
df.columns=['Employment','Degree','Status','Designation','family_job','Race','Sex','Country']
df.head()

Unnamed: 0,Employment,Degree,Status,Designation,family_job,Race,Sex,Country
0,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,United-States
1,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States
2,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,United-States
3,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,United-States
4,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,Cuba


In [34]:
for feature in df.columns[:]:
    print(feature,":",len(df[feature].unique()),'labels')

Employment : 9 labels
Degree : 16 labels
Status : 7 labels
Designation : 15 labels
family_job : 6 labels
Race : 5 labels
Sex : 2 labels
Country : 42 labels


Country feature has more categories.

In [35]:
country_map=df['Country'].value_counts().to_dict()

In [36]:
df['Country']=df['Country'].map(country_map)
df.head(10)

Unnamed: 0,Employment,Degree,Status,Designation,family_job,Race,Sex,Country
0,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,29170
1,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,29170
2,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,29170
3,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,29170
4,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,95
5,Private,Masters,Married-civ-spouse,Exec-managerial,Wife,White,Female,29170
6,Private,9th,Married-spouse-absent,Other-service,Not-in-family,Black,Female,81
7,Self-emp-not-inc,HS-grad,Married-civ-spouse,Exec-managerial,Husband,White,Male,29170
8,Private,Masters,Never-married,Prof-specialty,Not-in-family,White,Female,29170
9,Private,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,29170


##### Advantages:
* Doesn't increase number of features.

##### Disadvantages:
* If some categories in a variable have same count, lose of valuable information.

### 7. Probability Ratio Encoding
For each category calculate mean of output variable (Good Probability).

Replace each category with its ratio Good and Bad Probability.

In [37]:
df=pd.read_csv("titanic_dataset.csv",usecols=["Survived","Cabin"])
df.head()

Unnamed: 0,Survived,Cabin
0,0,
1,1,C85
2,1,
3,1,C123
4,0,


In [38]:
# Deleting records with NaN values
df.dropna(inplace=True)

In [39]:
df["Cabin"]=df["Cabin"].astype(str).str[0]

In [40]:
df["Cabin"]

1      C
3      C
6      E
10     G
11     C
      ..
871    D
872    B
879    C
887    B
889    C
Name: Cabin, Length: 204, dtype: object

In [41]:
prod_df=df.groupby("Cabin")["Survived"].mean()

In [42]:
prod_df=pd.DataFrame(prod_df)

In [43]:
prod_df["Died"]=1-prod_df["Survived"]

In [44]:
prod_df["Probability Ratio"]=prod_df["Survived"]/prod_df["Died"]

In [45]:
prob_map=prod_df["Probability Ratio"].to_dict()
prob_map

{'A': 0.875,
 'B': 2.916666666666666,
 'C': 1.4583333333333333,
 'D': 3.125,
 'E': 3.0,
 'F': 1.6000000000000003,
 'G': 1.0,
 'T': 0.0}

In [46]:
df["Cabin"]=df["Cabin"].map(prob_map)
df.head()

Unnamed: 0,Survived,Cabin
1,1,1.458333
3,1,1.458333
6,0,3.0
10,1,1.0
11,1,1.458333
