##### One Hot Encoding

- It is methods to transform the strings of categorical variables into numbers
- It consists of replacing the categorical variable by different boolean variables, which take value 0 or 1 (dummy variables)
- Example: "Gender", with labels 'female' and 'male'

In [1]:
import pandas as pd
data = pd.read_csv('C:\\Users\\admin\PP_programs\DataRepo\\titanic\\train.csv', usecols=['Sex'])
data.head()

Unnamed: 0,Sex
0,male
1,female
2,female
3,female
4,male


In [2]:
pd.get_dummies(data).head()

Unnamed: 0,Sex_female,Sex_male
0,0,1
1,1,0
2,1,0
3,1,0
4,0,1


In [3]:
pd.concat([data, pd.get_dummies(data)], axis=1).head()

Unnamed: 0,Sex,Sex_female,Sex_male
0,male,0,1
1,female,1,0
2,female,1,0
3,female,1,0
4,male,0,1


In [4]:
pd.get_dummies(data, drop_first=True).head()

Unnamed: 0,Sex_male
0,1
1,0
2,0
3,0
4,1


In [5]:
data = pd.read_csv('C:\\Users\\admin\PP_programs\DataRepo\\titanic\\train.csv', usecols=['Embarked'])
data.head()

Unnamed: 0,Embarked
0,S
1,C
2,S
3,S
4,S


In [6]:
data.Embarked.unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [7]:
pd.get_dummies(data).head()

Unnamed: 0,Embarked_C,Embarked_Q,Embarked_S
0,0,0,1
1,1,0,0
2,0,0,1
3,0,0,1
4,0,0,1


In [8]:
pd.get_dummies(data, drop_first=True).head()

Unnamed: 0,Embarked_Q,Embarked_S
0,0,1
1,0,0
2,0,1
3,0,1
4,0,1


In [9]:
pd.get_dummies(data, drop_first=True, dummy_na=True).head()

Unnamed: 0,Embarked_Q,Embarked_S,Embarked_nan
0,0,1,0
1,0,0,0
2,0,1,0
3,0,1,0
4,0,1,0


In [10]:
pd.get_dummies(data, drop_first=True, dummy_na=True).sum(axis=0)

Embarked_Q       77
Embarked_S      644
Embarked_nan      2
dtype: int64

#### When should you use k and when k-1?

One hot encoding into k-1:
 - It should be used in linear regression, to keep the correct number of degrees of freedom (k-1).
 - In support vector machines and neural networks as well. And clustering algorithms.

One hot encoding into k dummy variables:
 - In a tree based learning algorithm, it is good practice to encode it into k binary variables instead of k-1.

In [11]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split

In [12]:
data = pd.read_csv('C:\\Users\\admin\PP_programs\DataRepo\\titanic\\train.csv')
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [13]:
data_OHE = pd.concat([data[['Pclass', 'Age', 'SibSp','Parch', 'Survived']], # numerical variables 
                      pd.get_dummies(data.Sex, drop_first=True),   # binary categorical variable
                      pd.get_dummies(data.Embarked, drop_first=True)],  # k categories in categorical
                    axis=1)

data_OHE.head()

Unnamed: 0,Pclass,Age,SibSp,Parch,Survived,male,Q,S
0,3,22.0,1,0,0,1,0,1
1,1,38.0,1,0,1,0,0,0
2,3,26.0,0,0,1,0,0,1
3,1,35.0,1,0,1,0,0,1
4,3,35.0,0,0,0,1,0,1


In [14]:
X_train, X_test, y_train, y_test = train_test_split(data_OHE[['Pclass', 'Age', 'SibSp',
                                                              'Parch', 'male', 'Q', 'S']].fillna(0),
                                                    data_OHE.Survived,
                                                    test_size=0.3,
                                                    random_state=12)
X_train.shape, X_test.shape

((623, 7), (268, 7))

In [15]:
rf = RandomForestClassifier(n_estimators=20, random_state=12, max_depth=3)
rf.fit(X_train, y_train)
pred = rf.predict_proba(X_train)
print('Train: Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
pred = rf.predict_proba(X_test)
print('Test: Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

Train: Random Forests roc-auc: 0.8659464794911165
Test: Random Forests roc-auc: 0.8429500203169443


In [16]:
ada = AdaBoostClassifier(n_estimators=20, random_state=12)
ada.fit(X_train, y_train)
pred = ada.predict_proba(X_train)
print('Train:AdaBoost roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
pred = ada.predict_proba(X_test)
print('Test:AdaBoost roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

Train:AdaBoost roc-auc: 0.8793266067119982
Test:AdaBoost roc-auc: 0.8267254890578744


In [17]:
logit = LogisticRegression(random_state=12)
logit.fit(X_train, y_train)
pred = logit.predict_proba(X_train)
print('Train: Logistic Regression roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
pred = logit.predict_proba(X_test)
print('Test: Logistic Regression roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

Train: Logistic Regression roc-auc: 0.8514476858960298
Test: Logistic Regression roc-auc: 0.8222847855111163




Observation:
    If our datasets have a few multi-label variables, we will end up with datasets with thousands of columns or more.

## Other Encoding Methods:

### Variables with many categories   

     If a categorical variable contains multiple labels, then by re-encoding them using one hot encoding 
      will expand the feature space dramatically.
     Categorical variables have a few dominating categories and the remaining labels add mostly noise
     One could also choose the top 5, or top 20.

### Ordinal numbering encoding


    Categorical variable which categories can be meaningfully ordered are called ordinal. 
    
    For example:
    Student's grade in an exam (A, B, C or Fail).
    Days of the week can be ordinal with Monday = 1, and Sunday = 7.

In [18]:
import pandas as pd
import datetime

In [19]:
base = datetime.datetime.today()
date_list = [base - datetime.timedelta(days=x) for x in range(0, 30)]
df = pd.DataFrame(date_list)
df.columns = ['day']
df

Unnamed: 0,day
0,2020-04-25 20:19:18.348692
1,2020-04-24 20:19:18.348692
2,2020-04-23 20:19:18.348692
3,2020-04-22 20:19:18.348692
4,2020-04-21 20:19:18.348692
5,2020-04-20 20:19:18.348692
6,2020-04-19 20:19:18.348692
7,2020-04-18 20:19:18.348692
8,2020-04-17 20:19:18.348692
9,2020-04-16 20:19:18.348692


In [21]:
df['day_of_week'] = df['day'].dt.weekday
df.head()

Unnamed: 0,day,day_of_week
0,2020-04-25 20:19:18.348692,5
1,2020-04-24 20:19:18.348692,4
2,2020-04-23 20:19:18.348692,3
3,2020-04-22 20:19:18.348692,2
4,2020-04-21 20:19:18.348692,1


In [28]:
df['day_of_week'] = df['day'].dt.day_name()

df.head()

Unnamed: 0,day,day_of_week
0,2020-04-25 20:19:18.348692,Saturday
1,2020-04-24 20:19:18.348692,Friday
2,2020-04-23 20:19:18.348692,Thursday
3,2020-04-22 20:19:18.348692,Wednesday
4,2020-04-21 20:19:18.348692,Tuesday


In [29]:
weekday_map = {'Monday':1,
               'Tuesday':2,
               'Wednesday':3,
               'Thursday':4,
               'Friday':5,
               'Saturday':6,
               'Sunday':7
}

df['day_ordinal'] = df.day_of_week.map(weekday_map)
df.head(10)

Unnamed: 0,day,day_of_week,day_ordinal
0,2020-04-25 20:19:18.348692,Saturday,6
1,2020-04-24 20:19:18.348692,Friday,5
2,2020-04-23 20:19:18.348692,Thursday,4
3,2020-04-22 20:19:18.348692,Wednesday,3
4,2020-04-21 20:19:18.348692,Tuesday,2
5,2020-04-20 20:19:18.348692,Monday,1
6,2020-04-19 20:19:18.348692,Sunday,7
7,2020-04-18 20:19:18.348692,Saturday,6
8,2020-04-17 20:19:18.348692,Friday,5
9,2020-04-16 20:19:18.348692,Thursday,4


### Frequency Encoding

If a categorical variable contains multiple labels, then by re-encoding them using one hot encoding, we will expand the feature space dramatically.

Choose to replace the 10 most frequent labels by their count, and then group all the other labels under one label (for example "Rare"), and replace "Rare" by its count

## Some More Methods

#### Traget Guided Ordinal Encoding

#### Mean Encoding


#### Probability Ratio Encoding


#### Weight of Evidence : It is a measure of the "strength” of a grouping technique to separate good and bad risk (default). 