## Types of Categorical data:
    
    1.Nominal data (Don't worry about arrangement of categories)
    
    2.Ordinal data (Rearrange the categories based on the rank)
    
    
**1.Nominal Data:**

* The nominal data called labelled/named data.

* Allowed to change the order of categories, change in order doesn’t affect its value.

*For example,*

* Gender (Male/Female/Other),

* Age Groups (Young/Adult/Old),

* States (TN,DL,AP,other), etc.


**2.Ordinal Data:**

* Represent discretely and ordered units. 

* Same as nominal data but have ordered/rank. Not allowed to change the order of categories.

*For example,*

* Ranks: 1st/2nd/3rd, 
* Education: (High School/Undergrads/Postgrads/Doctorate)



## Way to handle the categorical data

In [3]:
import numpy as np
import pandas as pd

#### One Hot Encoding 

It is apply for Nominal categorical data

 Create dummies or binary type columns for each category in the object/ category type feature. The value for each row is 1 if that category is available in that row else 0. To create dummies use pandas get_dummies() function.

In [4]:
train = pd.read_csv('train.csv')

In [5]:
train.head()

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,1,0,49,0
1,65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0
2,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0
3,2542,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,0,50,0
4,48945,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,0,73,0


In [6]:
# In above data, Gender is nominal categorical data. we can use one Hot encoding for nominal categorical variable

In [7]:
Gender_dum = train['gender']

In [8]:
Gender_dum .head()

0    f
1    m
2    m
3    m
4    m
Name: gender, dtype: object

In [9]:
Gender_dum = pd.get_dummies(Gender_dum ,drop_first = True)

In [10]:
Gender_dum .head()

Unnamed: 0,m
0,0
1,1
2,1
3,1
4,1


In [11]:
train = pd.concat([train,Gender_dum],axis=1,)

In [12]:
train.head(4)

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted,m
0,65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,1,0,49,0,0
1,65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0,1
2,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0,1
3,2542,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,0,50,0,1


In [13]:
train = train.drop(['gender'],axis=1)

In [14]:
train.head(4)

Unnamed: 0,employee_id,department,region,education,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted,m
0,65438,Sales & Marketing,region_7,Master's & above,sourcing,1,35,5.0,8,1,0,49,0,0
1,65141,Operations,region_22,Bachelor's,other,1,30,5.0,4,0,0,60,0,1
2,7513,Sales & Marketing,region_19,Bachelor's,sourcing,1,34,3.0,7,0,0,50,0,1
3,2542,Sales & Marketing,region_23,Bachelor's,other,2,39,1.0,10,0,0,50,0,1


***Advantage:***

* Easy to use 
* Fast way to handle categorical column values.

***Disadvantage:***

* get_dummies method is not useful when data have many categorical columns.

* If the category column has many categories leads to add many features into the dataset.

* Hence, This method is only useful when data having less categorical columns with fewer categories.

In [15]:
# One Hot encoding with many categorical feature

#### Ordinal Number Encoding

When the categorical variables are ordinal, the easiest approach is to replace each label/category by some ordinal number based on the ranks.

In our data Pclass is ordinal feature having values First, Second, Third so each category replaced by its rank i.e 1,2,3 respectively.

In [16]:
train.head(4)

Unnamed: 0,employee_id,department,region,education,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted,m
0,65438,Sales & Marketing,region_7,Master's & above,sourcing,1,35,5.0,8,1,0,49,0,0
1,65141,Operations,region_22,Bachelor's,other,1,30,5.0,4,0,0,60,0,1
2,7513,Sales & Marketing,region_19,Bachelor's,sourcing,1,34,3.0,7,0,0,50,0,1
3,2542,Sales & Marketing,region_23,Bachelor's,other,2,39,1.0,10,0,0,50,0,1


In [17]:
label_encode = train['education']

In [18]:
label_encode.value_counts()

Bachelor's          36669
Master's & above    14925
Below Secondary       805
Name: education, dtype: int64

In [19]:
dic = {"Master's & above":1,
       "Bachelor's":2,
       "Below Secondary":3
    
}

In [20]:
train['ordinal_edu'] = train.education.map(dic)

In [21]:
train.head(4)

Unnamed: 0,employee_id,department,region,education,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted,m,ordinal_edu
0,65438,Sales & Marketing,region_7,Master's & above,sourcing,1,35,5.0,8,1,0,49,0,0,1.0
1,65141,Operations,region_22,Bachelor's,other,1,30,5.0,4,0,0,60,0,1,2.0
2,7513,Sales & Marketing,region_19,Bachelor's,sourcing,1,34,3.0,7,0,0,50,0,1,2.0
3,2542,Sales & Marketing,region_23,Bachelor's,other,2,39,1.0,10,0,0,50,0,1,2.0


In [22]:
train=train.drop('education',axis=1)

In [23]:
train.head(4)

Unnamed: 0,employee_id,department,region,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted,m,ordinal_edu
0,65438,Sales & Marketing,region_7,sourcing,1,35,5.0,8,1,0,49,0,0,1.0
1,65141,Operations,region_22,other,1,30,5.0,4,0,0,60,0,1,2.0
2,7513,Sales & Marketing,region_19,sourcing,1,34,3.0,7,0,0,50,0,1,2.0
3,2542,Sales & Marketing,region_23,other,2,39,1.0,10,0,0,50,0,1,2.0


***Advantage:***
    
* The easiest way to handle the ordinal feature in the dataset.

***Disadvantage:***

* Not good for Nominal type features in the dataset.

### 3. Count / Frequency Encoding

Replace each category with its frequency/number of time that category occurred in that column.

In [24]:
train.head()

Unnamed: 0,employee_id,department,region,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted,m,ordinal_edu
0,65438,Sales & Marketing,region_7,sourcing,1,35,5.0,8,1,0,49,0,0,1.0
1,65141,Operations,region_22,other,1,30,5.0,4,0,0,60,0,1,2.0
2,7513,Sales & Marketing,region_19,sourcing,1,34,3.0,7,0,0,50,0,1,2.0
3,2542,Sales & Marketing,region_23,other,2,39,1.0,10,0,0,50,0,1,2.0
4,48945,Technology,region_26,other,1,45,3.0,2,0,0,73,0,1,2.0


In [25]:
train['region'].value_counts()

region_2     12343
region_22     6428
region_7      4843
region_15     2808
region_13     2648
region_26     2260
region_31     1935
region_4      1703
region_27     1659
region_16     1465
region_28     1318
region_11     1315
region_23     1175
region_29      994
region_32      945
region_19      874
region_20      850
region_14      827
region_25      819
region_17      796
region_5       766
region_6       690
region_30      657
region_8       655
region_10      648
region_1       610
region_24      508
region_12      500
region_9       420
region_21      411
region_3       346
region_34      292
region_33      269
region_18       31
Name: region, dtype: int64

In [26]:
region_dic = train['region'].value_counts().to_dict()

In [27]:
train['region'] = train['region'].map(region_dic)

In [28]:
train.head(4)

Unnamed: 0,employee_id,department,region,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted,m,ordinal_edu
0,65438,Sales & Marketing,4843,sourcing,1,35,5.0,8,1,0,49,0,0,1.0
1,65141,Operations,6428,other,1,30,5.0,4,0,0,60,0,1,2.0
2,7513,Sales & Marketing,874,sourcing,1,34,3.0,7,0,0,50,0,1,2.0
3,2542,Sales & Marketing,1175,other,2,39,1.0,10,0,0,50,0,1,2.0


***Advantage:***
    
* Easy to implement.

* Not increasing any extra features.

***Disadvantage:***
    
* Not able to handle the same number of categories i.e provide the same values to both categories.
    

### 4. Target/Guided Encoding:

 Here, the category of the column has been replaced with its depending join probability ranking with respect to Target column.
 

In [29]:
train.head(4)

Unnamed: 0,employee_id,department,region,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted,m,ordinal_edu
0,65438,Sales & Marketing,4843,sourcing,1,35,5.0,8,1,0,49,0,0,1.0
1,65141,Operations,6428,other,1,30,5.0,4,0,0,60,0,1,2.0
2,7513,Sales & Marketing,874,sourcing,1,34,3.0,7,0,0,50,0,1,2.0
3,2542,Sales & Marketing,1175,other,2,39,1.0,10,0,0,50,0,1,2.0


In [30]:
train['department'].value_counts()

Sales & Marketing    16840
Operations           11348
Technology            7138
Procurement           7138
Analytics             5352
Finance               2536
HR                    2418
Legal                 1039
R&D                    999
Name: department, dtype: int64

In [31]:
train['department'] = train['department'].astype(str).str[0]

In [32]:
train.head(4)

Unnamed: 0,employee_id,department,region,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted,m,ordinal_edu
0,65438,S,4843,sourcing,1,35,5.0,8,1,0,49,0,0,1.0
1,65141,O,6428,other,1,30,5.0,4,0,0,60,0,1,2.0
2,7513,S,874,sourcing,1,34,3.0,7,0,0,50,0,1,2.0
3,2542,S,1175,other,2,39,1.0,10,0,0,50,0,1,2.0


In [34]:
train.groupby(['department'])['is_promoted'].mean().sort_values().index

Index(['L', 'H', 'R', 'S', 'F', 'O', 'A', 'P', 'T'], dtype='object', name='department')

In [36]:
ordinal_labels = train.groupby(['department'])['is_promoted'].mean().sort_values().index
ordinal_labels

Index(['L', 'H', 'R', 'S', 'F', 'O', 'A', 'P', 'T'], dtype='object', name='department')

In [39]:
ordinal_dic = {k:i for i,k in enumerate(ordinal_labels,0)}

In [40]:
ordinal_dic

{'L': 0, 'H': 1, 'R': 2, 'S': 3, 'F': 4, 'O': 5, 'A': 6, 'P': 7, 'T': 8}

In [41]:
train['department'] = train['department'].map(ordinal_dic)

In [42]:
train.head(4)

Unnamed: 0,employee_id,department,region,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted,m,ordinal_edu
0,65438,3,4843,sourcing,1,35,5.0,8,1,0,49,0,0,1.0
1,65141,5,6428,other,1,30,5.0,4,0,0,60,0,1,2.0
2,7513,3,874,sourcing,1,34,3.0,7,0,0,50,0,1,2.0
3,2542,3,1175,other,2,39,1.0,10,0,0,50,0,1,2.0


***Advantages:***

* It doesn’t affect the volume of the data i.e not add any extra features.

* Helps the machine learning model to learn faster.

***Disadvantages:***

* Typically, mean or joint probability encoding leads for over-fitting.

* Hence, to avoid overfitting cross-validation or some other approach is required most of the time

### 5.Mean Encoding

Simillar to target/guided encoding only difference is here we replace category with the mean value with respect to target column.

In [43]:
train['recruitment_channel'].value_counts()

other       30446
sourcing    23220
referred     1142
Name: recruitment_channel, dtype: int64

In [44]:
train['recruitment_channel']=train['recruitment_channel'].astype(str).str[0]

In [45]:
train.head(4)

Unnamed: 0,employee_id,department,region,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted,m,ordinal_edu
0,65438,3,4843,s,1,35,5.0,8,1,0,49,0,0,1.0
1,65141,5,6428,o,1,30,5.0,4,0,0,60,0,1,2.0
2,7513,3,874,s,1,34,3.0,7,0,0,50,0,1,2.0
3,2542,3,1175,o,2,39,1.0,10,0,0,50,0,1,2.0


In [46]:
mean_ordinal = train.groupby(['recruitment_channel'])['is_promoted'].mean().to_dict()

In [47]:
mean_ordinal

{'o': 0.0839519148656638, 'r': 0.12084063047285463, 's': 0.08501291989664082}

In [48]:
train['recruitment_channel'] = train['recruitment_channel'].map(mean_ordinal)

In [49]:
train.head(4)

Unnamed: 0,employee_id,department,region,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted,m,ordinal_edu
0,65438,3,4843,0.085013,1,35,5.0,8,1,0,49,0,0,1.0
1,65141,5,6428,0.083952,1,30,5.0,4,0,0,60,0,1,2.0
2,7513,3,874,0.085013,1,34,3.0,7,0,0,50,0,1,2.0
3,2542,3,1175,0.083952,2,39,1.0,10,0,0,50,0,1,2.0


***Advantages:***

* Capture information within labels or categories, rendering more predictive features.

* Create a monotonous relationship between the independent variable and the target variable.

***Disadvantages:***

* May leads to overfit the model, to overcome this problem cross-validation is use most of the time.

### 6. Probability Ratio Encoding

Here category of the column is replaced with a probability ratio with respect to Target variable.

In [63]:
train = pd.read_csv('train.csv')

In [64]:
train.head(4)

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,1,0,49,0
1,65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0
2,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0
3,2542,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,0,50,0


In [65]:
train['department'].value_counts()

Sales & Marketing    16840
Operations           11348
Technology            7138
Procurement           7138
Analytics             5352
Finance               2536
HR                    2418
Legal                 1039
R&D                    999
Name: department, dtype: int64

In [66]:
label_dep = train['department'].astype(str).str[0]

In [67]:
label_dep.value_counts()

S    16840
O    11348
T     7138
P     7138
A     5352
F     2536
H     2418
L     1039
R      999
Name: department, dtype: int64

In [68]:
Prob_promoted   = train.groupby(['department'])['is_promoted'].mean()

In [69]:
Prob_promoted = pd.DataFrame(Prob_promoted)

In [70]:
Prob_promoted 

Unnamed: 0_level_0,is_promoted
department,Unnamed: 1_level_1
Analytics,0.095665
Finance,0.08123
HR,0.056245
Legal,0.051011
Operations,0.090148
Procurement,0.096386
R&D,0.069069
Sales & Marketing,0.072031
Technology,0.107593


In [61]:
Prob_not_promoted = 1-Prob_promoted 

In [71]:
Prob_promoted['is_not_promoted']= 1-Prob_promoted 

In [72]:
Prob_promoted

Unnamed: 0_level_0,is_promoted,is_not_promoted
department,Unnamed: 1_level_1,Unnamed: 2_level_1
Analytics,0.095665,0.904335
Finance,0.08123,0.91877
HR,0.056245,0.943755
Legal,0.051011,0.948989
Operations,0.090148,0.909852
Procurement,0.096386,0.903614
R&D,0.069069,0.930931
Sales & Marketing,0.072031,0.927969
Technology,0.107593,0.892407


In [73]:
Prob_promoted['prob_ratio'] = Prob_promoted['is_promoted']/Prob_promoted['is_not_promoted'] 

In [74]:
Prob_promoted

Unnamed: 0_level_0,is_promoted,is_not_promoted,prob_ratio
department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Analytics,0.095665,0.904335,0.105785
Finance,0.08123,0.91877,0.088412
HR,0.056245,0.943755,0.059597
Legal,0.051011,0.948989,0.053753
Operations,0.090148,0.909852,0.09908
Procurement,0.096386,0.903614,0.106667
R&D,0.069069,0.930931,0.074194
Sales & Marketing,0.072031,0.927969,0.077622
Technology,0.107593,0.892407,0.120565


In [75]:
Encode_prob_ratio = Prob_promoted['prob_ratio'].to_dict()

In [76]:
Encode_prob_ratio

{'Analytics': 0.10578512396694216,
 'Finance': 0.08841201716738196,
 'HR': 0.0595968448729185,
 'Legal': 0.0537525354969574,
 'Operations': 0.09907990314769977,
 'Procurement': 0.10666666666666667,
 'R&D': 0.07419354838709677,
 'Sales & Marketing': 0.0776220643757599,
 'Technology': 0.1205651491365777}

In [77]:
train['Encode_prob_ratio'] = train['department'].map(Encode_prob_ratio)

In [78]:
train.head()

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted,Encode_prob_ratio
0,65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,1,0,49,0,0.077622
1,65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0,0.09908
2,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0,0.077622
3,2542,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,0,50,0,0.077622
4,48945,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,0,73,0,0.120565


***Advantages:***

* Not increase any extra feature..

* Captures information within the labels or category hence creates more predictive features.

* Creates a monotonic relationship between the variables and the target. So it’s suitable for linear models.

***Disadvantages:***

* Not defined when the denominator is 0.

* Same as the above two methods lead to overfitting to avoid and validate usually cross-validation has been performed.