#### -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## Task:

1. Count or frequency encoding
2. Ordinal_encoding
3. Target Guided Ordinal Encoding
4. Mean Encoding
5. Probability Ratio Encoding

#### -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### 1. Count or frequency encoding

#### High Cardinality


Another way to refer to variables that have a multitude of categories, is to call them variables with **high cardinality**.

If we have categorical variables containing many multiple labels or high cardinality,then by using one hot encoding, we will expand the feature space dramatically.

One approach that is heavily used in Kaggle competitions, is to replace each label of the categorical variable by the count, this is the amount of times each label appears in the dataset. Or the frequency, this is the percentage of observations within that category. The 2 are equivalent.


Let's see how this works:

In [9]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

# let's open the mercedes benz dataset for demonstration
# Download the dataset from the below link
#https://www.kaggle.com/aditya1702/mercedes-benz-data-exploration/data

data = pd.read_csv('C:\\Users\\krish\\OneDrive\\Desktop\\Kaggle\\mercedes-benz-greener-manufacturing\\train.csv')
df = data[['X1', 'X2', 'X3', 'X4', 'X5']]
print(df.shape)
df.head()


(4209, 5)


Unnamed: 0,X1,X2,X3,X4,X5
0,v,at,a,d,u
1,t,av,e,d,y
2,w,n,c,d,x
3,t,n,f,d,x
4,v,n,f,d,h


#### One hot Encoding

In [10]:
pd.get_dummies(df).shape

(4209, 111)

In [11]:
print("X1-unique:", df['X1'].nunique())
print("X2-unique:", df['X2'].nunique())
print("X3-unique:", df['X3'].nunique())
print("X4-unique:", df['X4'].nunique())
print("X5-unique:", df['X5'].nunique())

for col in df.columns:
    print(col, ":", df[col].unique())
    print("*"*100)

X1-unique: 27
X2-unique: 44
X3-unique: 7
X4-unique: 4
X5-unique: 29
X1 : ['v' 't' 'w' 'b' 'r' 'l' 's' 'aa' 'c' 'a' 'e' 'h' 'z' 'j' 'o' 'u' 'p' 'n'
 'i' 'y' 'd' 'f' 'm' 'k' 'g' 'q' 'ab']
****************************************************************************************************
X2 : ['at' 'av' 'n' 'e' 'as' 'aq' 'r' 'ai' 'ak' 'm' 'a' 'k' 'ae' 's' 'f' 'd'
 'ag' 'ay' 'ac' 'ap' 'g' 'i' 'aw' 'y' 'b' 'ao' 'al' 'h' 'x' 'au' 't' 'an'
 'z' 'ah' 'p' 'am' 'j' 'q' 'af' 'l' 'aa' 'c' 'o' 'ar']
****************************************************************************************************
X3 : ['a' 'e' 'c' 'f' 'd' 'b' 'g']
****************************************************************************************************
X4 : ['d' 'b' 'c' 'a']
****************************************************************************************************
X5 : ['u' 'y' 'x' 'h' 'g' 'f' 'j' 'i' 'd' 'c' 'af' 'ag' 'ab' 'ac' 'ad' 'ae'
 'ah' 'l' 'k' 'n' 'm' 'p' 'q' 's' 'r' 'v' 'w' 'o' 'aa']
**************

In [12]:
# let's obtain the counts for each one of the labels in variable X2
# let's capture this in a dictionary that we can use to re-map the labels

df['X2'].value_counts().to_dict()

In [14]:
# And now let's replace each label in column by its count

def Count_frequency_encoding(df, col):
    # first we make a dictionary that maps each label to the counts
    df_frequency_map = df[col].value_counts().to_dict()

    # and now we replace column labels in the dataset df
    df[col] = df[col].map(df_frequency_map)

In [15]:
for col in df.columns:
    Count_frequency_encoding(df, col)

In [16]:
print(df.shape)
df.head()

(4209, 5)


Unnamed: 0,X1,X2,X3,X4,X5
0,408,6,440,4205,1
1,31,4,163,4205,1
2,52,137,1942,4205,2
3,31,137,1076,4205,2
4,408,137,1076,4205,1


There are some advantages and disadvantages that we will discuss now

##### Advantages

1. It is very simple to implement
2. Does not increase the feature dimensional space

##### Disadvantages

1. If some of the labels have the same count, then they will be replaced with the same count and they will loose some valuable information.

2 Adds somewhat arbitrary numbers, and therefore weights to the different labels, that may not be related to their predictive power

Follow this thread in Kaggle for more information:
https://www.kaggle.com/general/16927


## 2. Ordinal_encoding

#### Ordinal numbering encoding or Label Encoding

**Ordinal categorical variables**

Ordinal data is a categorical, statistical data type where the variables have natural, ordered categories and the distances between the categories is not known.



For example:

- Student's grade in an exam (A, B, C or Fail).
- Educational level, with the categories: Elementary school,  High school, College graduate, PhD ranked from 1 to 4.

When the categorical variables are ordinal, the most straightforward best approach is to replace the labels by some ordinal number based on the ranks.




In [17]:
import pandas as pd
import datetime

# create a variable with dates, and from that extract the weekday
# I create a list of dates with 20 days difference from today
# and then transform it into a datafame

df_base = datetime.datetime.today()
df_date_list = [df_base - datetime.timedelta(days=x) for x in range(0, 20)]
df = pd.DataFrame(df_date_list)
df.columns = ['day']
df

Unnamed: 0,day
0,2021-05-13 20:20:21.068680
1,2021-05-12 20:20:21.068680
2,2021-05-11 20:20:21.068680
3,2021-05-10 20:20:21.068680
4,2021-05-09 20:20:21.068680
5,2021-05-08 20:20:21.068680
6,2021-05-07 20:20:21.068680
7,2021-05-06 20:20:21.068680
8,2021-05-05 20:20:21.068680
9,2021-05-04 20:20:21.068680


In [18]:
# extract the week day name

df['day_of_week'] = df['day'].dt.day_name()
df.head()

Unnamed: 0,day,day_of_week
0,2021-05-13 20:20:21.068680,Thursday
1,2021-05-12 20:20:21.068680,Wednesday
2,2021-05-11 20:20:21.068680,Tuesday
3,2021-05-10 20:20:21.068680,Monday
4,2021-05-09 20:20:21.068680,Sunday


In [19]:
# Engineer categorical variable by ordinal number replacement

weekday_map = {'Monday':1,
               'Tuesday':2,
               'Wednesday':3,
               'Thursday':4,
               'Friday':5,
               'Saturday':6,
               'Sunday':7
}

df['day_ordinal'] = df.day_of_week.map(weekday_map)
df.head(20)

Unnamed: 0,day,day_of_week,day_ordinal
0,2021-05-13 20:20:21.068680,Thursday,4
1,2021-05-12 20:20:21.068680,Wednesday,3
2,2021-05-11 20:20:21.068680,Tuesday,2
3,2021-05-10 20:20:21.068680,Monday,1
4,2021-05-09 20:20:21.068680,Sunday,7
5,2021-05-08 20:20:21.068680,Saturday,6
6,2021-05-07 20:20:21.068680,Friday,5
7,2021-05-06 20:20:21.068680,Thursday,4
8,2021-05-05 20:20:21.068680,Wednesday,3
9,2021-05-04 20:20:21.068680,Tuesday,2


##### Ordinal Measurement Advantages

Ordinal measurement is normally used for surveys and questionnaires. Statistical analysis is applied to the responses once they are collected to place the people who took the survey into the various categories. The data is then compared to draw inferences and conclusions about the whole surveyed population with regard to the specific variables. The advantage of using ordinal measurement is ease of collation and categorization. If you ask a survey question without providing the variables, the answers are likely to be so diverse they cannot be converted to statistics.

With Respect to Machine Learning

- Keeps the semantical information of the variable (human readable content)
- Straightforward

##### Ordinal Measurement Disadvantages
The same characteristics of ordinal measurement that create its advantages also create certain disadvantages. The responses are often so narrow in relation to the question that they create or magnify bias that is not factored into the survey. For example, on the question about satisfaction with the governor, people might be satisfied with his job performance but upset about a recent sex scandal. The survey question might lead respondents to state their dissatisfaction about the scandal, in spite of satisfaction with his job performance -- but the statistical conclusion will not differentiate.

With Respect to Machine Learning

- Does not add machine learning valuable information


### 3. Target Guided Ordinal Encoding

1. Ordering the labels according to the target
2. Replace the labels by the joint probability of being 1 or 0

In [1]:
import pandas as pd
df=pd.read_csv('titanic.csv', usecols=['Cabin','Survived'])
df.head()

Unnamed: 0,Survived,Cabin
0,0,
1,1,C85
2,1,
3,1,C123
4,0,


In [2]:
df['Cabin'].fillna('Missing',inplace=True)
df['Cabin']=df['Cabin'].astype(str).str[0]
df.head()

Unnamed: 0,Survived,Cabin
0,0,M
1,1,C
2,1,M
3,1,C
4,0,M


In [3]:
df.Cabin.unique()

array(['M', 'C', 'E', 'G', 'D', 'A', 'B', 'F', 'T'], dtype=object)

In [4]:
df.groupby(['Cabin'])['Survived'].mean()

Cabin
A    0.466667
B    0.744681
C    0.593220
D    0.757576
E    0.750000
F    0.615385
G    0.500000
M    0.299854
T    0.000000
Name: Survived, dtype: float64

In [5]:
ordinal_labels=df.groupby(['Cabin'])['Survived'].mean().sort_values().index
print(ordinal_labels)

ordinal_labels2={k:i for i,k in enumerate(ordinal_labels,0)}
print(ordinal_labels2)

df['Cabin_ordinal_labels']=df['Cabin'].map(ordinal_labels2)
df.head()

Index(['T', 'M', 'A', 'G', 'C', 'F', 'B', 'E', 'D'], dtype='object', name='Cabin')
{'T': 0, 'M': 1, 'A': 2, 'G': 3, 'C': 4, 'F': 5, 'B': 6, 'E': 7, 'D': 8}


Unnamed: 0,Survived,Cabin,Cabin_ordinal_labels
0,0,M,1
1,1,C,4
2,1,M,1
3,1,C,4
4,0,M,1


### 4. Mean Encoding

In [13]:
import pandas as pd
df=pd.read_csv('titanic.csv', usecols=['Cabin','Survived'])

df['Cabin'].fillna('Missing',inplace=True)
df['Cabin']=df['Cabin'].astype(str).str[0]
df.head()

Unnamed: 0,Survived,Cabin
0,0,M
1,1,C
2,1,M
3,1,C
4,0,M


In [14]:
mean_ordinal=df.groupby(['Cabin'])['Survived'].mean().to_dict()
mean_ordinal

{'A': 0.4666666666666667,
 'B': 0.7446808510638298,
 'C': 0.5932203389830508,
 'D': 0.7575757575757576,
 'E': 0.75,
 'F': 0.6153846153846154,
 'G': 0.5,
 'M': 0.29985443959243085,
 'T': 0.0}

In [15]:
df['mean_ordinal_encode']=df['Cabin'].map(mean_ordinal)
df.head()

Unnamed: 0,Survived,Cabin,mean_ordinal_encode
0,0,M,0.299854
1,1,C,0.59322
2,1,M,0.299854
3,1,C,0.59322
4,0,M,0.299854


### 5. Probability Ratio Encoding

- Probability of survived based on cabin-- categorical feature
- Probability of not survived-- 1-pr(survived)
- pr(survived)/pr(not survived)
- Dictionary to map cabin with probability
- Replace with the categorical feature


In [16]:
import pandas as pd
df=pd.read_csv('titanic.csv', usecols=['Cabin','Survived'])
df.head()

Unnamed: 0,Survived,Cabin
0,0,
1,1,C85
2,1,
3,1,C123
4,0,


In [17]:
df['Cabin'].fillna('Missing',inplace=True)
df['Cabin']=df['Cabin'].astype(str).str[0]
df.head()

Unnamed: 0,Survived,Cabin
0,0,M
1,1,C
2,1,M
3,1,C
4,0,M


In [18]:
df.Cabin.unique()

array(['M', 'C', 'E', 'G', 'D', 'A', 'B', 'F', 'T'], dtype=object)

In [26]:
prob_df=df.groupby('Cabin')['Survived'].mean()
prob_df = pd.DataFrame(prob_df)
prob_df.head()

Unnamed: 0_level_0,Survived
Cabin,Unnamed: 1_level_1
A,0.466667
B,0.744681
C,0.59322
D,0.757576
E,0.75


In [27]:
prob_df['Died'] = 1-prob_df['Survived']
prob_df.head()

Unnamed: 0_level_0,Survived,Died
Cabin,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0.466667,0.533333
B,0.744681,0.255319
C,0.59322,0.40678
D,0.757576,0.242424
E,0.75,0.25


In [28]:
prob_df['probability_ratio'] = prob_df['Survived']/prob_df['Died']
prob_df.head()

Unnamed: 0_level_0,Survived,Died,probability_ratio
Cabin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,0.466667,0.533333,0.875
B,0.744681,0.255319,2.916667
C,0.59322,0.40678,1.458333
D,0.757576,0.242424,3.125
E,0.75,0.25,3.0


In [31]:
probability_encoded=prob_df['probability_ratio'].to_dict()
df['cabin_encoded'] = df['Cabin'].map(probability_encoded)
df.head()

Unnamed: 0,Survived,Cabin,cabin_encoded
0,0,M,0.428274
1,1,C,1.458333
2,1,M,0.428274
3,1,C,1.458333
4,0,M,0.428274
