# Probability Ratio Encoding

- I have to find out percentage of my "Dependant Feature" based on my categorical variable I have considered. 
- In this case I need to find % of Survived, based on Cabin column 

Steps: 

- Probablity ratio of encoding = (probability of survived) / ( probablity of death)
- Generally, it would be Prob of yes/ prob of no
- Probabilty ratio is then converted into dictionary, so that we can map values to our categories.

In [2]:
import pandas as pd
import numpy as np 


In [5]:
df= pd.read_csv('titanic.csv', usecols=['Cabin','Survived'])
df.head()

# survived is Dependant feature
# These are related to each other. People who hasnt survived, their cabin value is NaN. (in some case not)

Unnamed: 0,Survived,Cabin
0,0,
1,1,C85
2,1,
3,1,C123
4,0,


In [6]:
# for missing values
# replace with new category

df['Cabin'].fillna("Missing", inplace= True)

In [7]:
df.isnull().sum()

# all NaN are replaced

Survived    0
Cabin       0
dtype: int64

In [9]:
df.head(20)
# in Cabin we have catgories wrt to alphabets

Unnamed: 0,Survived,Cabin
0,0,Missing
1,1,C85
2,1,Missing
3,1,C123
4,0,Missing
5,0,Missing
6,0,E46
7,0,Missing
8,1,Missing
9,1,Missing


In [11]:
df['Cabin'].unique()

array(['Missing', 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6',
       'C23 C25 C27', 'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33',
       'F G73', 'E31', 'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101',
       'F E69', 'D47', 'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4',
       'A32', 'B4', 'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35',
       'C87', 'B77', 'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19',
       'B49', 'D', 'C22 C26', 'C106', 'C65', 'E36', 'C54',
       'B57 B59 B63 B66', 'C7', 'E34', 'C32', 'B18', 'C124', 'C91', 'E40',
       'T', 'C128', 'D37', 'B35', 'E50', 'C82', 'B96 B98', 'E10', 'E44',
       'A34', 'C104', 'C111', 'C92', 'E38', 'D21', 'E12', 'E63', 'A14',
       'B37', 'C30', 'D20', 'B79', 'E25', 'D46', 'B73', 'C95', 'B38',
       'B39', 'B22', 'C86', 'C70', 'A16', 'C101', 'C68', 'A10', 'E68',
       'B41', 'A20', 'D19', 'D50', 'D9', 'A23', 'B50', 'A26', 'D48',
       'E58', 'C126', 'B71', 'B51 B53 B55', 'D49', 'B5', 'B20', 'F G63',
       'C62

- Here df["Cabin"] is in array format. To get the first character, first we need to convert it into String and then we can take first character using str[0]


In [14]:
# as we have so many unique values, I can not encode for each.
# so I am taking first character in this category. (ex: A,B,C etc)

df['Cabin']= df['Cabin'].astype(str).str[0]
df.head()

#here actually I am replacing all the values of cabin to the first character. 
#make sure that actual cabin number's first character shouldn't be M. (Missing- new variable )

#If I had cabin number starting from M ex: M123, then for the NaN values, replace it with Unknown
# after fetching 1st character, I will get U (when I use "Unknown")


#M stants for missing 

Unnamed: 0,Survived,Cabin
0,0,M
1,1,C
2,1,M
3,1,C
4,0,M


In [15]:
df["Cabin"].unique()

array(['M', 'C', 'E', 'G', 'D', 'A', 'B', 'F', 'T'], dtype=object)

## Probabilty Ratio Encoding 
Steps: 

1. Find out prob of Survived based on Cabin
2. Find out prob of "not survived". (1- df[survived])
3. Divide (prob of Survived)/ (prob of not survived)= Prob
4. Create dictionary to map Cabin with Prob 
5. Create a new column and Replace with Categorical Feature
6. Delete the original categorical feature

In [17]:
df.groupby(['Cabin'])['Survived'].mean()

#nobody survived from T class, Probably T class cabins belonged to bottom of cruise



Cabin
A    0.466667
B    0.744681
C    0.593220
D    0.757576
E    0.750000
F    0.615385
G    0.500000
M    0.299854
T    0.000000
Name: Survived, dtype: float64

In [18]:
# consider this as probability and save in new var

prob_df= df.groupby(['Cabin'])['Survived'].mean()

In [20]:
#converting probabilty into dataframe
prob_df= pd.DataFrame(prob_df)
prob_df

Unnamed: 0_level_0,Survived
Cabin,Unnamed: 1_level_1
A,0.466667
B,0.744681
C,0.59322
D,0.757576
E,0.75
F,0.615385
G,0.5
M,0.299854
T,0.0


In [24]:
prob_df["Died"]= 1- prob_df["Survived"]
prob_df.head()

# sum of both row entry should be 1 

Unnamed: 0_level_0,Survived,Died
Cabin,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0.466667,0.533333
B,0.744681,0.255319
C,0.59322,0.40678
D,0.757576,0.242424
E,0.75,0.25


In [25]:
prob_df['Probablity_ratio']= prob_df["Survived"]/ prob_df["Died"]

In [26]:
prob_df.head()

Unnamed: 0_level_0,Survived,Died,Probablity_ratio
Cabin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,0.466667,0.533333,0.875
B,0.744681,0.255319,2.916667
C,0.59322,0.40678,1.458333
D,0.757576,0.242424,3.125
E,0.75,0.25,3.0


In [29]:
#converting Probabilty ratio into dictionary so that I can replace with categories. 

probability_encoded= prob_df["Probablity_ratio"].to_dict()

In [32]:
# mapping inside dataframe with new column called Cabin_encoded

df['Cabin_encoded']= df["Cabin"].map(probability_encoded)
df.head(20)

Unnamed: 0,Survived,Cabin,Cabin_encoded
0,0,M,0.428274
1,1,C,1.458333
2,1,M,0.428274
3,1,C,1.458333
4,0,M,0.428274
5,0,M,0.428274
6,0,E,3.0
7,0,M,0.428274
8,1,M,0.428274
9,1,M,0.428274


Now I can use <b>probability_encoded</b> column to train model and delete <b>Cabin</b> column