## Target Encoding

 - Target encoding is a technique to convert categorical variables into numerical variables using the target variable. The idea is to replace each category with the average value of the target for that category.
 - For example, if you have a categorical variable x with three possible values: a, b, and c, and a target variable y that is binary (0 or 1), you can calculate the mean of y for each value of x and use that as the new representation of x.
 
 Demonstration of target encoding using Titanic survival dataset

In [1]:
import pandas as pd

df = pd.read_csv("../data/titanic.csv")

In [9]:
df.head(18)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
5,897,0,3,"Svensson, Mr. Johan Cervin",male,14.0,0,0,7538,9.225,,S
6,898,1,3,"Connolly, Miss. Kate",female,30.0,0,0,330972,7.6292,,Q
7,899,0,2,"Caldwell, Mr. Albert Francis",male,26.0,1,1,248738,29.0,,S
8,900,1,3,"Abrahim, Mrs. Joseph (Sophie Halaut Easu)",female,18.0,0,0,2657,7.2292,,C
9,901,0,3,"Davies, Mr. John Samuel",male,21.0,2,0,A/4 48871,24.15,,S


In [3]:
X = df[["Sex", "Embarked"]]
y = df["Survived"]

In [4]:
from sklearn.model_selection import train_test_split
import category_encoders as ce

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [5]:
encoder = ce.TargetEncoder(cols=["Sex", "Embarked"])
encoder.fit(X_train, y_train)

TargetEncoder(cols=['Sex', 'Embarked'])

In [6]:
X_test = encoder.transform(X_test)

The categorical features have been replaced by the mean values of the target variable for each category
For example:
- The sex feature has been replaced by 9.999845e-01 for female and 2.581593e-09 for male, which means that females had a higher survival rate than males.
- The embarked feature has been replaced by 0.342723 for S, 0.536005 for Q, and 0.383712 for C, which means that passengers who embarked from Q had a higher survival rate than those who embarked from S or C.

In [7]:
X_test.head(10)

Unnamed: 0,Sex,Embarked
358,2.581593e-09,0.536005
164,2.581593e-09,0.342723
17,2.581593e-09,0.383712
67,2.581593e-09,0.342723
4,0.9999845,0.342723
377,2.581593e-09,0.342723
214,0.9999845,0.342723
290,2.581593e-09,0.342723
381,2.581593e-09,0.536005
5,2.581593e-09,0.342723


In [11]:
df.groupby("Embarked")["Survived"].mean()

Embarked
C    0.392157
Q    0.521739
S    0.325926
Name: Survived, dtype: float64