https://medium.com/@sujathamudadla1213/target-guided-ordinal-encoding-with-example-450323fea78e

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Load data and print random sample

In [2]:
tips = sns.load_dataset("tips")
tips.sample(6)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
222,8.58,1.92,Male,Yes,Fri,Lunch,1
156,48.17,5.0,Male,No,Sun,Dinner,6
112,38.07,4.0,Male,No,Sun,Dinner,3
171,15.81,3.16,Male,Yes,Sat,Dinner,2
144,16.43,2.3,Female,No,Thur,Lunch,2
6,8.77,2.0,Male,No,Sun,Dinner,2


when we do label encoding, we see that unique numbers automatically proritize one above another. we can put priority based on prior experience. but if the priority depepds on another column, then what will happen? so we can put value instead of category but based on other or target data. here we can put mean or median or ordinal numbers based on target column.

In our tips dataset, we are going to change time based on total bill payed.

Part 1: based on mean

In [7]:
tips.groupby("time")["total_bill"].mean()
# tips.groupby("time")["total_bill"].mean().plot(kind="bar")

  tips.groupby("time")["total_bill"].mean()


time
Lunch     17.168676
Dinner    20.797159
Name: total_bill, dtype: float64

Making dictionary for simplicity

In [10]:
tips_dict = tips.groupby("time")["total_bill"].mean().to_dict()
tips_dict

  tips_dict = tips.groupby("time")["total_bill"].mean().to_dict()


{'Lunch': 17.168676470588235, 'Dinner': 20.79715909090909}

In [13]:
tips['encoded'] = tips["time"].map(tips_dict)

In [15]:
tips[["time", "total_bill", "encoded"]]

Unnamed: 0,time,total_bill,encoded
0,Dinner,16.99,20.797159
1,Dinner,10.34,20.797159
2,Dinner,21.01,20.797159
3,Dinner,23.68,20.797159
4,Dinner,24.59,20.797159
...,...,...,...
239,Dinner,29.03,20.797159
240,Dinner,27.18,20.797159
241,Dinner,22.67,20.797159
242,Dinner,17.82,20.797159


Part 2: Put priority

here we will do ordinal encoding, but based on mean or sum.

In [36]:
time_with_priority = tips.groupby("time")["total_bill"].mean().reset_index().sort_values("total_bill", ascending=False)['time'].values.tolist()

time_with_priority

  time_with_priority = tips.groupby("time")["total_bill"].mean().reset_index().sort_values("total_bill", ascending=False)['time'].values.tolist()


['Dinner', 'Lunch']

In [38]:
from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder(categories=[time_with_priority[::-1]])
tips['encoded_ordinal'] = enc.fit_transform(tips[['time']])
tips[["time", "total_bill", "encoded_ordinal"]].sample(6)

Unnamed: 0,time,total_bill,encoded_ordinal
194,Lunch,16.58,0.0
237,Dinner,32.83,1.0
185,Dinner,20.69,1.0
191,Lunch,19.81,0.0
82,Lunch,10.07,0.0
84,Lunch,15.98,0.0


lets breakdown above code
```
tips.groupby("time")["total_bill"].mean().reset_index().sort_values("total_bill", ascending=False)['time'].values
```

In [26]:
tips.groupby("time")["total_bill"].mean()

  tips.groupby("time")["total_bill"].mean()


time
Lunch     17.168676
Dinner    20.797159
Name: total_bill, dtype: float64

In above code, we get the mean but couldn't get the mean column name, so that we need to reset index.

In [27]:
tips.groupby("time")["total_bill"].mean().reset_index()

  tips.groupby("time")["total_bill"].mean().reset_index()


Unnamed: 0,time,total_bill
0,Lunch,17.168676
1,Dinner,20.797159


We need to sort time based on total_bill, so that they we can get maximum bill to at the end. so ordinal encoding put highest value to it.

In [32]:
tips.groupby("time")["total_bill"].mean().reset_index().sort_values("total_bill")

  tips.groupby("time")["total_bill"].mean().reset_index().sort_values("total_bill")


Unnamed: 0,time,total_bill
0,Lunch,17.168676
1,Dinner,20.797159


but in the main code we did sorting descending order so later we will reverse it.

Now we need the value of time column.

In [35]:
tips.groupby("time")["total_bill"].mean().reset_index().sort_values("total_bill")['time'].values.tolist()

  tips.groupby("time")["total_bill"].mean().reset_index().sort_values("total_bill")['time'].values.tolist()


['Lunch', 'Dinner']