# Label Encoding

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

In [2]:
df = pd.DataFrame({
    'color':['red', 'blue', 'green', 'green', 'red',' blue']
})

In [3]:
df

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red
5,blue


In [4]:
#create an instance of label encoder
encoder = LabelEncoder()

In [6]:
encoder.fit_transform(df['color'])

array([3, 1, 2, 2, 3, 0])

The problem with label encoding is that it assigns random integers to all the classes (that has no order) of a categorical variable. Now the class with higher numerical value will be considered greater than the class with lower numerical value. 

# Ordinal Encoding

Ordinal Encoding is used for encoding classes (that has order/rank) of a categorical variable.

In [8]:
from sklearn.preprocessing import OrdinalEncoder

df = pd.DataFrame({
    'size':['small','medium','large','medium','small','large']
})

In [9]:
df

Unnamed: 0,size
0,small
1,medium
2,large
3,medium
4,small
5,large


In [10]:
encoder = OrdinalEncoder(categories=[['small','medium','large']])

In [11]:
encoder.fit_transform(df[['size']])

array([[0.],
       [1.],
       [2.],
       [1.],
       [0.],
       [2.]])

# Target Guided Ordinal Encoding

It is a technique used to encode categorical variables based on their relationship with the target variable. This encoding technique is useful when we have a categorical variable with a large number of unique categories, and we want to use this variable as a feature in our machine learning model.

In Target Guided Ordinal Encoding, we replace each category in the categorical variable with a numerical value based on the mean or median of the target variable for that category. This creates a monotonic relationship between the categorical variable and the target variable, which can improve the predictive power of our model.

In [12]:
df = pd.DataFrame({
    'city':['New York', 'London', 'Paris', 'Tokyo','New York', 'Paris'],
    'price':[200, 150, 300, 250, 180, 320]
})

In [13]:
df

Unnamed: 0,city,price
0,New York,200
1,London,150
2,Paris,300
3,Tokyo,250
4,New York,180
5,Paris,320


In [15]:
# calculate the mean price for each city
df.groupby('city')['price'].mean()

city
London      150.0
New York    190.0
Paris       310.0
Tokyo       250.0
Name: price, dtype: float64

In [16]:
df.groupby('city')['price'].mean().to_dict()

{'London': 150.0, 'New York': 190.0, 'Paris': 310.0, 'Tokyo': 250.0}

In [17]:
mean_price=df.groupby('city')['price'].mean().to_dict()

In [18]:
#replace each city with its mean price
df['city_encoded']=df['city'].map(mean_price)

In [19]:
df

Unnamed: 0,city,price,city_encoded
0,New York,200,190.0
1,London,150,150.0
2,Paris,300,310.0
3,Tokyo,250,250.0
4,New York,180,190.0
5,Paris,320,310.0
