# **Data Encoding**

1. Nominal/OHE Encoding
2. Label and Ordinal Encoding
3. Target Guided Ordinal Encoding 

## **1. One-Hot Encoding**
- One hot encoding, also known as nominal encoding, is a technique used to represent categorical data as numerical data, which is more suitable for machine learning algorithms. In this technique, each category is represented as a binary vector where each bit corresponds to a unique category. For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it using one hot encoding as follows:
- Example:
   - Categories: [Red, Green, Blue]
   - Encoded: 
      ```
      Red   Green   Blue
      1     0       0
      0     1       0
      0     0       1
      ```

Useful when there is no ordinal relationship among categories.

In [3]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [4]:
df = pd.DataFrame({
    'color':["Red", "Green", "Blue", "Blue", "Green", "Red"]
})

In [5]:
df

Unnamed: 0,color
0,Red
1,Green
2,Blue
3,Blue
4,Green
5,Red


In [7]:
##create an instance of Onehotencoder
encoder = OneHotEncoder()

In [24]:
## perform fit and transform
encoded = encoder.fit_transform(df[['color']]).toarray()

In [11]:
print(encoded)

[[0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]


In [16]:
#check what color is assigned what encoding
encoder.transform([['Red']]).toarray()



array([[0., 0., 1.]])

In [17]:
encoder.transform([['Blue']]).toarray()



array([[1., 0., 0.]])

In [18]:
encoder.transform([['Green']]).toarray()



array([[0., 1., 0.]])

In [23]:
import pandas as pd
encoder_df=pd.DataFrame(encoded,columns=encoder.get_feature_names_out())
encoder_df

Unnamed: 0,color_Blue,color_Green,color_Red
0,0.0,0.0,1.0
1,0.0,1.0,0.0
2,1.0,0.0,0.0
3,1.0,0.0,0.0
4,0.0,1.0,0.0
5,0.0,0.0,1.0


In [26]:
pd.concat([df, encoder_df], axis=1)

Unnamed: 0,color,color_Blue,color_Green,color_Red
0,Red,0.0,0.0,1.0
1,Green,0.0,1.0,0.0
2,Blue,1.0,0.0,0.0
3,Blue,1.0,0.0,0.0
4,Green,0.0,1.0,0.0
5,Red,0.0,0.0,1.0


# Label Encoding 
#### Label encoding and ordinal encoding are two techniques used to encode categorical data as numerical data.
Label encoding involves assigning a unique numerical label to each category in the variable. The labels are usually assigned in alphabetical order or based on the frequency of the categories. 
The problem here is that since it encodes the sample data with numbers, the models might consider it as ranking of the color and may focus on color which is encoded with higher number.
For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it using label encoding as follows:

1. Red: 1
2. Green: 2
3. Blue: 3

In [27]:
df

Unnamed: 0,color
0,Red
1,Green
2,Blue
3,Blue
4,Green
5,Red


In [28]:
from sklearn.preprocessing import LabelEncoder
# Create instance of LabelEncoder
lbl_encoder = LabelEncoder()

In [53]:
lbl_encoder.fit_transform(df[['color']]) 

  y = column_or_1d(y, warn=True)


array([2, 1, 0, 0, 1, 2])

In [54]:
df['encoded_feature'] = lbl_encoder.fit_transform(df[['color']]) 
df

  y = column_or_1d(y, warn=True)


Unnamed: 0,color,encoded_feature
0,Red,2
1,Green,1
2,Blue,0
3,Blue,0
4,Green,1
5,Red,2


In [36]:
lbl_encoder.transform([['Red']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([2])

In [31]:
lbl_encoder.transform([['Blue']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([0])

In [32]:
lbl_encoder.transform([['Green']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([1])

# Ordinal Encoding
#### Assign ranks to each categorical variables. In this technique, each category is assigned a numerical value based on its position in the order. For example, if we have a categorical variable "education level" with four possible values (high school, college, graduate, post-graduate), we can represent it using ordinal encoding as follows:

#### 1. High school: 1
#### 2. College: 2
#### 3. Graduate: 3
#### 4. Post-graduate: 4

In [55]:
from sklearn.preprocessing import OrdinalEncoder

In [57]:
# create a sample dataframe with an ordinal variable
df = pd.DataFrame({
    'size': ['small', 'medium', 'large', 'medium', 'small', 'large']
})
df

Unnamed: 0,size
0,small
1,medium
2,large
3,medium
4,small
5,large


In [61]:
## create an instance of ORdinalEncoder and then fit_transform
ordinal_encoder=OrdinalEncoder(categories=[['small','medium','large']])

In [63]:
df['feature_encoded'] = ordinal_encoder.fit_transform(df[['size']])
df

Unnamed: 0,size,feature_encoded
0,small,0.0
1,medium,1.0
2,large,2.0
3,medium,1.0
4,small,0.0
5,large,2.0


In [71]:
import numpy as np
ordinal_encoder.transform([['small']])



array([[0.]])

# Target Guided Ordinal Encoding 
#### It is a technique used to encode categorical variables based on their relationship with the target variable. This encoding technique is useful when we have a categorical variable with a large number of unique categories, and we want to use this variable as a feature in our machine learning model.

#### In Target Guided Ordinal Encoding, we replace each category in the categorical variable with a numerical value based on the mean or median of the target variable for that category. This creates a monotonic relationship between the categorical variable and the target variable, which can improve the predictive power of our model.

In [74]:
import pandas as pd

# create a sample dataframe with a categorical variable and a target variable
df = pd.DataFrame({
    'city': ['New York', 'London', 'Paris', 'Tokyo', 'New York', 'Paris'],
    'price': [200, 150, 300, 250, 180, 320]
})
df

Unnamed: 0,city,price
0,New York,200
1,London,150
2,Paris,300
3,Tokyo,250
4,New York,180
5,Paris,320


In [86]:
mean_price = df.groupby('city')['price'].mean()
mean_price

city
London      150.0
New York    190.0
Paris       310.0
Tokyo       250.0
Name: price, dtype: float64

In [87]:
mean_price = df.groupby('city')['price'].mean().to_dict()
mean_price

{'London': 150.0, 'New York': 190.0, 'Paris': 310.0, 'Tokyo': 250.0}

In [88]:
df['city_encoded'] = df['city'].map(mean_price)
df['city_encoded']

0    190.0
1    150.0
2    310.0
3    250.0
4    190.0
5    310.0
Name: city_encoded, dtype: float64

In [89]:
df

Unnamed: 0,city,price,city_encoded
0,New York,200,190.0
1,London,150,150.0
2,Paris,300,310.0
3,Tokyo,250,250.0
4,New York,180,190.0
5,Paris,320,310.0


# Practice Problems

#### Use tips dataset from seaborn . Use `time` feature with `total bill` to create your encoded feature using ordinal encoding

In [90]:
import seaborn as sns

In [93]:
df = sns.load_dataset('tips')
df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


In [94]:
mean_total_bill = df.groupby('time')['total_bill'].mean().to_dict()

  mean_total_bill = df.groupby('time')['total_bill'].mean().to_dict()


In [95]:
mean_total_bill

{'Lunch': 17.168676470588235, 'Dinner': 20.79715909090909}

In [97]:
df['time_encoded'] = df['time'].map(mean_total_bill)

In [100]:
df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,time_encoded
0,16.99,1.01,Female,No,Sun,Dinner,2,20.797159
1,10.34,1.66,Male,No,Sun,Dinner,3,20.797159
2,21.01,3.50,Male,No,Sun,Dinner,3,20.797159
3,23.68,3.31,Male,No,Sun,Dinner,2,20.797159
4,24.59,3.61,Female,No,Sun,Dinner,4,20.797159
...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,20.797159
240,27.18,2.00,Female,Yes,Sat,Dinner,2,20.797159
241,22.67,2.00,Male,Yes,Sat,Dinner,2,20.797159
242,17.82,1.75,Male,No,Sat,Dinner,2,20.797159


In [110]:
df_lunch = df[df['time'] == 'Lunch']
df_lunch

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,time_encoded
77,27.20,4.00,Male,No,Thur,Lunch,4,17.168676
78,22.76,3.00,Male,No,Thur,Lunch,2,17.168676
79,17.29,2.71,Male,No,Thur,Lunch,2,17.168676
80,19.44,3.00,Male,Yes,Thur,Lunch,2,17.168676
81,16.66,3.40,Male,No,Thur,Lunch,2,17.168676
...,...,...,...,...,...,...,...,...
222,8.58,1.92,Male,Yes,Fri,Lunch,1,17.168676
223,15.98,3.00,Female,No,Fri,Lunch,3,17.168676
224,13.42,1.58,Male,Yes,Fri,Lunch,2,17.168676
225,16.27,2.50,Female,Yes,Fri,Lunch,2,17.168676


In [114]:
df_lunch_time_encoded = df[df['time'] == 'Lunch'][['time', 'time_encoded']]
df_lunch_time_encoded

Unnamed: 0,time,time_encoded
77,Lunch,17.168676
78,Lunch,17.168676
79,Lunch,17.168676
80,Lunch,17.168676
81,Lunch,17.168676
...,...,...
222,Lunch,17.168676
223,Lunch,17.168676
224,Lunch,17.168676
225,Lunch,17.168676
