# Data encoding

Data encoding is changing raw data into a binary format that an algorithm can read and interpret

categorical -> numerical

# 1. Nominal Encoding

used to transform categorical to numerical variable where each category is represented as a binary vector where each bit corresponds to a unique category.

In [16]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [17]:
df = pd.DataFrame({
    'color' : ['Red','blue','green','green','Red','blue']
})

In [18]:
df

Unnamed: 0,color
0,Red
1,blue
2,green
3,green
4,Red
5,blue


In [19]:
## create instance for one hot encoder
encoder = OneHotEncoder()

In [20]:
# fit the encoder to the dataframe
encoded = encoder.fit_transform(df[['color']])

In [22]:
import pandas as pd
encoded_df = pd.DataFrame(encoded.toarray() , columns=encoder.get_feature_names_out())

In [26]:

pd.concat([df,encoded_df],axis=1)

Unnamed: 0,color,color_Red,color_blue,color_green
0,Red,1.0,0.0,0.0
1,blue,0.0,1.0,0.0
2,green,0.0,0.0,1.0
3,green,0.0,0.0,1.0
4,Red,1.0,0.0,0.0
5,blue,0.0,1.0,0.0


 # 2. Label and ordinal Encoding

In [28]:
from sklearn.preprocessing import LabelEncoder

In [29]:
df = pd.DataFrame({
    'color' : ['Red','blue','green','green','Red','blue']
})

In [30]:
df

Unnamed: 0,color
0,Red
1,blue
2,green
3,green
4,Red
5,blue


In [31]:
## vreate an instance of label encoder
encoder = LabelEncoder()

In [32]:
encoder.fit_transform(df['color'])

array([0, 1, 2, 2, 0, 1])

In [33]:
# ordinal Encoder = rank rise encoding
from sklearn.preprocessing import OrdinalEncoder
df = pd.DataFrame({
    'size' : ['small','medium','large','medium','small','large']
})

In [34]:
df

Unnamed: 0,size
0,small
1,medium
2,large
3,medium
4,small
5,large


In [35]:
## create an instance of the ordinalEncoder
encoder = OrdinalEncoder(categories=[['small','medium','large']])

In [37]:
encoder.fit_transform(df[['size']])

array([[0.],
       [1.],
       [2.],
       [1.],
       [0.],
       [2.]])

# 3. Target guided Ordinal Encoding

It is a technique used to encode categorical variables based on their relationship with the target variable. This encoding technique is useful when we have a categorical variable with a large number of unique categories, and we want to use this variable as a feature in our machine learning model.

In Target Guided Ordinal Encoding, we replace each category in the categorical variable with a numerical value based on the mean or median of the target variable for that category. This creates a monotonic relationship between the categorical variable and the target variable, which can improve the predictive power of our model.

In [38]:
import pandas as pd

# create a sample dataframe with a categorical variable and a target variable
df = pd.DataFrame({
    'city': ['New York', 'London', 'Paris', 'Tokyo', 'New York', 'Paris'],
    'price': [200, 150, 300, 250, 180, 320]
})

In [39]:
df

Unnamed: 0,city,price
0,New York,200
1,London,150
2,Paris,300
3,Tokyo,250
4,New York,180
5,Paris,320


In [40]:
## calculate the mean price for each city
mean_price = df.groupby('city')['price'].mean().to_dict()
mean_price

{'London': 150.0, 'New York': 190.0, 'Paris': 310.0, 'Tokyo': 250.0}

In [42]:
## replace each city with its mean price
df['city_encoded']=df['city'].map(mean_price)

In [43]:
df

Unnamed: 0,city,price,city_encoded
0,New York,200,190.0
1,London,150,150.0
2,Paris,300,310.0
3,Tokyo,250,250.0
4,New York,180,190.0
5,Paris,320,310.0


In [50]:
import seaborn as sns
df = sns.load_dataset('tips')
df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


In [51]:
mean_total_bill = df.groupby('day', observed=True)['total_bill'].mean().to_dict()

In [52]:
df['day_encoded'] = df['day'].map(mean_total_bill)

In [53]:
df.head()


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,day_encoded
0,16.99,1.01,Female,No,Sun,Dinner,2,21.41
1,10.34,1.66,Male,No,Sun,Dinner,3,21.41
2,21.01,3.5,Male,No,Sun,Dinner,3,21.41
3,23.68,3.31,Male,No,Sun,Dinner,2,21.41
4,24.59,3.61,Female,No,Sun,Dinner,4,21.41
