### Data Encoding
1. Nominal/OHE Encoding
2. Label and Ordinal Encoding
3. Target Guided Ordinal Encoding

#### Nomial/OHE Encoding
One Hot Encoding, also known as Nominal Encoding, is a technique used to represent categorical data as numerical data, which is more suitable for the machine learning algorithms. In this technique, each category is represented as a binary vector where each bit corresponds to a unique category. For example, if we have a categorical variable "color" with three possible values (red,green,blue), we can represent it in one hot encoding as follows:

1. Red: [1,0,0]
2. Green: [0,1,0]
3. Blue: [0,0,1]

In [1]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [2]:
## Create a Simple data Frame
df = pd.DataFrame({
    'color': ['red','blue', 'green','green', 'red', 'blue']
})

In [3]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red


In [4]:
## Create an instance of One Hot Encoder
encoder = OneHotEncoder()

In [5]:
## Perform fit and transform
encoded_values = encoder.fit_transform(df[['color']]).toarray()

In [6]:
import pandas as pd
encoded_df = pd.DataFrame(encoded_values,columns=encoder.get_feature_names_out())

In [7]:
encoded_df.head()

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0


In [8]:
df = pd.concat([df,encoded_df],axis=1)

In [9]:
df.head()

Unnamed: 0,color,color_blue,color_green,color_red
0,red,0.0,0.0,1.0
1,blue,1.0,0.0,0.0
2,green,0.0,1.0,0.0
3,green,0.0,1.0,0.0
4,red,0.0,0.0,1.0


### Practicing OHE on tips Dataaset

In [10]:
import seaborn as sns
df = sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [11]:
df['sex'].value_counts()

sex
Male      157
Female     87
Name: count, dtype: int64

In [12]:
## Encoding the values for all the categorical columns
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
encoded_values_sex = encoder.fit_transform(df[['sex']]).toarray()
sex_encoded_df = pd.DataFrame(encoded_values_sex,columns=encoder.get_feature_names_out())
sex_encoded_df.head()


Unnamed: 0,sex_Female,sex_Male
0,1.0,0.0
1,0.0,1.0
2,0.0,1.0
3,0.0,1.0
4,1.0,0.0


In [13]:
## Creating the dataframe for each of the encoded values
encoded_values_smoker = encoder.fit_transform(df[['smoker']]).toarray()
smoker_encoded_df = pd.DataFrame(encoded_values_smoker,columns=encoder.get_feature_names_out())
smoker_encoded_df.head()

Unnamed: 0,smoker_No,smoker_Yes
0,1.0,0.0
1,1.0,0.0
2,1.0,0.0
3,1.0,0.0
4,1.0,0.0


In [14]:
encoded_values_day = encoder.fit_transform(df[['day']]).toarray()
day_encoded_df = pd.DataFrame(encoded_values_day,columns=encoder.get_feature_names_out())
day_encoded_df.head()

Unnamed: 0,day_Fri,day_Sat,day_Sun,day_Thur
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,0.0,0.0,1.0,0.0
3,0.0,0.0,1.0,0.0
4,0.0,0.0,1.0,0.0


In [15]:
encoded_values_time = encoder.fit_transform(df[['time']]).toarray()
time_encoded_df = pd.DataFrame(encoded_values_time,columns=encoder.get_feature_names_out())
time_encoded_df.head()

Unnamed: 0,time_Dinner,time_Lunch
0,1.0,0.0
1,1.0,0.0
2,1.0,0.0
3,1.0,0.0
4,1.0,0.0


In [16]:
## Concatenating the Encoded features in the original DF
df = pd.concat([df,sex_encoded_df,smoker_encoded_df,day_encoded_df,time_encoded_df],axis=1)
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,sex_Female,sex_Male,smoker_No,smoker_Yes,day_Fri,day_Sat,day_Sun,day_Thur,time_Dinner,time_Lunch
0,16.99,1.01,Female,No,Sun,Dinner,2,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,10.34,1.66,Male,No,Sun,Dinner,3,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
2,21.01,3.5,Male,No,Sun,Dinner,3,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
3,23.68,3.31,Male,No,Sun,Dinner,2,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
4,24.59,3.61,Female,No,Sun,Dinner,4,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0


In [17]:
df.drop(columns=df.select_dtypes(include=['object', 'category']).columns,inplace=True)

In [18]:
df.head()

Unnamed: 0,total_bill,tip,size,sex_Female,sex_Male,smoker_No,smoker_Yes,day_Fri,day_Sat,day_Sun,day_Thur,time_Dinner,time_Lunch
0,16.99,1.01,2,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,10.34,1.66,3,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
2,21.01,3.5,3,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
3,23.68,3.31,2,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
4,24.59,3.61,4,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0


### Label Encoding
Label encoding and ordinal encoding are two techniques used to encode categorical data as numerical data

Label encoding involves assigning a unique numerical label to each category in the variables. The labels are usually assigned in alphabetical order or based on the frequency of the categories. For example, if we have a categorical variable "color" with three possible values (red,green,blue), we can represent it using label encoding as follows:

1. Red: 1
2. Green: 2
3. Blue: 3

In [19]:
from sklearn.preprocessing import LabelEncoder
## Create the instance for the LabelEncoder
lbl_encoder = LabelEncoder()

In [20]:
df = pd.DataFrame({
    'color': ['red','blue', 'green','green', 'red', 'blue']
})

In [21]:
lbl_encoder.fit_transform(df[['color']])

  y = column_or_1d(y, warn=True)


array([2, 0, 1, 1, 2, 0])

In [22]:
lbl_encoder.transform([['red']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([2])

In [23]:
lbl_encoder.transform([['blue']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([0])

### Ordinal Encoding
It is used to encode categorical data that have an intrinsic order or ranking. In this technique, each category is assigned a numerical value based on its position in the order. For example, if we have a categorical value "education level" with four possible values (high school, college, post-graduate, graduate), we can represent it using the ordinall encoding as follows:

1. High School: 1
2. College: 2
3. Graduate: 3
4. Post-graduate: 4

In [26]:
## Ordinal encoding
from sklearn.preprocessing import OrdinalEncoder
df = pd.DataFrame({
    "size":['small','medium','large','medium','small','large']
})

In [28]:
df.head(6)

Unnamed: 0,size
0,small
1,medium
2,large
3,medium
4,small
5,large


In [29]:
## Create an instance of ordinal encoder and then perfomr fit transform
encoder = OrdinalEncoder(categories=[['small','medium','large']])

In [30]:
encoder.fit_transform(df[['size']])

array([[0.],
       [1.],
       [2.],
       [1.],
       [0.],
       [2.]])

### Target Guided Ordinal Encoding
It is a technique used to encode categorical variables based on their  relationship with the taarget variables. This encoding technique is useful when we have a categorical variable with a large number of unique categories, and we want to use this variable as a feature in our machine learning model.

In Target Guidded ordinal Encoding, we replace each category in the categorical variable with a numerical value based on the mean or median of the target variable for that category. This creates a monotonic relationship between the categorical variable and the target variable, which can improve the predective power of our model

In [31]:
df = pd.DataFrame({
    "city":['New York','London', 'Paris', 'Tokyo', 'New York', 'Paris'],
    "price": [200,150,300,250,180,320]
})

In [32]:
df

Unnamed: 0,city,price
0,New York,200
1,London,150
2,Paris,300
3,Tokyo,250
4,New York,180
5,Paris,320


In [34]:
mean_price = df.groupby('city')['price'].mean().to_dict()

In [35]:
mean_price

{'London': 150.0, 'New York': 190.0, 'Paris': 310.0, 'Tokyo': 250.0}

In [36]:
df['city_encoded'] = df['city'].map(mean_price)

In [37]:
df

Unnamed: 0,city,price,city_encoded
0,New York,200,190.0
1,London,150,150.0
2,Paris,300,310.0
3,Tokyo,250,250.0
4,New York,180,190.0
5,Paris,320,310.0


In [38]:
df[['city_encoded','price']]

Unnamed: 0,city_encoded,price
0,190.0,200
1,150.0,150
2,310.0,300
3,250.0,250
4,190.0,180
5,310.0,320


#### Practice on tips dataset

In [39]:
import seaborn as sns

In [40]:
df = sns.load_dataset("tips")

In [41]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [44]:
## Task: To perfrom the target guided ordinal encoding for time for the target of total bill
mean_total_bill = df.groupby('time')['total_bill'].mean().to_dict()

  mean_total_bill = df.groupby('time')['total_bill'].mean().to_dict()


In [46]:
mean_total_bill

{'Lunch': 17.168676470588235, 'Dinner': 20.79715909090909}

In [47]:
df['encoded_time']  = df['time'].map(mean_total_bill)
df.drop(columns=['time'],inplace=True)

In [48]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,size,encoded_time
0,16.99,1.01,Female,No,Sun,2,20.797159
1,10.34,1.66,Male,No,Sun,3,20.797159
2,21.01,3.5,Male,No,Sun,3,20.797159
3,23.68,3.31,Male,No,Sun,2,20.797159
4,24.59,3.61,Female,No,Sun,4,20.797159
