<a href="https://colab.research.google.com/github/ishnt/Data_science_stuff/blob/main/Data_Encoding(One_Hot_Encoding_Nominal).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## Data Encoding

1. Nominal/OHE Encoding
2. Label and Ordinal Encoding
3. Target Guided Ordinal Encoding

### Nominal/OHE Encoding
One hot encoding, also known as nominal encoding, is a technique used to represent categorical data as numerical data, which is more suitable for machine learning algorithms. In this technique, each category is represented as a binary vector where each bit corresponds to a unique category. For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it using one hot encoding as follows:

1. Red: [1, 0, 0]
2. Green: [0, 1, 0]
3. Blue: [0, 0, 1]



In [73]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [74]:
##create a single dataframe
df=pd.DataFrame({
    'color':['red','blue','green','black','yellow']
})

In [75]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,black
4,yellow


In [76]:
##create a instance for oneHotEncoder

In [77]:
encoder=OneHotEncoder()

In [78]:
#perform fit and transform
encoder.fit_transform(df[['color']]).toarray()

array([[0., 0., 0., 1., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1.]])

In [79]:
encoded=encoder.fit_transform(df[['color']]).toarray()

In [80]:
encoded_df=pd.DataFrame(encoded,columns=encoder.get_feature_names_out())

In [81]:
encoded_df

Unnamed: 0,color_black,color_blue,color_green,color_red,color_yellow
0,0.0,0.0,0.0,1.0,0.0
1,0.0,1.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0


In [82]:
##new data
encoder.transform([['blue']]).toarray()



array([[0., 1., 0., 0., 0.]])

In [83]:
pd.concat([df,encoded_df],axis=1)

Unnamed: 0,color,color_black,color_blue,color_green,color_red,color_yellow
0,red,0.0,0.0,0.0,1.0,0.0
1,blue,0.0,1.0,0.0,0.0,0.0
2,green,0.0,0.0,1.0,0.0,0.0
3,black,1.0,0.0,0.0,0.0,0.0
4,yellow,0.0,0.0,0.0,0.0,1.0


In [84]:
##implementing on tips dataset
import seaborn as sns

In [85]:
df=sns.load_dataset('tips')

In [86]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [87]:
df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


In [88]:
days=pd.DataFrame(df['day'])


In [89]:
encoder=OneHotEncoder()

In [90]:
encoded=encoder.fit_transform(days[['day']]).toarray()

In [91]:
encoded

array([[0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],


In [92]:
len(encoded)

244

In [93]:
encoded_df=pd.DataFrame(encoded,columns=encoder.get_feature_names_out())

In [94]:
encoded_df

Unnamed: 0,day_Fri,day_Sat,day_Sun,day_Thur
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,0.0,0.0,1.0,0.0
3,0.0,0.0,1.0,0.0
4,0.0,0.0,1.0,0.0
...,...,...,...,...
239,0.0,1.0,0.0,0.0
240,0.0,1.0,0.0,0.0
241,0.0,1.0,0.0,0.0
242,0.0,1.0,0.0,0.0


In [95]:
pd.concat([days,encoded_df],axis=1)

Unnamed: 0,day,day_Fri,day_Sat,day_Sun,day_Thur
0,Sun,0.0,0.0,1.0,0.0
1,Sun,0.0,0.0,1.0,0.0
2,Sun,0.0,0.0,1.0,0.0
3,Sun,0.0,0.0,1.0,0.0
4,Sun,0.0,0.0,1.0,0.0
...,...,...,...,...,...
239,Sat,0.0,1.0,0.0,0.0
240,Sat,0.0,1.0,0.0,0.0
241,Sat,0.0,1.0,0.0,0.0
242,Sat,0.0,1.0,0.0,0.0
