#### Data Encoding 
1. Nominal / OHE Encoding :
- One hot encoding also known as nominal encoding is a technique used to represent categorical data as numerical data, which is more suitable for ML algorithms. In this techniques , each category is represented as a binary vector where each bit corresponds to unique category . For example, if we have a categorical variable with 'color' with three prossible values(red,green,blue) we can represent one hot encoding as follows:

1. Red:[1,0,0]
2. Green:[0,1,0]
3. Blue:[0,0,1]

In [2]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder 

In [6]:
## Create a simple dataframe

df = pd.DataFrame({'color':['red','blue','green','red','blue']})
df

Unnamed: 0,color
0,red
1,blue
2,green
3,red
4,blue


In [8]:
# Create an instance of Onehotencoder
encoder = OneHotEncoder()

In [11]:
## Perform fit and transform
encoded=encoder.fit_transform(df[['color']]).toarray() ## alphabate order 

In [13]:
import pandas as pd
encoder_df = pd.DataFrame(encoded,columns=encoder.get_feature_names_out())

In [14]:
encoder_df

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,0.0,1.0
4,1.0,0.0,0.0


In [16]:
## for new data
encoder.transform([['blue']]).toarray()



array([[1., 0., 0.]])

In [17]:
pd.concat([df,encoder_df],axis=1)

Unnamed: 0,color,color_blue,color_green,color_red
0,red,0.0,0.0,1.0
1,blue,1.0,0.0,0.0
2,green,0.0,1.0,0.0
3,red,0.0,0.0,1.0
4,blue,1.0,0.0,0.0


In [20]:
import seaborn as sns
data=sns.load_dataset('tips')
data

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


In [None]:
# fit and transform
encoded_tips = encoder.fit_transform(data[['day']]).toarray()
encoded_tips

In [28]:
encoder_df_data = pd.DataFrame(encoded_tips,columns=encoder.get_feature_names_out())
encoder_df_data

Unnamed: 0,day_Fri,day_Sat,day_Sun,day_Thur
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,0.0,0.0,1.0,0.0
3,0.0,0.0,1.0,0.0
4,0.0,0.0,1.0,0.0
...,...,...,...,...
239,0.0,1.0,0.0,0.0
240,0.0,1.0,0.0,0.0
241,0.0,1.0,0.0,0.0
242,0.0,1.0,0.0,0.0


In [31]:
pd.concat([df,encoder_df_data],axis=1)

Unnamed: 0,color,day_Fri,day_Sat,day_Sun,day_Thur
0,red,0.0,0.0,1.0,0.0
1,blue,0.0,0.0,1.0,0.0
2,green,0.0,0.0,1.0,0.0
3,red,0.0,0.0,1.0,0.0
4,blue,0.0,0.0,1.0,0.0
...,...,...,...,...,...
239,,0.0,1.0,0.0,0.0
240,,0.0,1.0,0.0,0.0
241,,0.0,1.0,0.0,0.0
242,,0.0,1.0,0.0,0.0


2. Label Encoding:

- Label encoding and ordinal encoding are two techniques used to encode categorical data as numerical data.
- Label encoding involves assigning a unique numerical label to each category in the variable. The labels are usually assigned in alphabetical order or based on the frequency of the categories. 
- For example, if we have a categorical variable "color" with three possible values(red,green,blue) we can represent it using label encoding as follows:

    1.Red: 1 

    2.Green: 2
    
    3.Blue: 3


In [32]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,red
4,blue


In [33]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()


In [34]:
label_encoder.fit_transform(df[['color']])

  y = column_or_1d(y, warn=True)


array([2, 0, 1, 2, 0])

In [37]:
label_encoder.transform([['red']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([2])

#### Ordinal Encoding

Its used to encode categorical data that have an intrinsic order or ranking.
In this technique, each category is assigned a unique number based on its position in the order.
For example, if we have a categorical variable with education with 4 possible values 'highschool', 'college', 'graduate','post graduate'.

    1. High school: 1
    2. College: 2
    3. Graduate: 3
    4. Post graduate: 4

In [38]:
## oridnal encoding

from sklearn.preprocessing import OrdinalEncoder


In [39]:
## create a sample dataframe with an ordinal variable

df = pd.DataFrame({
    'size':['small', 'medium', 'large', 'small', 'medium', 'large']
})

In [None]:
## Create an instance of OrdinalEncoder 
ordinal_encoder = OrdinalEncoder(categories=[['small','medium','large']])
ordinal_encoder

In [None]:
## Now fit_transform
ordinal_encoder.fit_transform(df[['size']])

array([[0.],
       [1.],
       [2.],
       [0.],
       [1.],
       [2.]])

In [44]:
ordinal_encoder.transform([['small']])



array([[0.]])