## Data Encoding
- #### convert categorical data to numerical value

  1. Nominal/One Hot Encoding
  2. Label and Orginal Encoding
  3. Target Guided Ordinal Encoding

### Nominal/OHE Encoding

One hot encoding, also known as nominal encoding, is a technique used to represent categorical data as numerical data, which is more suitable for machine learning algorithms. In this technique, each category is represented as a binary vector where each bit corresponds to a unique category. 

For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it using one hot encoding as follows:

1. Red: [1,0,0]
2. Green: [0,1,0]
3. Blue: [0,0,1]

#### Disadvantages:
- If we have 100 categories, then we will need to create 100 new features.
- 1's and 0's is called Sparse Matrix and when there is Sprase matrix it usually leads to overfitting


In [1]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [2]:
## Create a simple df

df = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'green', 'red', 'blue']
})

In [3]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red


In [4]:
## create an instance of Onehotencoder

encoder = OneHotEncoder()

In [15]:
## perform fit and transform

encoder.fit_transform(df[['color']]).toarray() # we need to give 2D 

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])

In [9]:
#[Blue, Green, Red] Alphabetically

In [10]:
encoded = encoder.fit_transform(df[['color']]).toarray() 

In [11]:
encoder_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out())

In [13]:
encoder_df.head()

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0


In [16]:
# if we get new data is should be red, blue or green only

In [17]:
encoder.transform([['blue']]).toarray()



array([[1., 0., 0.]])

In [19]:
pd.concat([df, encoder_df], axis=1)

Unnamed: 0,color,color_blue,color_green,color_red
0,red,0.0,0.0,1.0
1,blue,1.0,0.0,0.0
2,green,0.0,1.0,0.0
3,green,0.0,1.0,0.0
4,red,0.0,0.0,1.0
5,blue,1.0,0.0,0.0


#### Internal Assignment

In [22]:
import seaborn as sns
tips_df = sns.load_dataset('tips')
tips_df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [29]:
encoder = OneHotEncoder()
encoder_tips = encoder.fit_transform(tips_df[['sex', 'smoker','day','time']]).toarray()
print(encoder_tips)

[[1. 0. 1. ... 0. 1. 0.]
 [0. 1. 1. ... 0. 1. 0.]
 [0. 1. 1. ... 0. 1. 0.]
 ...
 [0. 1. 0. ... 0. 1. 0.]
 [0. 1. 1. ... 0. 1. 0.]
 [1. 0. 1. ... 1. 1. 0.]]


In [30]:
encoded_tips_df = pd.DataFrame(encoder_tips, columns=encoder.get_feature_names_out())
encoded_tips_df.head()

Unnamed: 0,sex_Female,sex_Male,smoker_No,smoker_Yes,day_Fri,day_Sat,day_Sun,day_Thur,time_Dinner,time_Lunch
0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
2,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
3,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
4,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
