## Data Encoding

1. Nominal/OHE Encoding
2. Label and Ordinal Encoding
3. Target Guided Ordinal Encoding 

### Nominal/OHE Encoding
One hot encoding, also known as nominal encoding, is a technique used to represent categorical data as numerical data, which is more suitable for machine learning algorithms. In this technique, each category is represented as a binary vector where each bit corresponds to a unique category. For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it using one hot encoding as follows:

1. Red: [1, 0, 0]
2. Green: [0, 1, 0]
3. Blue: [0, 0, 1]

In [None]:
## One hot encoding creates sparse matrix - 0s and 1s and it can lead to overfitting during the training of algorithm

In [2]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [3]:
## Create a simple dataframe 
df = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'green', 'red', 'blue']
})

In [4]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red


In [6]:
df.color.unique()

array(['red', 'blue', 'green'], dtype=object)

In [8]:
# Create an instance of OneHotEncoder
encoder = OneHotEncoder()

In [9]:
# Perform fit and transform

encoder.fit_transform(df[['color']]).toarray()

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])

In [11]:
encoded = encoder.fit_transform(df[['color']]).toarray()

In [12]:
import pandas as pd
encoder_df = pd.DataFrame(encoded, columns = encoder.get_feature_names_out())

In [13]:
encoder_df

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0
5,1.0,0.0,0.0


## For new data

In [14]:
encoder.transform([['blue']]).toarray()

  "X does not have valid feature names, but"


array([[1., 0., 0.]])

In [17]:
pd.concat([df, encoder_df], axis = 1)

Unnamed: 0,color,color_blue,color_green,color_red
0,red,0.0,0.0,1.0
1,blue,1.0,0.0,0.0
2,green,0.0,1.0,0.0
3,green,0.0,1.0,0.0
4,red,0.0,0.0,1.0
5,blue,1.0,0.0,0.0


### Label Encoding 
Label encoding and ordinal encoding are two techniques used to encode categorical data as numerical data.

Label encoding involves assigning a unique numerical label to each category in the variable. The labels are usually assigned in alphabetical order or based on the frequency of the categories. For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it using label encoding as follows:

1. Red: 1
2. Green: 2
3. Blue: 3

In [18]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red


In [20]:
from sklearn.preprocessing import LabelEncoder
lbl_encoder = LabelEncoder()

In [21]:
lbl_encoder.fit_transform(df[['color']])

  y = column_or_1d(y, warn=True)


array([2, 0, 1, 1, 2, 0])

In [23]:
## For new values

In [24]:
lbl_encoder.transform([['red']])

array([2])

In [25]:
lbl_encoder.transform([['blue']])

array([0])

In [26]:
lbl_encoder.transform([['green']])

array([1])

#### problem - machine learning model will consider red >  green > blue

### Ordinal Encoding
It is used to encode categorical data that have an intrinsic order or ranking. In this technique, each category is assigned a numerical value based on its position in the order. For example, if we have a categorical variable "education level" with four possible values (high school, college, graduate, post-graduate), we can represent it using ordinal encoding as follows:

1. High school: 1
2. College: 2
3. Graduate: 3
4. Post-graduate: 4

In [29]:
from sklearn.preprocessing import OrdinalEncoder

In [30]:
# create a sample dataframe with an ordinal variable
df = pd.DataFrame({
    'size': ['small', 'medium', 'large', 'medium', 'small', 'large']
})

In [31]:
df

Unnamed: 0,size
0,small
1,medium
2,large
3,medium
4,small
5,large


In [32]:
### Create an instance of OrdinalEncoder and then perform fit_transform

encoder = OrdinalEncoder(categories = [['small','medium','large']])

In [38]:
encoded = encoder.fit_transform(df[['size']])
encoded

array([[0.],
       [1.],
       [2.],
       [1.],
       [0.],
       [2.]])

In [34]:
encoder.transform([['small']])

  "X does not have valid feature names, but"


array([[0.]])

In [44]:
encoded_df = pd.DataFrame(encoded, columns = ['size_encoded'])

In [45]:
encoded_df

Unnamed: 0,size_encoded
0,0.0
1,1.0
2,2.0
3,1.0
4,0.0
5,2.0


In [47]:
pd.concat([df, encoded_df], axis = 1)

Unnamed: 0,size,size_encoded
0,small,0.0
1,medium,1.0
2,large,2.0
3,medium,1.0
4,small,0.0
5,large,2.0
