## Data Encoding

1. Nominal/OHE Encoding

2. Label and Ordinal Encoding

3. Target Guided Ordinal Encoding


### Nominal/OHE Encoding

One hot encoding, also known as nominal encoding, is a technique used to represent categorical data as numerical data, which is more suitable for machine learning algorithms. In this technique, each category is represented as a binary vector where each bit corresponds to a unique category. For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it using one hot encoding as follows:

1. Red: [1, 0, 0]

2. Green: [0, 1, 0]

3. Blue: [0, 0, 1]




In [702]:
import numpy   as np
import pandas  as pd
import seaborn as sns
from   sklearn.preprocessing import OneHotEncoder

In [703]:
## Create a DataFrame

df = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'green', 'red', 'blue']
})

In [704]:
df

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red
5,blue


In [705]:
### Perform One Hot Encoding on the dataframe to convert the labels into the numbers

### Create an instance of the encoder

encoder = OneHotEncoder()


### using onehotencoder() transform the categories into the numbers

encoded = encoder.fit_transform(df[['color']]).toarray()

In [706]:
encoded

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])

In [707]:
# get the encoded form of the dataframe

import pandas as pd
import numpy  as np

df_new = pd.DataFrame(encoded, columns = encoder.get_feature_names_out())

In [708]:
df_new

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0
5,1.0,0.0,0.0


In [709]:
### Another Example

df = pd.DataFrame({
    'size': ['small', 'medium', 'large', 'medium', 'small', 'large']
})


print(df)

     size
0   small
1  medium
2   large
3  medium
4   small
5   large


In [710]:
df['size']

0     small
1    medium
2     large
3    medium
4     small
5     large
Name: size, dtype: object

In [711]:
### perform OneHotEncoding on 'size' column of the dataframe

from sklearn.preprocessing import OneHotEncoder


encoded = OneHotEncoder()

In [712]:
ans=encoded.fit_transform(df[['size']]).toarray()

In [713]:
### Transform the array value into the dataframe

df = pd.DataFrame(ans, columns = encoded.get_feature_names_out())

In [714]:
df

Unnamed: 0,size_large,size_medium,size_small
0,0.0,0.0,1.0
1,0.0,1.0,0.0
2,1.0,0.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0
5,1.0,0.0,0.0


### Label Encoding 

Label encoding and ordinal encoding are two techniques used to encode categorical data as numerical data.

Label encoding involves assigning a unique numerical label to each category in the variable. The labels are usually assigned in alphabetical order or based on the frequency of the categories. For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it using label encoding as follows:

1. Red: 1
2. Green: 2
3. Blue: 3

In [715]:
from sklearn.preprocessing import LabelEncoder       

In [716]:
### object for Label Encoder

label = LabelEncoder()

In [717]:
df = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'green', 'red', 'blue']
})

In [718]:
### perform the label encoding on the above dataset

ans = label.fit_transform(df[['color']])

  y = column_or_1d(y, warn=True)


In [719]:
ans

array([2, 0, 1, 1, 2, 0])

In [720]:
label.transform([['red']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([2])

In [721]:
label.transform([['blue']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([0])

In [722]:
label.transform([['green']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([1])

### OBSERVATIONS:


1. After performing Label Encoding on the above dataset, we see that different labels/numbers has been assigned to every color.

      1.   blue ----------->  0

      2.   green ---------->  1

      3.   red  ----------->  2

### Ordinal Encoding

It is used to encode categorical data that have an intrinsic order or ranking. In this technique, each category is assigned a numerical value based on its position in the order. For example, if we have a categorical variable "education level" with four possible values (high school, college, graduate, post-graduate), we can represent it using ordinal encoding as follows:

1. High school: 1

2. College: 2

3. Graduate: 3

4. Post-graduate: 4

In [723]:
from sklearn.preprocessing import OrdinalEncoder

In [724]:
ordinal = OrdinalEncoder()

In [725]:
# create a sample dataframe with an ordinal variable
df = pd.DataFrame({
    'size': ['small', 'medium', 'large', 'medium', 'small', 'large']
})

In [726]:
ordinal.fit_transform(df[['size']])

array([[2.],
       [1.],
       [0.],
       [1.],
       [2.],
       [0.]])

In [727]:
ordinal.transform([['small']])



array([[2.]])

In [728]:
ordinal.transform([['medium']])



array([[1.]])

In [729]:
ordinal.transform([['large']])



array([[0.]])