<a href="https://colab.research.google.com/github/nasif-raihan/ML-and-DL-Codes/blob/main/Feature_Engineering_Encoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Label Encoding

A technique for converting categorical data into numerical format.

### Use case:
1. Dealing with ordinal data
2. Using tree-based models

In [1]:
data = {'Category': ['A', 'B', 'C', 'D', 'A', 'B', 'C', 'D', 'A', 'B']}

In [2]:
# using pandas
import pandas as pd

df = pd.DataFrame(data)
pd.factorize(df['Category'])

(array([0, 1, 2, 3, 0, 1, 2, 3, 0, 1]),
 Index(['A', 'B', 'C', 'D'], dtype='object'))

In [3]:
# using scikit learn
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
encoded_labels = label_encoder.fit_transform(df['Category'])
original_labels = label_encoder.inverse_transform(encoded_labels)

print(f"{encoded_labels=}")
print(f"{original_labels=}")

encoded_labels=array([0, 1, 2, 3, 0, 1, 2, 3, 0, 1])
original_labels=array(['A', 'B', 'C', 'D', 'A', 'B', 'C', 'D', 'A', 'B'], dtype=object)


# One Hot Encoding

It converts categorical data into numerical format avoiding the problem of ordinality, which can occur when a categorical variable has a natural ordering (e.g. “small”, “medium”, “large”)..

### Disadvantages:
1. Increases dimentionality
2. It can lead to overfitting, especially if there are many categories in the variable and the sample size is relatively small.

### Use case:

1. Dealing with nominal data
2. Using linear or non tree-based models

In [4]:
df.head(2)

Unnamed: 0,Category
0,A
1,B


In [5]:
# using pandas
pd.get_dummies(df, prefix='encoded', columns=['Category'], drop_first=1, dtype=int)

Unnamed: 0,encoded_B,encoded_C,encoded_D
0,0,0,0
1,1,0,0
2,0,1,0
3,0,0,1
4,0,0,0
5,1,0,0
6,0,1,0
7,0,0,1
8,0,0,0
9,1,0,0


In [6]:
# using scikit learn
from sklearn.preprocessing import OneHotEncoder
one_hot_encoder = OneHotEncoder(sparse_output=False)

categorical_columns = df.select_dtypes(include=['object']).columns.tolist()
one_hot_encoded = one_hot_encoder.fit_transform(df[categorical_columns])
one_hot_encoded_columns = one_hot_encoder.get_feature_names_out()
one_hot_df = pd.DataFrame(one_hot_encoded, columns=one_hot_encoded_columns)

one_hot_df.head(2)

Unnamed: 0,Category_A,Category_B,Category_C,Category_D
0,1.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0


In [7]:
df_one_hot_encoded = pd.concat([df, one_hot_df], axis=1)
df_one_hot_encoded = df_one_hot_encoded.drop(categorical_columns, axis=1)
df_one_hot_encoded.head(4)

Unnamed: 0,Category_A,Category_B,Category_C,Category_D
0,1.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0
2,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,1.0


# Binary Encoding

First the category is to convert in number and then the number is to convert in binary values.

In [8]:
df.head(2)

Unnamed: 0,Category
0,A
1,B


In [9]:
# using pandas
label_encoded_df = pd.DataFrame(encoded_labels, columns=['Category'])
binary_encoded_df = label_encoded_df['Category'].apply(lambda x: bin(x)[2:])
binary_encoded_df.head(3)

0     0
1     1
2    10
Name: Category, dtype: object

In [10]:
# using category_encoders
!pip install --upgrade -qq category_encoders

In [11]:
import category_encoders as ce

binary_encoder = ce.BinaryEncoder(cols=['Category'], return_df=True)
binary_encoded_df = binary_encoder.fit_transform(df)
binary_encoded_df.head(3)

Unnamed: 0,Category_0,Category_1,Category_2
0,0,0,1
1,0,1,0
2,0,1,1


# Ordinal Encoding

## Use cases
1. Dealing with categorical variables where the categories have a natural order or ranking.
2. Statistical (decision tree, random forest) and non-linear models (SVM) handle ordinal encoding effectively.

In [12]:
from sklearn.preprocessing import OrdinalEncoder

In [13]:
data = [
    ['good'], ['bad'], ['excellent'], ['average'],
    ['good'], ['average'], ['excellent'], ['bad'],
    ['average'], ['good']
]

df = pd.DataFrame(data, columns=['reviews'])
df.head(4)

Unnamed: 0,reviews
0,good
1,bad
2,excellent
3,average


In [14]:
ordinal_encoder = OrdinalEncoder(categories=[df['reviews'].unique()])
df['reviews'] = ordinal_encoder.fit_transform(df[['reviews']])
df.head()

Unnamed: 0,reviews
0,0.0
1,1.0
2,2.0
3,3.0
4,0.0
