### Data Encoding
1. Nominal/OHE Encoding
2. Label and Ordinal Encoding
3. Target Guided Ordinal Encoding

In machine learning it refers to the process of converting data from one format or structure to another, primarily to make it compatible with machine learning. This is praticularly crucial for handling categorical data, which represents variables that fall into distinct categories, as most machine learning models require numerical input.

## Nominal/OHE Encoding
One hot encoding, also known as nominal encoding, is a technique used to represent categorical data as numerical data, which is more suitable for machine learning algorithms. In this technique, each category is represented as a binary vector where each bit corresponds to a unique category. For example, if we have a categorical variable "color" with three possible values (red,green,blue), we  can represent it as one hot encoding as follows:
1. Red : [1, 0, 0]
2. Green : [0, 1, 0]
3. Blue : [0, 0, 1]

# Disadvantage
- When a categorical variable has many unique values (high cardinality), one-hot encoding creates many new features, significantly increasing the dimensionality of your dataset.
- Sparse data : The resulting matrices are often sparse, meaning they contain mostly zeros. This can be inefficient for some machine learning algorithms to process.
- The increased number of features and sparse data lead to higher computational costs and increased memory usage, which can slow down model training and inference.
- One-hot encoding treats each category as independent, failing to capture any inherent relationship or order between them.
- For small datasets, the increased number of features from one-hot encoding can make models more prone to overfitting, where the model learns the training data too well and performs poorly on new data.
- The technique doesn't scale well to very large vocabularies or features with a vast number of categories, making it impractical for certain large-scale applications.

In [1]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [2]:
## Create a simple dataframe
df = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'red', 'blue']
})

In [3]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,red
4,blue


In [9]:
## create an instance of Onehotencoder
encoder = OneHotEncoder()

In [10]:
## perform fit and transform
encoded = encoder.fit_transform(df[['color']]).toarray()

In [11]:
encoder_df = pd.DataFrame(encoded,columns=encoder.get_feature_names_out())

In [12]:
encoder_df.head()

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,0.0,1.0
4,1.0,0.0,0.0


In [14]:
## for new data
encoder.transform([['blue']]).toarray()



array([[1., 0., 0.]])

In [17]:
pd.concat([df,encoder_df],axis=1)

Unnamed: 0,color,color_blue,color_green,color_red
0,red,0.0,0.0,1.0
1,blue,1.0,0.0,0.0
2,green,0.0,1.0,0.0
3,red,0.0,0.0,1.0
4,blue,1.0,0.0,0.0


In [20]:
import seaborn as sns
df = sns.load_dataset('tips')

In [22]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [21]:
enc = OneHotEncoder()

In [25]:
encode = enc.fit_transform(df[['sex']])