# One Hot Encoding
- G.J.O
  
One-hot encoding is a technique used to convert categorical data into a binary format  where each category is represented by a separate column with a 1 indicating its presence and 0s for all other categories.

Common challenge in ML is dealing with categorical variables such as colors, product types, or locations, beacause the algorithms typically require input.

# Why Use One-Hot Encoding?
One-hot encoding is an essential technique in data preprocessing for several reasons. It transforms categorical data into a format that machine learning models can easily understand and use. This transformation allows each category to be treated independently without implying any false relationships between them.

Additionally, many data processing and machine learning libraries support one-hot encoding. It fits smoothly into the data preprocessing workflow, making it easier to prepare datasets for various machine learning algorithms.

# Avoiding ordinality
Label encoding is another method to convert categorical data into numerical values by assigning each category a unique number. However, this approach can create problems because it might suggest an order or ranking among categories that doesn't actually exist.

For example, assigning 1 to Red, 2 to Green, and 3 to Blue could make the model think that Green is greater than Red and Blue is greater than both. This misunderstanding can negatively affect the model's performance.

One-hot encoding solves this problem by creating a separate binary column for each category. This way, the model can see that each category is distinct and unrelated to the others. 

Label encoding is useful when the categorical data has an inherent ordinal relationship, meaning the categories have a meaningful order or ranking. In such cases, the numerical values assigned by label encoding can effectively represent this order, making it a suitable choice.

In [7]:
import pandas as pd

#sample data
data = {'Color': ['Red', 'Green','Blue', 'Red']}
df = pd.DataFrame(data)

#Applying one-hot encoding
df_encoded = pd.get_dummies (df, dtype =int)

#display the encoded DataFrame

df_encoded

Unnamed: 0,Color_Blue,Color_Green,Color_Red
0,0,0,1
1,0,1,0
2,1,0,0
3,0,0,1


# Using Scikit-learn's OneHotEncoder
For flexibility and control over the encoding process, Scikit-learn offers the OneHotEncoder class. This class provides advanced options, such as handling unknown categories and fitting the encoder to the training data.

In [10]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np

#Creating the encoder
enc = OneHotEncoder(handle_unknown = 'ignore')

#Sample data

X = [['Red'], ['Green'], ['Blue']]

#Fitting the encoder to the data
enc.fit(X)

#Transforming new data
result = enc.transform([['Red']]).toarray()

#Displaying the encoded result
result

array([[0., 0., 1.]])

# Handling Categorical Features With Many Unique Values
One significant challenge with one-hot encoding is the "curse of dimensionality." This occurs when a categorical feature has a large number of unique values, leading to an explosion in the number of columns. This can make the dataset sparse and computationally expensive to process.

# Feature Hashing
Feature hashing, also known as the hashing trick, can help reduce dimensionality by hashing categories into a fixed number of columns. This approach maintains efficiency while controlling the number of features.


In [19]:
from sklearn.feature_extraction import FeatureHasher
import pandas as pd

# Sample data
data = {'Color': ['Red', 'Green', 'Blue', 'Red', 'Yellow']}
df = pd.DataFrame(data)

# Initialize FeatureHasher
hasher = FeatureHasher(n_features=3, input_type='string')

# Apply feature hashing (fix here 👇)
hashed_features = hasher.transform(df['Color'].apply(lambda x: [x]))
hashed_df = pd.DataFrame(hashed_features.toarray(), columns=['Feature1', 'Feature2', 'Feature3'])

# Display the hashed features DataFrame
print("Hashed Features DataFrame:")
print(hashed_df)


Hashed Features DataFrame:
   Feature1  Feature2  Feature3
0       0.0       0.0       1.0
1       0.0       1.0       0.0
2      -1.0       0.0       0.0
3       0.0       0.0       1.0
4       0.0      -1.0       0.0


We import the necessary libraries, including FeatureHasher from sklearn.feature_extraction and pandas. We then create a DataFrame with a categorical feature 'Color'.
We initialize FeatureHasher with the desired number of output features (n_features=3) and specify the input type as 'string'. After that, we apply the transform method to the 'Color' column and convert the resulting sparse matrix to a dense array, which is then converted to a DataFrame. Finally, we print the DataFrame containing the hashed features.

# Dimensionality reduction
After one-hot encoding, techniques like Principal Component Analysis (PCA) can be applied to reduce the number of dimensions while preserving the essential information in the dataset.

PCA can help compress the high-dimensional data into a lower-dimensional space, making it more manageable for machine learning algorithms.

In [30]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.decomposition import PCA
import pandas as pd

# Sample data
data = {'Color': ['Red', 'Green', 'Blue', 'Red', 'Yellow']}
df = pd.DataFrame(data)

# Instantiate encoder
encoder = OneHotEncoder(sparse_output=False)
one_hot_encoded = encoder.fit_transform(df[['Color']])

# Feature names
feature_names = encoder.get_feature_names_out(['Color'])
df_encoded = pd.DataFrame(one_hot_encoded, columns=feature_names)

# PCA transformation
pca = PCA(n_components=2)
pca_transformed = pca.fit_transform(df_encoded)
df_pca = pd.DataFrame(pca_transformed, columns=['PCA1', 'PCA2'])

# Output
print("PCA-Transformed DataFrame:")
print(df_pca)


PCA-Transformed DataFrame:
      PCA1          PCA2
0  0.69282  1.036858e-16
1 -0.46188 -4.082483e-01
2 -0.46188  8.164966e-01
3  0.69282  2.308544e-16
4 -0.46188 -4.082483e-01
