# An Overview of Encoding Techniques

# 1. Customer Churn Prediction (One-Hot Encoding)
## Scenario:
* A company wants to predict if a customer will churn based on their demographic details and interaction history. The dataset contains both numerical and categorical features (e.g., city, product type, and payment method).

## Why Use One-Hot Encoding:
* The categorical features (e.g., city or product type) are nominal (i.e., they have no inherent ranking). One-Hot Encoding works well here to convert these categories into a format suitable for algorithms like Logistic Regression, Decision Trees, and Neural Networks.

In [20]:
import pandas as pd

# Example data for customer churn prediction
data = {'City': ['New York', 'London', 'Paris', 'London', 'New York'],
        'Product': ['Phone', 'Tablet', 'Laptop', 'Phone', 'Tablet'],
        'Churn': [0, 1, 0, 1, 0]}

df = pd.DataFrame(data)

# Apply One-Hot Encoding to City and Product columns
one_hot_encoded = pd.get_dummies(df[['City', 'Product']])

# Concatenate the one-hot encoded columns with the original DataFrame
df = pd.concat([df, one_hot_encoded], axis=1)

print(df)


       City Product  Churn  City_London  City_New York  City_Paris  \
0  New York   Phone      0        False           True       False   
1    London  Tablet      1         True          False       False   
2     Paris  Laptop      0        False          False        True   
3    London   Phone      1         True          False       False   
4  New York  Tablet      0        False           True       False   

   Product_Laptop  Product_Phone  Product_Tablet  
0           False           True           False  
1           False          False            True  
2            True          False           False  
3           False           True           False  
4           False          False            True  


## When to Use:
* Use One-Hot Encoding when your categorical features are nominal (no ranking).
* This works well with algorithms like Logistic Regression, SVM, Neural Networks, and Tree-based models where categories are independent.

# 2. House Price Prediction (Label Encoding)
## Scenario:
* In a housing price prediction model, you have features like the condition of the house (Excellent, Good, Fair, Poor), which are ordinal in nature.

## Why Use Label Encoding:
* Since the house condition is ordinal (i.e., Excellent > Good > Fair > Poor), we can assign these categories numeric values that reflect their ranking. Label Encoding is appropriate when dealing with ordinal features.



In [25]:
from sklearn.preprocessing import LabelEncoder

# Example data for house price prediction
data = {'Condition': ['Excellent', 'Good', 'Fair', 'Poor'],
        'Price': [400000, 350000, 300000, 250000]}

df = pd.DataFrame(data)

# Initialize and apply Label Encoder
label_encoder = LabelEncoder()
df['Condition_encoded'] = label_encoder.fit_transform(df['Condition'])

print(df)

   Condition   Price  Condition_encoded
0  Excellent  400000                  0
1       Good  350000                  2
2       Fair  300000                  1
3       Poor  250000                  3


## When to Use:
* Use Label Encoding for ordinal categories (those with a meaningful order) such as education levels, ratings (e.g., low, medium, high), and more.
* It works with most algorithms but is especially useful in Decision Trees, Random Forests, and Gradient Boosting algorithms.


# 3. Fraud Detection (Frequency Encoding)
## Scenario:
* You are building a fraud detection model, and one of the features is the merchant where a transaction occurred. Some merchants appear frequently (common stores), while others are rare.

## Why Use Frequency Encoding:
* In this scenario, the frequency of a merchant can be an important indicator of fraud risk. Merchants with many transactions may have different risk profiles compared to merchants with very few transactions. Frequency Encoding replaces categories with how often they occur in the dataset.


In [30]:
import pandas as pd

# Example data for fraud detection
data = {'Merchant': ['Store_A', 'Store_B', 'Store_A', 'Store_C', 'Store_B', 'Store_C'],
        'Amount': [100, 200, 150, 400, 300, 250]}

df = pd.DataFrame(data)

# Frequency encoding for the 'Merchant' column
merchant_frequency = df['Merchant'].value_counts().to_dict()
df['Merchant_encoded'] = df['Merchant'].map(merchant_frequency)

print(df)

  Merchant  Amount  Merchant_encoded
0  Store_A     100                 2
1  Store_B     200                 2
2  Store_A     150                 2
3  Store_C     400                 2
4  Store_B     300                 2
5  Store_C     250                 2


## When to Use:
* Use Frequency Encoding when you believe the frequency of a category is relevant to the target outcome. This can be useful in fraud detection, customer segmentation, and recommendation systems.


# 4. Insurance Premium Prediction (Target Encoding)
## Scenario:
* You are predicting insurance premiums, and one feature is the type of vehicle (e.g., SUV, sedan, truck). Premium amounts may vary significantly based on the type of vehicle.

## Why Use Target Encoding:
* Target Encoding is beneficial when the categorical variable is highly predictive of the target. In this case, the type of vehicle directly impacts the premium, and we can use the average premium for each type of vehicle as the encoding.


In [35]:
import pandas as pd

# Example data for insurance premium prediction
data = {'Vehicle_Type': ['SUV', 'Sedan', 'Truck', 'Sedan', 'SUV'],
        'Premium': [1200, 1000, 800, 1100, 1250]}

df = pd.DataFrame(data)

# Calculate mean premium for each vehicle type (Target Encoding)
mean_premium = df.groupby('Vehicle_Type')['Premium'].mean()
df['Vehicle_encoded'] = df['Vehicle_Type'].map(mean_premium)

print(df)

  Vehicle_Type  Premium  Vehicle_encoded
0          SUV     1200           1225.0
1        Sedan     1000           1050.0
2        Truck      800            800.0
3        Sedan     1100           1050.0
4          SUV     1250           1225.0


## When to Use:
* Use Target Encoding when the categorical variable has a strong relationship with the target (dependent variable). Be careful to avoid overfitting, especially on small datasets. It’s often used in regression tasks or when working with algorithms that can benefit from target leakage.


# 5. Customer Segmentation (Binary Encoding)
## Scenario:
* You are segmenting customers based on different product categories (e.g., Electronics, Furniture, Apparel, etc.). The number of product categories is large, and you want to avoid creating too many features.

## Why Use Binary Encoding:
* Binary Encoding is useful when you have high-cardinality categorical features (i.e., many unique categories). Instead of creating many columns (as with One-Hot Encoding), Binary Encoding converts the category into a binary format, which is more memory-efficient.


In [39]:
import category_encoders as ce
import pandas as pd

# Example data for customer segmentation
data = {'Product_Category': ['Electronics', 'Furniture', 'Apparel', 'Electronics', 'Apparel']}

df = pd.DataFrame(data)

# Apply Binary Encoding
binary_encoder = ce.BinaryEncoder(cols=['Product_Category'])
df_encoded = binary_encoder.fit_transform(df)

print(df_encoded)

   Product_Category_0  Product_Category_1
0                   0                   1
1                   1                   0
2                   1                   1
3                   0                   1
4                   1                   1


## When to Use:
* Use Binary Encoding when the categorical feature has many unique categories and you want to keep the feature space smaller.
It’s a good choice for models like Tree-based algorithms (Random Forest, Gradient Boosting), or even Neural Networks.

# 6. Text Classification (Hashing Encoding)
## Scenario:
* You are building a spam email classifier where one of the features is the domain of the email sender. Since there can be thousands of unique email domains, using One-Hot Encoding would result in an unmanageable number of features.

## Why Use Hashing Encoding:
* For extremely high-cardinality features like email domains, Hashing Encoding is a fast and memory-efficient way to convert categories into numerical values without expanding the feature space excessively. This is especially helpful for real-time applications.


In [43]:
from sklearn.feature_extraction import FeatureHasher

# Example data for text classification (email domains)
data = [{'Domain': 'gmail.com'}, {'Domain': 'yahoo.com'}, {'Domain': 'hotmail.com'}]

# Hashing Encoding
hasher = FeatureHasher(input_type='dict', n_features=5)
hashed_features = hasher.transform(data).toarray()

print(hashed_features)

[[ 0.  0.  0. -1.  0.]
 [ 0.  0.  0.  0.  1.]
 [ 0.  0.  1.  0.  0.]]


## When to Use:
* Use Hashing Encoding when dealing with very high-cardinality categorical features like URLs, email domains, user IDs, etc.
Works well with algorithms like Logistic Regression, SVM, and Neural Networks.

## How to Choose the Right Encoding Technique:
* One-Hot Encoding: Use when your categorical variable has no natural order (nominal) and there are a small number of unique categories.

* Label Encoding: Use for ordinal variables with a clear ranking/order (education levels, ratings).

* Frequency Encoding: Use when the frequency of a category carries meaning, as in fraud detection or sales prediction.

* Target Encoding: Use when there is a strong relationship between the categorical variable and the target variable. Be cautious about overfitting.

* Binary Encoding: Use for high-cardinality categorical features when you want a compact encoding method that works efficiently.

* Hashing Encoding: Use when dealing with extremely high-cardinality features where the number of unique categories is very large.