Target Encoding:

* This is a more advanced encoding technique used for dealing with high cardinality categorical features, i.e., features with many unique categories.

* The average target value for each category is calculated and this average value is used to replace the categorical feature.

* This has the advantage of considering the relationship between the target and the categorical feature, but it can also lead to overfitting if not used with caution.

1. Calculate the mean of the target variable for each category.
2. Replace the category with its corresponding mean value

**Pros:** Target encoding leverages the relationship between categorical variables and the target variable, making it a powerful encoding technique when this relationship is significant. It retains the information within the original feature, making it memory-efficient.

**Cons:** One of the significant drawbacks of target encoding is the potential for overfitting, especially when applied to small datasets. It suffers from the problem of target leakage as the target variable is used to directly encode the input feature and the same feature is used to fit a model on the target variable.

**When to use:** It is suitable for categorical features exhibiting a high number of categories. In the context of multi-class classification tasks, the initial step involves employing one-hot encoding on the target variable. This results in n binary columns, each corresponding to a specific class of the target variable. However, it’s noteworthy that only n-1 of these binary columns are linearly independent. As a consequence, any one of these columns can be omitted. Subsequently, the standard target encoding procedure is applied to each categorical feature, utilizing each binary label individually. Consequently, for a single categorical feature, n-1 target-encoded features are generated. If there are k categorical features in the dataset, the cumulative result is k times (n-1) features.

In [None]:
!pip install category_encoders

Collecting category_encoders
  Downloading category_encoders-2.6.4-py2.py3-none-any.whl.metadata (8.0 kB)
Downloading category_encoders-2.6.4-py2.py3-none-any.whl (82 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m82.0/82.0 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: category_encoders
Successfully installed category_encoders-2.6.4


In [None]:
import category_encoders as ce
import pandas as pd

In [None]:
data = {
    'Color': ['Red', 'Blue', 'Green', 'Red', 'Red', 'Blue', 'Green'],
    'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small', 'Small', 'Medium'],
    'Label': [1, 0, 1, 1, 0, 0, 1]
}
df = pd.DataFrame(data)
df

Unnamed: 0,Color,Size,Label
0,Red,Small,1
1,Blue,Medium,0
2,Green,Large,1
3,Red,Medium,1
4,Red,Small,0
5,Blue,Small,0
6,Green,Medium,1


In [None]:
column_names =  df.select_dtypes(include=['object']).columns.to_list()
column_names

['Color', 'Size']

In [None]:
encoder = ce.TargetEncoder()
df[column_names] = encoder.fit_transform(df[column_names], df['Label'])
df

Unnamed: 0,Color,Size,Label
0,0.58614,0.534651,1
1,0.490371,0.58614,0
2,0.632222,0.627189,1
3,0.58614,0.58614,1
4,0.58614,0.534651,0
5,0.490371,0.534651,0
6,0.632222,0.58614,1


In [None]:
data = {
    'Color': ['Red', 'Blue', 'Green', 'Red', 'Red', 'Blue', 'Green'],
    'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small', 'Small', 'Medium'],
    'Label': [1, 0, 1, 1, 0, 0, 1]
}
df = pd.DataFrame(data)
df

Unnamed: 0,Color,Size,Label
0,Red,Small,1
1,Blue,Medium,0
2,Green,Large,1
3,Red,Medium,1
4,Red,Small,0
5,Blue,Small,0
6,Green,Medium,1


In [None]:
color_mean = df.groupby('Color')['Label'].mean()
df['color_label'] = df['Color'].map(color_mean)

In [None]:
color_mean

Unnamed: 0_level_0,Label
Color,Unnamed: 1_level_1
Blue,0.0
Green,1.0
Red,0.666667


In [None]:
size_mean = df.groupby('Size')['Label'].mean()
df['size_label'] = df['Size'].map(size_mean)

In [None]:
df

Unnamed: 0,Color,Size,Label,color_label,size_label
0,Red,Small,1,0.666667,0.333333
1,Blue,Medium,0,0.0,0.666667
2,Green,Large,1,1.0,1.0
3,Red,Medium,1,0.666667,0.666667
4,Red,Small,0,0.666667,0.333333
5,Blue,Small,0,0.0,0.333333
6,Green,Medium,1,1.0,0.666667


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
import category_encoders as ce

# Generate a dummy dataset with categorical variables
data = {
    'Color': ['Red', 'Blue', 'Green', 'Red', 'Red', 'Blue', 'Green'],
    'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small', 'Small', 'Medium'],
    'Label': [1, 0, 1, 1, 0, 0, 1]
}

df = pd.DataFrame(data)

# Split the data into training and test sets
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# Initialize the MeanEncoder
mean_encoder = ce.TargetEncoder()

# Fit the encoder on the training data
mean_encoder.fit(train_df[['Color', 'Size']], train_df['Label'])

# Transform both the training and test datasets
train_encoded = mean_encoder.transform(train_df[['Color', 'Size']])
test_encoded = mean_encoder.transform(test_df[['Color', 'Size']])

# Display the encoded datasets
print("Training Data (After Mean Encoding):\n", train_encoded)
print("\nTest Data (After Mean Encoding):\n", test_encoded)

Training Data (After Mean Encoding):
       Color      Size
5  0.521935  0.514889
2  0.656740  0.652043
4  0.585815  0.514889
3  0.585815  0.656740
6  0.656740  0.656740

Test Data (After Mean Encoding):
       Color      Size
0  0.585815  0.514889
1  0.521935  0.656740
