# 🧩 Data Encoding — Transforming Categorical Data into Numerical Format

---

## 🧠 What is Data Encoding?

In any machine learning workflow, our models can only **understand numerical data**.  
However, most real-world datasets contain **categorical variables** — features that represent *labels or categories* rather than numbers.

For example:

| Item ID | Color |
|----------|--------|
| 1 | Red |
| 2 | Green |
| 3 | Blue |
| 4 | Red |
| 5 | Green |

These text-based categorical values need to be **converted into numeric form** before training the model — a process known as **data encoding**.

Mathematically, the encoding function can be represented as:

$$
f: C \rightarrow \mathbb{R}
$$

Where:  
- $C$ → Set of categorical values  
- $\mathbb{R}$ → Real numbers (numeric representation)

---

## 🚀 Why Encoding is Needed?

Machine learning algorithms like **Linear Regression**, **Decision Trees**, **SVM**, and **Neural Networks** work only with **numbers**.  
They calculate distances, gradients, and weights — all of which require numeric values.

If we feed categorical strings like `"Red"`, `"Green"`, or `"Blue"` directly, the model cannot make sense of them.

👉 Hence, **data encoding** converts these categories into numbers, allowing the model to learn patterns effectively.

---

## 🧭 Types of Data Encoding

There are different types of encoding techniques — each used based on whether your categorical data has an **order** or not.

Let’s go through them one by one 👇

---

### 1️⃣ **Nominal / One-Hot Encoding**

One Hot encoding (OHE), is also known as Nominal Encoding, is a technique used to represent categorical data as numerical data, which is more sutiable for machine learning algorithms. In this technique, each category is represented as a binary vector where each bit corresponds to a unique category.

**When to use:**  
Use this when categories have **no natural order** — for example, `"Color" = {Red, Green, Blue}`.

In One-Hot Encoding, we create a **new column for each category** and mark it with `1` or `0` to show its presence or absence.

| Color | color_Red | color_Green | color_Blue |
|--------|------------|--------------|-------------|
| Red | 1 | 0 | 0 |
| Green | 0 | 1 | 0 |
| Blue | 0 | 0 | 1 |
| Red | 1 | 0 | 0 |
| Green | 0 | 1 | 0 |

So, each color becomes its own binary flag — like a yes/no indicator.

🧠 **Intuition:**  
The model now understands colors as independent attributes rather than trying to order them.

Mathematically, for $k$ unique categories:

$$
\text{OHE}(x_i) = [I(x_i = c_1), I(x_i = c_2), \dots, I(x_i = c_k)]
$$

Where $I(\cdot)$ is the indicator function that returns `1` if true, else `0`.

⛔️ **Disadvantage:** <br>
It has following disadvantages:
- Say for an example, I have 100's of catagorical data, with this technique -- it will create 100s new feature/column
- __Sparse Matrix__ - Usually leads to overfitting

### ⚠️ Drawback of One-Hot Encoding — Sparse Matrix and Overfitting

While **One-Hot Encoding (OHE)** is simple and effective, it can create a **very large and sparse feature matrix**, especially when a categorical column has **many unique values** (called *high cardinality*).

---

#### 🧩 What is a Sparse Matrix?

A **sparse matrix** is one that contains mostly zeros.  
In One-Hot Encoding, for each observation, only **one value is 1** and all other category columns are **0**.

Example:  
If we have 100 categories, each row will have **1 value = 1** and **99 values = 0**.

Mathematically, if $n$ = number of samples and $k$ = number of categories:

$$
X \in \mathbb{R}^{n \times k}
$$

and most entries of $X$ are zeros:

$$
x_{ij} = 
\begin{cases} 
1, & \text{if sample } i \text{ belongs to category } j \\
0, & \text{otherwise}
\end{cases}
$$

This results in a matrix that is **high-dimensional** and **sparse**.

---

#### 💣 Why is this a Problem?

1. **Memory Inefficiency** — Sparse matrices require more memory to store and process.  
2. **Overfitting Risk** — Models may start memorizing rare categories instead of learning general patterns.  
3. **Computational Cost** — The model must handle many zero-valued features, slowing down training.

---

#### 💡 Example

If you One-Hot Encode a feature like `"City"` with 1000 unique values,  
you’ll end up with **1000 new columns** — most of which will be zeros for any given row.

Such matrices can lead to:

$$
\text{Overfitting} \; \propto \; \text{Model Complexity} \; + \; \text{Number of Sparse Features}
$$

---

✅ **In short:**  
> One-Hot Encoding is great for small categorical features,  
> but for large ones — it can make the dataset **sparse**, **high-dimensional**, and prone to **overfitting**.


---

### 2️⃣ **Label and Ordinal Encoding**

**When to use:**  
Use this when categories have a **clear order or ranking** — for example, `"Education" = {High School, Bachelor, Master, PhD}`.

This method simply assigns a **unique integer** to each category.

| Education | Encoded |
|------------|----------|
| High School | 0 |
| Bachelor | 1 |
| Master | 2 |
| PhD | 3 |

🧠 **Intuition:**  
The model can now understand the progression or hierarchy between categories.  
For example, `PhD > Master > Bachelor > High School`.

⚠️ **Be careful:**  
If your data has **no real order** (like colors or cities), Label Encoding can be misleading — the model may wrongly assume one category is greater than another.

Mathematically:

$$
x_i \in C \Rightarrow \text{Encode}(x_i) = \text{Index}(x_i)
$$

**Ordinal Encoding:**
It is used to encode categorical data that have an intrinsic order or ranking. In this technique, each category is assigned a numerical cvalue based on its position in the order.

---

### 3️⃣ **Target Guided Ordinal Encoding**

It is a technique used to encode categorical variables based on their relationship with the target variable. This encoding technique is useful when we have categorical varaible with large number of unique categories, and we want to use this variable as a feature in our machine learning model.

In Target Guided Ordinal Encoding, we replace each category in the categorical variable with a numerical value based on the mean or median of the target variable for that category. This creates a monotonic relationship between the categorical varaible and the target variable, which can improve the predicive power of our model.

**When to use:**  
Use this when you have a **target variable (supervised learning)** and want to encode categories based on how strongly they relate to the target.

Example: Suppose you’re predicting **Sales**, and your dataset looks like this:

| Color | Average Sales |
|--------|----------------|
| Red | 120 |
| Green | 90 |
| Blue | 60 |

We can rank colors by their average sales and assign numbers accordingly:

| Color | Mean(Sales) | Encoded |
|--------|--------------|----------|
| Red | 120 | 1 |
| Green | 90 | 2 |
| Blue | 60 | 3 |

🧠 **Intuition:**  
Here, the encoding isn’t arbitrary — it’s based on how each color affects the target value.  
So, “Red” has the highest average sales and gets the lowest rank (1).

Mathematically:

$$
\text{Encoded}(x_i) = \text{Rank}\left( \mathbb{E}[Y \mid X = x_i] \right)
$$

Where:  
- $Y$ = Target variable (e.g., Sales)  
- $X$ = Categorical feature (e.g., Color)

⚠️ **Important:**  
Always perform this encoding **after train-test split** to avoid data leakage.

---

## ✅ Summary

| Encoding Type | When to Use | Pros | Cons |
|----------------|--------------|------|------|
| **One-Hot Encoding** | Nominal data (no order) | Easy to interpret, widely used | Increases dimensionality |
| **Label/Ordinal Encoding** | Ordered categories | Simple, compact | Can mislead models if order doesn’t exist |
| **Target Guided Ordinal** | When you have a target variable | Captures relation with target | Can cause data leakage if misused |

---

💡 **Next Step:**  
Let’s now implement these encoding techniques on a simple `"Color"` dataset in Python 🐍 and see how each one transforms the data 🔍📊


In [2]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

**See More:**
[One Hot Encoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)

In [None]:
# ---------------------------------------------------------
# 🧩 Step 1 — Create a simple DataFrame
# ---------------------------------------------------------
# Let's create a small sample dataset with a single categorical column 'color'
color_df = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'green', 'red', 'blue']
})

# Display the original dataset
display(color_df.head())

# ---------------------------------------------------------
# 🧠 Step 2 — Create an instance of OneHotEncoder
# ---------------------------------------------------------
# The OneHotEncoder from scikit-learn will convert each category
# (like 'red', 'blue', 'green') into a separate binary column (1 or 0)
encoder = OneHotEncoder()

# ---------------------------------------------------------
# ⚙️ Step 3 — Fit and Transform the data
# ---------------------------------------------------------
# fit_transform() does two things:
#   1. 'fit' → Learns the unique categories from the 'color' column
#   2. 'transform' → Converts them into a one-hot encoded (binary) format
# The output is a NumPy array (sparse matrix converted to dense using .toarray())
encoded = encoder.fit_transform(color_df[['color']]).toarray()

# ---------------------------------------------------------
# ℹ️ Note:
# OneHotEncoder sorts categories **alphabetically** by default.
# So even though 'red' appears first in the data,
# the columns will be ordered as ['blue', 'green', 'red']
# This explains why 'red' corresponds to the **3rd column** in the output.

# ---------------------------------------------------------
# 🧱 Step 4 — Convert encoded array into a DataFrame
# ---------------------------------------------------------
# encoder.get_feature_names_out() returns column names in the format 'color_<category>'
encoder_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out())

# Display the encoded DataFrame
display(encoder_df)

# ---------------------------------------------------------
# 🧩 Step 5 — Encode a new sample
# ---------------------------------------------------------
# You can use .transform() on new data to get the same encoding pattern
# Example: encoding 'blue'
print("Encoding for 'blue':")
print(encoder.transform([['blue']]).toarray())

# ---------------------------------------------------------
# 🧮 Step 6 — Combine original and encoded DataFrames
# ---------------------------------------------------------
# Concatenate the original and encoded columns side by side for comparison
final_df = pd.concat([color_df, encoder_df], axis=1)

# Display the final combined DataFrame
display(final_df)


Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red


Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0
5,1.0,0.0,0.0


Encoding for 'blue':
[[1. 0. 0.]]




Unnamed: 0,color,color_blue,color_green,color_red
0,red,0.0,0.0,1.0
1,blue,1.0,0.0,0.0
2,green,0.0,1.0,0.0
3,green,0.0,1.0,0.0
4,red,0.0,0.0,1.0
5,blue,1.0,0.0,0.0


In [None]:
# ---------------------------------------------------------
# 📘 Step 0 — Import Required Libraries
# ---------------------------------------------------------
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder
from IPython.display import display

# ---------------------------------------------------------
# 🧩 Step 1 — Load the Dataset
# ---------------------------------------------------------
# Let's load the 'tips' dataset from seaborn.
# This dataset includes categorical columns like:
# - 'sex' (Male/Female)
# - 'smoker' (Yes/No)
# - 'day' (Thur/Fri/Sat/Sun)
# - 'time' (Lunch/Dinner)
tips_data = sns.load_dataset('tips')

# Display the first few rows of the dataset
display(tips_data.head())

# ---------------------------------------------------------
# 🧠 Step 2 — Create an instance of OneHotEncoder
# ---------------------------------------------------------
# The OneHotEncoder from scikit-learn will convert each category
# (like 'sex', 'smoker', 'day', 'time') into a separate binary column (1 or 0)
encoder = OneHotEncoder()

# ---------------------------------------------------------
# ⚙️ Step 3 — Fit and Transform the Data
# ---------------------------------------------------------
# We'll apply One-Hot Encoding to the 'sex' column.
# fit_transform() does two things:
#   1️⃣ 'fit' → Learns the unique categories from the column ('Male', 'Female')
#   2️⃣ 'transform' → Converts them into a one-hot encoded (binary) matrix
# The result is a sparse matrix, which we convert to a dense NumPy array using `.toarray()`
encoded_tips_data = encoder.fit_transform(tips_data[['sex']]).toarray()

# ---------------------------------------------------------
# ℹ️ Note:
# OneHotEncoder sorts categories **alphabetically** by default.
# So even if 'Male' appears first in the data,
# the columns will be ordered as ['Female', 'Male'] alphabetically.
# This is why 'Male' corresponds to the **second column** in the encoded matrix.

# ---------------------------------------------------------
# 🧱 Step 4 — Convert Encoded Array into a DataFrame
# ---------------------------------------------------------
# encoder.get_feature_names_out() returns column names like 'sex_Female', 'sex_Male'
encoded_tips_data_df = pd.DataFrame(
    encoded_tips_data, 
    columns=encoder.get_feature_names_out()
)

# Display the encoded version of the 'sex' column
display(encoded_tips_data_df.head())

# ---------------------------------------------------------
# 🧩 Step 5 — Combine Original and Encoded Data
# ---------------------------------------------------------
# Let's merge the encoded columns back with the original DataFrame for comparison
tips_encoded_final = pd.concat([tips_data, encoded_tips_data_df], axis=1)

# Display a few rows to compare before vs after encoding
display(tips_encoded_final[['sex', 'sex_Female', 'sex_Male']].head())

# ---------------------------------------------------------
# 🧩 Step 6 — Encode a New Sample 
# ---------------------------------------------------------
# To encode new data safely, pass it as a DataFrame with the same column name.
new_sample = pd.DataFrame({'sex': ['Male']})
print("Encoding for 'Male':")
print(encoder.transform(new_sample).toarray())


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


Unnamed: 0,sex_Female,sex_Male
0,1.0,0.0
1,0.0,1.0
2,0.0,1.0
3,0.0,1.0
4,1.0,0.0


Unnamed: 0,sex,sex_Female,sex_Male
0,Female,1.0,0.0
1,Male,0.0,1.0
2,Male,0.0,1.0
3,Male,0.0,1.0
4,Female,1.0,0.0


Encoding for 'Male':
[[0. 1.]]


**See More:**
[Label Encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)

In [30]:
from sklearn.preprocessing import LabelEncoder

In [35]:
# ---------------------------------------------------------
# 🧩 Step 1 — Create a Simple DataFrame
# ---------------------------------------------------------
# Let's create a small sample dataset with one categorical column: 'color'
# This column contains three unique categories: 'red', 'blue', and 'green'
color_df = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'green', 'red', 'blue']
})

# Display the original dataset
display(color_df.head())

# ---------------------------------------------------------
# 🧠 Step 2 — Initialize the LabelEncoder
# ---------------------------------------------------------
# LabelEncoder converts each category (string) into a unique numeric code.
# It assigns an integer starting from 0 to n-1, based on sorted order of categories.
label_encoder = LabelEncoder()

# ---------------------------------------------------------
# ⚙️ Step 3 — Fit and Transform the 'color' column
# ---------------------------------------------------------
# The 'fit' step learns the unique categories and assigns numeric labels.
# The 'transform' step replaces each original category with its corresponding label.
encoded_colors = label_encoder.fit_transform(color_df['color'])

# Display the encoded numeric values
print("Encoded Values:", encoded_colors)

# ---------------------------------------------------------
# ℹ️ Note:
# LabelEncoder automatically sorts categories **alphabetically** before assigning numbers.
# So, the mapping will be:
#   'blue'  → 0
#   'green' → 1
#   'red'   → 2

# Let's verify the mapping:
print("\nClass Mapping:", dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_))))

# ---------------------------------------------------------
# 🧩 Step 4 — Add Encoded Column to the Original DataFrame
# ---------------------------------------------------------
color_df['color_encoded'] = encoded_colors

# Display the updated DataFrame
display(color_df)

# ---------------------------------------------------------
# 🧮 Step 5 — Encode a New Sample
# ---------------------------------------------------------
# You can transform new data using the same fitted encoder.
# ⚠️ Make sure the new value exists in the fitted categories, otherwise it will raise an error.
new_encoded = label_encoder.transform(['red'])

print("\nEncoding for 'red':", new_encoded)


Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red


Encoded Values: [2 0 1 1 2 0]

Class Mapping: {'blue': np.int64(0), 'green': np.int64(1), 'red': np.int64(2)}


Unnamed: 0,color,color_encoded
0,red,2
1,blue,0
2,green,1
3,green,1
4,red,2
5,blue,0



Encoding for 'red': [2]


### 💡 Key Takeaways
- `LabelEncoder` assigns integer labels based on alphabetical order of categories.
- It’s simple and compact — ideal for ordinal data (where order matters).

    ⚠️ For nominal data (like colors or cities), it may mislead models into assuming order or magnitude that doesn’t exist.
- Always verify your category mapping using:

In [36]:
label_encoder.classes_


array(['blue', 'green', 'red'], dtype=object)

### ⚠️ Drawback of Label Encoding — False Sense of Order

While **Label Encoding** is simple and memory-efficient, it can unintentionally introduce a **false numerical relationship** between categories.

---

#### 🧩 The Problem

LabelEncoder assigns **integer values** to categories based on their **alphabetical order**.

Example:

| Color | Encoded |
|--------|----------|
| Blue | 0 |
| Green | 1 |
| Red | 2 |

Now, the machine learning model “sees” this as:

$$
\text{Blue} < \text{Green} < \text{Red}
$$

This implies that:
- “Red” is **greater than** “Green”
- “Green” is **greater than** “Blue”

➡️ But in reality, **colors have no natural order** — these numeric labels **don’t carry any real-world meaning**.

---

#### 💣 Why This is Dangerous

1. **Models may assume hierarchy:**  
   Algorithms like **Linear Regression** or **SVM** may treat these numeric labels as having a mathematical relationship.

   Example:  
   The model might think that “Red (2)” has *twice the impact* of “Blue (0)”, which is meaningless for nominal data.

2. **Distorted feature importance:**  
   The model might assign higher or lower weights based purely on label numbers, leading to **biased learning**.

3. **Misleading distance calculations:**  
   Models that rely on distance (e.g., KNN) will compute distances between labels as if they were ordered numbers.

---

#### 🧠 In Simple Words

> Label Encoding gives every category a number —  
> but the model might think those numbers mean **ranking** or **priority**,  
> even when they don’t.

---

#### ✅ When It’s Safe to Use
Use Label Encoding only when the categories have a **clear, natural order** (ordinal data).

Examples:
- Education: High School < Bachelor < Master < PhD  
- Size: Small < Medium < Large  

For **unordered (nominal)** data like colors, cities, or countries, prefer **One-Hot Encoding** instead.


**See More:**
[Ordinal Encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html)

In [49]:
from sklearn.preprocessing import OrdinalEncoder

In [50]:
# ---------------------------------------------------------
# 🧩 Step 1 — Create a Simple DataFrame
# ---------------------------------------------------------
# Let's create a small dataset with a categorical column 'size'
# This column has an inherent order: small < medium < large
size_data = pd.DataFrame({
    'size': ['small', 'medium', 'large', 'medium', 'small', 'large']
})

# Display the original dataset
display(size_data.head())

# ---------------------------------------------------------
# 🧠 Step 2 — Initialize the OrdinalEncoder
# ---------------------------------------------------------
# The OrdinalEncoder converts categories into ordered numbers.
# We explicitly define the order using the 'categories' parameter.
# If not provided, it defaults to alphabetical order, which might not be correct.
size_data_encoder = OrdinalEncoder(categories=[['small', 'medium', 'large']])

# ---------------------------------------------------------
# ⚙️ Step 3 — Fit and Transform
# ---------------------------------------------------------
# 'fit' learns the category order, and 'transform' converts them to numbers
encoded_size = size_data_encoder.fit_transform(size_data[['size']])

# Display encoded values as a new DataFrame
encoded_size_df = pd.DataFrame(encoded_size, columns=['size_encoded'])
display(pd.concat([size_data, encoded_size_df], axis=1))

# ---------------------------------------------------------
# ℹ️ Note:
# Here, we defined the order as:
#   small → 0
#   medium → 1
#   large → 2
#
# This order is preserved in the encoding.

# ---------------------------------------------------------
# 🧮 Step 4 — Encode a New Sample
# ---------------------------------------------------------
# To encode new data, pass it as a DataFrame (2D array-like structure)
new_sample = pd.DataFrame({'size': ['small']})
new_encoded = size_data_encoder.transform(new_sample)

print("Encoding for 'small':", new_encoded)


Unnamed: 0,size
0,small
1,medium
2,large
3,medium
4,small


Unnamed: 0,size,size_encoded
0,small,0.0
1,medium,1.0
2,large,2.0
3,medium,1.0
4,small,0.0
5,large,2.0


Encoding for 'small': [[0.]]


### 💡 Key Points to Remember — Ordinal Encoding

---

- **Ordinal Encoding** assigns numbers to categories that have a **meaningful order**.  
  This helps the model understand that one category is greater or smaller than another.

---

#### 🧩 Example

For the feature `"size"`:

$$
\text{small} < \text{medium} < \text{large}
$$

The encoded values might look like:

| size | encoded |
|-------|----------|
| small | 0 |
| medium | 1 |
| large | 2 |

---

#### 🧠 Model Interpretation

Now, models can **interpret the order correctly**,  
since higher numbers represent a “greater” or “larger” value.

For example:
- `large (2)` > `medium (1)` > `small (0)`

This makes sense for ordered (ordinal) categories such as **sizes, education levels, or ratings**.

---

#### ⚠️ Important Caution

Do **not** use Ordinal Encoding for **unordered (nominal)** data like:
- Colors (Red, Blue, Green)
- Cities (Delhi, London, Tokyo)
- Animal Types (Cat, Dog, Fish)

Because the model may incorrectly assume an **order** or **magnitude** between categories.

---

#### 🧮 Tip

You can always check the order learned by the encoder using:

```python
size_data_encoder.categories_
```

In [48]:
size_data_encoder.categories_

[array(['small', 'medium', 'large'], dtype=object)]

**See More:**
[Target Guided Ordinal Encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.TargetEncoder.html)

In [54]:
from sklearn.preprocessing import TargetEncoder

In [61]:
# ---------------------------------------------------------
# 🧩 Step 1 — Create a Simple DataFrame
# ---------------------------------------------------------
# Let's create a dataset with two columns:
# 'city'  → categorical variable
# 'price' → numeric target variable
location_data = pd.DataFrame({
    'city': ['New York', 'London', 'Paris', 'Tokyo', 'New York', 'Paris'],
    'price': [200, 150, 300, 250, 180, 320]
})

# Display the original dataset
display(location_data)

# ---------------------------------------------------------
# 🧠 Step 2 — Compute Mean of Target Variable per Category
# ---------------------------------------------------------
# We calculate the mean 'price' for each unique city.
# This gives us an idea of how each city is associated with the target.
mean_price = location_data.groupby('city')['price'].mean().to_dict()

# Display the mean target value for each city
display(mean_price)

# ---------------------------------------------------------
# ⚙️ Step 3 — Encode the Categories using Target Mean
# ---------------------------------------------------------
# Replace each city with its corresponding mean price.
# This creates an encoding that reflects the target variable's behavior.
location_data['city_encoded'] = location_data['city'].map(mean_price)

# Display the encoded dataset
display(location_data)

# ---------------------------------------------------------
# 🧩 Step 4 — Compare 'price' and Encoded Values
# ---------------------------------------------------------
# We can now observe how encoding preserves the target relationship.
display(location_data[['price', 'city_encoded']])


Unnamed: 0,city,price
0,New York,200
1,London,150
2,Paris,300
3,Tokyo,250
4,New York,180
5,Paris,320


{'London': 150.0, 'New York': 190.0, 'Paris': 310.0, 'Tokyo': 250.0}

Unnamed: 0,city,price,city_encoded
0,New York,200,190.0
1,London,150,150.0
2,Paris,300,310.0
3,Tokyo,250,250.0
4,New York,180,190.0
5,Paris,320,310.0


Unnamed: 0,price,city_encoded
0,200,190.0
1,150,150.0
2,300,310.0
3,250,250.0
4,180,190.0
5,320,310.0


Each city is replaced with its average price.
For instance:
- Paris → mean price ≈ 310
- New York → mean price ≈ 190
- London → mean price ≈ 150
- This captures how each city correlates with the target variable.

### ⚠️ Important Note (Data Leakage Warning)
- If you perform this encoding before splitting your dataset into train and test,
- you’ll leak target information from the test set into training — leading to overfitting.

### 🧩 Using `category_encoders` Library for Target Guided Encoding

While we can manually perform Target Guided Encoding using `pandas` (`groupby` + `map`),  
there’s a convenient library called **`category_encoders`** that provides a ready-made transformer called **`TargetEncoder`**.

This encoder works just like other `sklearn` transformers — it fits on data and applies transformation safely.

---

#### ⚙️ Installation
If you don’t already have it installed, run:

```bash
pip install category_encoders
```

In [62]:
!pip install category_encoders


Collecting category_encoders
  Downloading category_encoders-2.8.1-py3-none-any.whl.metadata (7.9 kB)
Downloading category_encoders-2.8.1-py3-none-any.whl (85 kB)
Installing collected packages: category_encoders
Successfully installed category_encoders-2.8.1


In [64]:
import category_encoders as ce

# Step 1️⃣ — Create the encoder
# Specify which column(s) to encode using the 'cols' parameter.
encoder = ce.TargetEncoder(cols=['city'])

# Step 2️⃣ — Fit the encoder on feature (X) and target (y)
# It learns the average target value for each category in 'city'
# and encodes accordingly.
encoded_df = encoder.fit_transform(location_data['city'], location_data['price'])

# Step 3️⃣ — View the encoded output
display(encoded_df.head())


Unnamed: 0,city
0,227.186454
1,222.49096
2,244.208582
3,235.501808
4,227.186454


### 💡 What Happens Internally

- The encoder computes the **mean of the target (`price`)** for each category (`city`).  
- It then **replaces each category** with its corresponding mean value.  
- It integrates smoothly with **train/test splits** and **cross-validation**,  
  ensuring that **data leakage** is prevented during model training.
