# **11. Advanced Features & Optimization**

## 🧮 5. **Categorical Data for Optimization**

In [1]:
import pandas as pd

## 📌 1. **Purpose & When to Use It**

### 🎯 **Purpose**:

The `Categorical` data type in pandas is designed to **optimize memory** and **speed up performance** when a column contains **repeated string values or limited unique values** (low cardinality). Instead of storing raw strings repeatedly, pandas stores an integer-based mapping behind the scenes.

### 📅 **When to Use It**:

* When columns contain **repeating string values** like country names, categories, regions, etc.
* When performing **grouping, sorting, filtering**, or **merging** operations on such fields
* When working with **string/object** columns that can be encoded

## 🧠 2. **Different Methods and Techniques**

| Technique                         | Description                                      |
| --------------------------------- | ------------------------------------------------ |
| `.astype('category')`             | Convert object/string column to categorical type |
| `pd.Categorical()`                | Manually create a categorical object             |
| `df.select_dtypes()`              | Filter columns by type to target categoricals    |
| `df['col'].cat.categories`        | Access the list of categories                    |
| `df['col'].cat.codes`             | View underlying integer representation           |
| `.cat.set_categories()`           | Manually define or reorder categories            |
| `.cat.remove_unused_categories()` | Clean up unused levels                           |


## 🧪 3. **Examples with Code**

### 🔹 a) Basic conversion to category

In [3]:
df = pd.DataFrame({
    'City': ['Delhi', 'Mumbai', 'Delhi', 'Chennai', 'Mumbai']
})

display(df)
print(df.dtypes)

Unnamed: 0,City
0,Delhi
1,Mumbai
2,Delhi
3,Chennai
4,Mumbai


City    object
dtype: object


In [5]:
df['City'] = df['City'].astype('category')
display(df)
print(df.dtypes)

Unnamed: 0,City
0,Delhi
1,Mumbai
2,Delhi
3,Chennai
4,Mumbai


City    category
dtype: object


### 🔹 b) Access categories and codes

In [7]:
df['City'].cat.categories

Index(['Chennai', 'Delhi', 'Mumbai'], dtype='object')

In [9]:
display(df)
display(df['City'].cat.codes)

Unnamed: 0,City
0,Delhi
1,Mumbai
2,Delhi
3,Chennai
4,Mumbai


0    1
1    2
2    1
3    0
4    2
dtype: int8

### 🔹 c) Manually define categorical values

In [13]:
df['City1'] = pd.Categorical(
    df['City'], categories=['Chennai', 'Delhi', 'Mumbai'], ordered=True
)

df

Unnamed: 0,City,City1
0,Delhi,Delhi
1,Mumbai,Mumbai
2,Delhi,Delhi
3,Chennai,Chennai
4,Mumbai,Mumbai


### 🔹 d) Remove unused categories

In [14]:
df = df[df['City'] != 'Delhi']

df

Unnamed: 0,City,City1
1,Mumbai,Mumbai
3,Chennai,Chennai
4,Mumbai,Mumbai


In [17]:
df['City'].cat.remove_unused_categories()

1     Mumbai
3    Chennai
4     Mumbai
Name: City, dtype: category
Categories (2, object): ['Chennai', 'Mumbai']

In [16]:
df['City'] = df['City'].cat.remove_unused_categories()

df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['City'] = df['City'].cat.remove_unused_categories()


Unnamed: 0,City,City1
1,Mumbai,Mumbai
3,Chennai,Chennai
4,Mumbai,Mumbai


## ⚡ 4. **Performance Considerations**

| Optimization              | Memory Saved            | Speed Impact                 |
| ------------------------- | ----------------------- | ---------------------------- |
| Categorical vs object     | Up to 80–90% reduction  | Faster groupby and sorting   |
| Encoding with `cat.codes` | Useful for ML pipelines | Reduced computation overhead |
| Ordered categoricals      | Optimized comparisons   | Speeds up conditional logic  |

### ✅ Benchmark Example:

```python
df = pd.DataFrame({'Fruit': ['Apple']*1000000 + ['Banana']*1000000})
df['Fruit'] = df['Fruit'].astype('category')
# This will consume ~10x less memory than object dtype
```

## ⚠️ 5. **Common Pitfalls & Mistakes**

| Mistake                                              | Problem                                 |
| ---------------------------------------------------- | --------------------------------------- |
| ❌ Converting **high-cardinality** strings            | Little to no memory gain, may slow down |
| ❌ Using `category` on columns that frequently change | Overhead in updating categories         |
| ❌ Not removing unused categories                     | Leads to bloated category lists         |
| ❌ Assuming it helps numeric columns                  | Only applies to string/object columns   |
| ❌ Forgetting to set `ordered=True` when needed       | Breaks sorting logic                    |


## ✅ 6. **Best Practices**

* ✅ Use `category` for **low-cardinality text** columns (e.g., gender, product type)
* ✅ Convert early in your pipeline (right after loading)
* ✅ For ML preprocessing, convert to `cat.codes` for numerical features
* ✅ Use `remove_unused_categories()` when subsetting the data
* ✅ Set `ordered=True` when doing ordered comparisons (e.g., small < medium < large)
* ✅ Avoid on columns with unique or near-unique values


## 💼 7. **Use Cases in Real Projects**

| Domain                    | Use Case                                                   |
| ------------------------- | ---------------------------------------------------------- |
| 🛍️ E-commerce            | Product category, brand, user gender for profiling         |
| 🚚 Logistics              | Route zone, warehouse names (low cardinality strings)      |
| 🏥 Healthcare             | Diagnosis codes, departments, patient group types          |
| 📈 Finance                | Account type, transaction category for aggregation         |
| 🧪 ML Pipelines           | Convert target and input features to codes for model input |
| 📊 Surveys & Demographics | Age group, education level, region, etc.                   |


## ✅ Summary

The use of **categorical data** is one of the **most impactful memory optimization strategies** in pandas. It reduces memory usage drastically for repeated string values and can significantly **improve performance for grouping, filtering, and sorting**. Use it carefully, especially in datasets with **low-cardinality textual data**.


<center><b>Thanks</b></center>