# **11. Advanced Features & Optimization**

## ⚡ 3. **Performance Optimization**

In [1]:
import pandas as pd

## 📌 1. **Purpose & When to Use It**

### 🔍 **Purpose**:

Performance optimization in pandas refers to techniques aimed at:

* **Speeding up** data processing
* **Reducing memory usage**
* **Avoiding inefficient patterns** like loops or excessive function calls

### 📅 **When to Use It**:

* When you're working with **large datasets** (millions of rows)
* When code is running **slow or crashing** due to memory overload
* In **production pipelines** that must be fast and efficient
* During **exploratory data analysis (EDA)** where interactive performance matters

## 🧠 2. **Different Methods and Techniques**

| Technique                            | Description                                                          |
| ------------------------------------ | -------------------------------------------------------------------- |
| ✅ **Vectorization**                  | Use pandas/NumPy operations instead of Python loops                  |
| ✅ **Use of `np.where`, `np.select`** | Efficient conditional logic                                          |
| ✅ **Use `.query()` & `.eval()`**     | Faster and readable row filtering & expressions                      |
| ✅ **Use `.memory_usage()`**          | Inspect and optimize memory footprint                                |
| ✅ **Categorical data types**         | Optimize memory for string/object columns                            |
| ✅ **Avoid unnecessary copies**       | Work in-place or chain methods                                       |
| ✅ **Downcasting**                    | Convert large numeric types to smaller equivalents                   |
| ✅ **Chunking**                       | Process large files in small batches using `chunksize` in `read_csv` |


## 🧪 3. **Examples with Code**

### 🔹 a) ✅ **Vectorized Operation vs. Loop**

```python
# Inefficient: Loop
df['squared'] = [x**2 for x in df['value']]

# Efficient: Vectorized
df['squared'] = df['value'] ** 2
```

---

### 🔹 b) ✅ **Using `np.where()` for fast conditional logic**

```python
import numpy as np
df['flag'] = np.where(df['amount'] > 1000, 'High', 'Low')
```

---

### 🔹 c) ✅ **Using `.query()` for readable & fast filtering**

```python
df_filtered = df.query("Region == 'East' and Sales > 5000")
```

---

### 🔹 d) ✅ **Converting column to `category` to save memory**

```python
df['ProductType'] = df['ProductType'].astype('category')
```

---

### 🔹 e) ✅ **Inspecting memory usage**

```python
df.memory_usage(deep=True)
```

---

### 🔹 f) ✅ **Downcasting numeric columns**

```python
df['int_column'] = pd.to_numeric(df['int_column'], downcast='integer')
df['float_column'] = pd.to_numeric(df['float_column'], downcast='float')
```

---

### 🔹 g) ✅ **Reading large files in chunks**

```python
chunk_iter = pd.read_csv('big_data.csv', chunksize=100000)

for chunk in chunk_iter:
    process(chunk)  # Your logic here
```


## ⚡ 4. **Performance Considerations**

| Strategy                 | Benefit                                                |
| ------------------------ | ------------------------------------------------------ |
| Vectorization            | Drastically faster than loops                          |
| Categorical types        | Save 50–90% memory on repetitive string/object columns |
| `np.where` / `np.select` | Much faster than row-wise `apply()`                    |
| `.query()` / `.eval()`   | Slight speed-up + better readability                   |
| Chunking                 | Prevents memory crashes when loading large files       |
| Downcasting              | Reduces DataFrame memory footprint                     |


## ⚠️ 5. **Common Pitfalls & Mistakes**

1. ❌ **Using loops (`for`) over DataFrame rows** – very slow
2. ❌ **Applying complex logic inside `apply()` instead of vectorizing**
3. ❌ **Forgetting to downcast float/int columns after import**
4. ❌ **Keeping strings as `object` instead of `category`**
5. ❌ **Overusing `copy()` or storing too many intermediate DataFrames**


## ✅ 6. **Best Practices**

* ✅ Always prefer **vectorized operations** (avoid row-wise loops)
* ✅ Use **NumPy functions** where possible (`np.where`, `np.select`, `np.log()`)
* ✅ Convert string/object columns to **`category`** when repetitive
* ✅ Monitor memory with `.memory_usage(deep=True)`
* ✅ Use **downcasting** for numeric types when full precision isn’t needed
* ✅ Use `.query()` and `.eval()` for better filtering performance
* ✅ Break down large files into **chunks** for processing
* ✅ Use `.loc[]` or `.iloc[]` rather than `apply()` for conditional assignments


## 💼 7. **Use Cases in Real Projects**

| Domain               | Use Case                                                                      |
| -------------------- | ----------------------------------------------------------------------------- |
| 💳 Finance           | Downcasting high-precision decimals to float32 in portfolio data              |
| 📦 Logistics         | Converting location codes to `category` to reduce memory usage                |
| 🏥 Healthcare        | Filtering and transforming large patient datasets using `.query()`            |
| 🛍️ E-commerce       | Replacing expensive `apply()` calls with `np.where()` for faster segmentation |
| 📊 Big Data Analysis | Reading 10M+ rows CSV files using `chunksize` to avoid memory overflow        |
| 🤖 ML Pipelines      | Optimizing memory during data preparation and feature transformation          |


## ✅ Summary

Performance optimization in pandas is essential when handling large datasets or building efficient pipelines. Through **vectorization**, **memory profiling**, and **method selection**, you can make your workflows significantly faster and more memory-efficient.


<center><b>Thanks</b></center>