# **11. Advanced Features & Optimization**

## ‚ö° 3. **Performance Optimization**

In [1]:
import pandas as pd

## üìå 1. **Purpose & When to Use It**

### üîç **Purpose**:

Performance optimization in pandas refers to techniques aimed at:

* **Speeding up** data processing
* **Reducing memory usage**
* **Avoiding inefficient patterns** like loops or excessive function calls

### üìÖ **When to Use It**:

* When you're working with **large datasets** (millions of rows)
* When code is running **slow or crashing** due to memory overload
* In **production pipelines** that must be fast and efficient
* During **exploratory data analysis (EDA)** where interactive performance matters

## üß† 2. **Different Methods and Techniques**

| Technique                            | Description                                                          |
| ------------------------------------ | -------------------------------------------------------------------- |
| ‚úÖ **Vectorization**                  | Use pandas/NumPy operations instead of Python loops                  |
| ‚úÖ **Use of `np.where`, `np.select`** | Efficient conditional logic                                          |
| ‚úÖ **Use `.query()` & `.eval()`**     | Faster and readable row filtering & expressions                      |
| ‚úÖ **Use `.memory_usage()`**          | Inspect and optimize memory footprint                                |
| ‚úÖ **Categorical data types**         | Optimize memory for string/object columns                            |
| ‚úÖ **Avoid unnecessary copies**       | Work in-place or chain methods                                       |
| ‚úÖ **Downcasting**                    | Convert large numeric types to smaller equivalents                   |
| ‚úÖ **Chunking**                       | Process large files in small batches using `chunksize` in `read_csv` |


## üß™ 3. **Examples with Code**

### üîπ a) ‚úÖ **Vectorized Operation vs. Loop**

```python
# Inefficient: Loop
df['squared'] = [x**2 for x in df['value']]

# Efficient: Vectorized
df['squared'] = df['value'] ** 2
```

---

### üîπ b) ‚úÖ **Using `np.where()` for fast conditional logic**

```python
import numpy as np
df['flag'] = np.where(df['amount'] > 1000, 'High', 'Low')
```

---

### üîπ c) ‚úÖ **Using `.query()` for readable & fast filtering**

```python
df_filtered = df.query("Region == 'East' and Sales > 5000")
```

---

### üîπ d) ‚úÖ **Converting column to `category` to save memory**

```python
df['ProductType'] = df['ProductType'].astype('category')
```

---

### üîπ e) ‚úÖ **Inspecting memory usage**

```python
df.memory_usage(deep=True)
```

---

### üîπ f) ‚úÖ **Downcasting numeric columns**

```python
df['int_column'] = pd.to_numeric(df['int_column'], downcast='integer')
df['float_column'] = pd.to_numeric(df['float_column'], downcast='float')
```

---

### üîπ g) ‚úÖ **Reading large files in chunks**

```python
chunk_iter = pd.read_csv('big_data.csv', chunksize=100000)

for chunk in chunk_iter:
    process(chunk)  # Your logic here
```


## ‚ö° 4. **Performance Considerations**

| Strategy                 | Benefit                                                |
| ------------------------ | ------------------------------------------------------ |
| Vectorization            | Drastically faster than loops                          |
| Categorical types        | Save 50‚Äì90% memory on repetitive string/object columns |
| `np.where` / `np.select` | Much faster than row-wise `apply()`                    |
| `.query()` / `.eval()`   | Slight speed-up + better readability                   |
| Chunking                 | Prevents memory crashes when loading large files       |
| Downcasting              | Reduces DataFrame memory footprint                     |


## ‚ö†Ô∏è 5. **Common Pitfalls & Mistakes**

1. ‚ùå **Using loops (`for`) over DataFrame rows** ‚Äì very slow
2. ‚ùå **Applying complex logic inside `apply()` instead of vectorizing**
3. ‚ùå **Forgetting to downcast float/int columns after import**
4. ‚ùå **Keeping strings as `object` instead of `category`**
5. ‚ùå **Overusing `copy()` or storing too many intermediate DataFrames**


## ‚úÖ 6. **Best Practices**

* ‚úÖ Always prefer **vectorized operations** (avoid row-wise loops)
* ‚úÖ Use **NumPy functions** where possible (`np.where`, `np.select`, `np.log()`)
* ‚úÖ Convert string/object columns to **`category`** when repetitive
* ‚úÖ Monitor memory with `.memory_usage(deep=True)`
* ‚úÖ Use **downcasting** for numeric types when full precision isn‚Äôt needed
* ‚úÖ Use `.query()` and `.eval()` for better filtering performance
* ‚úÖ Break down large files into **chunks** for processing
* ‚úÖ Use `.loc[]` or `.iloc[]` rather than `apply()` for conditional assignments


## üíº 7. **Use Cases in Real Projects**

| Domain               | Use Case                                                                      |
| -------------------- | ----------------------------------------------------------------------------- |
| üí≥ Finance           | Downcasting high-precision decimals to float32 in portfolio data              |
| üì¶ Logistics         | Converting location codes to `category` to reduce memory usage                |
| üè• Healthcare        | Filtering and transforming large patient datasets using `.query()`            |
| üõçÔ∏è E-commerce       | Replacing expensive `apply()` calls with `np.where()` for faster segmentation |
| üìä Big Data Analysis | Reading 10M+ rows CSV files using `chunksize` to avoid memory overflow        |
| ü§ñ ML Pipelines      | Optimizing memory during data preparation and feature transformation          |


## ‚úÖ Summary

Performance optimization in pandas is essential when handling large datasets or building efficient pipelines. Through **vectorization**, **memory profiling**, and **method selection**, you can make your workflows significantly faster and more memory-efficient.


<center><b>Thanks</b></center>