# **11. Advanced Features & Optimization**

## 📉 4. **Memory Usage & Efficiency**

In [1]:
import pandas as pd

## 📌 1. **Purpose & When to Use It**

### 🎯 **Purpose**:

Memory usage optimization focuses on **reducing the RAM** consumed by pandas DataFrames, which is critical when working with **large datasets**. By controlling data types and memory representations, you can:

* Load large datasets without crashing
* Speed up operations due to smaller memory footprint
* Reduce computational cost and latency in production environments

### 📅 **When to Use It**:

* When your dataset has **millions of rows**
* When you're hitting **memory errors**
* Before **saving DataFrames to disk** to reduce file size
* When preparing **datasets for machine learning**

## 🧠 2. **Different Methods and Techniques**

| Technique                    | Description                                                        |
| ---------------------------- | ------------------------------------------------------------------ |
| `.memory_usage()`            | Inspect memory consumption of each column                          |
| `astype()`                   | Convert columns to more efficient data types                       |
| **Downcasting**              | Reduce float/int types to lower precision (e.g., `int64` → `int8`) |
| **Categorical data**         | Optimize memory for repetitive strings/objects                     |
| `convert_dtypes()`           | Automatically infer better memory-efficient types                  |
| **Read CSV with types**      | Define `dtype` argument while reading files                        |
| **Drop unnecessary columns** | Free up space early in your pipeline                               |
| **Use `chunksize`**          | Load large files in small parts to avoid memory peaks              |


## 🧪 3. **Examples with Code**

### 🔹 a) Check memory usage

```python
import pandas as pd

df = pd.read_csv('large_dataset.csv')
print(df.memory_usage(deep=True))  # Shows memory per column
print(df.memory_usage(deep=True).sum() / 1024**2, "MB")  # Total in MB
```

---

### 🔹 b) Downcasting numeric columns

```python
df['int_col'] = pd.to_numeric(df['int_col'], downcast='integer')
df['float_col'] = pd.to_numeric(df['float_col'], downcast='float')
```

---

### 🔹 c) Convert object columns to `category`

```python
df['city'] = df['city'].astype('category')
```

---

### 🔹 d) Automatic data type optimization

```python
df = df.convert_dtypes()
```

---

### 🔹 e) Optimize while reading CSV

```python
dtypes = {'user_id': 'int32', 'price': 'float32', 'category': 'category'}
df = pd.read_csv('sales.csv', dtype=dtypes)
```

---

### 🔹 f) Load large file in chunks

```python
chunks = pd.read_csv('big.csv', chunksize=50000)
for chunk in chunks:
    process(chunk)  # Your custom logic
```

## ⚡ 4. **Performance Considerations**

| Optimization             | Memory Saved             | Speed Impact                     |
| ------------------------ | ------------------------ | -------------------------------- |
| Downcasting integers     | High                     | Slight speed-up                  |
| Using categories         | Very High (up to 80–90%) | Faster groupby/sorting           |
| Dropping unused columns  | High                     | Faster overall processing        |
| Reading with `dtype`     | Medium                   | Avoids upcasting during read     |
| Using `convert_dtypes()` | Moderate                 | Automated but not always optimal |
| Chunking                 | Prevents crash           | Slightly slower but safer        |


## ⚠️ 5. **Common Pitfalls & Mistakes**

1. ❌ **Leaving default data types** — `int64`, `float64`, and `object` take more memory than necessary
2. ❌ **Ignoring string/object columns** — these consume the most memory
3. ❌ **Downcasting without checking value range** — could cause **overflow or precision loss**
4. ❌ **Using category for high-cardinality columns** — may actually **increase memory**
5. ❌ **Reading large CSVs without `dtype` or `chunksize`** — leads to memory crashes


## ✅ 6. **Best Practices**

* ✅ Always **inspect memory** using `.memory_usage(deep=True)`
* ✅ Use `astype()` to downcast numeric types appropriately
* ✅ Convert repeated strings to `category` when cardinality is low
* ✅ Drop or exclude **unnecessary columns** early
* ✅ Pass `dtype` explicitly when reading large files
* ✅ Use **`chunksize`** for massive datasets
* ✅ Combine memory optimization with performance techniques (vectorization, method chaining)

## 💼 7. **Use Cases in Real Projects**

| Domain                 | Use Case                                                                            |
| ---------------------- | ----------------------------------------------------------------------------------- |
| 🏥 Healthcare          | Large EHR (Electronic Health Records) datasets with categorical codes               |
| 📦 Logistics           | Shipment logs with repeated locations, product types                                |
| 🛍️ Retail             | Millions of order rows stored with default `object` types → converted to `category` |
| 📈 Finance             | High-volume tick data downcasted to `float32` and `int32`                           |
| 📊 Marketing Analytics | Read billions of clickstream rows using `chunksize` and optimized types             |


## ✅ Summary

Memory usage and efficiency techniques are **critical** for working with large datasets. By applying **smart data typing, chunking, and memory profiling**, you can handle big data even on modest hardware, write scalable pipelines, and reduce runtime and resource costs.


<center><b>Thanks</b></center>