# **11. Advanced Features & Optimization**

## üìâ 4. **Memory Usage & Efficiency**

In [1]:
import pandas as pd

## üìå 1. **Purpose & When to Use It**

### üéØ **Purpose**:

Memory usage optimization focuses on **reducing the RAM** consumed by pandas DataFrames, which is critical when working with **large datasets**. By controlling data types and memory representations, you can:

* Load large datasets without crashing
* Speed up operations due to smaller memory footprint
* Reduce computational cost and latency in production environments

### üìÖ **When to Use It**:

* When your dataset has **millions of rows**
* When you're hitting **memory errors**
* Before **saving DataFrames to disk** to reduce file size
* When preparing **datasets for machine learning**

## üß† 2. **Different Methods and Techniques**

| Technique                    | Description                                                        |
| ---------------------------- | ------------------------------------------------------------------ |
| `.memory_usage()`            | Inspect memory consumption of each column                          |
| `astype()`                   | Convert columns to more efficient data types                       |
| **Downcasting**              | Reduce float/int types to lower precision (e.g., `int64` ‚Üí `int8`) |
| **Categorical data**         | Optimize memory for repetitive strings/objects                     |
| `convert_dtypes()`           | Automatically infer better memory-efficient types                  |
| **Read CSV with types**      | Define `dtype` argument while reading files                        |
| **Drop unnecessary columns** | Free up space early in your pipeline                               |
| **Use `chunksize`**          | Load large files in small parts to avoid memory peaks              |


## üß™ 3. **Examples with Code**

### üîπ a) Check memory usage

```python
import pandas as pd

df = pd.read_csv('large_dataset.csv')
print(df.memory_usage(deep=True))  # Shows memory per column
print(df.memory_usage(deep=True).sum() / 1024**2, "MB")  # Total in MB
```

---

### üîπ b) Downcasting numeric columns

```python
df['int_col'] = pd.to_numeric(df['int_col'], downcast='integer')
df['float_col'] = pd.to_numeric(df['float_col'], downcast='float')
```

---

### üîπ c) Convert object columns to `category`

```python
df['city'] = df['city'].astype('category')
```

---

### üîπ d) Automatic data type optimization

```python
df = df.convert_dtypes()
```

---

### üîπ e) Optimize while reading CSV

```python
dtypes = {'user_id': 'int32', 'price': 'float32', 'category': 'category'}
df = pd.read_csv('sales.csv', dtype=dtypes)
```

---

### üîπ f) Load large file in chunks

```python
chunks = pd.read_csv('big.csv', chunksize=50000)
for chunk in chunks:
    process(chunk)  # Your custom logic
```

## ‚ö° 4. **Performance Considerations**

| Optimization             | Memory Saved             | Speed Impact                     |
| ------------------------ | ------------------------ | -------------------------------- |
| Downcasting integers     | High                     | Slight speed-up                  |
| Using categories         | Very High (up to 80‚Äì90%) | Faster groupby/sorting           |
| Dropping unused columns  | High                     | Faster overall processing        |
| Reading with `dtype`     | Medium                   | Avoids upcasting during read     |
| Using `convert_dtypes()` | Moderate                 | Automated but not always optimal |
| Chunking                 | Prevents crash           | Slightly slower but safer        |


## ‚ö†Ô∏è 5. **Common Pitfalls & Mistakes**

1. ‚ùå **Leaving default data types** ‚Äî `int64`, `float64`, and `object` take more memory than necessary
2. ‚ùå **Ignoring string/object columns** ‚Äî these consume the most memory
3. ‚ùå **Downcasting without checking value range** ‚Äî could cause **overflow or precision loss**
4. ‚ùå **Using category for high-cardinality columns** ‚Äî may actually **increase memory**
5. ‚ùå **Reading large CSVs without `dtype` or `chunksize`** ‚Äî leads to memory crashes


## ‚úÖ 6. **Best Practices**

* ‚úÖ Always **inspect memory** using `.memory_usage(deep=True)`
* ‚úÖ Use `astype()` to downcast numeric types appropriately
* ‚úÖ Convert repeated strings to `category` when cardinality is low
* ‚úÖ Drop or exclude **unnecessary columns** early
* ‚úÖ Pass `dtype` explicitly when reading large files
* ‚úÖ Use **`chunksize`** for massive datasets
* ‚úÖ Combine memory optimization with performance techniques (vectorization, method chaining)

## üíº 7. **Use Cases in Real Projects**

| Domain                 | Use Case                                                                            |
| ---------------------- | ----------------------------------------------------------------------------------- |
| üè• Healthcare          | Large EHR (Electronic Health Records) datasets with categorical codes               |
| üì¶ Logistics           | Shipment logs with repeated locations, product types                                |
| üõçÔ∏è Retail             | Millions of order rows stored with default `object` types ‚Üí converted to `category` |
| üìà Finance             | High-volume tick data downcasted to `float32` and `int32`                           |
| üìä Marketing Analytics | Read billions of clickstream rows using `chunksize` and optimized types             |


## ‚úÖ Summary

Memory usage and efficiency techniques are **critical** for working with large datasets. By applying **smart data typing, chunking, and memory profiling**, you can handle big data even on modest hardware, write scalable pipelines, and reduce runtime and resource costs.


<center><b>Thanks</b></center>