# Performance Optimization in Pandas

## Overview

**Performance Optimization** = Making Pandas operations faster and more memory-efficient

### Why Optimize?

```
‚ùå Slow & Memory-Hungry:
- Minutes to hours for operations
- Out of memory errors
- Poor user experience

‚úÖ Optimized:
- Seconds instead of minutes
- 50-90% less memory
- Handle larger datasets
- Better scalability
```

### What We'll Learn

**1. Memory Optimization** üíæ
- Efficient data types (int8 vs int64)
- Categorical data
- Memory profiling
- Reducing DataFrame size

**2. Vectorization** ‚ö°
- Avoid loops (apply vs vectorized)
- NumPy operations
- Built-in methods
- When to use what

**3. Efficient Operations** üöÄ
- Query vs boolean indexing
- eval() for expressions
- Efficient groupby
- Index optimization

**4. Large Data Handling** üìä
- Chunk processing
- Column selection (usecols)
- Data type specification
- Sampling strategies

**5. Advanced Techniques** üéØ
- Parallel processing
- Numba acceleration
- Sparse data structures
- Copy vs inplace

### Performance Principles

```
1. Vectorize > Apply > Loop
2. Right data types = Less memory
3. Sorted index = Fast selection
4. Read only what you need
5. Process in chunks if needed
```

### Speed Comparison (Typical)

```
Vectorized:      0.001s  ‚ö°‚ö°‚ö° (1000x faster)
Built-in apply:  0.1s    ‚ö°‚ö°
Custom apply:    1s      ‚ö°
For loop:        10s     üêå (10,000x slower!)
```

### What You'll Master

1. ‚úÖ Memory profiling and optimization
2. ‚úÖ Choosing optimal data types
3. ‚úÖ Vectorization techniques
4. ‚úÖ Efficient filtering and selection
5. ‚úÖ Large file reading strategies
6. ‚úÖ Chunked processing
7. ‚úÖ Index optimization
8. ‚úÖ Parallel processing basics
9. ‚úÖ Performance measurement
10. ‚úÖ Best practices for production

In [1]:
import pandas as pd
import numpy as np
import time
import sys
from functools import wraps

# Display settings
pd.set_option('display.max_rows', 20)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.precision', 2)

# Timer decorator for measuring performance
def timer(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        start = time.time()
        result = func(*args, **kwargs)
        end = time.time()
        print(f"{func.__name__}: {end - start:.4f}s")
        return result
    return wrapper

print("‚úÖ Libraries imported")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

‚úÖ Libraries imported
Pandas version: 2.2.3
NumPy version: 2.1.3


## 1. Memory Optimization

### Data Type Sizes

```python
# Integer types
int8:    -128 to 127                  (1 byte)
int16:   -32,768 to 32,767            (2 bytes)
int32:   -2B to 2B                    (4 bytes)
int64:   -9 quintillion to 9 quintillion (8 bytes) ‚Üê Default

# Float types
float16: 5 digits precision           (2 bytes)
float32: 7 digits precision           (4 bytes)
float64: 15 digits precision          (8 bytes) ‚Üê Default

# Other
bool:    True/False                   (1 byte)
category: Depends on # unique values  (Usually 1-2 bytes + overhead)
```

### Automatic Downcasting

```python
def optimize_dtypes(df):
    # Integers
    for col in df.select_dtypes(include=['int']).columns:
        col_min = df[col].min()
        col_max = df[col].max()
        
        if col_min >= 0:
            if col_max < 255:
                df[col] = df[col].astype('uint8')
            elif col_max < 65535:
                df[col] = df[col].astype('uint16')
            elif col_max < 4294967295:
                df[col] = df[col].astype('uint32')
        else:
            if col_min > -128 and col_max < 127:
                df[col] = df[col].astype('int8')
            elif col_min > -32768 and col_max < 32767:
                df[col] = df[col].astype('int16')
            elif col_min > -2147483648 and col_max < 2147483647:
                df[col] = df[col].astype('int32')
    
    # Floats
    for col in df.select_dtypes(include=['float']).columns:
        df[col] = df[col].astype('float32')
    
    return df
```

### Memory Profiling

```python
# Check memory usage
df.info(memory_usage='deep')  # Detailed memory info
df.memory_usage(deep=True)    # Per column

# Total memory
df.memory_usage(deep=True).sum() / 1024**2  # MB
```

### Categorical Conversion

```python
# When to use categorical
unique_ratio = df['col'].nunique() / len(df)

if unique_ratio < 0.5:  # Less than 50% unique
    df['col'] = df['col'].astype('category')
```

### Memory Savings Example

```
Before optimization:
int64:    8 bytes √ó 1M rows = 8 MB
float64:  8 bytes √ó 1M rows = 8 MB
object:   50 bytes √ó 1M rows = 50 MB
Total: 66 MB

After optimization:
int8:     1 byte √ó 1M rows = 1 MB     (87% savings)
float32:  4 bytes √ó 1M rows = 4 MB    (50% savings)
category: 2 bytes √ó 1M rows = 2 MB    (96% savings)
Total: 7 MB (89% savings!)
```

In [2]:
print("=== MEMORY OPTIMIZATION ===\n")

# Example 1: Data type impact
print("Example 1: Integer data type comparison\n")
n = 1_000_000
values = np.random.randint(0, 100, n)

df_int64 = pd.DataFrame({'value': values})
df_int8 = pd.DataFrame({'value': values.astype('int8')})

mem_int64 = df_int64.memory_usage(deep=True).sum() / 1024**2
mem_int8 = df_int8.memory_usage(deep=True).sum() / 1024**2

print(f"1M integers (0-100):")
print(f"  int64:  {mem_int64:.2f} MB")
print(f"  int8:   {mem_int8:.2f} MB")
print(f"  Savings: {(1 - mem_int8/mem_int64)*100:.1f}%")
print()

# Example 2: Float precision
print("="*70)
print("Example 2: Float precision impact\n")
float_values = np.random.randn(n)

df_float64 = pd.DataFrame({'value': float_values})
df_float32 = pd.DataFrame({'value': float_values.astype('float32')})

mem_float64 = df_float64.memory_usage(deep=True).sum() / 1024**2
mem_float32 = df_float32.memory_usage(deep=True).sum() / 1024**2

print(f"1M floats:")
print(f"  float64: {mem_float64:.2f} MB")
print(f"  float32: {mem_float32:.2f} MB")
print(f"  Savings: {(1 - mem_float32/mem_float64)*100:.1f}%")
print()

# Example 3: Categorical for strings
print("="*70)
print("Example 3: String vs Categorical\n")
categories = ['A', 'B', 'C', 'D', 'E']
strings = np.random.choice(categories, n)

df_object = pd.DataFrame({'category': strings})
df_category = pd.DataFrame({'category': pd.Categorical(strings)})

mem_object = df_object.memory_usage(deep=True).sum() / 1024**2
mem_category = df_category.memory_usage(deep=True).sum() / 1024**2

print(f"1M strings (5 unique values):")
print(f"  object:    {mem_object:.2f} MB")
print(f"  category:  {mem_category:.2f} MB")
print(f"  Savings: {(1 - mem_category/mem_object)*100:.1f}%")
print()

# Example 4: Automatic optimization function
print("="*70)
print("Example 4: Automatic dtype optimization\n")

def optimize_dtypes(df):
    """Optimize DataFrame data types"""
    for col in df.select_dtypes(include=['int']).columns:
        col_min = df[col].min()
        col_max = df[col].max()
        
        if col_min >= 0:
            if col_max < 255:
                df[col] = df[col].astype('uint8')
            elif col_max < 65535:
                df[col] = df[col].astype('uint16')
        else:
            if col_min > -128 and col_max < 127:
                df[col] = df[col].astype('int8')
            elif col_min > -32768 and col_max < 32767:
                df[col] = df[col].astype('int16')
    
    for col in df.select_dtypes(include=['float']).columns:
        df[col] = df[col].astype('float32')
    
    for col in df.select_dtypes(include=['object']).columns:
        num_unique = df[col].nunique()
        if num_unique / len(df) < 0.5:
            df[col] = df[col].astype('category')
    
    return df

# Create test DataFrame
df_test = pd.DataFrame({
    'age': np.random.randint(0, 100, 10000),
    'score': np.random.randn(10000),
    'category': np.random.choice(['A', 'B', 'C'], 10000)
})

print("Before optimization:")
print(df_test.dtypes)
mem_before = df_test.memory_usage(deep=True).sum() / 1024
print(f"Memory: {mem_before:.2f} KB")

df_optimized = optimize_dtypes(df_test.copy())
print("\nAfter optimization:")
print(df_optimized.dtypes)
mem_after = df_optimized.memory_usage(deep=True).sum() / 1024
print(f"Memory: {mem_after:.2f} KB")
print(f"\nSavings: {(1 - mem_after/mem_before)*100:.1f}%")

=== MEMORY OPTIMIZATION ===

Example 1: Integer data type comparison

1M integers (0-100):
  int64:  7.63 MB
  int8:   0.95 MB
  Savings: 87.5%

Example 2: Float precision impact

1M floats:
  float64: 7.63 MB
  float32: 3.81 MB
  Savings: 50.0%

Example 3: String vs Categorical

1M strings (5 unique values):
  object:    47.68 MB
  category:  0.95 MB
  Savings: 98.0%

Example 4: Automatic dtype optimization

Before optimization:
age           int64
score       float64
category     object
dtype: object
Memory: 644.66 KB

After optimization:
age            uint8
score        float32
category    category
dtype: object
Memory: 58.97 KB

Savings: 90.9%


## 2. Vectorization - Avoid Loops!

### Performance Hierarchy

```
‚ö°‚ö°‚ö° FASTEST (100-1000x)
1. Built-in Pandas methods
   df['result'] = df['col1'] + df['col2']

2. NumPy vectorized operations
   df['result'] = np.sqrt(df['col'])

‚ö°‚ö° FAST (10-100x)
3. Pandas built-in functions
   df['result'] = df['col'].str.upper()

‚ö° SLOW (baseline)
4. apply() with built-in functions
   df['result'] = df['col'].apply(lambda x: x**2)

üêå VERY SLOW (10-1000x slower)
5. apply() with custom functions
   df['result'] = df.apply(complex_function, axis=1)

üêåüêå EXTREMELY SLOW
6. For loops (iterrows, itertuples)
   for idx, row in df.iterrows():
```

### Vectorization Examples

**‚ùå Loop (Slow)**
```python
result = []
for i in range(len(df)):
    result.append(df.loc[i, 'a'] + df.loc[i, 'b'])
df['result'] = result
```

**‚ùå iterrows (Slow)**
```python
result = []
for idx, row in df.iterrows():
    result.append(row['a'] + row['b'])
df['result'] = result
```

**‚úÖ Vectorized (Fast)**
```python
df['result'] = df['a'] + df['b']
```

### Common Operations

**Arithmetic**
```python
# ‚ùå Slow
df['result'] = df['col'].apply(lambda x: x * 2 + 5)

# ‚úÖ Fast
df['result'] = df['col'] * 2 + 5
```

**Conditional Logic**
```python
# ‚ùå Slow
df['category'] = df['value'].apply(
    lambda x: 'High' if x > 100 else 'Low'
)

# ‚úÖ Fast
df['category'] = np.where(df['value'] > 100, 'High', 'Low')

# ‚úÖ Fast (multiple conditions)
conditions = [
    df['value'] > 100,
    df['value'] > 50,
    df['value'] > 0
]
choices = ['High', 'Medium', 'Low']
df['category'] = np.select(conditions, choices, default='None')
```

**String Operations**
```python
# ‚ùå Slow
df['upper'] = df['name'].apply(lambda x: x.upper())

# ‚úÖ Fast
df['upper'] = df['name'].str.upper()
```

**Math Operations**
```python
# ‚ùå Slow
df['sqrt'] = df['value'].apply(lambda x: x**0.5)

# ‚úÖ Fast
df['sqrt'] = np.sqrt(df['value'])
```

### When apply() is OK

```python
# Complex logic not vectorizable
def complex_calculation(row):
    # Multiple column interactions
    # Complex conditions
    # External API calls
    return result

# Use apply when truly needed
df['result'] = df.apply(complex_calculation, axis=1)
```

In [3]:
print("=== VECTORIZATION PERFORMANCE ===\n")

# Create test data
n = 100_000
df = pd.DataFrame({
    'a': np.random.randint(0, 100, n),
    'b': np.random.randint(0, 100, n),
    'value': np.random.randn(n),
    'name': np.random.choice(['alice', 'bob', 'charlie'], n)
})

# Example 1: Simple arithmetic
print("Example 1: Arithmetic operations\n")

# Method 1: Loop (slowest)
start = time.time()
result = []
for i in range(len(df)):
    result.append(df.loc[i, 'a'] + df.loc[i, 'b'])
df['result_loop'] = result
time_loop = time.time() - start
print(f"Loop:       {time_loop:.4f}s")

# Method 2: iterrows
start = time.time()
result = []
for idx, row in df.iterrows():
    result.append(row['a'] + row['b'])
df['result_iterrows'] = result
time_iterrows = time.time() - start
print(f"iterrows:   {time_iterrows:.4f}s")

# Method 3: apply
start = time.time()
df['result_apply'] = df.apply(lambda row: row['a'] + row['b'], axis=1)
time_apply = time.time() - start
print(f"apply:      {time_apply:.4f}s")

# Method 4: Vectorized
start = time.time()
df['result_vectorized'] = df['a'] + df['b']
time_vec = time.time() - start
print(f"Vectorized: {time_vec:.4f}s")

print(f"\nSpeedup: {time_loop/time_vec:.0f}x faster!")
print()

# Example 2: Conditional logic
print("="*70)
print("Example 2: Conditional operations\n")

# Method 1: apply
start = time.time()
df['category_apply'] = df['value'].apply(
    lambda x: 'Positive' if x > 0 else 'Negative'
)
time_apply = time.time() - start
print(f"apply:      {time_apply:.4f}s")

# Method 2: np.where (vectorized)
start = time.time()
df['category_where'] = np.where(df['value'] > 0, 'Positive', 'Negative')
time_where = time.time() - start
print(f"np.where:   {time_where:.4f}s")

print(f"\nSpeedup: {time_apply/time_where:.0f}x faster!")
print()

# Example 3: Multiple conditions
print("="*70)
print("Example 3: Multiple conditions\n")

# Method 1: apply with if-elif-else
def categorize_apply(x):
    if x > 1:
        return 'High'
    elif x > 0:
        return 'Medium'
    elif x > -1:
        return 'Low'
    else:
        return 'Very Low'

start = time.time()
df['level_apply'] = df['value'].apply(categorize_apply)
time_apply = time.time() - start
print(f"apply:      {time_apply:.4f}s")

# Method 2: np.select (vectorized)
start = time.time()
conditions = [
    df['value'] > 1,
    df['value'] > 0,
    df['value'] > -1
]
choices = ['High', 'Medium', 'Low']
df['level_select'] = np.select(conditions, choices, default='Very Low')
time_select = time.time() - start
print(f"np.select:  {time_select:.4f}s")

print(f"\nSpeedup: {time_apply/time_select:.0f}x faster!")
print()

# Example 4: String operations
print("="*70)
print("Example 4: String operations\n")

# Method 1: apply
start = time.time()
df['name_upper_apply'] = df['name'].apply(lambda x: x.upper())
time_apply = time.time() - start
print(f"apply:      {time_apply:.4f}s")

# Method 2: str accessor (vectorized)
start = time.time()
df['name_upper_str'] = df['name'].str.upper()
time_str = time.time() - start
print(f"str.upper:  {time_str:.4f}s")

print(f"\nSpeedup: {time_apply/time_str:.0f}x faster!")
print()

# Example 5: Mathematical operations
print("="*70)
print("Example 5: Math operations\n")

# Create positive values for sqrt
df['positive'] = np.abs(df['value']) + 1

# Method 1: apply
start = time.time()
df['sqrt_apply'] = df['positive'].apply(lambda x: x**0.5)
time_apply = time.time() - start
print(f"apply:      {time_apply:.4f}s")

# Method 2: NumPy (vectorized)
start = time.time()
df['sqrt_numpy'] = np.sqrt(df['positive'])
time_numpy = time.time() - start
print(f"np.sqrt:    {time_numpy:.4f}s")

print(f"\nSpeedup: {time_apply/time_numpy:.0f}x faster!")

=== VECTORIZATION PERFORMANCE ===

Example 1: Arithmetic operations

Loop:       0.6336s
iterrows:   1.2246s
apply:      0.2777s
Vectorized: 0.0007s

Speedup: 866x faster!

Example 2: Conditional operations

apply:      0.0103s
np.where:   0.0065s

Speedup: 2x faster!

Example 3: Multiple conditions

apply:      0.0130s
np.select:  0.0083s

Speedup: 2x faster!

Example 4: String operations

apply:      0.0094s
str.upper:  0.0080s

Speedup: 1x faster!

Example 5: Math operations

apply:      0.0148s
np.sqrt:    0.0003s

Speedup: 43x faster!


## 3. Efficient Operations

### Query vs Boolean Indexing

```python
# ‚ùå Slower for complex conditions
df[(df['age'] > 25) & (df['city'] == 'NYC') & (df['score'] > 80)]

# ‚úÖ Often faster (especially large DataFrames)
df.query('age > 25 and city == "NYC" and score > 80')
```

### eval() for Complex Expressions

```python
# ‚ùå Slower (creates intermediate arrays)
df['result'] = df['a'] + df['b'] * df['c'] - df['d'] / df['e']

# ‚úÖ Faster (no intermediate arrays)
df['result'] = df.eval('a + b * c - d / e')
```

### Efficient GroupBy

```python
# ‚úÖ Use built-in aggregations
df.groupby('category')['value'].sum()      # Fast
df.groupby('category')['value'].mean()     # Fast

# ‚ùå Custom aggregation (slower)
df.groupby('category')['value'].apply(custom_func)

# ‚úÖ Multiple aggregations at once
df.groupby('category').agg({
    'value': ['sum', 'mean', 'count']
})
```

### Index Optimization

```python
# ‚úÖ Sort index for fast slicing
df = df.sort_index()
df.loc['A':'Z']  # Fast with sorted index

# ‚úÖ Set index for frequent lookups
df = df.set_index('id')  # If you filter by 'id' often
df.loc[12345]  # Fast lookup
```

### Copy vs Inplace

```python
# ‚ùå Creates copy (uses more memory)
df_new = df.drop('col', axis=1)

# ‚ö†Ô∏è Inplace (modifies original)
df.drop('col', axis=1, inplace=True)

# Note: inplace is being deprecated in many methods
# Better to be explicit:
df = df.drop('col', axis=1)
```

### Efficient Merging

```python
# ‚úÖ Set index before merge if merging multiple times
df1 = df1.set_index('key')
df2 = df2.set_index('key')
result = df1.join(df2)  # Faster than merge

# ‚úÖ Specify merge columns explicitly
pd.merge(df1, df2, on='key', how='inner')  # Explicit is fast
```

### Method Chaining

```python
# ‚úÖ Efficient chaining
result = (df
    .query('age > 25')
    .groupby('city')['sales']
    .sum()
    .sort_values(ascending=False)
    .head(10)
)
```

In [4]:
print("=== EFFICIENT OPERATIONS ===\n")

# Create test data
n = 100_000
df = pd.DataFrame({
    'age': np.random.randint(20, 70, n),
    'city': np.random.choice(['NYC', 'LA', 'Chicago'], n),
    'score': np.random.randint(50, 100, n),
    'a': np.random.randn(n),
    'b': np.random.randn(n),
    'c': np.random.randn(n),
    'd': np.random.randn(n),
    'e': np.random.randn(n) + 1  # Avoid division by zero
})

# Example 1: query() vs boolean indexing
print("Example 1: query() vs boolean indexing\n")

# Boolean indexing
start = time.time()
result1 = df[(df['age'] > 25) & (df['city'] == 'NYC') & (df['score'] > 80)]
time_bool = time.time() - start
print(f"Boolean:  {time_bool:.4f}s")

# query()
start = time.time()
result2 = df.query('age > 25 and city == "NYC" and score > 80')
time_query = time.time() - start
print(f"query():  {time_query:.4f}s")

print(f"\nResults match: {len(result1) == len(result2)}")
if time_bool > time_query:
    print(f"query() is {time_bool/time_query:.1f}x faster")
else:
    print(f"Boolean is {time_query/time_bool:.1f}x faster")
print()

# Example 2: eval() for complex expressions
print("="*70)
print("Example 2: eval() for expressions\n")

# Standard operation
start = time.time()
df['result_standard'] = df['a'] + df['b'] * df['c'] - df['d'] / df['e']
time_standard = time.time() - start
print(f"Standard: {time_standard:.4f}s")

# eval()
start = time.time()
df['result_eval'] = df.eval('a + b * c - d / e')
time_eval = time.time() - start
print(f"eval():   {time_eval:.4f}s")

print(f"\neval() is {time_standard/time_eval:.1f}x faster")
print()

# Example 3: Efficient groupby
print("="*70)
print("Example 3: GroupBy performance\n")

# Single aggregation
start = time.time()
result1 = df.groupby('city')['score'].sum()
time1 = time.time() - start
print(f"Single agg:    {time1:.4f}s")

# Multiple separate aggregations
start = time.time()
sum_result = df.groupby('city')['score'].sum()
mean_result = df.groupby('city')['score'].mean()
count_result = df.groupby('city')['score'].count()
time2 = time.time() - start
print(f"Separate aggs: {time2:.4f}s")

# Combined aggregation (efficient)
start = time.time()
result3 = df.groupby('city')['score'].agg(['sum', 'mean', 'count'])
time3 = time.time() - start
print(f"Combined agg:  {time3:.4f}s")

print(f"\nCombined is {time2/time3:.1f}x faster than separate")
print()

# Example 4: Index optimization
print("="*70)
print("Example 4: Index optimization\n")

# Create DataFrame with random order
df_unsorted = df.copy()
df_unsorted.index = np.random.permutation(df.index)

# Selection without sorted index
start = time.time()
for i in range(100):
    _ = df_unsorted.loc[df_unsorted['city'] == 'NYC']
time_unsorted = time.time() - start
print(f"Unsorted index: {time_unsorted:.4f}s (100 selections)")

# Set and sort index
df_sorted = df.set_index('city').sort_index()

# Selection with sorted index
start = time.time()
for i in range(100):
    _ = df_sorted.loc['NYC']
time_sorted = time.time() - start
print(f"Sorted index:   {time_sorted:.4f}s (100 selections)")

print(f"\nSorted index is {time_unsorted/time_sorted:.1f}x faster")
print()

# Example 5: Method chaining
print("="*70)
print("Example 5: Efficient method chaining\n")

# Chained operations
start = time.time()
result = (df
    .query('age > 30')
    .groupby('city')['score']
    .mean()
    .sort_values(ascending=False)
)
time_chain = time.time() - start

print("Chained operation result:")
print(result)
print(f"\nTime: {time_chain:.4f}s")

=== EFFICIENT OPERATIONS ===

Example 1: query() vs boolean indexing

Boolean:  0.0050s
query():  0.0058s

Results match: True
Boolean is 1.1x faster

Example 2: eval() for expressions

Standard: 0.0006s
eval():   0.0016s

eval() is 0.4x faster

Example 3: GroupBy performance

Single agg:    0.0042s
Separate aggs: 0.0122s
Combined agg:  0.0054s

Combined is 2.3x faster than separate

Example 4: Index optimization

Unsorted index: 0.4978s (100 selections)
Sorted index:   0.0076s (100 selections)

Sorted index is 65.6x faster

Example 5: Efficient method chaining

Chained operation result:
city
LA         74.45
NYC        74.39
Chicago    74.30
Name: score, dtype: float64

Time: 0.0074s


## 4. Large Data Handling

### Reading Large Files Efficiently

**1. Specify dtypes**
```python
# ‚ùå Pandas infers dtypes (slow, memory-hungry)
df = pd.read_csv('large.csv')

# ‚úÖ Specify dtypes upfront
dtypes = {
    'id': 'int32',
    'value': 'float32',
    'category': 'category'
}
df = pd.read_csv('large.csv', dtype=dtypes)
```

**2. Select only needed columns**
```python
# ‚ùå Read all columns
df = pd.read_csv('large.csv')  # 50 columns

# ‚úÖ Read only what you need
df = pd.read_csv('large.csv', usecols=['id', 'name', 'value'])
```

**3. Parse dates efficiently**
```python
# ‚ùå Parse after reading
df = pd.read_csv('large.csv')
df['date'] = pd.to_datetime(df['date'])

# ‚úÖ Parse during reading
df = pd.read_csv('large.csv', parse_dates=['date'])
```

### Chunked Processing

**Process in chunks**
```python
# Process 10,000 rows at a time
chunk_size = 10_000
results = []

for chunk in pd.read_csv('large.csv', chunksize=chunk_size):
    # Process chunk
    processed = chunk[chunk['value'] > 100]
    results.append(processed)

# Combine results
df = pd.concat(results, ignore_index=True)
```

**Aggregate while chunking**
```python
# Calculate sum without loading all data
total = 0

for chunk in pd.read_csv('large.csv', chunksize=10_000):
    total += chunk['value'].sum()

print(f"Total: {total}")
```

### Sampling

```python
# ‚úÖ Random sample for development
df = pd.read_csv('large.csv', 
                 skiprows=lambda i: i > 0 and np.random.random() > 0.01)
# Reads ~1% of rows

# ‚úÖ First N rows for testing
df = pd.read_csv('large.csv', nrows=1000)
```

### Compression

```python
# Read compressed files (auto-detected)
df = pd.read_csv('data.csv.gz')  # gzip
df = pd.read_csv('data.csv.bz2')  # bzip2
df = pd.read_csv('data.csv.zip')  # zip

# Save with compression
df.to_csv('data.csv.gz', compression='gzip')
```

### Alternative Formats

```python
# Parquet (fast, compressed, preserves dtypes)
df.to_parquet('data.parquet')  # Save
df = pd.read_parquet('data.parquet')  # Load

# Feather (very fast, for temporary storage)
df.to_feather('data.feather')
df = pd.read_feather('data.feather')

# Pickle (preserves everything, Python-only)
df.to_pickle('data.pkl')
df = pd.read_pickle('data.pkl')
```

### Format Comparison

```
Format      Speed    Size    Cross-Lang  Best For
-------     -----    ----    ----------  --------
CSV         Slow     Large   ‚úÖ          Sharing, human-readable
Parquet     Fast     Small   ‚úÖ          Production, archiving
Feather     Fastest  Medium  ‚úÖ          Temporary, inter-process
Pickle      Fast     Medium  ‚ùå          Python-only, quick save
HDF5        Fast     Small   ‚úÖ          Large datasets, append
```

In [5]:
print("=== LARGE DATA HANDLING ===\n")

# Create sample large CSV for demonstration
n = 1_000_000
sample_data = pd.DataFrame({
    'id': range(n),
    'category': np.random.choice(['A', 'B', 'C'], n),
    'value': np.random.randn(n),
    'amount': np.random.randint(0, 1000, n),
    'date': pd.date_range('2020-01-01', periods=n, freq='T')
})
sample_data.to_csv('large_sample.csv', index=False)
print("‚úÖ Created large_sample.csv (1M rows)\n")

# Example 1: Default vs optimized reading
print("Example 1: Optimized CSV reading\n")

# Default reading
start = time.time()
df_default = pd.read_csv('large_sample.csv')
time_default = time.time() - start
mem_default = df_default.memory_usage(deep=True).sum() / 1024**2
print(f"Default read:")
print(f"  Time: {time_default:.2f}s")
print(f"  Memory: {mem_default:.1f} MB")

# Optimized reading
start = time.time()
df_optimized = pd.read_csv('large_sample.csv',
                           dtype={'id': 'int32',
                                  'category': 'category',
                                  'value': 'float32',
                                  'amount': 'int16'},
                           parse_dates=['date'])
time_optimized = time.time() - start
mem_optimized = df_optimized.memory_usage(deep=True).sum() / 1024**2
print(f"\nOptimized read:")
print(f"  Time: {time_optimized:.2f}s")
print(f"  Memory: {mem_optimized:.1f} MB")

print(f"\nMemory savings: {(1 - mem_optimized/mem_default)*100:.1f}%")
print()

# Example 2: Select only needed columns
print("="*70)
print("Example 2: Column selection (usecols)\n")

# Read all columns
start = time.time()
df_all = pd.read_csv('large_sample.csv')
time_all = time.time() - start
mem_all = df_all.memory_usage(deep=True).sum() / 1024**2
print(f"All columns (5):")
print(f"  Time: {time_all:.2f}s")
print(f"  Memory: {mem_all:.1f} MB")

# Read only 2 columns
start = time.time()
df_subset = pd.read_csv('large_sample.csv', usecols=['id', 'value'])
time_subset = time.time() - start
mem_subset = df_subset.memory_usage(deep=True).sum() / 1024**2
print(f"\nOnly 2 columns:")
print(f"  Time: {time_subset:.2f}s")
print(f"  Memory: {mem_subset:.1f} MB")

print(f"\nMemory savings: {(1 - mem_subset/mem_all)*100:.1f}%")
print(f"Speed improvement: {time_all/time_subset:.1f}x")
print()

# Example 3: Chunked processing
print("="*70)
print("Example 3: Chunked processing\n")

# Process in chunks and filter
chunk_size = 100_000
results = []

start = time.time()
for i, chunk in enumerate(pd.read_csv('large_sample.csv', chunksize=chunk_size)):
    # Filter chunk
    filtered = chunk[chunk['amount'] > 500]
    results.append(filtered)
    if i == 0:
        print(f"Processing chunk {i+1}: {len(filtered)} rows kept")

df_filtered = pd.concat(results, ignore_index=True)
time_chunk = time.time() - start

print(f"\nProcessed {len(sample_data)} rows in {time_chunk:.2f}s")
print(f"Result: {len(df_filtered)} rows (filtered)")
print()

# Example 4: Aggregation without loading all data
print("="*70)
print("Example 4: Chunk aggregation\n")

# Calculate sum and count without loading full DataFrame
total_sum = 0
total_count = 0

start = time.time()
for chunk in pd.read_csv('large_sample.csv', chunksize=100_000):
    total_sum += chunk['amount'].sum()
    total_count += len(chunk)

mean_value = total_sum / total_count
time_agg = time.time() - start

print(f"Total sum: {total_sum:,.0f}")
print(f"Mean: {mean_value:.2f}")
print(f"Time: {time_agg:.2f}s")
print()

# Example 5: Sampling
print("="*70)
print("Example 5: Sampling strategies\n")

# Read first N rows
start = time.time()
df_head = pd.read_csv('large_sample.csv', nrows=10_000)
time_head = time.time() - start
print(f"First 10K rows: {time_head:.4f}s")

# Read random sample (~1%)
start = time.time()
df_sample = pd.read_csv('large_sample.csv',
                        skiprows=lambda i: i > 0 and np.random.random() > 0.01)
time_sample = time.time() - start
print(f"Random 1% sample: {time_sample:.4f}s ({len(df_sample)} rows)")
print()

# Example 6: File format comparison
print("="*70)
print("Example 6: File format performance\n")

# Prepare smaller dataset for format comparison
df_format = sample_data.head(100_000)

# CSV
start = time.time()
df_format.to_csv('test.csv', index=False)
time_csv_write = time.time() - start
start = time.time()
_ = pd.read_csv('test.csv')
time_csv_read = time.time() - start
import os
size_csv = os.path.getsize('test.csv') / 1024**2

# Parquet
start = time.time()
df_format.to_parquet('test.parquet')
time_parquet_write = time.time() - start
start = time.time()
_ = pd.read_parquet('test.parquet')
time_parquet_read = time.time() - start
size_parquet = os.path.getsize('test.parquet') / 1024**2

# Pickle
start = time.time()
df_format.to_pickle('test.pkl')
time_pickle_write = time.time() - start
start = time.time()
_ = pd.read_pickle('test.pkl')
time_pickle_read = time.time() - start
size_pickle = os.path.getsize('test.pkl') / 1024**2

print(f"{'Format':<10} {'Write':<10} {'Read':<10} {'Size (MB)':<10}")
print("-" * 40)
print(f"{'CSV':<10} {time_csv_write:<10.3f} {time_csv_read:<10.3f} {size_csv:<10.1f}")
print(f"{'Parquet':<10} {time_parquet_write:<10.3f} {time_parquet_read:<10.3f} {size_parquet:<10.1f}")
print(f"{'Pickle':<10} {time_pickle_write:<10.3f} {time_pickle_read:<10.3f} {size_pickle:<10.1f}")

# Cleanup
import os
for f in ['large_sample.csv', 'test.csv', 'test.parquet', 'test.pkl']:
    if os.path.exists(f):
        os.remove(f)
print("\n‚úÖ Cleanup complete")

=== LARGE DATA HANDLING ===



  'date': pd.date_range('2020-01-01', periods=n, freq='T')


‚úÖ Created large_sample.csv (1M rows)

Example 1: Optimized CSV reading

Default read:
  Time: 0.50s
  Memory: 135.4 MB

Optimized read:
  Time: 1.76s
  Memory: 18.1 MB

Memory savings: 86.6%

Example 2: Column selection (usecols)

All columns (5):
  Time: 0.46s
  Memory: 135.4 MB

Only 2 columns:
  Time: 0.18s
  Memory: 15.3 MB

Memory savings: 88.7%
Speed improvement: 2.6x

Example 3: Chunked processing

Processing chunk 1: 49788 rows kept

Processed 1000000 rows in 0.48s
Result: 499053 rows (filtered)

Example 4: Chunk aggregation

Total sum: 499,645,967
Mean: 499.65
Time: 0.59s

Example 5: Sampling strategies

First 10K rows: 0.0055s
Random 1% sample: 0.4565s (9968 rows)

Example 6: File format performance

Format     Write      Read       Size (MB) 
----------------------------------------
CSV        0.273      0.048      4.9       
Parquet    2.533      3.308      2.6       
Pickle     0.004      0.002      3.2       

‚úÖ Cleanup complete


## 5. Advanced Techniques

### Parallel Processing

**Using swifter (auto-parallelizes apply)**
```python
# Install: pip install swifter
import swifter

# ‚ùå Single-core apply
df['result'] = df['col'].apply(complex_function)

# ‚úÖ Multi-core apply
df['result'] = df['col'].swifter.apply(complex_function)
```

**Using Dask (parallel Pandas)**
```python
# Install: pip install dask
import dask.dataframe as dd

# Convert to Dask DataFrame
ddf = dd.from_pandas(df, npartitions=4)

# Operations are parallelized
result = ddf.groupby('category')['value'].mean().compute()
```

**Using multiprocessing**
```python
from multiprocessing import Pool
import numpy as np

def process_chunk(chunk):
    # Your processing logic
    return chunk[chunk['value'] > 0]

# Split DataFrame
chunks = np.array_split(df, 4)

# Process in parallel
with Pool(4) as pool:
    results = pool.map(process_chunk, chunks)

# Combine results
df_result = pd.concat(results)
```

### Numba Acceleration

```python
# Install: pip install numba
from numba import jit

# ‚ùå Slow Python function
def calculate_slow(arr):
    result = np.zeros_like(arr)
    for i in range(len(arr)):
        result[i] = arr[i] ** 2 + arr[i] ** 0.5
    return result

# ‚úÖ JIT-compiled (10-100x faster)
@jit(nopython=True)
def calculate_fast(arr):
    result = np.zeros_like(arr)
    for i in range(len(arr)):
        result[i] = arr[i] ** 2 + arr[i] ** 0.5
    return result

df['result'] = calculate_fast(df['value'].values)
```

### Sparse Data Structures

```python
# When data is mostly zeros/NaN
import pandas.arrays as arrays

# Create sparse array
sparse_arr = arrays.SparseArray([0, 0, 1, 0, 0, 2, 0, 0])
df['sparse'] = sparse_arr

# Huge memory savings for sparse data
# Example: 1M rows, 99% zeros
# Dense: 8 MB
# Sparse: 0.08 MB (99% savings!)
```

### String Performance

```python
# ‚úÖ Use string dtype for better performance
df['text'] = df['text'].astype('string')  # Pandas 1.0+

# Benefits:
# - Better memory usage
# - Faster string operations
# - Consistent NA handling
```

### Copy-on-Write (COW)

```python
# Enable COW mode (Pandas 2.0+)
pd.options.mode.copy_on_write = True

# Benefits:
# - Avoids unnecessary copies
# - Better memory usage
# - More predictable behavior
```

### Efficient String Operations

```python
# ‚úÖ Vectorized regex
df['matches'] = df['text'].str.contains(r'pattern', regex=True)

# ‚úÖ Extract groups
df[['first', 'last']] = df['name'].str.extract(r'(\w+)\s(\w+)')

# ‚úÖ Replace efficiently
df['clean'] = df['text'].str.replace(r'[^a-zA-Z]', '', regex=True)
```

### Window Operations

```python
# Rolling operations (efficient C implementation)
df['rolling_mean'] = df['value'].rolling(window=7).mean()
df['rolling_sum'] = df['value'].rolling(window=7).sum()

# Expanding operations
df['cumulative_mean'] = df['value'].expanding().mean()
```

In [6]:
print("=== ADVANCED TECHNIQUES ===\n")

# Example 1: Numba acceleration (if available)
print("Example 1: Numba acceleration\n")

try:
    from numba import jit
    
    n = 1_000_000
    arr = np.random.randn(n)
    
    # Regular Python function
    def calculate_python(arr):
        result = np.zeros_like(arr)
        for i in range(len(arr)):
            result[i] = arr[i] ** 2 + np.sqrt(np.abs(arr[i]))
        return result
    
    # JIT-compiled version
    @jit(nopython=True)
    def calculate_numba(arr):
        result = np.zeros_like(arr)
        for i in range(len(arr)):
            result[i] = arr[i] ** 2 + np.sqrt(np.abs(arr[i]))
        return result
    
    # Warm up JIT
    _ = calculate_numba(arr[:100])
    
    # Benchmark
    start = time.time()
    result_python = calculate_python(arr)
    time_python = time.time() - start
    print(f"Python loop:  {time_python:.4f}s")
    
    start = time.time()
    result_numba = calculate_numba(arr)
    time_numba = time.time() - start
    print(f"Numba JIT:    {time_numba:.4f}s")
    
    print(f"\nNumba is {time_python/time_numba:.0f}x faster!")
    
except ImportError:
    print("‚ö†Ô∏è Numba not installed (pip install numba)")
    print("Skipping Numba example")

print()

# Example 2: Sparse arrays
print("="*70)
print("Example 2: Sparse arrays for sparse data\n")

n = 1_000_000
# Create mostly zero array (95% zeros)
sparse_data = np.random.choice([0, 1, 2, 3], n, p=[0.95, 0.03, 0.01, 0.01])

# Dense array
df_dense = pd.DataFrame({'values': sparse_data})
mem_dense = df_dense.memory_usage(deep=True).sum() / 1024**2
print(f"Dense array:  {mem_dense:.2f} MB")

# Sparse array
df_sparse = pd.DataFrame({
    'values': pd.arrays.SparseArray(sparse_data)
})
mem_sparse = df_sparse.memory_usage(deep=True).sum() / 1024**2
print(f"Sparse array: {mem_sparse:.2f} MB")
print(f"\nMemory savings: {(1 - mem_sparse/mem_dense)*100:.1f}%")
print()

# Example 3: String dtype
print("="*70)
print("Example 3: String dtype performance\n")

n = 100_000
strings = ['text_' + str(i % 100) for i in range(n)]

# Object dtype
df_object = pd.DataFrame({'text': strings})
mem_object = df_object.memory_usage(deep=True).sum() / 1024**2
print(f"Object dtype:  {mem_object:.2f} MB")

# String dtype
df_string = pd.DataFrame({'text': pd.array(strings, dtype='string')})
mem_string = df_string.memory_usage(deep=True).sum() / 1024**2
print(f"String dtype:  {mem_string:.2f} MB")

# Performance test
start = time.time()
_ = df_object['text'].str.upper()
time_object = time.time() - start
print(f"\nObject upper:  {time_object:.4f}s")

start = time.time()
_ = df_string['text'].str.upper()
time_string = time.time() - start
print(f"String upper:  {time_string:.4f}s")
print()

# Example 4: Window operations
print("="*70)
print("Example 4: Efficient window operations\n")

n = 1_000_000
df_window = pd.DataFrame({
    'value': np.random.randn(n)
})

# Rolling mean (efficient C implementation)
start = time.time()
df_window['rolling_mean'] = df_window['value'].rolling(window=100).mean()
time_rolling = time.time() - start
print(f"Rolling mean (window=100): {time_rolling:.4f}s")

# Expanding mean
start = time.time()
df_window['expanding_mean'] = df_window['value'].expanding().mean()
time_expanding = time.time() - start
print(f"Expanding mean:            {time_expanding:.4f}s")
print()

# Example 5: Efficient regex operations
print("="*70)
print("Example 5: Vectorized string operations\n")

n = 10_000
emails = [f"user{i}@example.com" for i in range(n)]
df_email = pd.DataFrame({'email': emails})

# Extract username (vectorized)
start = time.time()
df_email['username'] = df_email['email'].str.extract(r'(.+)@')
time_extract = time.time() - start
print(f"Extract username:  {time_extract:.4f}s")

# Check pattern (vectorized)
start = time.time()
df_email['is_valid'] = df_email['email'].str.contains(r'^[\w.]+@[\w.]+\.[a-z]+$', regex=True)
time_pattern = time.time() - start
print(f"Pattern matching:  {time_pattern:.4f}s")

# Replace (vectorized)
start = time.time()
df_email['domain'] = df_email['email'].str.replace(r'.+@', '', regex=True)
time_replace = time.time() - start
print(f"Replace pattern:   {time_replace:.4f}s")
print()

# Example 6: Copy-on-Write mode
print("="*70)
print("Example 6: Copy-on-Write optimization\n")

# Check if COW is available (Pandas 2.0+)
try:
    original_cow = pd.options.mode.copy_on_write
    print(f"Copy-on-Write mode: {original_cow}")
    print("\nCOW Benefits:")
    print("  - Avoids unnecessary data copies")
    print("  - Reduces memory usage")
    print("  - More predictable behavior")
    print("\nTo enable: pd.options.mode.copy_on_write = True")
except AttributeError:
    print("‚ö†Ô∏è Copy-on-Write not available (requires Pandas 2.0+)")
    print("Your version:", pd.__version__)

=== ADVANCED TECHNIQUES ===

Example 1: Numba acceleration

Python loop:  1.2026s
Numba JIT:    0.0010s

Numba is 1191x faster!

Example 2: Sparse arrays for sparse data

Dense array:  7.63 MB
Sparse array: 0.57 MB

Memory savings: 92.5%

Example 3: String dtype performance

Object dtype:  5.33 MB
String dtype:  5.33 MB

Object upper:  0.0078s
String upper:  0.0085s

Example 4: Efficient window operations

Rolling mean (window=100): 0.0140s
Expanding mean:            0.0099s

Example 5: Vectorized string operations

Extract username:  0.0059s
Pattern matching:  0.0034s
Replace pattern:   0.0044s

Example 6: Copy-on-Write optimization

Copy-on-Write mode: False

COW Benefits:
  - Avoids unnecessary data copies
  - Reduces memory usage
  - More predictable behavior

To enable: pd.options.mode.copy_on_write = True


## 6. Best Practices & Optimization Checklist

### Memory Optimization Checklist ‚úÖ

```python
# 1. Use optimal data types
‚òê int64 ‚Üí int8/int16/int32 (if range allows)
‚òê float64 ‚Üí float32 (if precision allows)
‚òê object ‚Üí category (for repeated values)
‚òê object ‚Üí string (for text data)

# 2. Read efficiently
‚òê Specify dtype in read_csv()
‚òê Use usecols to read only needed columns
‚òê Parse dates during reading
‚òê Use chunksize for large files

# 3. Remove unused data
‚òê Drop unnecessary columns
‚òê Filter early in pipeline
‚òê Use sparse arrays for sparse data
```

### Speed Optimization Checklist ‚ö°

```python
# 1. Vectorize operations
‚òê Use built-in Pandas/NumPy functions
‚òê Avoid iterrows(), itertuples()
‚òê Replace apply() with vectorized alternatives
‚òê Use np.where() for conditionals
‚òê Use np.select() for multiple conditions

# 2. Optimize data access
‚òê Sort index if slicing frequently
‚òê Set index for frequent lookups
‚òê Use query() for complex filters
‚òê Use eval() for expressions

# 3. Efficient groupby
‚òê Use built-in aggregations (sum, mean)
‚òê Combine multiple aggregations with agg()
‚òê Avoid custom functions in groupby if possible
```

### Development Workflow üîÑ

```python
# 1. Start small
df_sample = pd.read_csv('large.csv', nrows=1000)  # Test on small sample
# Develop and debug on sample

# 2. Profile performance
%timeit operation  # Jupyter magic
# Or use timer decorator

# 3. Check memory
df.info(memory_usage='deep')
df.memory_usage(deep=True).sum() / 1024**2  # MB

# 4. Optimize iteratively
# - Identify bottlenecks
# - Apply optimizations
# - Measure improvements
# - Repeat

# 5. Scale up
# Apply to full dataset once optimized
```

### Common Anti-Patterns ‚ùå

```python
# ‚ùå DON'T: Loop over rows
for idx, row in df.iterrows():
    df.loc[idx, 'new'] = row['a'] + row['b']

# ‚úÖ DO: Vectorize
df['new'] = df['a'] + df['b']

# ‚ùå DON'T: Grow DataFrame in loop
df = pd.DataFrame()
for i in range(1000):
    df = pd.concat([df, new_row])

# ‚úÖ DO: Build list, then create DataFrame
rows = []
for i in range(1000):
    rows.append(new_row)
df = pd.DataFrame(rows)

# ‚ùå DON'T: Use apply for simple operations
df['result'] = df['value'].apply(lambda x: x * 2)

# ‚úÖ DO: Use vectorized operations
df['result'] = df['value'] * 2

# ‚ùå DON'T: Read all columns if you need few
df = pd.read_csv('large.csv')  # 50 columns
df = df[['col1', 'col2']]  # Use only 2

# ‚úÖ DO: Read only needed columns
df = pd.read_csv('large.csv', usecols=['col1', 'col2'])

# ‚ùå DON'T: Use object dtype for categories
df['category'] = df['category'].astype('object')

# ‚úÖ DO: Use category dtype
df['category'] = df['category'].astype('category')
```

### Performance Targets üéØ

```
Data Size    Operations Should Be
---------    --------------------
< 10K rows   Instant (< 0.1s)
< 100K       Fast (< 1s)
< 1M         Reasonable (< 10s)
> 1M         Consider chunking or Dask

If slower, optimize!
```

### When to Use What

```python
# Vectorized operations
When: Always try first
Example: df['result'] = df['a'] + df['b']

# np.where() / np.select()
When: Conditional logic
Example: df['category'] = np.where(df['value'] > 0, 'Pos', 'Neg')

# apply() with built-in function
When: Need to apply built-in function
Example: df['result'] = df['col'].apply(len)

# apply() with custom function
When: Complex logic, can't vectorize
Example: df['result'] = df.apply(complex_logic, axis=1)

# Numba
When: Custom numeric computations, need speed
Example: @jit(nopython=True) def compute(arr): ...

# Dask/parallel
When: Very large datasets, embarrassingly parallel
Example: ddf.groupby('key').agg('sum').compute()

# Chunking
When: Dataset doesn't fit in memory
Example: for chunk in pd.read_csv(..., chunksize=10000):
```

## Practice Exercises

### Beginner Level (1-5)

1. **Memory Profiling**
   - Check memory usage of a DataFrame
   - Identify which columns use most memory
   - Calculate total memory in MB

2. **Data Type Optimization**
   - Convert int64 column (values 0-255) to int8
   - Convert float64 to float32
   - Convert object column to category

3. **Vectorization Basics**
   - Replace loop with vectorized arithmetic
   - Use np.where() for conditional
   - Time both approaches

4. **Efficient Reading**
   - Read CSV with dtype specification
   - Use usecols to read subset
   - Compare memory usage

5. **String Operations**
   - Replace apply() with str accessor
   - Measure performance difference

### Intermediate Level (6-10)

6. **Automatic Optimization**
   - Write function to optimize all dtypes
   - Handle int, float, and object columns
   - Report memory savings

7. **Query vs Boolean**
   - Implement same filter with both methods
   - Benchmark on large DataFrame
   - Compare performance

8. **Chunked Processing**
   - Read large CSV in chunks
   - Filter each chunk
   - Combine results

9. **Efficient GroupBy**
   - Compare custom vs built-in aggregation
   - Use agg() for multiple aggregations
   - Benchmark performance

10. **eval() Usage**
    - Create complex expression
    - Implement with standard operations
    - Implement with eval()
    - Compare performance

### Advanced Level (11-15)

11. **File Format Comparison**
    - Save DataFrame in CSV, Parquet, Pickle
    - Compare write/read times
    - Compare file sizes

12. **Sparse Arrays**
    - Create DataFrame with 90% zeros
    - Compare dense vs sparse memory
    - Perform operations on sparse

13. **Parallel Processing**
    - Split DataFrame into chunks
    - Process chunks in parallel (multiprocessing)
    - Compare with single-threaded

14. **Index Optimization**
    - Benchmark lookups without index
    - Set and sort index
    - Benchmark lookups with index
    - Calculate speedup

15. **Complete Optimization Pipeline**
    - Read large CSV
    - Optimize dtypes
    - Vectorize operations
    - Use efficient aggregations
    - Save in optimal format

### Challenge Problems (16-20)

16. **Memory Budget Challenge**
    - Given: 10M row DataFrame, 100MB memory limit
    - Optimize to fit in budget
    - Maintain functionality

17. **Speed Challenge**
    - Given: Slow data pipeline (1 minute)
    - Apply all optimization techniques
    - Target: < 5 seconds

18. **Large File Processing**
    - Process 1GB+ CSV file
    - Don't load all into memory
    - Perform aggregations
    - Save results

19. **Real-World Pipeline**
    - Read from multiple CSVs
    - Merge efficiently
    - Apply transformations (vectorized)
    - Aggregate results
    - Optimize entire pipeline

20. **Numba Integration**
    - Implement custom function
    - Compare Python, NumPy, Numba versions
    - Integrate into Pandas workflow

## Quick Reference Card

### Memory Optimization

```python
# Check memory
df.info(memory_usage='deep')
df.memory_usage(deep=True).sum() / 1024**2  # MB

# Optimize integers
df['col'] = df['col'].astype('int8')    # -128 to 127
df['col'] = df['col'].astype('int16')   # -32K to 32K
df['col'] = df['col'].astype('int32')   # -2B to 2B

# Optimize floats
df['col'] = df['col'].astype('float32')  # 7 digits precision

# Categoricals
df['col'] = df['col'].astype('category')  # For repeated values

# String dtype
df['col'] = df['col'].astype('string')  # Better than object
```

### Vectorization

```python
# Arithmetic (always vectorized)
df['result'] = df['a'] + df['b'] * df['c']

# Conditionals
df['cat'] = np.where(df['val'] > 0, 'Pos', 'Neg')

# Multiple conditions
conditions = [df['val'] > 100, df['val'] > 50]
choices = ['High', 'Medium']
df['cat'] = np.select(conditions, choices, default='Low')

# String operations
df['upper'] = df['text'].str.upper()
df['len'] = df['text'].str.len()
df['contains'] = df['text'].str.contains('pattern')

# Math operations
df['sqrt'] = np.sqrt(df['val'])
df['log'] = np.log(df['val'])
```

### Efficient Reading

```python
# Specify dtypes
df = pd.read_csv('file.csv', dtype={'id': 'int32', 'cat': 'category'})

# Select columns
df = pd.read_csv('file.csv', usecols=['col1', 'col2'])

# Parse dates
df = pd.read_csv('file.csv', parse_dates=['date'])

# Chunked reading
for chunk in pd.read_csv('file.csv', chunksize=10000):
    process(chunk)

# Sampling
df = pd.read_csv('file.csv', nrows=1000)  # First 1000
df = pd.read_csv('file.csv',  # ~1% sample
                 skiprows=lambda i: i > 0 and np.random.random() > 0.01)
```

### Efficient Operations

```python
# Query (often faster)
df.query('age > 25 and city == "NYC"')

# eval() for expressions
df['result'] = df.eval('a + b * c - d / e')

# Efficient groupby
df.groupby('cat')['val'].agg(['sum', 'mean', 'count'])

# Sort index for slicing
df = df.sort_index()
df.loc['A':'Z']  # Fast

# Set index for lookups
df = df.set_index('id')
df.loc[12345]  # Fast
```

### File Formats

```python
# Parquet (recommended for production)
df.to_parquet('file.parquet')
df = pd.read_parquet('file.parquet')

# Feather (fast temporary storage)
df.to_feather('file.feather')
df = pd.read_feather('file.feather')

# Pickle (Python-only)
df.to_pickle('file.pkl')
df = pd.read_pickle('file.pkl')

# CSV with compression
df.to_csv('file.csv.gz', compression='gzip')
df = pd.read_csv('file.csv.gz')
```

### Performance Hierarchy

```python
# From fastest to slowest:
1. Vectorized:           df['c'] = df['a'] + df['b']
2. NumPy:                df['c'] = np.sqrt(df['a'])
3. Built-in methods:     df['c'] = df['a'].str.upper()
4. apply() with lambda:  df['c'] = df['a'].apply(lambda x: x**2)
5. apply() custom:       df['c'] = df.apply(func, axis=1)
6. itertuples():         for row in df.itertuples(): ...
7. iterrows():           for idx, row in df.iterrows(): ...
8. Loop with loc:        for i in range(len(df)): df.loc[i, ...]
```

### Benchmarking

```python
# Jupyter magic
%timeit operation

# Python timing
import time
start = time.time()
operation
print(f"Time: {time.time() - start:.4f}s")

# Decorator
from functools import wraps
def timer(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        start = time.time()
        result = func(*args, **kwargs)
        print(f"{func.__name__}: {time.time()-start:.4f}s")
        return result
    return wrapper
```

## Summary

### Key Optimization Principles üéØ

**1. Right Data Types = Less Memory**
- int64 ‚Üí int8/int16/int32 (87% savings)
- float64 ‚Üí float32 (50% savings)
- object ‚Üí category (up to 96% savings)
- Use string dtype for text data

**2. Vectorize Everything**
- Built-in operations: 100-1000x faster
- NumPy functions: 50-500x faster
- Avoid loops: They are 10-1000x slower
- Use np.where() and np.select()

**3. Read Smart, Not Hard**
- Specify dtypes upfront
- Use usecols for column subset
- Parse dates during reading
- Use chunksize for large files

**4. Efficient Operations**
- query() often faster than boolean
- eval() for complex expressions
- Built-in groupby aggregations
- Sort index for fast slicing

**5. Right Format Matters**
- Parquet: Production use (fast, small)
- Feather: Temporary storage (fastest)
- CSV: Sharing only (slowest, largest)

---

### Performance Checklist ‚úÖ

**Before optimization:**
```
‚òê Profile memory usage
‚òê Identify bottlenecks
‚òê Measure current performance
```

**Data types:**
```
‚òê Downcast integers
‚òê Use float32 instead of float64
‚òê Convert to category where appropriate
‚òê Use string dtype for text
```

**Operations:**
```
‚òê Replace loops with vectorization
‚òê Replace apply with built-in methods
‚òê Use np.where/np.select for conditionals
‚òê Use query() for complex filters
‚òê Use eval() for expressions
```

**I/O:**
```
‚òê Specify dtypes in read_csv
‚òê Use usecols to read subset
‚òê Use chunksize for large files
‚òê Save in Parquet format
```

**After optimization:**
```
‚òê Measure improvements
‚òê Verify results unchanged
‚òê Document optimizations
```

---

### Typical Improvements üìä

```
Optimization          Typical Speedup    Memory Savings
----------------      ---------------    --------------
Data types            N/A                50-90%
Vectorization         10-1000x           N/A
query() vs boolean    1.5-3x             N/A
eval()                1.5-2x             Reduces temps
Efficient reading     2-5x               50-90%
Parquet vs CSV        5-10x              70-90%
Index optimization    10-100x            N/A
Chunking              Enables process    100% (no OOM)
```

---

### Decision Tree üå≥

**Is it slow?**
```
‚îú‚îÄ Using loops?
‚îÇ  ‚îî‚îÄ Vectorize!
‚îú‚îÄ Using apply()?
‚îÇ  ‚îî‚îÄ Use built-in method or np.where
‚îú‚îÄ Complex filter?
‚îÇ  ‚îî‚îÄ Try query()
‚îú‚îÄ Complex expression?
‚îÇ  ‚îî‚îÄ Try eval()
‚îî‚îÄ Still slow?
   ‚îî‚îÄ Consider Numba or parallel
```

**Out of memory?**
```
‚îú‚îÄ Optimize data types first
‚îú‚îÄ Read only needed columns
‚îú‚îÄ Use category for repeated strings
‚îú‚îÄ Process in chunks
‚îî‚îÄ Use Parquet/Feather format
```

---

### Remember

- üéØ **Measure first** - Don't optimize blindly
- ‚ö° **Vectorize always** - Avoid loops at all costs
- üíæ **Right dtypes** - Can save 50-90% memory
- üìä **Use Parquet** - For production data storage
- üîç **Profile often** - Find real bottlenecks
- üìñ **Read smart** - Specify dtypes, select columns
- üöÄ **Start small** - Test on samples
- ‚úÖ **Verify results** - After optimization

---

### Next Steps

After mastering performance optimization:
1. **Dask** - For distributed computing
2. **Vaex** - For out-of-core DataFrames
3. **Polars** - Alternative fast DataFrame library
4. **Arrow** - For zero-copy data sharing
5. **GPU acceleration** - cuDF for NVIDIA GPUs

---

**Make Pandas Fast! ‚ö°üêº**