## Step 1 — Load Data (Default Settings)

In [1]:
import pandas as pd

# Load with default settings
df = pd.read_csv('customer_data.csv')

# Check the shape
print(f"Shape: {df.shape}")
print(f"\nData types:\n{df.dtypes}")

# Measure memory usage
memory_before = df.memory_usage(deep=True).sum() / 1024**2  # Convert to MB
print(f"\nMemory usage (before optimization): {memory_before:.2f} MB")

Shape: (10000, 10)

Data types:
customer_id             int64
age                     int64
region                 object
customer_type          object
total_purchases         int64
total_spent           float64
avg_purchase_value    float64
satisfaction_score    float64
account_status         object
referral_source        object
dtype: object

Memory usage (before optimization): 2.59 MB



## Step 2 — Optimize Memory Usage

```
(a) Select only required columns

customer_id, age, region, customer_type, total_spent, satisfaction_score

(b) Apply optimal dtypes
	•	customer_id → int32
	•	age → int8
	•	region → category
	•	customer_type → category
	•	total_spent → float32
	•	satisfaction_score → float32
```

In [2]:
import pandas as pd

# Select only needed columns
use_cols = [
    "customer_id", "age", "region", 
    "customer_type", "total_spent", "satisfaction_score"
]

# Define optimal dtypes
opt_dtypes = {
    "customer_id": "int32",
    "age": "int8",
    "region": "category",
    "customer_type": "category",
    "total_spent": "float32",
    "satisfaction_score": "float32"
}

# Load with optimizations
df_optimized = pd.read_csv(
    "customer_data.csv",
    usecols=use_cols,
    dtype=opt_dtypes
)

# Measure optimized memory
memory_after = df_optimized.memory_usage(deep=True).sum() / 1024**2
print(f"\nMemory usage (after optimization): {memory_after:.4f} MB")

# Calculate improvement
reduction_percent = ((memory_before - memory_after) / memory_before) * 100
print(f"\nMemory reduction: {reduction_percent:.2f}%")


Memory usage (after optimization): 0.1440 MB

Memory reduction: 94.45%


## Step 3 — Verify Optimized DataFrame

In [3]:
print("\nOptimized shape:", df_optimized.shape)
print("\nOptimized dtypes:\n", df_optimized.dtypes)
print("\nFirst few rows:\n", df_optimized.head())


Optimized shape: (10000, 6)

Optimized dtypes:
 customer_id              int32
age                       int8
region                category
customer_type         category
total_spent            float32
satisfaction_score     float32
dtype: object

First few rows:
    customer_id  age     region customer_type  total_spent  satisfaction_score
0            1   56  Northeast          Gold   246.130005                 1.1
1            2   69  Northeast        Silver  7928.109863                 3.5
2            3   46    Midwest        Bronze    20.570000                 3.8
3            4   32  Southeast        Bronze  3439.129883                 2.6
4            5   60       West      Platinum  4945.830078                 1.7


## Final Results

| Metric                               | Your Result |
|--------------------------------------|-------------|
| Memory usage before optimization     | 2.59 MB     |
| Memory usage after optimization      | 0.1440 MB   |
| Memory reduction percentage          | 94.45%      |
| Number of columns (before)           | 10          |
| Number of columns (after)            | 6           |

## Reflection Questions

### 1. Which optimization technique had the biggest impact on memory usage? Why?

The biggest impact came from dtype optimization, especially converting object columns like region and customer_type into category, and downcasting numeric columns from int64/float64 to smaller types like int32, int8, and float32. Category encoding saves a lot of memory because repeated strings are internally stored only once. Numeric downcasting cuts column sizes in half or more, which is especially impactful in large datasets.

### 2. If the dataset were 100× larger, how would these optimizations help your workflow?

For a dataset 100× larger, these optimizations would drastically reduce RAM usage, making it possible to load and analyze the data without crashes or slowdowns. The smaller memory footprint speeds up file loading, filtering, grouping, and model training. It also enables more efficient data pipelines and reduces cloud compute costs when working with very large datasets.