#🚀 Pandas Notebook 6: Supercharged Data
"Handling a MILLION customers without crashing!"

## 🎯 Today's Power-Ups  
1. **Dtypes**: Shrinking giant DataFrames  
2. **Chunking**: Processing in bite-sized pieces  
3. **Parallel Processing**: Speed hacks  
4. **Real Use**: Analyzing 1M+ sales records  

## 🧠 Shrinking Your Data Footprint  
Like packing a suitcase efficiently instead of throwing everything in!  

In [1]:
import pandas as pd
import numpy as np

# Big dataset example
data = {
    "CustomerID": np.arange(1, 1_000_001),  # Default: 64-bit int (huge!)
    "Liters": np.random.uniform(1, 5, 1_000_000),
    "Rating": np.random.randint(1, 6, 1_000_000, dtype=np.int8)  # Tiny dtype!
}

df = pd.DataFrame(data)
print("Original Memory:", df.memory_usage(deep=True).sum() / 1024**2, "MB")

# Optimize dtypes
df["CustomerID"] = pd.to_numeric(df["CustomerID"], downcast="unsigned")
print("Optimized Memory:", df.memory_usage(deep=True).sum() / 1024**2, "MB")

Original Memory: 16.212589263916016 MB
Optimized Memory: 12.397891998291016 MB


## 🍕 Eating Pizza One Slice at a Time  
How to read a 10GB file on a laptop with 8GB RAM:  

In [None]:
# Simulate reading a huge CSV in chunks
chunk_iter = pd.read_csv("million_sales.csv", chunksize=100_000)
total_sales = 0

for chunk in chunk_iter:
    total_sales += chunk["Liters"].sum()

print("Total liters sold:", total_sales)

## ⚡ Turbo Boost for Apply()  
Like having 4 cashiers instead of 1 at your stand:  

In [4]:
!pip install swifter


Collecting swifter
  Downloading swifter-1.4.0.tar.gz (1.2 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m48.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: swifter
  Building wheel for swifter (setup.py) ... [?25l[?25hdone
  Created wheel for swifter: filename=swifter-1.4.0-py3-none-any.whl size=16505 sha256=754c45c0bd1612b2dc0966b6678380291d13198945f64060b8a1cea64d2f585d
  Stored in directory: /root/.cache/pip/wheels/ef/7f/bd/9bed48f078f3ee1fa75e0b29b6e0335ce1cb03a38d3443b3a3
Successfully built swifter
Installing collected packages: swifter
Successfully installed swifter-1.4.0


In [5]:
# Install first: pip install swifter
import swifter

# Slow apply:
# df["Rating"].apply(lambda x: x + 1)

# Fast apply:
df["Rating"] = df["Rating"].swifter.apply(lambda x: x + 1)

## 📈 Analyzing 1M Rows Efficiently  

In [6]:
# Optimized aggregation
results = (
    df
    .astype({"Rating": "category"})  # Text-like columns → categories
    .groupby("Rating")
    .agg({"Liters": ["sum", "mean"], "CustomerID": "count"})
)
print(results)

               Liters           CustomerID
                  sum      mean      count
Rating                                    
2       598964.746999  2.997387     199829
3       600299.862527  3.001514     199999
4       599399.751519  3.000374     199775
5       602345.697219  3.001987     200649
6       597546.419174  2.991501     199748


## ✏️ Optimization Drills  
1. Convert these columns to minimal dtypes:  
   - `pd.Series([1,2,3], dtype='float64')`  
   - `pd.Series(['a','b','a'], dtype='object')`  
2. Read a file in chunks to find average rating  
3. Bonus: Time normal vs. swifter apply on 1M rows  

*(Solutions next cell!)*  

In [None]:
# 1
float_series = pd.Series([1,2,3]).astype("float32")
cat_series = pd.Series(['a','b','a']).astype("category")

# 2
chunk_iter = pd.read_csv("sales.csv", chunksize=50_000)
ratings = []
for chunk in chunk_iter:
    ratings.append(chunk["Rating"].mean())
print("Overall avg:", np.mean(ratings))

# 3
%timeit df["Rating"].apply(lambda x: x+1)
%timeit df["Rating"].swifter.apply(lambda x: x+1)