# From Python to Production — Day N  
## Notebook X: Pandas — Beginner to Advanced  
By **Prerna Joshi** | #25DaysOfDataTech | #PythonToProduction

> "Treat data as tables, think in transformations, and let Pandas do the heavy lifting."


### What you will learn today
- Core Pandas objects: Series, DataFrame, Index, dtypes
- Reading and writing data: CSV, Parquet, and chunked I/O
- Selecting data correctly: loc, iloc, boolean masks, query
- Cleaning and type handling: assign, pipe, to_datetime, to_numeric, string ops
- Missing data strategies
- Groupby aggregations you can trust
- Joins with validation and post-join checks
- Reshaping data: melt, pivot, pivot_table, MultiIndex
- Time series: resample, rolling windows, expanding stats
- Text, categoricals, and performance-aware dtypes
- Method chaining and pipeline patterns for production
- Working with large files using chunks
- Light-weight validation and assertions


### Prerequisites
You should be comfortable with Python basics, functions, and file paths. If you feel rusty, quickly skim your earlier notebooks before diving in.


In [1]:
# Setup
import pandas as pd
import numpy as np

pd.set_option("display.max_rows", 12)
pd.set_option("display.width", 120)

print(pd.__version__)


2.2.1


### Sample dataset for this notebook
We will synthesize small, realistic tables so you can run everything locally without external files.


In [2]:
# Create small sample datasets
rng = np.random.default_rng(42)

dates = pd.date_range("2025-01-01", periods=60, freq="D")
n = 500

df_orders = pd.DataFrame({
    "order_id": np.arange(1, n+1),
    "customer_id": rng.integers(1001, 1101, size=n),
    "date": rng.choice(dates, size=n),
    "channel": rng.choice(["online","store","partner"], size=n, p=[0.6, 0.3, 0.1]),
    "state": rng.choice(["OH","MI","PA","NY","IN"], size=n),
    "amount": np.round(rng.normal(120, 50, size=n).clip(5, None), 2),
    "status": rng.choice(["completed","completed","completed","cancelled"], size=n, p=[0.7,0.15,0.1,0.05]),
    "discount": rng.choice(["0","5","10","N/A",""], size=n, p=[0.5,0.2,0.15,0.1,0.05]),
    "email": [f"user{i}@example.com" for i in rng.integers(1, 5000, size=n)]
})

df_customers = pd.DataFrame({
    "customer_id": np.arange(1001, 1101),
    "signup_date": pd.date_range("2024-06-01", periods=100, freq="D"),
    "segment": pd.Categorical(
        np.where(rng.random(100) > 0.7, "Premium", "Standard"),
        categories=["Standard","Premium"], ordered=True
    )
})

df_orders.head(), df_customers.head()


(   order_id  customer_id       date  channel state  amount     status discount                 email
 0         1         1009 2025-01-07    store    IN    5.00  cancelled       10  user2543@example.com
 1         2         1078 2025-02-09    store    OH   51.10  completed        0  user3492@example.com
 2         3         1066 2025-02-27  partner    MI   29.47  completed        0  user2628@example.com
 3         4         1044 2025-02-22   online    PA    7.51  completed        5  user3371@example.com
 4         5         1044 2025-01-17   online    NY   60.23  completed           user2835@example.com,
    customer_id signup_date   segment
 0         1001  2024-06-01  Standard
 1         1002  2024-06-02  Standard
 2         1003  2024-06-03   Premium
 3         1004  2024-06-04  Standard
 4         1005  2024-06-05   Premium)

## 1 Pandas objects and dtypes
- Series vs DataFrame vs Index  
- Nullable dtypes and casting with astype and select_dtypes


In [3]:
# Explore objects and dtypes
df_orders.info()
df_orders.dtypes

# Cast discount safely to numeric with errors='coerce'
df_orders["discount_num"] = pd.to_numeric(df_orders["discount"], errors="coerce")
df_orders[["discount","discount_num"]].head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   order_id     500 non-null    int32         
 1   customer_id  500 non-null    int64         
 2   date         500 non-null    datetime64[ns]
 3   channel      500 non-null    object        
 4   state        500 non-null    object        
 5   amount       500 non-null    float64       
 6   status       500 non-null    object        
 7   discount     500 non-null    object        
 8   email        500 non-null    object        
dtypes: datetime64[ns](1), float64(1), int32(1), int64(1), object(5)
memory usage: 33.3+ KB


Unnamed: 0,discount,discount_num
0,10.0,10.0
1,0.0,0.0
2,0.0,0.0
3,5.0,5.0
4,,


## 2 Read and write data
- CSV and Parquet  
- dtype and parse_dates on read  
- chunksize for large files


In [4]:
# Write sample files
df_orders.to_csv("orders.csv", index=False)
df_customers.to_csv("customers.csv", index=False)

# Read with types
types = {"order_id":"int64","customer_id":"int64","channel":"category","state":"category","status":"category"}
orders = pd.read_csv("orders.csv", dtype=types, parse_dates=["date"])

orders.head()


Unnamed: 0,order_id,customer_id,date,channel,state,amount,status,discount,email,discount_num
0,1,1009,2025-01-07,store,IN,5.0,cancelled,10.0,user2543@example.com,10.0
1,2,1078,2025-02-09,store,OH,51.1,completed,0.0,user3492@example.com,0.0
2,3,1066,2025-02-27,partner,MI,29.47,completed,0.0,user2628@example.com,0.0
3,4,1044,2025-02-22,online,PA,7.51,completed,5.0,user3371@example.com,5.0
4,5,1044,2025-01-17,online,NY,60.23,completed,,user2835@example.com,


In [5]:
# Chunked reading example (simulate a large file)
totals = []
for chunk in pd.read_csv("orders.csv", chunksize=200, parse_dates=["date"]):
    total = chunk["amount"].sum()
    totals.append(total)

sum(totals)


58413.63

## 3 Selecting data correctly
- loc (label) and iloc (position)  
- Boolean masks  
- query for readability


In [6]:
recent = orders.loc[orders["date"] >= "2025-01-20", ["order_id","customer_id","amount","status","channel"]]
first10 = orders.iloc[:10]

big_online = orders.query("amount >= 150 and channel == 'online'")
recent.head(), first10.head(), big_online.head()


(   order_id  customer_id  amount     status  channel
 1         2         1078   51.10  completed    store
 2         3         1066   29.47  completed  partner
 3         4         1044    7.51  completed   online
 5         6         1086  186.25  completed    store
 6         7         1009  117.78  completed   online,
    order_id  customer_id       date  channel state  amount     status  discount                 email  discount_num
 0         1         1009 2025-01-07    store    IN    5.00  cancelled      10.0  user2543@example.com          10.0
 1         2         1078 2025-02-09    store    OH   51.10  completed       0.0  user3492@example.com           0.0
 2         3         1066 2025-02-27  partner    MI   29.47  completed       0.0  user2628@example.com           0.0
 3         4         1044 2025-02-22   online    PA    7.51  completed       5.0  user3371@example.com           5.0
 4         5         1044 2025-01-17   online    NY   60.23  completed       NaN  user2835

## 4 Cleaning and type handling
- rename, assign, pipe  
- to_datetime, to_numeric  
- vectorized string ops with .str and datetime via .dt


In [7]:
clean = (
    orders
    .rename(columns=str.lower)
    .assign(
        date=lambda d: pd.to_datetime(d["date"]),
        domain=lambda d: d["email"].str.split("@").str[-1]
    )
    .pipe(lambda d: d[d["status"].ne("cancelled")])
)
clean.head()


Unnamed: 0,order_id,customer_id,date,channel,state,amount,status,discount,email,discount_num,domain
1,2,1078,2025-02-09,store,OH,51.1,completed,0.0,user3492@example.com,0.0,example.com
2,3,1066,2025-02-27,partner,MI,29.47,completed,0.0,user2628@example.com,0.0,example.com
3,4,1044,2025-02-22,online,PA,7.51,completed,5.0,user3371@example.com,5.0,example.com
4,5,1044,2025-01-17,online,NY,60.23,completed,,user2835@example.com,,example.com
5,6,1086,2025-01-28,store,OH,186.25,completed,0.0,user2625@example.com,0.0,example.com


## 5 Missing data strategies
- isna, fillna, dropna  
- choosing sensible defaults vs real NAs


In [8]:
clean["discount_num"] = pd.to_numeric(clean["discount"], errors="coerce")
clean["discount_num"] = clean["discount_num"].fillna(0.0)

clean[["discount","discount_num"]].head(10)


Unnamed: 0,discount,discount_num
1,0.0,0.0
2,0.0,0.0
3,5.0,5.0
4,,0.0
5,0.0,0.0
6,0.0,0.0
7,5.0,5.0
8,0.0,0.0
10,5.0,5.0
11,10.0,10.0


## 6 Groupby and aggregations
- named aggregations  
- nunique vs count  
- custom functions when truly needed


In [9]:
summary = (
    clean
    .groupby("customer_id", as_index=False)
    .agg(
        total_amount=("amount","sum"),
        n_orders=("order_id","nunique"),
        first_order=("date","min"),
        last_order=("date","max")
    )
    .sort_values("total_amount", ascending=False)
)
summary.head()


Unnamed: 0,customer_id,total_amount,n_orders,first_order,last_order
43,1044,1571.98,13,2025-01-15,2025-02-26
49,1050,1239.46,9,2025-01-08,2025-03-01
54,1055,1150.07,7,2025-01-15,2025-02-24
46,1047,1135.32,10,2025-01-17,2025-02-21
78,1079,1110.35,9,2025-01-07,2025-02-13


## 7 Joins you can trust
- merge variants and keys  
- validate to protect against duplicates  
- post-join assertions


In [10]:
joined = summary.merge(df_customers, on="customer_id", how="left", validate="one_to_one")
assert len(joined) == len(summary), "Row-count changed unexpectedly after one-to-one join"

joined.head()


Unnamed: 0,customer_id,total_amount,n_orders,first_order,last_order,signup_date,segment
0,1044,1571.98,13,2025-01-15,2025-02-26,2024-07-14,Premium
1,1050,1239.46,9,2025-01-08,2025-03-01,2024-07-20,Standard
2,1055,1150.07,7,2025-01-15,2025-02-24,2024-07-25,Standard
3,1047,1135.32,10,2025-01-17,2025-02-21,2024-07-17,Premium
4,1079,1110.35,9,2025-01-07,2025-02-13,2024-08-18,Standard


## 8 Reshaping
- melt and pivot  
- pivot_table with aggfunc  
- MultiIndex indexing basics


In [11]:
# Make a simple wide table
month = clean.assign(month=lambda d: d["date"].dt.to_period("M").dt.to_timestamp())
monthly = month.groupby(["customer_id","month"], as_index=False).agg(spend=("amount","sum"))

wide = monthly.pivot(index="customer_id", columns="month", values="spend").fillna(0.0)
wide.head()


month,2025-01-01,2025-02-01,2025-03-01
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1001,361.96,83.7,0.0
1002,178.78,416.41,0.0
1003,264.28,0.0,79.53
1004,249.22,360.89,0.0
1005,427.08,88.26,0.0


In [12]:
# Back to long form with melt
long_again = wide.reset_index().melt(id_vars=["customer_id"], var_name="month", value_name="spend")
long_again.head()


Unnamed: 0,customer_id,month,spend
0,1001,2025-01-01 00:00:00,361.96
1,1002,2025-01-01 00:00:00,178.78
2,1003,2025-01-01 00:00:00,264.28
3,1004,2025-01-01 00:00:00,249.22
4,1005,2025-01-01 00:00:00,427.08


## 9 Time series and windows
- resample to daily totals  
- rolling windows for moving averages  
- expanding statistics


In [13]:
daily = (
    clean.set_index("date")
    .resample("D")["amount"].sum()
    .to_frame("amount")
    .assign(ma7=lambda d: d["amount"].rolling(7, min_periods=1).mean(),
            exp_ma=lambda d: d["amount"].ewm(span=7, adjust=False).mean())
)
daily.head(12)


Unnamed: 0_level_0,amount,ma7,exp_ma
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2025-01-01,1078.84,1078.84,1078.84
2025-01-02,561.98,820.41,949.625
2025-01-03,959.24,866.686667,952.02875
2025-01-04,959.72,889.945,953.951562
2025-01-05,258.11,763.578,779.991172
2025-01-06,1800.26,936.358333,1035.058379
2025-01-07,544.38,880.361429,912.388784
2025-01-08,1470.84,936.361429,1052.001588
2025-01-09,1782.87,1110.774286,1234.718691
2025-01-10,986.87,1114.721429,1172.756518


## 10 Text and categoricals
- fast vectorized .str methods  
- category dtype for speed and ordering


In [14]:
clean["state"] = pd.Categorical(clean["state"], categories=["IN","MI","NY","OH","PA"], ordered=True)
clean["tld"] = clean["domain"].str.split(".").str[-1]

clean[["state","domain","tld"]].head()


Unnamed: 0,state,domain,tld
1,OH,example.com,com
2,MI,example.com,com
3,PA,example.com,com
4,NY,example.com,com
5,OH,example.com,com


## 11 Performance and memory
- choosing dtypes early  
- downcasting numeric columns  
- selecting columns with usecols and select_dtypes


In [15]:
# Estimate memory usage quickly
before = clean.memory_usage(deep=True).sum()

opt = clean.copy()
opt["amount"] = pd.to_numeric(opt["amount"], downcast="float")
opt["customer_id"] = pd.to_numeric(opt["customer_id"], downcast="integer")

after = opt.memory_usage(deep=True).sum()
{"before_bytes": int(before), "after_bytes": int(after), "saved_bytes": int(before - after)}


{'before_bytes': 114344, 'after_bytes': 109614, 'saved_bytes': 4730}

## 12 Method chaining and .pipe
- write one readable pipeline  
- keep transformations local and testable


In [16]:
def keep_completed(d):
    return d[d["status"].eq("completed")]

features = (
    orders
    .pipe(keep_completed)
    .assign(week=lambda d: pd.to_datetime(d["date"]).dt.to_period("W").dt.start_time)
    .groupby(["customer_id","week"], as_index=False)
    .agg(spend=("amount","sum"), n_orders=("order_id","nunique"))
    .sort_values(["customer_id","week"])
)
features.head()


Unnamed: 0,customer_id,week,spend,n_orders
0,1001,2025-01-06,277.7,2
1,1001,2025-01-27,84.26,1
2,1001,2025-02-24,83.7,1
3,1002,2025-01-06,178.78,1
4,1002,2025-02-03,180.12,2


## 13 Large files with chunks
- process-in-chunks and incremental concat  
- avoid holding everything in memory when not needed


In [17]:
chunk_summaries = []
for chunk in pd.read_csv("orders.csv", chunksize=200, parse_dates=["date"]):
    s = (
        chunk
        .query("status == 'completed'")
        .assign(week=lambda d: d["date"].dt.to_period("W").dt.start_time)
        .groupby("week", as_index=False)["amount"].sum()
        .rename(columns={"amount":"weekly_amount"})
    )
    chunk_summaries.append(s)

weekly_totals = pd.concat(chunk_summaries).groupby("week", as_index=False)["weekly_amount"].sum()
weekly_totals.head()


Unnamed: 0,week,weekly_amount
0,2024-12-30,3817.89
1,2025-01-06,8786.07
2,2025-01-13,5335.59
3,2025-01-20,4449.73
4,2025-01-27,7688.07


## 14 Data quality checks
- validate join shapes  
- assert monotonic indexes where needed  
- basic sanity checks before saving


In [18]:
# Example sanity checks
assert features["spend"].ge(0).all(), "Spend should be non-negative"
assert features.groupby("customer_id")["week"].is_monotonic_increasing.any() or True

# Save final artifacts
features.to_parquet("features.parquet", index=False)
weekly_totals.to_parquet("weekly_totals.parquet", index=False)

"Saved features.parquet and weekly_totals.parquet"


'Saved features.parquet and weekly_totals.parquet'

## Practice exercises
Work through these directly in the notebook. Keep solutions in separate cells.

1. Selection  
   Select only in-store orders from February 2025 with amount between 50 and 200 inclusive. Keep order_id, customer_id, amount, and date.

2. Cleaning  
   Create a clean_discount column that is numeric, has no missing values, and is clipped to the range [0, 30].

3. Groupby  
   For each state, compute total revenue, average order value, and number of unique customers.

4. Join + Validation  
   Join the customer segment from df_customers onto your state summary. Validate the join shape and assert that each state has at least one row.

5. Reshape  
   Build a customer-month matrix of spend similar to 'wide' but only for completed orders. Replace missing values with 0 and sort customers by their total spend descending.

6. Time Series  
   Using weekly_totals, compute a 4-week moving average and a simple anomaly flag when weekly_amount is more than 2 std devs above the rolling mean.

7. Chunks  
   Recreate the weekly_totals pipeline using a chunksize of 100 and confirm the results match exactly.


### Optional reference solutions
Below are example approaches. Yours may differ yet still be correct if logic and results match.



In [19]:
# 1 Selection — reference
sel = (
    orders
    .query("channel == 'store' and amount >= 50 and amount <= 200")
    .loc[lambda d: d["date"].between('2025-02-01','2025-02-28')]
    [["order_id","customer_id","amount","date"]]
)
sel.head()


Unnamed: 0,order_id,customer_id,amount,date
1,2,1078,51.1,2025-02-09
11,12,1098,146.2,2025-02-14
19,20,1046,124.33,2025-02-05
23,24,1093,148.53,2025-02-19
30,31,1046,172.24,2025-02-05


In [20]:
# 2 Cleaning — reference
clean2 = orders.assign(
    clean_discount=lambda d: pd.to_numeric(d["discount"], errors="coerce").fillna(0.0).clip(0, 30)
)
clean2[["discount","clean_discount"]].head()


Unnamed: 0,discount,clean_discount
0,10.0,10.0
1,0.0,0.0
2,0.0,0.0
3,5.0,5.0
4,,0.0


In [21]:
# 3 Groupby — reference
state_summary = (
    clean
    .groupby("state", as_index=False)
    .agg(total_revenue=("amount","sum"),
         avg_order_value=("amount","mean"),
         unique_customers=("customer_id","nunique"))
    .sort_values("total_revenue", ascending=False)
)
state_summary.head()


  .groupby("state", as_index=False)


Unnamed: 0,state,total_revenue,avg_order_value,unique_customers
1,MI,11636.19,110.820857,63
2,NY,11549.58,120.308125,62
3,OH,11445.63,120.480316,64
4,PA,11277.84,122.585217,63
0,IN,9818.89,115.516353,55


In [22]:
# 4 Join + Validation — reference
cust_seg = summary.merge(
    df_customers[["customer_id","segment"]],
    on="customer_id",
    how="left",
    validate="one_to_one"   # each customer has exactly one segment
)

assert len(cust_seg) == len(summary), "Row count changed unexpectedly."

cust_seg.head()


Unnamed: 0,customer_id,total_amount,n_orders,first_order,last_order,segment
0,1044,1571.98,13,2025-01-15,2025-02-26,Premium
1,1050,1239.46,9,2025-01-08,2025-03-01,Standard
2,1055,1150.07,7,2025-01-15,2025-02-24,Standard
3,1047,1135.32,10,2025-01-17,2025-02-21,Premium
4,1079,1110.35,9,2025-01-07,2025-02-13,Standard


In [23]:
# 5 Reshape — reference
cust_month = (
    orders.query("status == 'completed'")
    .assign(month=lambda d: pd.to_datetime(d["date"]).dt.to_period("M").dt.to_timestamp())
    .groupby(["customer_id","month"], as_index=False)["amount"].sum()
    .pivot(index="customer_id", columns="month", values="amount")
    .fillna(0.0)
)

cust_month.assign(_total=cust_month.sum(axis=1)).sort_values("_total", ascending=False).drop(columns="_total").head()


month,2025-01-01 00:00:00,2025-02-01 00:00:00,2025-03-01 00:00:00
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1044,974.85,597.13,0.0
1050,805.62,281.33,152.51
1055,458.54,691.53,0.0
1047,428.37,706.95,0.0
1079,949.13,161.22,0.0


In [24]:
# 6 Time Series — reference
w = weekly_totals.copy().sort_values("week")
w["ma4"] = w["weekly_amount"].rolling(4, min_periods=2).mean()
w["std4"] = w["weekly_amount"].rolling(4, min_periods=2).std()
w["anomaly"] = (w["weekly_amount"] > (w["ma4"] + 2*w["std4"]))
w.head(10)


Unnamed: 0,week,weekly_amount,ma4,std4,anomaly
0,2024-12-30,3817.89,,,False
1,2025-01-06,8786.07,6301.98,3513.033768,False
2,2025-01-13,5335.59,5979.85,2545.978464,False
3,2025-01-20,4449.73,5597.32,2215.09692,False
4,2025-01-27,7688.07,6564.865,2014.965196,False
5,2025-02-03,6961.75,6108.785,1480.088331,False
6,2025-02-10,6071.77,6292.83,1395.234002,False
7,2025-02-17,6724.88,6861.6175,667.234709,False
8,2025-02-24,5892.38,6412.695,511.819767,False


In [25]:
# 7 Chunks — reference
totals_v2 = []
for chunk in pd.read_csv("orders.csv", chunksize=100, parse_dates=["date"]):
    s = (
        chunk
        .query("status == 'completed'")
        .assign(week=lambda d: d["date"].dt.to_period("W").dt.start_time)
        .groupby("week", as_index=False)["amount"].sum()
        .rename(columns={"amount":"weekly_amount"})
    )
    totals_v2.append(s)

weekly_totals_v2 = pd.concat(totals_v2).groupby("week", as_index=False)["weekly_amount"].sum()

# Check equality
pd.testing.assert_frame_equal(
    weekly_totals.sort_values("week").reset_index(drop=True),
    weekly_totals_v2.sort_values("week").reset_index(drop=True)
)
"Match confirmed"


'Match confirmed'

## Interview Questions to practice
- Explain loc vs iloc and when query improves code readability.
- Show how you validated a join and what checks you did post-join.
- How did you reduce memory by 40%+ on a large CSV job.
- Why would you use categoricals and how can they change groupby results.
- Debug a groupby that changed after filtering — root causes and fixes.



---

# Interview Q&A


## Q1 — Explain `loc` vs `iloc`, and when `query` improves readability

**`.loc` (label-based)**
- Selects rows/columns using labels (names).
- Slice end **inclusive**.
- Safer for semantic column referencing.

**`.iloc` (position-based)**
- Selects rows/columns using integer positions.
- Slice end **exclusive**.
- Good for fixed index/offset logic.

**Example**
```python
df.loc[df["amount"] >= 100, ["order_id","amount"]]   # label-based
df.iloc[:10, :3]                                     # position-based
```

**When `.query` helps readability**
Use `.query` when the filter expression is long and uses only column names.

```python
df.query("amount >= 100 and channel in ['online','store'] and status != 'cancelled'")
```
Benefits:
- Cleaner syntax
- No repeated `df[...]`
- Easier to read complex boolean conditions

Use `@var` to reference Python variables inside `.query`.


## Q2 — How do you validate a join and what checks do you do post-join?

**Before / during join**
1. Ensure key uniqueness:
```python
assert not left["key"].duplicated().any()
assert not right["key"].duplicated().any()
```

2. Enforce relationship rules using `validate=`  
- `"one_to_one"`, `"one_to_many"`, `"many_to_one"`
```python
out = left.merge(right, on="customer_id", how="left", validate="one_to_one")
```

---

**Post-join checks**

1. **Row-count consistency**
```python
assert len(out) == len(left)
```

2. **Missing matches check**
```python
miss_rate = out["right_col"].isna().mean()
```

3. **Anti-join inspection**
```python
missing = left.loc[~left["customer_id"].isin(right["customer_id"])]
```

4. **Sanity checks**
Compare before/after:
- Sum of numeric columns
- Unique counts
- Date ranges
- No overwritten columns (check suffixes)


## Q3 — How did you reduce memory by 40%+ on a large CSV job?

### Techniques that consistently give 40–80% memory reduction:

**1) Read with explicit dtypes**
```python
dtypes = {"customer_id":"int32", "channel":"category", "state":"category"}
df = pd.read_csv("data.csv", dtype=dtypes, parse_dates=["date"])
```

**2) Downcast numeric columns**
```python
df["amount"] = pd.to_numeric(df["amount"], downcast="float")
df["qty"] = pd.to_numeric(df["qty"], downcast="integer")
```

**3) Use categoricals for low-cardinality strings**
Huge memory savings + faster groupby operations.

**4) Chunked processing instead of loading entire CSV**
```python
totals = []
for chunk in pd.read_csv("data.csv", chunksize=200_000):
    totals.append(chunk["amount"].sum())
```

**5) Save transformed output as Parquet**
Columnar + compression = smaller + faster.


## Q4 — Why would you use categoricals and how can they change groupby results?

### Why categoricals are useful:
- Significant memory reduction for repeated strings.
- Faster `groupby`, `merge`, and `value_counts`.
- Custom ordering (e.g., Mon → Sun).

### How they change groupby results:

**1) Unobserved categories**
- Pandas may include categories not present in the filtered data.
- Use:
```python
df.groupby("state", observed=True)
```

**2) Ordering**
- Ordered categoricals sort/group in the defined order.

**3) Category mismatch**
Different DataFrames must share the same category set.
```python
df["state"] = df["state"].cat.set_categories(master_states)
```

**4) Filtering + categoricals**
After filtering, remove unused category levels:
```python
df["state"] = df["state"].cat.remove_unused_categories()
```


## Q5 — Debug a groupby that changed after filtering (root causes & fixes)

Common reasons and fixes:

### 1) **Row-count drift**
Filtering may remove rows needed for aggregation.
```python
len(df_before), len(df_after)
```

### 2) **Missing values in groupby key**
`groupby` drops `NaN` keys.
Fix:
```python
df["state"] = df["state"].fillna("UNKNOWN")
```

### 3) **Categorical behavior (observed vs unobserved levels)**
```python
df.groupby("state", observed=True).agg(...)
```
AND
```python
df["state"] = df["state"].cat.remove_unused_categories()
```

### 4) **Dtype mismatch**
`object`, `string`, and `category` group differently.
Enforce consistent dtype.

### 5) **Duplicate keys or unexpected granularity**
Filtering may create duplicates.
```python
df.duplicated(subset=["key"]).sum()
```

### 6) **Timestamp granularity differences**
Floor/round timestamps consistently.
```python
df["day"] = df["ts"].dt.floor("D")
```

### 7) **Order-of-operations**
Filtering must come before aggregation.

### Debug helper snippet:
```python
pre = df.groupby("state", observed=True)["amount"].sum()

df2 = df.query("status == 'completed'").copy()
from pandas.api.types import is_categorical_dtype
if "state" in df2 and is_categorical_dtype(df2["state"]):
    df2["state"] = df2["state"].cat.remove_unused_categories()

post = df2.groupby("state", observed=True)["amount"].sum()

display(pre, post, (pre.sum(), post.sum()))
```
