# Stage 09 — Homework Starter Notebook

In the lecture, we learned how to create engineered features. Now it’s your turn to apply those ideas to your own project data.

In [1]:
# Stage 09 — Feature Engineering Notebook

import pandas as pd
import numpy as np

# === Load dataset from Stage08 processed output ===
PROCESSED = "../data/processed/stage08_clean.csv"
df = pd.read_csv(PROCESSED)

print("Data loaded:", df.shape)
df.head()


Data loaded: (192, 6)


Unnamed: 0,date,region,age,income,transactions,spend
0,2025-02-01,West,41.2,51712.11,0,109.42
1,2025-02-03,South,42.7,29900.82,2,39.73
2,2025-02-04,South,43.0,34212.69,5,125.46
3,2025-02-05,South,45.7,67315.9,5,257.72
4,2025-02-06,East,30.4,32664.85,1,83.56


## Feature1: Spend-to-Income Ratio

In [2]:

df['spend_income_ratio'] = df['spend'] / df['income']

df[['income', 'spend', 'spend_income_ratio']].head()


Unnamed: 0,income,spend,spend_income_ratio
0,51712.11,109.42,0.002116
1,29900.82,39.73,0.001329
2,34212.69,125.46,0.003667
3,67315.9,257.72,0.003829
4,32664.85,83.56,0.002558


Rationale:

This feature captures how much of a person’s income is being spent.

In EDA (Stage08), we observed that higher spending often scales with higher income, but the proportion differs between individuals.

This ratio normalizes spend by income and helps reveal over-spending behavior.

In [3]:
# TODO: Add another feature
# Example: df['rolling_spend_mean'] = df['monthly_spend'].rolling(3).mean()

## Feature2: Rolling Mean of Spend

In [4]:
# Rolling 5-day average of spend
df['rolling_spend_mean'] = df['spend'].rolling(5).mean()

df[['spend', 'rolling_spend_mean']].head(10)


Unnamed: 0,spend,rolling_spend_mean
0,109.42,
1,39.73,
2,125.46,
3,257.72,
4,83.56,123.178
5,78.24,116.942
6,59.91,120.978
7,147.43,125.372
8,150.38,103.904
9,148.05,116.802


Rationale:

Captures short-term spending trends (5-day smoothing).

This reduces noise from daily fluctuations and helps the model pick up spending patterns over time.

## Feature3: Age Grouping

In [5]:
# Categorize age into bins
df['age_group'] = pd.cut(df['age'],
                         bins=[17, 29, 39, 49, 59, 70],
                         labels=['20s','30s','40s','50s','60+'])

df[['age','age_group']].head()


Unnamed: 0,age,age_group
0,41.2,40s
1,42.7,40s
2,43.0,40s
3,45.7,40s
4,30.4,30s



Rationale:

Converts continuous age into categorical age_group.

Helps capture non-linear effects of age on spending.

Based on Stage08 insights, spending showed variation with age (young vs older groups).

## Feature 4: Regional Spend Share

In [6]:
# Calculate region spend share
region_spend = df.groupby('region')['spend'].transform('sum')
df['region_spend_share'] = df['spend'] / region_spend


Rationale:

Normalizes individual spend by regional total spend.

Highlights whether someone is a "big spender" relative to peers in the same region.

Connects to EDA where spending distributions varied by region.

## Feature 5: Income Variability Proxy

In [7]:
# Rolling 5-day variance in income (proxy for variability)
df['rolling_income_var'] = df['income'].rolling(5).var()


Rationale:

Captures short-term variability in income.

Even if average income is stable, local variance may signal irregular patterns (e.g., bonuses, missing records).

Ties to EDA observation of outliers and missingness in income.

In [8]:
# Save to processed folder
FEATURED = "../data/processed/stage09_features.csv"
df.to_csv(FEATURED, index=False)
print(f"Feature-engineered dataset saved to {FEATURED}")


Feature-engineered dataset saved to ../data/processed/stage09_features.csv
