# Stage 09 — Homework Starter Notebook

In the lecture, we learned how to create engineered features. Now it’s your turn to apply those ideas to your own project data.

In [1]:
# === Stage 09 — Setup & Synthetic Data (same schema as Stage 08) ===
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import pearsonr

sns.set(context="talk", style="whitegrid")
np.random.seed(8)
pd.set_option("display.max_columns", 100)

# Recreate Stage 08 dataset
n = 160
df = pd.DataFrame({
    "date": pd.date_range("2021-02-01", periods=n, freq="D"),
    "region": np.random.choice(["North","South","East","West"], size=n),
    "age": np.random.normal(40, 8, size=n).clip(22, 70).round(1),
    "income": np.random.lognormal(mean=10.6, sigma=0.3, size=n).round(2),
    "transactions": np.random.poisson(lam=3, size=n),
})
base = df["income"] * 0.0015 + df["transactions"] * 18 + np.random.normal(0, 40, size=n)
df["spend"] = np.maximum(0, base).round(2)

# Inject a bit of missingness and outliers (same as Stage 08)
df.loc[np.random.choice(df.index, 5, replace=False), "income"] = np.nan
df.loc[np.random.choice(df.index, 3, replace=False), "spend"] = np.nan
df.loc[np.random.choice(df.index, 2, replace=False), "transactions"] = df["transactions"].max() + 12

df.head()

Unnamed: 0,date,region,age,income,transactions,spend
0,2021-02-01,West,37.6,28086.81,4,73.35
1,2021-02-02,North,43.0,33034.75,1,52.37
2,2021-02-03,South,38.2,50045.39,2,131.85
3,2021-02-04,South,24.9,39467.28,4,147.58
4,2021-02-05,South,59.8,31201.65,1,86.76


In [2]:
# === Feature 1: Spend per Transaction ===
# Purpose: normalize spend by activity level to capture spending intensity
df['spend_per_txn'] = df['spend'] / df['transactions'].replace(0, np.nan)

# Optional quick check (correlation with target "spend")
tmp = df[['spend','spend_per_txn']].dropna()
corr_1 = tmp.corr().loc['spend','spend_per_txn']
corr_1

np.float64(0.19240508300219825)

**Feature 1 — `spend_per_txn`**  
*Rationale:* Normalizes total spend by number of transactions to capture spending intensity per purchase. Helps when raw counts vary widely (we observed outliers in `transactions` during EDA). Expected to correlate with overall spend while reducing sensitivity to extreme transaction counts.

In [3]:
# === Feature 2: Spend to Income Ratio ===
# Purpose: proportionality of spend relative to earning capacity
df['spend_income_ratio'] = df['spend'] / df['income']

# Optional quick check: correlation with spend (drop NA)
tmp = df[['spend','spend_income_ratio']].dropna()
corr_2 = tmp.corr().loc['spend','spend_income_ratio']
corr_2

np.float64(0.7800118564894746)

**Feature 2 — `spend_income_ratio`**  
*Rationale:* Measures how much a customer spends relative to their income. This captures proportional behavior and can be more informative than raw income or spend alone, especially when income is right-skewed (observed in EDA).

## Feature Engineering Summary

**Feature 1 – `spend_per_txn`**  
- *Why:* Normalizes spend by activity level; we saw outliers in `transactions` during EDA.  
- *Expectation:* Captures spending intensity per purchase; less sensitive to extreme transaction counts.

**Feature 2 – `spend_income_ratio`**  
- *Why:* Income was right-skewed and related to spend; ratio captures proportional spending behavior.  
- *Expectation:* Helps compare customers with different income levels on a common scale.

**Next steps (if modeling):** consider robust scaling / winsorizing outliers, handle missing `income`/`spend` prior to training, and evaluate feature importance.