# Stage 09 — Feature Engineering: Homework Starter

Use this notebook to prototype engineered features and validate their impact on EDA and downstream models.

**Checklist**:
- [ ] Load your dataset
- [ ] Create engineered features (ratios, flags, interactions, logs)
- [ ] Inspect distributions & correlations
- [ ] Document assumptions and risks


In [None]:
# Setup
import sys
from pathlib import Path
import pandas as pd
import numpy as np

# Make src importable
PROJECT_ROOT = Path.cwd().parents[0]  # stage09_feature-engineering/
SRC_PATH = PROJECT_ROOT / "src"
if str(SRC_PATH) not in sys.path:
    sys.path.append(str(SRC_PATH))

from utils import (
    create_income_to_balance_ratio,
    create_transactions_per_year,
    flag_recent_activity,
)

## 1) Load or Create Data
Replace the synthetic example with your actual dataset (e.g., `data/raw/customers.csv`).

In [None]:
# Synthetic example — replace with your real data
n = 1_000
rng = np.random.default_rng(42)
df = pd.DataFrame({
    'customer_id': np.arange(1, n+1),
    'income': rng.normal(60000, 15000, n).clip(1000, None).astype(int),
    'account_balance': rng.normal(8000, 3000, n).clip(0, None).astype(int),
    'num_transactions': rng.poisson(30, n),
    'tenure_years': rng.uniform(0, 10, n),
    'last_login_days_ago': rng.integers(0, 180, n)
})
df.head()

## 2) Engineer Features
These mirror the utility functions in `src/utils.py`.

In [None]:
df = create_income_to_balance_ratio(df)
df = create_transactions_per_year(df)
df = flag_recent_activity(df)
df.head()

## 3) Quick EDA
Check distributions and simple correlations to see if engineered features are informative.

In [None]:
desc = df[['income','account_balance','num_transactions','tenure_years',
           'income_to_balance_ratio','transactions_per_year','is_active']].describe()
desc

In [None]:
corr = df[['income','account_balance','num_transactions','tenure_years',
           'income_to_balance_ratio','transactions_per_year','is_active']].corr(numeric_only=True)
corr

## 4) "So What?" — Notes
Use this section to connect features to hypotheses.

- **income_to_balance_ratio** may help separate customers with high earnings but low balances.
- **transactions_per_year** normalizes raw counts by tenure, enabling fair comparisons across customers with different lifetimes.
- **is_active** captures current engagement; useful for churn‑like targets or cohorting users by recent behavior.

**Assumptions & Risks**: Describe data quality assumptions (e.g., "tenure_years > 0"), potential leakage, and what happens if they don't hold.