# 03 – Feature Engineering & Risk Index

**Author**: Namora Fernando
**Date**: 2025-08-18
**Objective**: Engineer modeling-ready features and build a per-year inflation risk index from World Bank indicators:

1. Create change/volatility features (FX change, CPI volatility, Money Supply change).
2. Impute small gaps conservatively (country-wise).
3. Winsorize outliers per year.
4. Normalize per year and combine into a composite **Risk Score (0–100)**.
5. Export dataset for next steps and Power BI.

## 1. Imports & Paths (and Repro Setup)

In [1]:
import os
import numpy as np
import pandas as pd

pd.set_option("display.max_columns", 50)
pd.set_option("display.width", 120)

# Paths
INPUT_FILE  = "data_intermediate/cleaned_merged_inflation_data.csv"   # from 01
OUTPUT_DIR  = "data_intermediate"
OUTPUT_FILE = os.path.join(OUTPUT_DIR, "inflation_data_with_risk_index.csv")

os.makedirs(OUTPUT_DIR, exist_ok=True)

## 2. Load Data & Basic Checks

In [2]:
df = pd.read_csv(INPUT_FILE)

# Ensure proper types & ordering
df["Year"] = df["Year"].astype(int)
df = df.sort_values(["Country Name", "Year"]).reset_index(drop=True)

print(df.shape)
df.head()

(15918, 7)


Unnamed: 0,Country Name,Country Code,Year,CPI_AnnualChange,GDP_Growth,MoneySupply_GDPpct,ExchangeRate_LCUperUSD
0,Afghanistan,AFG,1960,,,,17.196561
1,Afghanistan,AFG,1961,,,,17.196561
2,Afghanistan,AFG,1962,,,,17.196561
3,Afghanistan,AFG,1963,,,,35.109645
4,Afghanistan,AFG,1964,,,,38.692262


## 3. Feature Engineering (Changes, Volatility, Trends)

**Why these features?**

- **FX depreciation** is a classic inflation risk signal → use **YoY % change** of exchange rate.
- **CPI volatility** (rolling std) captures instability → use **3-year rolling std**.
- **Money supply growth** reflects monetary expansion → use **YoY % change**.
- **GDP growth** level already in data (we’ll smooth for reference too).

In [3]:
# Exchange Rate % Change (YoY) per country
df["ExchangeRate_ChangePct"] = (
    df.groupby("Country Name")["ExchangeRate_LCUperUSD"]
      .pct_change() * 100
)

# CPI rolling volatility (3-year rolling std)
df["CPI_RollingVol_3y"] = (
    df.groupby("Country Name")["CPI_AnnualChange"]
      .transform(lambda s: s.rolling(window=3, min_periods=2).std())
)

# 3Money Supply % Change (YoY)
df["MoneySupply_ChangePct"] = (
    df.groupby("Country Name")["MoneySupply_GDPpct"]
      .pct_change() * 100
)

# GDP 3y moving average (context only, not in index)
df["GDP_Growth_MA_3y"] = (
    df.groupby("Country Name")["GDP_Growth"]
      .transform(lambda s: s.rolling(window=3, min_periods=1).mean())
)

df.head(10)

  .pct_change() * 100
  .pct_change() * 100


Unnamed: 0,Country Name,Country Code,Year,CPI_AnnualChange,GDP_Growth,MoneySupply_GDPpct,ExchangeRate_LCUperUSD,ExchangeRate_ChangePct,CPI_RollingVol_3y,MoneySupply_ChangePct,GDP_Growth_MA_3y
0,Afghanistan,AFG,1960,,,,17.196561,,,,
1,Afghanistan,AFG,1961,,,,17.196561,0.0,,,
2,Afghanistan,AFG,1962,,,,17.196561,0.0,,,
3,Afghanistan,AFG,1963,,,,35.109645,104.166667,,,
4,Afghanistan,AFG,1964,,,,38.692262,10.204082,,,
5,Afghanistan,AFG,1965,,,,38.692262,0.0,,,
6,Afghanistan,AFG,1966,,,,38.692262,0.0,,,
7,Afghanistan,AFG,1967,,,,38.692262,0.0,,,
8,Afghanistan,AFG,1968,,,,38.692262,0.0,,,
9,Afghanistan,AFG,1969,,,,38.692262,0.0,,,


## 4. Conservative Imputation (Country-wise, small gaps only)

**Policy**: we keep 01 as the non-imputed “truth”. <br>
Here we impute **only small gaps** so the index can be computed:

- Linear interpolate within country (`limit=2` years).
- Then `ffill`/`bfill` inside country. If an entire country series is missing for an indicator, values stay NaN.

In [4]:
cols_to_impute = [
    "CPI_AnnualChange",
    "GDP_Growth",
    "ExchangeRate_ChangePct",
    "MoneySupply_ChangePct"
]

# Ensure sorting before interpolation
df = df.sort_values(["Country Name", "Year"]).reset_index(drop=True)

for col in cols_to_impute:
    # linear interpolation for small gaps within each country
    df[col] = (
        df.groupby("Country Name")[col]
          .apply(lambda s: s.interpolate(method="linear", limit=2, limit_direction="both"))
          .reset_index(level=0, drop=True)
    )
    # fill edges within country (still conservative)
    df[col] = df.groupby("Country Name")[col].transform(lambda s: s.ffill().bfill())

# Missing summary after impute
df[cols_to_impute].isna().sum()

CPI_AnnualChange          1430
GDP_Growth                  65
ExchangeRate_ChangePct    3114
MoneySupply_ChangePct     3022
dtype: int64

## 5. Per-Year Outlier Handling (Winsorization)

We clip **within each year** at the 1st–99th percentiles to reduce the influence of hyper-outliers, while preserving cross-country comparability in that year.

In [5]:
def winsorize_by_year(frame, col, lower=1, upper=99):
    def _clip(s):
        lo, hi = s.quantile(lower/100), s.quantile(upper/100)
        return s.clip(lo, hi)
    return frame.groupby("Year")[col].transform(_clip)

winsor_cols = {
    "CPI_AnnualChange": (1, 99),
    "GDP_Growth": (1, 99),
    "ExchangeRate_ChangePct": (1, 99),
    "MoneySupply_ChangePct": (1, 99)
}

for c, (lo, hi) in winsor_cols.items():
    df[f"{c}_win"] = winsorize_by_year(df, c, lower=lo, upper=hi)

df[[c for c in df.columns if c.endswith("_win")]].head()

Unnamed: 0,CPI_AnnualChange_win,GDP_Growth_win,ExchangeRate_ChangePct_win,MoneySupply_ChangePct_win
0,12.686269,-9.431974,0.0,2.098195
1,12.686269,-9.431974,0.0,2.098195
2,12.686269,-9.431974,0.0,2.098195
3,12.686269,-9.431974,64.972387,2.098195
4,12.686269,-9.431974,10.204082,2.098195


## 6. Per-Year Normalization (Z-Scores)

We normalize indicators **within each year** so the composite score compares countries *in the same macro context (year)*.
- CPI (↑) → higher z = higher risk
- FX change (↑ depreciation) → higher z = higher risk
- Money Supply growth (↑) → higher z = higher risk
- GDP growth (↑) → lower risk ⇒ we invert sign

In [6]:
def zscore_by_year(frame, col):
    def _z(s):
        mu = s.mean()
        sd = s.std(ddof=0)
        if sd == 0 or np.isnan(sd):
            return pd.Series(np.zeros(len(s)), index=s.index)
        return (s - mu) / sd
    return frame.groupby("Year")[col].transform(_z)

df["z_CPI"]   = zscore_by_year(df, "CPI_AnnualChange_win")
df["z_FX"]    = zscore_by_year(df, "ExchangeRate_ChangePct_win")
df["z_MS"]    = zscore_by_year(df, "MoneySupply_ChangePct_win")
df["z_GDP"]   = zscore_by_year(df, "GDP_Growth_win")

# invert GDP direction (higher growth => lower risk)
df["z_GDP_inv"] = -df["z_GDP"]

df[["Year","z_CPI","z_FX","z_MS","z_GDP","z_GDP_inv"]].head()

Unnamed: 0,Year,z_CPI,z_FX,z_MS,z_GDP,z_GDP_inv
0,1960,0.033103,-0.162884,-0.057358,-2.388371,2.388371
1,1961,0.016928,-0.162596,-0.008434,-2.331492,2.331492
2,1962,0.004785,-0.204221,-0.052471,-2.704354,2.704354
3,1963,-0.007727,6.65728,-0.068666,-2.587929,2.587929
4,1964,-0.011513,0.365811,0.075538,-2.803802,2.803802


## 7. Composite Risk Index (per Year), then 0–100 Scaling

Weights (sum to 1) — can be tuned later:
- CPI: **0.40**
- FX change: **0.25**
- Money Supply change: **0.20**
- GDP growth (inverted): **0.15**

In [7]:
weights = {
    "z_CPI": 0.40,
    "z_FX":  0.25,
    "z_MS":  0.20,
    "z_GDP_inv": 0.15
}

df["Risk_Index_Z"] = (
    weights["z_CPI"]     * df["z_CPI"] +
    weights["z_FX"]      * df["z_FX"]  +
    weights["z_MS"]      * df["z_MS"]  +
    weights["z_GDP_inv"] * df["z_GDP_inv"]
)

def minmax_0_100_by_year(frame, col):
    def _scale(s):
        lo, hi = s.min(), s.max()
        if hi == lo:
            return pd.Series(np.full(len(s), 50.0), index=s.index)
        return (s - lo) / (hi - lo) * 100.0
    return frame.groupby("Year")[col].transform(_scale)

df["Risk_Score_0_100"] = minmax_0_100_by_year(df, "Risk_Index_Z")

df[["Country Name","Year","Risk_Index_Z","Risk_Score_0_100"]].head(10)

Unnamed: 0,Country Name,Year,Risk_Index_Z,Risk_Score_0_100
0,Afghanistan,1960,0.319304,26.132792
1,Afghanistan,1961,0.314159,27.676863
2,Afghanistan,1962,0.346018,28.848937
3,Afghanistan,1963,2.035685,63.576891
4,Afghanistan,1964,0.522525,38.485362
5,Afghanistan,1965,0.347099,22.387804
6,Afghanistan,1966,0.343218,23.199435
7,Afghanistan,1967,0.271065,29.761782
8,Afghanistan,1968,0.234562,36.10794
9,Afghanistan,1969,0.331099,33.801294


## 8. Quick Sanity Checks (Top Risky Countries in Latest Year)

In [8]:
latest_year = int(df["Year"].max())
cols_view = [
    "Country Name", "Year", "Risk_Score_0_100",
    "CPI_AnnualChange", "ExchangeRate_ChangePct", "MoneySupply_ChangePct", "GDP_Growth"
]
top_latest = (
    df[df["Year"] == latest_year]
      .sort_values("Risk_Score_0_100", ascending=False)
      .head(15)[cols_view]
)
top_latest

Unnamed: 0,Country Name,Year,Risk_Score_0_100,CPI_AnnualChange,ExchangeRate_ChangePct,MoneySupply_ChangePct,GDP_Growth
13351,South Sudan,2024,100.0,91.440822,132.508954,0.0,-10.793365
8333,Lebanon,2024,75.217563,45.243042,545.01599,0.0,-0.760584
13998,Sudan,2024,71.866749,138.80846,0.0,0.0,-13.493292
10837,Nigeria,2024,64.124029,33.242097,129.22796,0.0,3.426439
15917,Zimbabwe,2024,62.300991,104.705171,-6.920159,0.0,2.029484
14816,Turkiye,2024,59.086236,58.506451,38.196474,0.0,3.184024
4026,"Egypt, Arab Rep.",2024,41.671984,28.27059,47.909123,-4.253268,2.399169
6069,Haiti,2024,38.527445,26.949056,-6.540831,0.0,-4.169634
9046,Malawi,2024,37.471634,32.17965,0.0,0.0,1.82685
6972,"Iran, Islamic Rep.",2024,36.539775,32.455871,0.0,0.0,3.04


## 9. Save Output

In [9]:
df.to_csv(OUTPUT_FILE, index=False)
print(f"Saved with risk index → {OUTPUT_FILE}")
print(f"Rows: {len(df):,} | Columns: {df.shape[1]}")

Saved with risk index → data_intermediate\inflation_data_with_risk_index.csv
Rows: 15,918 | Columns: 22


## 10. Summary

- Engineered features: FX % change, CPI 3y volatility, Money Supply % change, GDP 3y MA.
- Conservative imputation within countries to close small gaps.
- Per-year winsorization (1–99th pct) and z-score normalization.
- Composite Risk Index (CPI↑, FX↑, MS↑, GDP↓) with weights (0.40, 0.25, 0.20, 0.15).
- Scaled to Risk_Score_0_100 per year for clear cross-country comparison.
- Exported `inflation_data_with_risk_index.csv` for next steps.