# Reinforcement Learning for Monetary Policy Optimisation

**Authors:** Leonardo Luksic, Krisha Chandnani, Ignacio Orueta  
**Institution:** London School of Economics  
**Date:** February 2026

---

This notebook trains a Deep Q-Network (DQN) agent on monthly US macroeconomic data
(1971–2025) to set nominal interest rates. The agent's policy is benchmarked against
the standard Taylor Rule and actual Federal Reserve decisions.

**Approach:**

The pipeline has two stages. First, a neural network learns to forecast inflation
12 months ahead using lagged macroeconomic indicators. Second, that forecast model
serves as the transition function inside a Gymnasium environment where the DQN agent
learns a rate-setting policy by minimising a weighted loss over inflation deviations,
unemployment deviations, and interest-rate volatility.

The forecast model is the only learned component — unemployment and capacity
utilisation evolve from the historical record. This avoids the compounding-error
problem that arises when multiple transition models are chained together.

**Information structure:**

We test four feature specifications with different lag horizons. Three use only
lags of 18–30 months, approximating a realistic central-bank information set
(publication delays of ~6 months plus a 12-month lookback). A fourth adds
intermediate lags (3, 6, 12 months) that provide more recent signal at the
cost of realism. Comparing these quantifies the forecasting penalty imposed
by realistic information constraints.

**Validation:**

All models are validated with expanding-window time-series cross-validation
across five distinct economic regimes (1990s recession, dot-com era, 2000s,
Global Financial Crisis, and the post-2015 period including COVID).

## 1. Setup

In [None]:
import os
import warnings
import random
from collections import deque, namedtuple

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

import gymnasium as gym
from gymnasium import spaces
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

warnings.filterwarnings("ignore", message="X has feature names")
warnings.filterwarnings("ignore", category=FutureWarning)

print(f"PyTorch: {torch.__version__}  |  Gymnasium: {gym.__version__}")
print(f"Device: {'cuda' if torch.cuda.is_available() else 'cpu'}")

## 2. Configuration

In [None]:
DATA_PATH = '/Users/leoss/Downloads'
OUTPUT_PATH = '/Users/leoss/Desktop/Portfolio/Website-/Central bank/Outputs'
os.makedirs(OUTPUT_PATH, exist_ok=True)

CFG = {
    # Targets
    "inflation_target": 2.0,
    "unemployment_natural": 5.0,

    # Environment
    "max_steps": 36,          # 3-year episodes
    "n_actions": 41,          # 0–20 % in 0.5 pp increments
    "min_rate": 0.0,
    "max_rate": 20.0,
    "omega_pi": 1.0,          # weight on inflation deviation
    "omega_u": 0.5,           # weight on unemployment deviation
    "omega_smooth": 0.1,      # weight on rate smoothing

    # DQN
    "buffer_capacity": 15_000,
    "batch_size": 64,
    "gamma": 0.99,
    "epsilon_start": 1.0,
    "epsilon_end": 0.02,
    "epsilon_decay_steps": 8_000,
    "lr": 3e-4,
    "target_update_freq": 400,
    "hidden_dim": 128,
    "train_start_step": 500,
    "n_episodes": 600,

    # Visualisation
    "fig_dpi": 200,
    "seed": 42,
}

# Cross-validation folds (expanding window)
CV_FOLDS = [
    {"train_end": "1990-12", "test_start": "1991-01",
     "test_end": "1995-12", "name": "Early 1990s"},
    {"train_end": "1995-12", "test_start": "1996-01",
     "test_end": "2000-12", "name": "Late 1990s"},
    {"train_end": "2000-12", "test_start": "2001-01",
     "test_end": "2007-12", "name": "2000s"},
    {"train_end": "2007-12", "test_start": "2008-01",
     "test_end": "2015-12", "name": "GFC & aftermath"},
    {"train_end": "2015-12", "test_start": "2016-01",
     "test_end": "2024-12", "name": "Recent"},
]

## 3. Data Loading

Monthly economic indicators from FRED:

| Variable | Series | Frequency | Role |
|----------|--------|-----------|------|
| CPI | CPIAUCSL | Monthly | Compute YoY inflation |
| Unemployment | UNRATE | Monthly | Labour market slack |
| Fed Funds Rate | FEDFUNDS | Monthly | Policy instrument |
| Capacity Utilisation | TCU | Monthly | Real-economy output measure |
| 10-Year Treasury | GS10 | Monthly | Term structure / expectations |
| Financial Conditions | NFCI | Weekly → Monthly | Financial stress indicator |

In [None]:
def load_data(path):
    """Load and merge monthly FRED CSVs."""
    def _read(names):
        for n in names:
            p = os.path.join(path, n)
            if os.path.exists(p):
                return pd.read_csv(p, parse_dates=["observation_date"],
                                   index_col="observation_date")
        raise FileNotFoundError(f"None of {names} found in {path}")

    cpi     = _read(["CPIAUCSL.csv"])
    unrate  = _read(["UNRATE.csv"])
    ff      = _read(["FEDFUNDS.csv", "FEDFUNDS-1.csv"])
    tcu     = _read(["TCU.csv"])
    gs10    = _read(["GS10.csv"])
    nfci_w  = _read(["NFCI.csv"])
    nfci    = nfci_w.resample("MS").last()

    merged = pd.concat([
        cpi.rename(columns={cpi.columns[0]: "cpi"}),
        unrate.rename(columns={unrate.columns[0]: "unemployment"}),
        ff.rename(columns={ff.columns[0]: "fed_funds"}),
        tcu.rename(columns={tcu.columns[0]: "capacity_util"}),
        gs10.rename(columns={gs10.columns[0]: "treasury_10y"}),
        nfci.rename(columns={nfci.columns[0]: "fin_conditions"}),
    ], axis=1).dropna()

    print(f"Merged panel: {len(merged)} monthly obs  "
          f"({merged.index.min():%Y-%m} to {merged.index.max():%Y-%m})")
    return merged

data = load_data(DATA_PATH)

## 4. Feature Engineering

**Derived indicators:**
- Inflation: year-over-year CPI percentage change
- Unemployment gap: deviation from the 5 % natural rate
- Capacity gap: deviation from full capacity (100 %)
- Term spread: 10-year Treasury minus fed funds rate

**Lag structure:** We create lags at 3, 6, 12, 18, 24, and 30 months
for each indicator. The shorter lags (3–12) provide more recent information;
the longer lags (18–30) approximate the realistic information set available
after accounting for publication delays.

**Target:** Inflation 12 months ahead — the standard central-bank forecast horizon.

In [None]:
def engineer_features(data, natural_rate=5.0):
    df = data.copy()

    # Derived indicators
    df["inflation"]    = df["cpi"].pct_change(12) * 100
    df["unemp_gap"]    = df["unemployment"] - natural_rate
    df["capacity_gap"] = df["capacity_util"] - 100.0
    df["term_spread"]  = df["treasury_10y"] - df["fed_funds"]

    # Lagged features at both horizons
    lag_vars = ["inflation", "unemp_gap", "capacity_gap",
                "fed_funds", "term_spread", "fin_conditions"]
    for var in lag_vars:
        for lag in [3, 6, 12, 18, 24, 30]:
            df[f"L{lag}_{var}"] = df[var].shift(lag)

    # Forward target
    df["inflation_12m_ahead"] = df["inflation"].shift(-12)

    df = df.dropna()
    print(f"Final dataset: {len(df)} obs  "
          f"({df.index.min():%Y-%m} to {df.index.max():%Y-%m})")
    return df

historical = engineer_features(data, CFG["unemployment_natural"])

## 5. Data Overview

In [None]:
C1, C2, C3 = "#2563eb", "#dc2626", "#7c3aed"
C_BG = "#fafafa"

plt.rcParams.update({
    "figure.facecolor": C_BG, "axes.facecolor": C_BG,
    "axes.grid": True, "grid.color": "#e5e7eb", "grid.linewidth": 0.5,
    "font.size": 11, "axes.spines.top": False, "axes.spines.right": False,
})

RECESSIONS = [
    ("1980-01", "1982-11"), ("1990-07", "1991-03"),
    ("2001-03", "2001-11"), ("2007-12", "2009-06"),
    ("2020-02", "2020-04"),
]

fig, axes = plt.subplots(3, 2, figsize=(15, 11))
fig.suptitle("US Economic Indicators (Monthly, 1973-2025)",
             fontweight="bold", fontsize=14, y=0.995)

panels = [
    ("inflation", "Inflation (YoY %)", C1, {"hline": 2.0, "hl": "Target"}),
    ("unemployment", "Unemployment rate (%)", C2, {"hline": 5.0, "hl": "Natural rate"}),
    ("capacity_util", "Capacity utilisation (%)", "#c2410c", {"hline": 100, "hl": "Full capacity"}),
    (["fed_funds", "treasury_10y"], "Interest rates (%)", None, {}),
    ("term_spread", "Term spread (10Y - FFR, pp)", "#7c3aed", {"hline": 0}),
    ("fin_conditions", "Financial Conditions Index", "#c2410c", {"hline": 0, "hl": "Neutral"}),
]

for ax, (col, title, color, opts) in zip(axes.flat, panels):
    if isinstance(col, list):
        ax.plot(historical.index, historical[col[0]], lw=1.5, color=C1, label="Fed funds")
        ax.plot(historical.index, historical[col[1]], lw=1.5, color=C2, alpha=0.7, label="10Y Treasury")
        ax.legend(fontsize=9, frameon=False)
    else:
        ax.plot(historical.index, historical[col], lw=1.5, color=color)
    if "hline" in opts:
        ax.axhline(opts["hline"], color="grey", ls="--", lw=1, alpha=0.6, label=opts.get("hl"))
        if "hl" in opts:
            ax.legend(fontsize=9, frameon=False)
    ax.set_title(title, fontweight="bold", fontsize=11)

fig.tight_layout()
fig.savefig(f"{OUTPUT_PATH}/data_overview.png", dpi=CFG["fig_dpi"], bbox_inches="tight")
plt.show()

## 6. Model Specifications

Five specifications test the effect of lag horizon, variable breadth, and
dimensionality on forecast accuracy:

| Spec | Features | Lags | Description |
|------|----------|------|-------------|
| SIMPLE | Inflation, fed funds | 18, 24, 30 | Core variables, realistic lags |
| EXPANDED | + unemployment, capacity | 18, 24, 30 | Real-economy measures added |
| FULL | + term spread, NFCI | 18, 24, 30 | All variables, realistic lags |
| INFORMATIVE | All variables | 3-30 | Intermediate + realistic lags |
| LEAN | Inflation, rate, unemp gap, spread | 3, 6, 12 | Focused set, fewer parameters |

The first three use only realistic lags (>=18 months). INFORMATIVE adds shorter
lags but has 36 features — a concern with ~460 training observations. LEAN targets
the best ratio of signal to model complexity: 8 features covering the core
macro variables at intermediate lags.

In [None]:
SPECS = {
    "SIMPLE": {
        "features": [
            "L18_inflation", "L24_inflation", "L30_inflation",
            "L18_fed_funds", "L24_fed_funds", "L30_fed_funds",
        ],
        "desc": "Core variables, realistic lags only",
    },
    "EXPANDED": {
        "features": [
            "L18_inflation", "L24_inflation", "L30_inflation",
            "L18_unemp_gap", "L24_unemp_gap", "L30_unemp_gap",
            "L18_capacity_gap", "L24_capacity_gap", "L30_capacity_gap",
            "L18_fed_funds", "L24_fed_funds", "L30_fed_funds",
        ],
        "desc": "With real-economy measures, realistic lags only",
    },
    "FULL": {
        "features": [
            "L18_inflation", "L24_inflation", "L30_inflation",
            "L18_unemp_gap", "L24_unemp_gap", "L30_unemp_gap",
            "L18_capacity_gap", "L24_capacity_gap", "L30_capacity_gap",
            "L18_fed_funds", "L24_fed_funds", "L30_fed_funds",
            "L18_term_spread", "L24_term_spread", "L30_term_spread",
            "L18_fin_conditions", "L24_fin_conditions", "L30_fin_conditions",
        ],
        "desc": "Full specification, realistic lags only",
    },
    "INFORMATIVE": {
        "features": [
            "L3_inflation", "L6_inflation", "L12_inflation",
            "L3_unemp_gap", "L6_unemp_gap", "L12_unemp_gap",
            "L3_capacity_gap", "L6_capacity_gap", "L12_capacity_gap",
            "L3_fed_funds", "L6_fed_funds", "L12_fed_funds",
            "L3_term_spread", "L6_term_spread", "L12_term_spread",
            "L3_fin_conditions", "L6_fin_conditions", "L12_fin_conditions",
            "L18_inflation", "L24_inflation", "L30_inflation",
            "L18_unemp_gap", "L24_unemp_gap", "L30_unemp_gap",
            "L18_capacity_gap", "L24_capacity_gap", "L30_capacity_gap",
            "L18_fed_funds", "L24_fed_funds", "L30_fed_funds",
            "L18_term_spread", "L24_term_spread", "L30_term_spread",
            "L18_fin_conditions", "L24_fin_conditions", "L30_fin_conditions",
        ],
        "desc": "All variables, intermediate + realistic lags (L3-L30)",
    },
    "LEAN": {
        "features": [
            "L3_inflation", "L6_inflation", "L12_inflation",
            "L3_fed_funds", "L6_fed_funds", "L12_fed_funds",
            "L6_unemp_gap", "L6_term_spread",
        ],
        "desc": "Focused: inflation + rate + unemployment + spread (8 features)",
    },
}

TARGET_COL = "inflation_12m_ahead"

for name, spec in SPECS.items():
    print(f"{name:14s}  {len(spec['features']):2d} features  |  {spec['desc']}")

## 7. Time-Series Cross-Validation

Expanding-window CV trains on all data up to each cutoff date, then tests
on the subsequent period. This avoids look-ahead bias and checks whether
the model generalises across structurally different economic regimes.

In [None]:
cv_results = {}

for spec_name, spec in SPECS.items():
    print(f"\n--- {spec_name} ({len(spec['features'])} features) ---")
    fold_metrics = []

    for fold in CV_FOLDS:
        train = historical.loc[:fold["train_end"]]
        test  = historical.loc[fold["test_start"]:fold["test_end"]]
        if len(test) < 10:
            continue

        scaler = StandardScaler()
        X_tr = scaler.fit_transform(train[spec["features"]])
        X_te = scaler.transform(test[spec["features"]])
        y_tr, y_te = train[TARGET_COL], test[TARGET_COL]

        model = MLPRegressor(
            hidden_layer_sizes=(64, 32), activation="relu",
            solver="adam", alpha=0.01, batch_size=32,
            learning_rate="adaptive", learning_rate_init=0.001,
            max_iter=500, early_stopping=True,
            validation_fraction=0.15, n_iter_no_change=20,
            random_state=42, verbose=False,
        )
        model.fit(X_tr, y_tr)
        y_pred = model.predict(X_te)

        mse = mean_squared_error(y_te, y_pred)
        mae = mean_absolute_error(y_te, y_pred)
        r2  = r2_score(y_te, y_pred)
        fold_metrics.append({"fold": fold["name"], "mse": mse, "mae": mae, "r2": r2})
        print(f"  {fold['name']:18s}  MSE={mse:.4f}  MAE={mae:.4f}  R2={r2:.4f}")

    avg = {k: np.mean([f[k] for f in fold_metrics]) for k in ("mse", "mae", "r2")}
    cv_results[spec_name] = {"folds": fold_metrics, **avg}
    print(f"  {'Average':18s}  MSE={avg['mse']:.4f}  MAE={avg['mae']:.4f}  R2={avg['r2']:.4f}")

best_spec = min(cv_results, key=lambda s: cv_results[s]["mse"])
print(f"\nBest specification: {best_spec}  (avg CV MSE = {cv_results[best_spec]['mse']:.4f})")

### CV Results

In [None]:
specs_list = list(cv_results.keys())
folds_list = [f["fold"] for f in cv_results[specs_list[0]]["folds"]]
n_folds, n_specs = len(folds_list), len(specs_list)

fig, ax = plt.subplots(figsize=(13, 5.5))
x = np.arange(n_folds)
width = 0.8 / n_specs
colors = [C1, C2, C3, "#f59e0b", "#10b981"]

for i, spec in enumerate(specs_list):
    mses = [f["mse"] for f in cv_results[spec]["folds"]]
    ax.bar(x + i * width, mses, width, label=spec,
           color=colors[i], alpha=0.7, edgecolor="white", lw=1)

ax.set_xticks(x + width * (n_specs - 1) / 2)
ax.set_xticklabels(folds_list, fontsize=10)
ax.set_ylabel("Test MSE")
ax.set_title("Time-series CV: forecast error by specification and period", fontweight="bold")
ax.legend(frameon=False, fontsize=9)
fig.tight_layout()
fig.savefig(f"{OUTPUT_PATH}/cv_results.png", dpi=CFG["fig_dpi"], bbox_inches="tight")
plt.show()

## 8. Final Inflation Model

In [None]:
feats = SPECS[best_spec]["features"]
X, y = historical[feats], historical[TARGET_COL]
split = int(len(X) * 0.8)

final_scaler = StandardScaler()
X_tr = final_scaler.fit_transform(X.iloc[:split])
X_val = final_scaler.transform(X.iloc[split:])
y_tr, y_val = y.iloc[:split], y.iloc[split:]

final_model = MLPRegressor(
    hidden_layer_sizes=(64, 32), activation="relu",
    solver="adam", alpha=0.01, batch_size=32,
    learning_rate="adaptive", learning_rate_init=0.001,
    max_iter=500, early_stopping=True,
    validation_fraction=0.15, n_iter_no_change=20,
    random_state=42, verbose=False,
)
final_model.fit(X_tr, y_tr)

y_pred = final_model.predict(X_val)
print(f"Specification: {best_spec} ({len(feats)} features)")
print(f"Validation MSE: {mean_squared_error(y_val, y_pred):.4f}")
print(f"Validation R2:  {r2_score(y_val, y_pred):.4f}")

### Model Validation: Predicted vs Actual

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13, 5.5))

y_tr_pred = final_model.predict(X_tr)
ax1.scatter(y_tr, y_tr_pred, alpha=0.4, s=15, color=C1, edgecolors="white", lw=0.3)
ax1.plot([y_tr.min(), y_tr.max()], [y_tr.min(), y_tr.max()], "k--", lw=1.5, alpha=0.5)
ax1.set_xlabel("Actual inflation (%)")
ax1.set_ylabel("Predicted inflation (%)")
ax1.set_title(f"Training (R2 = {r2_score(y_tr, y_tr_pred):.3f})", fontweight="bold")

ax2.scatter(y_val, y_pred, alpha=0.4, s=15, color=C2, edgecolors="white", lw=0.3)
ax2.plot([y_val.min(), y_val.max()], [y_val.min(), y_val.max()], "k--", lw=1.5, alpha=0.5)
ax2.set_xlabel("Actual inflation (%)")
ax2.set_ylabel("Predicted inflation (%)")
ax2.set_title(f"Validation (R2 = {r2_score(y_val, y_pred):.3f})", fontweight="bold")

fig.suptitle(f"{best_spec}: Predicted vs Actual Inflation (12-month ahead)", fontweight="bold")
fig.tight_layout()
fig.savefig(f"{OUTPUT_PATH}/model_validation.png", dpi=CFG["fig_dpi"], bbox_inches="tight")
plt.show()

## 9. RL Environment

The agent observes `(inflation, unemployment, capacity utilisation, current rate)`
and picks a discrete interest rate from 0–20 % in 0.5 pp increments.

The agent's rate choices are tracked in a rolling buffer and substituted into the
`fed_funds` lag features when building the inflation prediction input. This means
the agent's decisions propagate through the forecast model with the appropriate
delay — short lags (L3) reflect recent choices, longer lags (L18+) still come
from the historical record early in an episode.

Unemployment and capacity utilisation evolve from the historical record.
Only inflation is model-predicted.

In [None]:
class MonetaryPolicyEnv(gym.Env):
    metadata = {"render_modes": []}

    def __init__(self, model, scaler, feature_cols, historical_df, cfg):
        super().__init__()
        self.model = model
        self.scaler = scaler
        self.feature_cols = feature_cols
        self.hist = historical_df.reset_index(drop=True)

        self.pi_target  = cfg["inflation_target"]
        self.u_target   = cfg["unemployment_natural"]
        self.w_pi       = cfg["omega_pi"]
        self.w_u        = cfg["omega_u"]
        self.w_smooth   = cfg["omega_smooth"]
        self.max_steps  = cfg["max_steps"]

        self.n_actions  = cfg["n_actions"]
        self.rate_grid  = np.linspace(cfg["min_rate"], cfg["max_rate"], self.n_actions)
        self.action_space = spaces.Discrete(self.n_actions)
        self.observation_space = spaces.Box(-np.inf, np.inf, (4,), dtype=np.float32)
        self.reset()

    def reset(self, seed=None, options=None):
        if seed is not None:
            super().reset(seed=seed)
            np.random.seed(seed)

        min_start = 30
        max_start = len(self.hist) - self.max_steps - 12 - 1
        self.start_idx = (min_start if max_start <= min_start
                          else np.random.randint(min_start, max_start))
        self.idx = self.start_idx
        self.step_count = 0
        self._done = False

        row = self.hist.iloc[self.idx]
        self.state = np.array([row["inflation"], row["unemployment"],
                               row["capacity_util"], row["fed_funds"]], dtype=np.float32)
        self.prev_rate = row["fed_funds"]

        # Rolling buffer for agent's rate history (lag propagation)
        self._rate_buffer = deque(
            [self.hist.iloc[max(0, self.idx - i)]["fed_funds"] for i in range(31)],
            maxlen=31)

        self.episode_history = []
        self.cumulative_reward = 0.0
        return self.state, {}

    def step(self, action):
        if self._done:
            raise RuntimeError("Episode finished")

        rate = float(self.rate_grid[int(action)])
        self.step_count += 1
        self._rate_buffer.appendleft(rate)

        # Build feature vector — override fed_funds lags with agent's buffer
        row = self.hist.iloc[self.idx]
        feat_dict = {}
        for col in self.feature_cols:
            if col in row.index:
                if "fed_funds" in col:
                    lag = int(col.split("_")[0][1:])
                    feat_dict[col] = [self._rate_buffer[lag] if lag < len(self._rate_buffer)
                                      else row[col]]
                else:
                    feat_dict[col] = [row[col]]
            else:
                feat_dict[col] = [0.0]

        features_scaled = self.scaler.transform(pd.DataFrame(feat_dict))
        next_pi = np.clip(float(self.model.predict(features_scaled)[0]), -2.0, 15.0)

        # Other variables from historical record
        next_idx = min(self.idx + 1, len(self.hist) - 1)
        next_u   = self.hist.iloc[next_idx]["unemployment"]
        next_cap = self.hist.iloc[next_idx]["capacity_util"]

        reward = -(self.w_pi * (next_pi - self.pi_target) ** 2
                   + self.w_u * (next_u - self.u_target) ** 2
                   + self.w_smooth * (rate - self.prev_rate) ** 2)

        self.episode_history.append({
            "inflation": next_pi, "unemployment": next_u,
            "capacity": next_cap, "rate": rate, "reward": reward})
        self.cumulative_reward += reward

        self.state = np.array([next_pi, next_u, next_cap, rate], dtype=np.float32)
        self.prev_rate = rate
        self.idx = next_idx
        self._done = self.step_count >= self.max_steps
        return self.state, reward, self._done, False, {}

env = MonetaryPolicyEnv(final_model, final_scaler, feats, historical, CFG)
print(f"Environment created  |  Actions: {env.n_actions}  |  Episode: {env.max_steps} months")

# Smoke test
state, _ = env.reset(seed=42)
next_state, reward, done, _, _ = env.step(env.action_space.sample())
print(f"Test step: reward = {reward:.2f}")

## 10. DQN Agent

Standard Deep Q-Network with experience replay, target network, and
epsilon-greedy exploration. The Q-network is a 3-layer MLP (128-128-41)
trained with Huber loss.

In [None]:
Transition = namedtuple("Transition", ("state", "action", "reward", "next_state", "done"))

class ReplayBuffer:
    def __init__(self, capacity):
        self.buf = deque(maxlen=capacity)
    def push(self, *args):
        self.buf.append(Transition(*args))
    def sample(self, n):
        return random.sample(self.buf, n)
    def __len__(self):
        return len(self.buf)

class QNetwork(nn.Module):
    def __init__(self, state_dim, action_dim, hidden=128):
        super().__init__()
        self.fc1 = nn.Linear(state_dim, hidden)
        self.fc2 = nn.Linear(hidden, hidden)
        self.fc3 = nn.Linear(hidden, action_dim)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return self.fc3(x)

class DQNAgent:
    def __init__(self, env, cfg):
        self.env = env
        sd = env.observation_space.shape[0]
        ad = env.action_space.n
        self.buffer = ReplayBuffer(cfg["buffer_capacity"])
        self.batch_size = cfg["batch_size"]
        self.gamma = cfg["gamma"]
        self.epsilon = cfg["epsilon_start"]
        self.eps_min = cfg["epsilon_end"]
        self.eps_decay = (cfg["epsilon_end"] / cfg["epsilon_start"]) ** (1.0 / cfg["epsilon_decay_steps"])
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.policy_net = QNetwork(sd, ad, cfg["hidden_dim"]).to(self.device)
        self.target_net = QNetwork(sd, ad, cfg["hidden_dim"]).to(self.device)
        self.target_net.load_state_dict(self.policy_net.state_dict())
        self.target_net.eval()
        self.optimiser = optim.Adam(self.policy_net.parameters(), lr=cfg["lr"])
        self.loss_fn = nn.SmoothL1Loss()
        self._steps = 0
        self._target_freq = cfg["target_update_freq"]

    def choose_action(self, state, greedy=False):
        if not greedy and np.random.random() < self.epsilon:
            return self.env.action_space.sample()
        with torch.no_grad():
            t = torch.FloatTensor(state).unsqueeze(0).to(self.device)
            return self.policy_net(t).argmax().item()

    def store(self, s, a, r, s2, done):
        self.buffer.push(s, a, r, s2, done)

    def update(self):
        if len(self.buffer) < self.batch_size:
            return None
        batch = Transition(*zip(*self.buffer.sample(self.batch_size)))
        s  = torch.FloatTensor(np.array(batch.state)).to(self.device)
        a  = torch.LongTensor(batch.action).unsqueeze(1).to(self.device)
        r  = torch.FloatTensor(batch.reward).unsqueeze(1).to(self.device)
        s2 = torch.FloatTensor(np.array(batch.next_state)).to(self.device)
        d  = torch.BoolTensor(batch.done).unsqueeze(1).to(self.device)
        q = self.policy_net(s).gather(1, a)
        with torch.no_grad():
            q2 = self.target_net(s2).max(1)[0].unsqueeze(1)
            target = r + self.gamma * q2 * (~d)
        loss = self.loss_fn(q, target)
        self.optimiser.zero_grad()
        loss.backward()
        nn.utils.clip_grad_norm_(self.policy_net.parameters(), 10.0)
        self.optimiser.step()
        self._steps += 1
        if self._steps % self._target_freq == 0:
            self.target_net.load_state_dict(self.policy_net.state_dict())
        return loss.item()

    def decay_epsilon(self):
        self.epsilon = max(self.eps_min, self.epsilon * self.eps_decay)

agent = DQNAgent(env, CFG)
print(f"DQN agent on {agent.device}  |  epsilon: {CFG['epsilon_start']} -> {CFG['epsilon_end']}")

## 11. Training

In [None]:
episode_rewards, episode_losses = [], []
total_steps = 0
warmup = CFG["train_start_step"]

for ep in range(CFG["n_episodes"]):
    s, _ = env.reset()
    ep_r, ep_l, n_upd = 0.0, 0.0, 0

    for _ in range(env.max_steps):
        a = agent.choose_action(s)
        s2, r, done, _, _ = env.step(a)
        agent.store(s, a, r, s2, done)

        if total_steps > warmup:
            loss = agent.update()
            if loss is not None:
                ep_l += loss
                n_upd += 1

        s = s2
        ep_r += r
        total_steps += 1
        agent.decay_epsilon()
        if done:
            break
    episode_rewards.append(ep_r)
    episode_losses.append(ep_l / max(n_upd, 1))

    if (ep + 1) % 100 == 0:
        avg = np.mean(episode_rewards[-100:])
        print(f"  ep {ep+1:4d}  |  avg reward {avg:8.2f}  |  eps {agent.epsilon:.4f}")

print(f"\nFinal 50-episode avg reward: {np.mean(episode_rewards[-50:]):.2f}")

### Training Curves

In [None]:
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 7.5))
w = 30

ax1.plot(episode_rewards, alpha=0.2, color=C1, lw=0.8)
if len(episode_rewards) >= w:
    sm = np.convolve(episode_rewards, np.ones(w)/w, mode="valid")
    ax1.plot(range(w-1, len(episode_rewards)), sm, color=C1, lw=2.5, label=f"{w}-episode avg")
ax1.set_ylabel("Episode reward")
ax1.set_title("Training: cumulative reward per episode", fontweight="bold")
ax1.legend(frameon=False)

valid = [(i, l) for i, l in enumerate(episode_losses) if l and l > 0]
if valid and len(valid) >= w:
    ix, vals = zip(*valid)
    sm = np.convolve(vals, np.ones(w)/w, mode="valid")
    ax2.plot(range(ix[0]+w-1, ix[0]+w-1+len(sm)), sm, color=C2, lw=2.5)
ax2.set_ylabel("TD loss")
ax2.set_xlabel("Episode")
ax2.set_title("Training: Huber loss", fontweight="bold")

fig.tight_layout(h_pad=2.5)
fig.savefig(f"{OUTPUT_PATH}/training_curves.png", dpi=CFG["fig_dpi"], bbox_inches="tight")
plt.show()

## 12. Policy Comparison

Generate rate recommendations for the full historical record under each policy:
- **Federal Reserve** (actual decisions)
- **Taylor Rule**: $i_t = r^* + \pi_t + 1.5(\pi_t - \pi^*) + 0.5(u_t - u^*)$
- **DQN agent** (greedy policy)

In [None]:
def taylor_rule(pi, u, r_star=2.0, pi_star=2.0, u_star=5.0, a_pi=1.5, a_u=0.5):
    return np.clip(r_star + pi + a_pi * (pi - pi_star) + a_u * (u - u_star), 0, 20)

df = historical.copy()
df["taylor_rate"] = df.apply(lambda r: taylor_rule(r["inflation"], r["unemployment"]), axis=1)
df["dqn_rate"] = np.nan

agent.epsilon = 0.0  # greedy
for i in range(len(df)):
    if i < 30:
        continue
    row = df.iloc[i]
    state = np.array([row["inflation"], row["unemployment"],
                      row["capacity_util"], row["fed_funds"]], dtype=np.float32)
    action = agent.choose_action(state, greedy=True)
    df.iloc[i, df.columns.get_loc("dqn_rate")] = env.rate_grid[action]

print(f"Generated policy recommendations for {df['dqn_rate'].notna().sum()} months")

### Historical Rate Comparison

In [None]:
fig, axes = plt.subplots(2, 1, figsize=(16, 9), sharex=True)
fig.suptitle("Monetary policy comparison: DQN vs Taylor Rule vs Federal Reserve (1975-2025)",
             fontweight="bold", fontsize=13, y=0.995)

axes[0].plot(df.index, df["fed_funds"], lw=2, color=C1, label="Federal Reserve (actual)", alpha=0.9)
axes[0].plot(df.index, df["taylor_rate"], lw=1.8, color=C2, ls="--", label="Taylor Rule", alpha=0.8)
axes[0].plot(df.index, df["dqn_rate"], lw=2, color=C3, ls=":", label="DQN agent", alpha=0.85)
for s, e in RECESSIONS:
    axes[0].axvspan(s, e, alpha=0.12, color="#fca5a5", zorder=0)
axes[0].set_ylabel("Nominal interest rate (%)")
axes[0].set_ylim(-1, 22)
axes[0].legend(loc="upper right", frameon=True, fontsize=10)
axes[0].set_title("Interest-rate policies", fontweight="bold")

t_dev = df["taylor_rate"] - df["fed_funds"]
d_dev = df["dqn_rate"] - df["fed_funds"]
axes[1].fill_between(df.index, t_dev, 0, alpha=0.15, color=C2)
axes[1].fill_between(df.index, d_dev, 0, alpha=0.15, color=C3)
axes[1].plot(df.index, t_dev, lw=1.5, color=C2, label="Taylor deviation", alpha=0.8)
axes[1].plot(df.index, d_dev, lw=1.5, color=C3, label="DQN deviation", alpha=0.8)
axes[1].axhline(0, color=C1, lw=1, alpha=0.4)
axes[1].set_ylabel("Deviation from Fed rate (pp)")
axes[1].set_xlabel("Year")
axes[1].legend(loc="upper right", frameon=True, fontsize=10)
axes[1].set_title("Policy deviations from actual Federal Reserve decisions", fontweight="bold")

fig.tight_layout()
fig.savefig(f"{OUTPUT_PATH}/policy_comparison.png", dpi=CFG["fig_dpi"], bbox_inches="tight")
plt.show()

## 13. Performance Metrics

In [None]:
print("Deviation from actual Federal Reserve decisions:\n")
for name, col in [("Taylor Rule", "taylor_rate"), ("DQN Agent", "dqn_rate")]:
    dev = (df[col] - df["fed_funds"]).dropna()
    mad  = np.abs(dev).mean()
    rmse = np.sqrt((dev ** 2).mean())
    print(f"  {name:15s}  MAD = {mad:.3f} pp  |  RMSE = {rmse:.3f} pp")

# Improvement
t_mad = np.abs(df["taylor_rate"] - df["fed_funds"]).dropna().mean()
d_mad = np.abs(df["dqn_rate"] - df["fed_funds"]).dropna().mean()
if d_mad < t_mad:
    print(f"\n  DQN reduces MAD by {((t_mad - d_mad) / t_mad * 100):.1f}% relative to Taylor Rule")

## 14. Discussion

**What the model does:** A neural network forecasts inflation 12 months ahead
using lagged macroeconomic indicators. A DQN agent then uses that forecast,
together with current unemployment and capacity data, to learn a rate-setting
policy over 3-year simulated episodes. The agent is evaluated by comparing
its recommendations against the Taylor Rule and actual Fed decisions across
50 years of history.

**Information-lag tradeoff:** The cross-validation results quantify the cost
of realistic information constraints. Specifications restricted to 18–30 month
lags perform substantially worse on the Early 1990s fold, where the model
must extrapolate from the volatile 1970s–80s into a structurally different
disinflation period. Adding intermediate lags (3–12 months) reduces forecast
error, particularly for more stable post-2000 regimes.

**Limitations:**
- The inflation forecast model has limited out-of-sample accuracy, especially
  across regime changes. The agent's policy quality is bounded by model quality.
- Unemployment and capacity evolve from historical data rather than responding
  to the agent's rate choices. A full general-equilibrium model would capture
  these feedback effects but introduces the compounding-error problem.
- The agent's tendency to anchor near the current fed funds rate partly reflects
  the rate-smoothing penalty rather than learned macro intuition.

**References:**
1. Taylor, J. B. (1993). Discretion versus policy rules in practice. *Carnegie-Rochester Conference Series on Public Policy.*
2. Hinterlang, N. & Tanzer, P. M. (2021). Monetary policy using reinforcement learning. *arXiv:2108.01195.*
3. Mnih, V. et al. (2015). Human-level control through deep reinforcement learning. *Nature.*

**Data:** Monthly FRED series, 1971–2025. Retrieved February 2026.