# Synthetic CDS Data Generation

This notebook generates synthetic CDS spread data for basket CDS pricing analysis.

**Outputs:**
- Time series of 5Y CDS spreads (monthly)
- CDS curve snapshot (term structure)

**Model:** Log-normal mean-reverting (OU on log-spreads)
```
d(log S) = κ * (log θ - log S) * dt + σ * √dt * dW
```
This ensures positive spreads and percentage-based volatility.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import importlib
import data.generator
importlib.reload(data.generator)
from data.generator import CDSDataGenerator, EntityConfig, Regime

---
## 1. Configuration

### 1.1 Output Files

In [None]:
# Output file paths
OUTPUT_TIME_SERIES = "data/synthetic_cds_5y_monthly.csv"
OUTPUT_CURVE = "data/synthetic_cds_curve.csv"

# Random seed for reproducibility
SEED = 42

### 1.2 Reference Entities

Using dummy names to keep synthetic data separate from real data.

**Parameters:**
- `base_spread_5y`: Long-run mean spread in bps
- `volatility_pct`: Annualised volatility as decimal (0.30 = 30%)
- `mean_reversion_speed`: κ (annualised). κ=1.0 gives half-life ≈ 8 months

In [None]:
entities = [
    EntityConfig(
        name="Alpha Bank Corp",
        ticker="ALPHA",
        sector="financials",
        base_spread_5y=75,       # bps
        volatility_pct=0.35,     # 35% annual vol
        mean_reversion_speed=1.0
    ),
    EntityConfig(
        name="Beta Energy Inc",
        ticker="BETA",
        sector="energy",
        base_spread_5y=55,
        volatility_pct=0.40,     # energy more volatile
        mean_reversion_speed=0.8
    ),
    EntityConfig(
        name="Gamma Tech Ltd",
        ticker="GAMMA",
        sector="tech",
        base_spread_5y=45,       # lower spread (higher quality)
        volatility_pct=0.30,
        mean_reversion_speed=1.2
    ),
    EntityConfig(
        name="Delta Industrial Co",
        ticker="DELTA",
        sector="industrials",
        base_spread_5y=100,      # higher spread (more risk)
        volatility_pct=0.45,
        mean_reversion_speed=0.7
    ),
    EntityConfig(
        name="Epsilon Telecom",
        ticker="EPSILON",
        sector="telecoms",
        base_spread_5y=65,
        volatility_pct=0.25,     # telecoms more stable
        mean_reversion_speed=1.0
    ),
]

# Display configuration
pd.DataFrame([{
    "Ticker": e.ticker,
    "Sector": e.sector,
    "Base 5Y (bps)": e.base_spread_5y,
    "Vol (annual %)": f"{e.volatility_pct:.0%}",
    "κ": e.mean_reversion_speed
} for e in entities])

### 1.3 Correlation Matrix

Target correlation for spread changes. This is what we expect to recover in copula calibration.

**Design choices:**
- Higher correlation within related sectors
- All positive (credit risk is systemic)

In [None]:
# Correlation matrix for spread changes
# Order: ALPHA, BETA, GAMMA, DELTA, EPSILON

correlation_matrix = np.array([
    #  ALPHA   BETA  GAMMA  DELTA  EPSILON
    [  1.00,  0.35,  0.40,  0.45,   0.50],  # ALPHA (financials)
    [  0.35,  1.00,  0.25,  0.55,   0.30],  # BETA (energy)
    [  0.40,  0.25,  1.00,  0.35,   0.45],  # GAMMA (tech)
    [  0.45,  0.55,  0.35,  1.00,   0.40],  # DELTA (industrials)
    [  0.50,  0.30,  0.45,  0.40,   1.00],  # EPSILON (telecoms)
])

# Verify symmetry and PSD
assert np.allclose(correlation_matrix, correlation_matrix.T), "Matrix not symmetric"
eigenvalues = np.linalg.eigvalsh(correlation_matrix)
print(f"Eigenvalues: {eigenvalues.round(4)}")
print(f"Matrix is PSD: {all(eigenvalues >= -1e-10)}")

In [None]:
# Visualise correlation matrix
tickers = [e.ticker for e in entities]

fig, ax = plt.subplots(figsize=(6, 5))
sns.heatmap(
    correlation_matrix,
    annot=True,
    fmt=".2f",
    cmap="RdYlGn",
    center=0,
    xticklabels=tickers,
    yticklabels=tickers,
    vmin=-1,
    vmax=1,
    ax=ax
)
ax.set_title("Target Correlation Matrix (Spread Changes)")
plt.tight_layout()
plt.show()

### 1.4 Regime Structure

Define periods of normal and stressed market conditions.

**Parameters:**
- `vol_multiplier`: Scales volatility (2.0 = double normal vol)
- `spread_shift_pct`: Shifts mean level (0.5 = 50% higher mean)
- `correlation_multiplier`: Pushes correlations toward 1 (contagion)

In [None]:
regimes = [
    Regime(
        name="pre_covid_normal",
        start="2016-01",
        end="2020-02",
        vol_multiplier=1.0,
        spread_shift_pct=0.0,
        correlation_multiplier=1.0
    ),
    Regime(
        name="covid_stress",
        start="2020-03",
        end="2020-06",
        vol_multiplier=2.5,          # elevated volatility
        spread_shift_pct=0.8,        # spreads 80% higher
        correlation_multiplier=1.5   # correlations increase
    ),
    Regime(
        name="covid_recovery",
        start="2020-07",
        end="2021-06",
        vol_multiplier=1.5,
        spread_shift_pct=0.3,
        correlation_multiplier=1.2
    ),
    Regime(
        name="post_covid_normal",
        start="2021-07",
        end="2023-02",
        vol_multiplier=1.0,
        spread_shift_pct=0.0,
        correlation_multiplier=1.0
    ),
    Regime(
        name="banking_stress",
        start="2023-03",
        end="2023-06",
        vol_multiplier=1.8,
        spread_shift_pct=0.4,
        correlation_multiplier=1.3
    ),
    Regime(
        name="final_normal",
        start="2023-07",
        end="2025-12",
        vol_multiplier=1.0,
        spread_shift_pct=0.0,
        correlation_multiplier=1.0
    ),
]

# Display regime configuration
pd.DataFrame([{
    "Regime": r.name,
    "Start": r.start,
    "End": r.end,
    "Vol ×": r.vol_multiplier,
    "Spread Shift": f"+{r.spread_shift_pct:.0%}" if r.spread_shift_pct else "0%",
    "Corr ×": r.correlation_multiplier
} for r in regimes])

### 1.5 Curve Configuration

**Spread-dependent curvature:**
- Low spread (high quality) → steeper curve
- High spread (lower quality) → flatter curve

Formula: `alpha = BASE_CURVATURE + SPREAD_SENSITIVITY × spread_5y`

In [None]:
# Curve snapshot parameters
CURVE_AS_OF_DATE = "2025-12-31"
TENORS = [1, 2, 3, 4, 5]  # years
RECOVERY_RATE = 0.40

# Spread-dependent curve shape
# alpha = BASE_CURVATURE + SPREAD_SENSITIVITY * spread_5y
# Lower alpha = more curvature (steeper short-end rise)
BASE_CURVATURE = 0.35       # base exponent
SPREAD_SENSITIVITY = 0.003  # each 100bps adds 0.3 to alpha

---
## 2. Generate Data

In [None]:
# Initialize generator
gen = CDSDataGenerator(seed=SEED)

# Configure
gen.set_entities(entities)
gen.set_correlation_matrix(correlation_matrix)
gen.set_regimes(regimes)

print("Generator configured.")

In [None]:
# Generate time series
ts_df = gen.generate_time_series(
    start="2016-01-31",
    end="2025-12-31",
    freq="M"
)

print(f"Generated {len(ts_df)} rows")
print(f"Date range: {ts_df['date'].min().date()} to {ts_df['date'].max().date()}")
print(f"Entities: {ts_df['ticker'].nunique()}")
ts_df.head(10)

In [None]:
# Generate curve snapshot
curve_df = gen.generate_curve_snapshot(
    as_of=CURVE_AS_OF_DATE,
    tenors=TENORS,
    recovery_rate=RECOVERY_RATE,
    base_curvature=BASE_CURVATURE,
    spread_sensitivity=SPREAD_SENSITIVITY
)

print(f"Generated curve with {len(curve_df)} rows")
curve_df

---
## 3. Validation

### 3.1 Spread Time Series

In [None]:
# Pivot for plotting
ts_wide = ts_df.pivot(index="date", columns="ticker", values="cds_5y_spread_bps")

# Plot spread paths
fig, ax = plt.subplots(figsize=(12, 6))
ts_wide.plot(ax=ax, linewidth=1.5)

# Mark stress periods
stress_periods = [
    ("2020-03-01", "2020-06-30", "COVID"),
    ("2023-03-01", "2023-06-30", "Banking"),
]
for start, end, label in stress_periods:
    ax.axvspan(pd.to_datetime(start), pd.to_datetime(end), alpha=0.2, color="red")

ax.set_xlabel("Date")
ax.set_ylabel("5Y CDS Spread (bps)")
ax.set_title("Synthetic 5Y CDS Spreads")
ax.legend(loc="upper left")
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Summary statistics
ts_wide.describe().round(1)

### 3.2 Correlation: Target vs Realised

In [None]:
# Get correlation comparison
corr_comparison = gen.get_correlation_comparison()
print("Target vs Realised Correlation:")
corr_comparison

In [None]:
# Visual comparison
realised_corr = gen.get_realised_correlation()

fig, axes = plt.subplots(1, 3, figsize=(14, 4))

# Target
sns.heatmap(
    correlation_matrix, annot=True, fmt=".2f", cmap="RdYlGn", center=0,
    xticklabels=tickers, yticklabels=tickers, vmin=-1, vmax=1, ax=axes[0]
)
axes[0].set_title("Target Correlation")

# Realised
sns.heatmap(
    realised_corr, annot=True, fmt=".2f", cmap="RdYlGn", center=0,
    xticklabels=tickers, yticklabels=tickers, vmin=-1, vmax=1, ax=axes[1]
)
axes[1].set_title("Realised Correlation")

# Difference
diff = realised_corr - correlation_matrix
sns.heatmap(
    diff, annot=True, fmt=".2f", cmap="RdBu_r", center=0,
    xticklabels=tickers, yticklabels=tickers, vmin=-0.3, vmax=0.3, ax=axes[2]
)
axes[2].set_title("Difference (Realised - Target)")

plt.tight_layout()
plt.show()

### 3.3 Distribution of Spread Changes

In [None]:
# Calculate spread changes
changes = ts_wide.diff().dropna()

fig, axes = plt.subplots(1, 5, figsize=(15, 3))

for i, ticker in enumerate(tickers):
    axes[i].hist(changes[ticker], bins=30, edgecolor="black", alpha=0.7)
    axes[i].axvline(0, color="red", linestyle="--", linewidth=1)
    axes[i].set_title(ticker)
    axes[i].set_xlabel("Monthly Δ (bps)")

plt.suptitle("Distribution of Monthly Spread Changes", y=1.02)
plt.tight_layout()
plt.show()

### 3.4 Scatter Matrix (Spread Changes)

In [None]:
from pandas.plotting import scatter_matrix

scatter_matrix(changes, figsize=(10, 10), diagonal="hist", alpha=0.6)
plt.suptitle("Pairwise Spread Changes", y=1.02)
plt.tight_layout()
plt.show()

### 3.5 CDS Curves

In [None]:
# Plot term structures
fig, ax = plt.subplots(figsize=(8, 5))

for ticker in tickers:
    entity_curve = curve_df[curve_df["ticker"] == ticker]
    alpha = entity_curve["curve_alpha"].iloc[0]
    ax.plot(
        entity_curve["tenor_years"],
        entity_curve["cds_spread_bps"],
        marker="o",
        label=f"{ticker} (α={alpha:.2f})"
    )

ax.set_xlabel("Tenor (years)")
ax.set_ylabel("CDS Spread (bps)")
ax.set_title(f"CDS Term Structures as of {CURVE_AS_OF_DATE}")
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_xticks(TENORS)
plt.tight_layout()
plt.show()

In [None]:
# Show curve alphas by entity
curve_df[["ticker", "cds_5y_spread_bps", "curve_alpha"]].drop_duplicates()

In [None]:
# Verify 5Y spread matches terminal time series value
terminal_ts = ts_df[ts_df["date"] == ts_df["date"].max()][["ticker", "cds_5y_spread_bps"]]
terminal_curve = curve_df[curve_df["tenor_years"] == 5][["ticker", "cds_spread_bps"]]

check = terminal_ts.merge(terminal_curve, on="ticker")
check["match"] = np.isclose(check["cds_5y_spread_bps"], check["cds_spread_bps"])
print("Terminal spreads match curve 5Y points:")
check

---
## 4. Save Data

In [None]:
# Save time series
gen.save_time_series(OUTPUT_TIME_SERIES)
print(f"✓ Saved time series to: {OUTPUT_TIME_SERIES}")

# Save curve
gen.save_curve(curve_df, OUTPUT_CURVE)
print(f"✓ Saved curve to: {OUTPUT_CURVE}")

---
## 5. Summary

**Generated datasets:**

| File | Description |
|------|-------------|
| `synthetic_cds_5y_monthly.csv` | Monthly 5Y CDS spreads, 2016-2025 |
| `synthetic_cds_curve.csv` | Term structure snapshot at 2025-12-31 |

In [None]:
print("=" * 50)
print("GENERATION PARAMETERS")
print("=" * 50)

print("\n[Entities]")
for e in entities:
    print(f"  {e.ticker:8} | base={e.base_spread_5y:3}bps | vol={e.volatility_pct:.0%} | κ={e.mean_reversion_speed}")

print("\n[Target Correlation Matrix]")
print(pd.DataFrame(correlation_matrix, index=tickers, columns=tickers).to_string())

print("\n[Regimes]")
for r in regimes:
    shift_str = f"+{r.spread_shift_pct:.0%}" if r.spread_shift_pct else "0%"
    print(f"  {r.start} → {r.end} | {r.name:20} | vol×{r.vol_multiplier:.1f} | shift={shift_str}")

print(f"\n[Curve]")
print(f"  Recovery rate: {RECOVERY_RATE}")
print(f"  Base curvature: {BASE_CURVATURE}")
print(f"  Spread sensitivity: {SPREAD_SENSITIVITY}")

print(f"\n[Seed: {SEED}]")