# Implied-Vol Surface Completion (FC-NN + BS layer)
## a lotta words to say that this project IS NOT COMPLETED YET!!!!


**Goal (snapshot t_0):** Given a few option quotes at one timestamp, complete an **arbitrage-free** surface of prices/IV across strikes K and tenors T.  
**Method:** Fully-connected MLP → predicted IV 𝜎̂(K,T) → **Black–Scholes** price layer → fit to observed quotes.  
**Stretch:** Small **PINN** term (BS-PDE residual + simple boundary/terminal penalties) as regularizer.  
**Deliverables:** error/arb metrics, plots, ablation (baseline / +no-arb / +PINN / ensemble).


We will be building a black scholes model calculator in house and following this methodology:
 - Generate data using b-s model for pretraining
 - Utilize data from kaggle dataset(s) for fine-tuning


maybe we can experiment with join curriculum, with 90% synthetic / 10% real and validate/test on real only


## Data generation using the Black-Scholes model
Use b-s to generate "snapshots" of many quotes. Within each snapshot, pick 10-20 quotes as "observed", while treating the rest as targets for completion. This is fixed only once for fair comparison. We then use this phase to debug the model/tune rough ranges

Black Scholes Formula (Call):
$$ C = SN(d_1)-Ke^{rT}N(d_2) $$
Where:
- $C$: Price of the European call option
- $S$: Current price of the underlying asset (spot)
- $K$: Strike price of the option 
- $r$: Risk-free interest rate 
- $T$: Time to expiration (in years) 
- $\sigma$: Volatility of the underlying asset's returns 
- $N(d_{1})$ and $N(d_{2})$: The cumulative standard normal distribution function, which gives the probability that a variable will be less than a certain value

Put is different formula (to work on later)

In [1]:
import numpy as np
from scipy.stats import norm


def bs_price(spot, strike, years, r, q, sigma, option="call"):
    T = np.asarray(years, dtype=float)
    S = np.asarray(spot, dtype=float)
    K = np.asarray(strike, dtype=float)
    sig = np.asarray(sigma, dtype=float)

    eps = 1e-12 # fix divide by zero error
    
    T = np.maximum(T, eps)
    sig = np.maximum(sig, 1e-12)

    d1 = (np.log(S / K) + (r - q + 0.5 * sig**2) * T) / (sig * np.sqrt(T))
    d2 = d1 - sig * np.sqrt(T)

    Nd1 = norm.cdf(d1)
    Nd2 = norm.cdf(d2)

    if option == "call":
        return S * np.exp(-q * T) * Nd1 - K * np.exp(-r * T) * Nd2
    else:  # put
        return K * np.exp(-r * T) * norm.cdf(-d2) - S * np.exp(-q * T) * norm.cdf(-d1)


In [2]:
def make_snapshot(snapshot_id, n_strikes=30, tenors_days=(7,14,30,60,90,180,365),
                  smile=False, rng=None):
    rng = np.random.default_rng(rng)

    S0 = rng.uniform(50, 500)
    r  = rng.uniform(0.00, 0.05)
    q  = rng.uniform(0.00, 0.03)

    m = rng.uniform(0.6, 1.4, size=n_strikes)  
    K = np.sort(S0 * m) 
    T = np.array(tenors_days) / 365.0 

    K_grid, T_grid = np.meshgrid(K, T, indexing="xy")
    S_grid = np.full_like(K_grid, S0)
    r_grid = np.full_like(K_grid, r)
    q_grid = np.full_like(K_grid, q)

    # Volatility: constant or simple smile
    if not smile:
        sigma_snap = rng.uniform(0.10, 0.60)
        sigma_grid = np.full_like(K_grid, sigma_snap)
    else:
        ell = np.log(S0 / K_grid)
        a = rng.uniform(0.10, 0.50)
        b = rng.uniform(-0.20, 0.20)
        c = rng.uniform(0.00, 0.20)
        d = rng.uniform(-0.05, 0.10)
        sigma_grid = np.clip(a + b*ell + c*ell**2 + d*np.sqrt(T_grid), 0.05, 2.0)

    P_mid = bs_price(S_grid, K_grid, T_grid, r_grid, q_grid, sigma_grid, option="call")

    noise = rng.normal(loc=0.0, scale=0.01*np.maximum(0.1, P_mid))
    P_obs = np.clip(P_mid + noise, 0.0, None)

    df = pd.DataFrame({
        "snapshot_id": snapshot_id,
        "S0": S_grid.ravel(),
        "K": K_grid.ravel(),
        "T": T_grid.ravel(),
        "r": r_grid.ravel(),
        "q": q_grid.ravel(),
        "sigma_true": sigma_grid.ravel(),
        "price_mid": P_mid.ravel(),
        "price_obs": P_obs.ravel(),
    })
    return df


In [3]:
def make_dataset(n_snapshots=2000, smile=False, seed=42):
    rng = np.random.default_rng(seed)
    dfs = []
    for sid in range(n_snapshots):
        df = make_snapshot(sid, smile=smile, rng=rng)
        dfs.append(df)
    return pd.concat(dfs, ignore_index=True)

In [4]:
def split_snapshots(n_snapshots, train=0.8, val=0.1, seed=7):
    rng = np.random.default_rng(seed)
    ids = np.arange(n_snapshots)
    rng.shuffle(ids)
    n_train = int(train*n_snapshots)
    n_val = int(val*n_snapshots)
    return {
        "train": ids[:n_train],
        "val":   ids[n_train:n_train+n_val],
        "test":  ids[n_train+n_val:]
    }

In [5]:
def make_observed_mask(df, observed_per_snapshot=15, seed=99):
    rng = np.random.default_rng(seed)
    mask = {}
    for sid, df_s in df.groupby("snapshot_id"):
        idx = df_s.index.values
        choose = min(observed_per_snapshot, len(idx))
        obs_idx = rng.choice(idx, size=choose, replace=False)
        mask[int(sid)] = np.sort(obs_idx)
    return mask