### 05 — RL Hedging Environment

In this notebook you turn the components from Notebooks 02–04 into a **Markov Decision Process (MDP)** environment for an **RL-based option hedging policy**.

---

#### 5.1 MDP Structure (Fixed Components)

We model the hedging problem as a finite-horizon MDP
$$
\mathcal{M} = (S, A, P, R, \gamma).
$$

- **State** $s_n$: contains market, MM and option risk information at decision time $t_n$.
- **Action** $a_n$: choose a **target net delta / net vega bucket** for the overall portfolio, which is then implemented via option trades in a fixed option universe.
- **Transition** $P$: driven by the BTC QED+Hawkes simulator and the MM strategy.
- **Reward** $r_n$: change in portfolio equity, penalised by option costs and risk measures.
- **Discount** $\gamma$: you may use $\gamma=1$ for episodic training on finite horizons.




---

In [1]:
"""
In code:

- Create a simple container (class or dict) to hold:
  - current simulator state,
  - MM state (inventory, cash, equity),
  - option portfolio state (positions, greeks).
- Make sure you can:
  - advance the simulator by one hedge decision step,
  - re-evaluate greeks and portfolio equity at each step.

"""   
import numpy as np
import pandas as pd

class HedgingEnv:
    """
    RL Environment for hedging BTC option exposures using a fixed option universe
    and a baseline market-making perpetual strategy.

    State  s_n  includes:
        - market state (S_t, sigma_loc, recent returns etc.)
        - option greeks (portfolio delta, vega)
        - MM inventory & equity
        - time to maturity info

    Action a_n:
        - target delta bucket OR target vega bucket
        (This code supports both via a simple discrete mapping)

    Reward r_n:
        - change in total equity minus trading costs & risk penalties
    """

    def __init__(
        self,
        simulator,
        option_book,
        mm_engine,
        dt=5/1440,            # 5 minutes in days
        horizon_days=14,
        gamma=1.0,
        delta_actions = [-2, -1, 0, 1, 2],   # target Δ buckets
        vega_actions  = [-2, -1, 0, 1, 2],   # target V buckets
        mode="delta"          # "delta" or "vega"
    ):
        self.simulator = simulator
        self.option_book = option_book
        self.mm = mm_engine
        self.dt = dt
        self.N = int(horizon_days * 24 * 12)   # number of 5-min steps
        self.gamma = gamma

        # RL action settings
        self.delta_actions = delta_actions
        self.vega_actions = vega_actions
        self.mode = mode

        # internal
        self.t = 0
        self.done = False

    # -------------------------------------------------------------
    # Reset episode
    # -------------------------------------------------------------
    def reset(self):
        self.simulator.reset()
        self.mm.reset()
        self.option_book.reset()

        self.t = 0
        self.done = False

        state = self._get_state()
        return state

    # -------------------------------------------------------------
    # Perform one hedge step
    # -------------------------------------------------------------
    def step(self, action):
        if self.done:
            raise ValueError("Episode already done.")

        # ---- 1. Apply hedge decision ----
        self.apply_hedge_action(action)

        # ---- 2. Advance simulator by dt ----
        self.advance_simulator()

        # ---- 3. Recompute Greeks and equity ----
        equity_before = self.mm.equity
        self.compute_greeks_and_equity()
        equity_after = self.mm.equity

        # Reward = change in total equity
        reward = equity_after - equity_before

        # ---- 4. Build next state ----
        state = self._get_state()

        # ---- 5. Termination check ----
        self.t += 1
        if self.t >= self.N:
            self.done = True

        return state, reward, self.done, {}

    # -------------------------------------------------------------
    # Advance BTC simulator by 5-minute step
    # -------------------------------------------------------------
    def advance_simulator(self):
        """
        Runs one simulator step:
        - BTC price path (QED + Hawkes)
        - Perpetual market making PnL
        - Option book marking-to-market
        """
        # BTC price
        S_new, sigma_loc = self.simulator.step()

        # MM engine PnL
        self.mm.update(S_new)

        # Option repricing + expiry handling
        self.option_book.update_prices(S_new, sigma_loc)

    # -------------------------------------------------------------
    # Compute portfolio greeks + equity
    # -------------------------------------------------------------
    def compute_greeks_and_equity(self):
        """
        Compute:
            - total delta
            - total vega
            - option mtm
            - total equity = MM cash + option mtm - inventory*price
        """
        self.option_book.compute_greeks()
        option_mtm = self.option_book.total_value

        S = self.simulator.S
        inv = self.mm.inventory

        # Equity = cash + perpetual inventory MTM + option MTM
        self.mm.equity = self.mm.cash + inv * S + option_mtm

    # -------------------------------------------------------------
    # Apply hedge action: target delta or target vega
    # -------------------------------------------------------------
    def apply_hedge_action(self, action):
        """
        action ∈ {0,1,2,3,4} (index)
        Map to target Delta or Vega bucket
        """
        if self.mode == "delta":
            target = self.delta_actions[action]
            current_delta = self.option_book.total_delta + self.mm.inventory
            delta_to_trade = target - current_delta

            # Send delta-to-trade into option_book hedge trader
            self.option_book.trade_delta(delta_to_trade)

        elif self.mode == "vega":
            target = self.vega_actions[action]
            current_vega = self.option_book.total_vega
            vega_to_trade = target - current_vega

            # Vega hedging via options
            self.option_book.trade_vega(vega_to_trade)

        else:
            raise ValueError("mode must be 'delta' or 'vega'")

    # -------------------------------------------------------------
    # Construct state vector for RL
    # -------------------------------------------------------------
    def _get_state(self):
        S = self.simulator.S
        sigma = self.simulator.sigma_loc
        opt_delta = self.option_book.total_delta
        opt_vega = self.option_book.total_vega
        inv = self.mm.inventory
        eq = self.mm.equity

        # Basic feature vector; can be extended
        state = np.array([
            S,
            sigma,
            opt_delta,
            opt_vega,
            inv,
            eq
        ], dtype=float)

        return state





#### 5.2 Hedge Universe: Strikes and Maturities

The RL agent trades only within a **small, liquid universe** of BTC options:

* **Moneyness / strikes**:
  $$
  K \in {0.9 S_0,; 1.0 S_0,; 1.1 S_0},
  $$
  corresponding to 10% OTM, ATM, and 10% OTM on the other side.

* **Maturities**:
  $$
  T \in {1\text{d},; 7\text{d}}.
  $$

* **Types**:

  * Calls and puts on the BTC simulator price $S_n$.

This gives a natural universe of up to:

* $3$ strikes $\times$ $2$ maturities $\times$ $2$ types (call/put)
  $= 12$ distinct option contracts.

You may restrict to a smaller subset (e.g. ATM options only) for computational reasons, but the default assumption is that the agent **has access to all** of these contracts.

---

#### 5.3 Portfolio, Delta and Vega

Let:

* $I_n$: MM inventory in the BTC perpetual at time $n$.
* $Q_n^{(i)}$: position (in lots) in option contract $i$ at time $n$
  (e.g. “number of contracts” or a normalised lot size).
* $P_n^{(i)}$: price of option $i$ at time $n$.

The **total equity** (MM + options) is:

$$
\Pi_n^{\text{total}}
= \Pi_n^{\text{MM}} + \sum_i Q_n^{(i)} P_n^{(i)},
$$

where $\Pi_n^{\text{MM}}$ is the equity of the MM engine alone.

We define:

* $\Delta^{\text{MM}}_n$: delta of MM position (essentially $I_n$ if perp is 1:1 delta).
* $\Delta^{(i)}_n$: delta of option $i$ at time $n$.
* $\Delta^{\text{opt}}_n = \sum_i Q_n^{(i)} \Delta^{(i)}_n$: aggregate option delta.
* $\Delta^{\text{port}}_n = \Delta^{\text{MM}}_n + \Delta^{\text{opt}}_n$: net portfolio delta.

Similarly for vega:

* $V^{(i)}_n$: vega of option $i$ at time $n$.
* $V^{\text{opt}}_n = \sum_i Q_n^{(i)} V^{(i)}_n$.
* $V^{\text{port}}_n = V^{\text{opt}}_n$ (perpetual has negligible vega).

These quantities are part of the **risk state** the RL agent must learn to control.

---

#### 5.4 State Representation

At each hedge decision time $n$ (e.g. every few 5-minute steps), the agent observes a state vector $s_n$.
A reasonable baseline state includes:

$$
s_n = (
S_n,;
I_n,;
\Pi_n^{\text{total}},;
\hat{\sigma}_{\text{loc}}(n),;
\Delta^{\text{port}}_n,;
V^{\text{port}}_n,;
\text{TTM features},;
\text{moneyness features},;
\Delta S_n,;
\Delta V_n
).
$$

Where:

* $S_n$: BTC price.
* $I_n$: MM inventory (BTC).
* $\Pi_n^{\text{total}}$: current equity of MM + options.
* $\hat{\sigma}_{\text{loc}}(n)$: local realised volatility from Section 3.
* $\Delta^{\text{port}}_n$: net portfolio delta.
* $V^{\text{port}}_n$: net portfolio vega.
* TTM features: time to maturity of relevant contracts (e.g. normalised).
* Moneyness features: e.g. $\log(K/S_n)$ for representative strikes.
* $\Delta S_n$: recent price change(s).
* $\Delta V_n$: recent changes in option prices or IV.

You may add/remove features (e.g. regime indicators, jump flags, realised variance windows), as long as you justify your design choices.

---

#### 5.5 Action Space: Discrete Option Trades

At each decision time, the agent chooses **one discrete action** from a finite set $A$.

The natural design, given our universe, is:

* **No trade:**

  * Do nothing this step.

* **Option trades:**

  * For each option contract $i$ in the universe
    (strike $K \in {0.9 S_0, 1.0 S_0, 1.1 S_0}$,
    maturity $T \in {1\text{d}, 7\text{d}}$,
    call or put), define:

    * “buy 1 lot of option $i$”,
    * “sell 1 lot of option $i$”.

Let $q_{\text{opt}}$ be the **fixed lot size** per trade.
Then a “buy” action increases $Q_n^{(i)}$ by $+q_{\text{opt}}$,
and a “sell” action decreases $Q_n^{(i)}$ by $-q_{\text{opt}}$.

You may optionally restrict the action space to a smaller subset of contracts
(e.g. ATM 1d and ATM 7d only) if needed.

---

#### 5.6 Transaction Costs (Size-Aware)

Each option trade incurs a transaction cost **proportional to notional**.
If at time $n$ we execute trades $\Delta Q_n^{(i)}$ in each contract $i$, then:

$$
TC_n = c_{\text{opt}} \sum_i \left| \Delta Q_n^{(i)} P_n^{(i)} \right|,
$$

where $c_{\text{opt}}$ is a cost rate (e.g. $0.0005$ for $0.05%$).

This cost term is **size-aware**:

* larger lots or more expensive options
  $\Rightarrow$ larger notional
  $\Rightarrow$ larger $TC_n$ penalty.

---

#### 5.7 Reward design

design and justify your own RL reward function.
The RL agent should not be forced to keep the book exactly delta– and vega–neutral.  
We want to **allow under-hedging / over-hedging** as long as the **overall portfolio is profitable** and **tail risk is controlled**.

Your reward design must satisfy the following principles:

* **Profit focus.**
  The main positive signal should be **realised PnL**, net of transaction costs for hedging trades.
  Make clear what PnL you are using (per-step or cumulative increment).

* **Cost awareness.**
  Transaction costs for option hedges must enter the reward with the correct sign
  (higher costs should reduce reward).

* **Risk exposure control, but not hard neutrality.**
  You may expose the book to delta and vega risk, and the agent is allowed to under–hedge or over–hedge.
  However, your reward should **discourage extremely large risk exposures**
  (for example via soft penalties once $|\Delta^{\text{port}}_n|$ or $|V^{\text{port}}_n|$ exceed some comfort band).

* **Tail–risk awareness.**
  Include at least one component that penalises **bad tail outcomes over the whole episode**,
  such as large final loss, large drawdown, or a risk measure like downside variance or CVaR.
  This should make “rare but very large losses” unattractive even if average PnL is high.

* **No trivial solutions.**
  Check that your reward does **not** make degenerate policies obviously optimal
  (e.g. “never hedge” or “always fully hedge to zero risk” regardless of market conditions).

What you need to hand in:

* A **mathematical expression** of your reward (per-step and/or terminal), with all symbols defined.
* A short **written justification**  explaining:

  * how your reward trades off profit vs risk and transaction costs;
  * why it allows meaningful under–hedging / over–hedging;
  * why it is suitable for controlling tail risk in this assignment.

---

In [2]:
# write your code and analysis here
class OptionContract:
    def __init__(self, K, T, opt_type):
        """
        K: strike
        T: maturity in days
        opt_type: "call" or "put"
        """
        self.K = K
        self.T = T
        self.type = opt_type

        # dynamic values
        self.tau = T       # time to maturity (days, decreasing)
        self.price = 0.0
        self.delta = 0.0
        self.vega = 0.0


class HedgeUniverse:
    """
    Construct the fixed universe of 12 BTC options:
    K = {0.9 S0, S0, 1.1 S0}
    T = {1d, 7d}
    type = call/put
    """

    def __init__(self, S0):
        Ks = [0.9*S0, S0, 1.1*S0]
        Ts = [1, 7]             # in days
        types = ["call", "put"]

        self.contracts = []
        for K in Ks:
            for T in Ts:
                for typ in types:
                    self.contracts.append(OptionContract(K, T, typ))

        # positions Q_i for each contract
        self.Q = np.zeros(len(self.contracts))

        # store values
        self.prices = np.zeros(len(self.contracts))
        self.deltas = np.zeros(len(self.contracts))
        self.vegas  = np.zeros(len(self.contracts))

    def reset(self):
        self.Q[:] = 0.0
        for c in self.contracts:
            c.tau = c.T
            c.price = 0.0
            c.delta = 0.0
            c.vega = 0.0
    
    def compute_portfolio_greeks(self):
        self.port_delta = np.sum(self.Q * self.deltas)
        self.port_vega  = np.sum(self.Q * self.vegas)



In [3]:
def _get_state(self):
    S = self.simulator.S
    sigma = self.simulator.sigma_loc

    self.option_book.compute_portfolio_greeks()
    delta_p = self.option_book.port_delta
    vega_p  = self.option_book.port_vega

    inv = self.mm.inventory
    eq = self.mm.equity

    # sample TTM and moneyness features
    tau_1d  = self.option_book.contracts[0].tau
    tau_7d  = self.option_book.contracts[1].tau

    moneyness_ATM = np.log(self.option_book.contracts[1].K / S)

    # last price and vol changes
    dS = S - self.simulator.S_prev
    dV = sigma - self.simulator.prev_sigma_loc

    state = np.array([
        S,
        inv,
        eq,
        sigma,
        delta_p,
        vega_p,
        tau_1d,
        tau_7d,
        moneyness_ATM,
        dS,
        dV
    ], dtype=float)

    return state


In [4]:
class ActionSpace:
    def __init__(self, num_contracts, lot=1.0):
        self.lot = lot
        self.n_actions = 1 + 2 * num_contracts
        # mapping: 0 = no trade
        # 1 → buy contract 0
        # 2 → sell contract 0
        # 3 → buy contract 1
        # 4 → sell contract 1
        # etc.

    def decode(self, action):
        if action == 0:
            return None
        i = (action - 1) // 2
        direction = +1 if (action - 1) % 2 == 0 else -1
        return i, direction




In [5]:
def apply_option_trade(self, contract_id, direction):
    """
    direction = +1 buy, -1 sell
    """
    c = self.option_book.contracts[contract_id]
    lot = self.action_space.lot

    ΔQ = direction * lot
    price = c.price

    # transaction cost
    cost = self.c_opt * abs(ΔQ * price)
    self.mm.cash -= cost

    # adjust book
    self.option_book.Q[contract_id] += ΔQ
    self.mm.cash -= ΔQ * price   # pay for purchases / receive for sells


In [6]:
def compute_reward(self):
    pnl = self.mm.equity - self.prev_equity

    # portfolio greeks
    delta = abs(self.option_book.port_delta + self.mm.inventory)
    vega  = abs(self.option_book.port_vega)

    risk_penalty = self.c_risk * (
        delta / self.delta_max +
        vega  / self.vega_max
    )

    reward = pnl - risk_penalty

    self.prev_equity = self.mm.equity
    return reward
