# Execution Training & Testing

This notebook runs the Signature Q-Learning experiment for the optimal execution problem. It covers the full pipeline from environment setup to saving results:

1. **Environment setup** — register the custom execution environment as a Gym environment and define its parameters.
2. **Baseline policy** — run a simple sell-inventory baseline and save its results.
3. **Training** — learn approximate Q-functions with Signature Q-Learning over multiple seeded runs.
4. **Testing** — evaluate the learned Q-functions on unseen episodes and save the test results.

All results (baseline, training, testing) are saved as pickle files in the `../results` directory. Each file name includes a `date_id` (e.g. `20250127_A`) that is generated at the start of the notebook and serves as a unique identifier. The companion notebook `execution_results_analysis.ipynb` loads these files for analysis.

Under the hood, each environment episode runs a full multi-agent market simulation powered by the ABIDES submodule (`abides-jpmc-public`). The environment wraps ABIDES as an OpenAI Gym environment via the custom class `SubGymMarketsCustomExecutionEnv`.

## Imports

In [None]:
%load_ext autoreload
%autoreload 2

from tqdm.notebook import tqdm
import sys
sys.path.insert(0, '../src') # to import from src directory

import gym

from abides_gym_custom_execution_environment import SubGymMarketsCustomExecutionEnv
from sigqlearning_qfunctions import SigQFunction
from sigqlearning_test_execution import test
from sigqlearning_train_execution import train
from sigqlearning_baseline_execution import run_baseline
import utils

# Environment setup

## Register Gym environment

Register the custom execution environment for Gym use and define a helper function to generate an environment instance with a given seed and parameters. Each call to `generate_env` creates a fresh environment backed by an ABIDES market simulation.

In [None]:
from gym.envs.registration import register

register(
    id='custom-execution-v0',
    entry_point=SubGymMarketsCustomExecutionEnv,
)

def generate_env(seed=None, **env_params):
    """
    generates specific environment with the parameters defined and set the seed
    """
    env = gym.make(id = 'custom-execution-v0', **env_params)
    if seed is not None:
        env.seed(seed)
    return env

## Environment parameters

Set the execution environment parameters used for training and testing. These parameters correspond to the environment configuration detailed in **Section 6.3** of the master thesis.

In [None]:
env_params = dict(
    background_config = 'rmsc04',
    mkt_close = '10:05:00',
    timestep_duration = '10s',
    first_interval = '00:05:00',
    observation_interval = '00:00:00',            
    order_fixed_size = 50,
    max_inventory = 1000,
    starting_inventory = 700,
    terminal_inventory_reward = -0.7, # reward or penalty
    terminal_inventory_mode = 'quadratic', # quadratic, linear, flat
    running_inventory_reward_dampener = 0., # 0.6, 1.0
    damp_mode = None, # asymmetric
    debug_mode = False,
    reward_multiplier = 'quadratic_positive', # running reward mode
    reward_multiplier_float = None, #1.5,
)

## Date identifier

Generate a `date_id` (e.g. `'20250127_A'`) that uniquely identifies all results saved during this notebook run. The identifier is based on the current date and an incrementing letter to avoid overwriting existing files in `../results`.

In [None]:
# returns current date plus letter as string
date_id = utils.get_date_id() 

---
# Baseline policy

The baseline policy sells the starting inventory at a fixed rate as long as the absolute inventory is above 5 % of the maximum inventory and then stops trading until the end of the episode. This is the policy the RL agent is supposed to learn, and the resulting baseline reward serves as a benchmark for comparison. Results are saved to `../results` via `utils.save_results`.

In [None]:
# run baseline
env = generate_env(1000, **env_params) # seed different from any test run seed
baseline_results_dict = run_baseline(env, episodes=10)

In [None]:
# save baseline results
date_id = utils.save_results(baseline_results_dict, date_id, results_type='baseline')

---
# Training

Train approximate Q-functions using Signature Q-Learning. The training is performed over multiple independent runs with different seeds to assess variability. Each run produces a training results dictionary and a final Q-function state dict.

## Training parameters

Define the signature Q-function parameters (truncation depth, basepoint, initial bias) and the training hyper-parameters (number of episodes, discount factor, learning rate and its decay, exploration strategy with epsilon-greedy decay).

In [None]:
# signature parameters
sigq_params = dict(
    sig_depth = 7,
    basepoint = [0., 0.], 
    initial_bias = 0.01,
)

# training parameter
training_params = dict(
    episodes = 2,
    discount = 1.0,
    learning_rate = 5*1e-5,
    learning_rate_decay = dict(mode='exponential', factor=0.999),
    exploration = 'greedy',
    epsilon = 1,
    epsilon_decay = dict(mode='exponential', factor=0.997),
    decay_mode = 'episodes',
    debug_mode = None,
    progress_display = 'tqdm', # set to 'livelossplot' for live charts during each training run
)

## Learn Q-function estimates

An approximate Q-function is learned with Signature Q-Learning. Multiple runs are executed with distinct prime-number seeds to ensure reproducibility and to capture variance across initialisations.

**Note**: This cell takes a long time to run, as each episode involves a full ABIDES market simulation. Consider copying this cell into a standalone Python script and running it in the background (e.g. with [screen](https://linuxize.com/post/how-to-use-linux-screen/) or [tmux](https://www.hamvocke.com/blog/a-quick-and-easy-guide-to-tmux/)).

In [None]:
# number of training runs
n_runs = 2

# dict to store results
training_results_dict = {run : [] for run in range(n_runs)}
training_seeds = {
    run : seed for run, seed 
    in zip(training_results_dict, utils.generate_prime_seeds(n_runs, random=False))
}

final_Q_functions = {}

# training runs
runs_pbar = tqdm(training_results_dict.keys(), desc='Training run')
for run in runs_pbar:
    env = generate_env(training_seeds[run], **env_params)
    sigqfunction = SigQFunction(env, **sigq_params)
    training_results_dict[run] = train(env, sigqfunction, **training_params)
    final_Q_functions[run] = sigqfunction.state_dict()

## Save training results

Save the training results (per-run results, final Q-function state dicts, and all parameters) to `../results` as a single pickle file. The file name follows the pattern `execution_training_results_<date_id>.pkl`.

In [None]:
results_collected=dict(
    training_results=training_results_dict, 
    final_Q_functions=final_Q_functions,
    sig_params=sigq_params, 
    training_params=training_params, 
    env_params=env_params,
    training_seeds=training_seeds
)

date_id = utils.save_results(results_collected, date_id, results_type='training')

---
# Testing

Evaluate the learned Q-functions on unseen environment episodes. To test Q-functions from a previous training run instead of the one just completed, set `load_training_results` to `True` and specify the corresponding `date_id`.

In [None]:
load_training_results = False

if load_training_results:
    date_id = '20250127_A'
    training_data = utils.load_results('training', date_id)
    
    # Unpack with helper function
    (training_results_dict, final_Q_functions, sigq_params, 
     training_params, env_params, training_seeds, n_runs) = utils.unpack_training_results(training_data)
    
    # Display parameters
    print(f'Loaded {n_runs} training runs with parameters:')
    from pprint import pprint
    pprint({
        k: v for k, v in training_data.items() 
        if k not in ('training_results', 'final_Q_functions')
    })

## Test Q-function estimates

Run the learned Q-functions greedily (no exploration) on fresh environment episodes. By default the final Q-function from each training run is used (`checkpoint = -1`). To test an intermediate checkpoint saved during training, set `checkpoint` to the desired index (checkpoints are saved every 10 episodes during training, so `0 <= checkpoint < n_training_episodes / 10`).

In [None]:
checkpoint = -1 # final

n_test_episodes = 2000
test_results_dict = {}
test_seeds = utils.generate_prime_seeds(100, random=True)

runs_pbar = tqdm(training_results_dict.keys(), desc='Test run')
for run in runs_pbar:
    env = generate_env(test_seeds[run], **env_params)
    sigqfunction = SigQFunction(env, **sigq_params)
    sigq_state_dict = final_Q_functions[run] if checkpoint == -1 \
        else training_results_dict[run]['intermediate'][checkpoint]
    sigqfunction.load_state_dict(sigq_state_dict)
    test_results_dict[run] = test(env, sigqfunction, n_test_episodes, debug_mode='info')

## Save test results

Save the test results (per-run results, test seeds, and the checkpoint used) to `../results` as a pickle file. The file name follows the pattern `execution_test_results_<date_id>.pkl`.

In [None]:
testing_data = {
    'test_results': test_results_dict,
    'test_seeds': test_seeds,
    'checkpoint': checkpoint,
}

date_id = utils.save_results(testing_data, date_id='20250127_A', results_type='testing')