# Execution Results Analysis

This notebook analyses the results obtained from the Signature Q-Learning execution experiment, as carried out in the companion notebook `execution_training_testing.ipynb`. It covers:

1. **Baseline policy analysis** — summary statistics, reward distribution, and confidence intervals for the baseline (sell-inventory) policy.
2. **Training analysis** — per-run and averaged training trajectories (reward, loss, cash, terminal inventory), observation-action histories, and convergence diagnostics via first-observation values.
3. **Testing analysis** — summary statistics, inventory/action trajectories, and reward comparisons across test runs.

The results loaded here correspond to `date_id = '20250127_A'` and are stored in the `../results` directory. A detailed discussion of these results is provided in **Section 6.3** of the master thesis. To analyse results from a different experiment, change the `date_id` variable in the cell below; note, however, that the inline comments in this notebook refer to the `20250127_A` run.

**Note**: To save any plot, set `save=True` in the respective plotting function call. Plots are saved to the `../figures` directory with the `date_id` included in the file name.

## Imports

In [None]:
%load_ext autoreload
%autoreload 2

import sys
sys.path.insert(0, '../src') # to import from src directory

import matplotlib.pyplot as plt
import numpy as np
from tabulate import tabulate
import scipy.stats as stats 

import plotting_utils
import utils

# Load results

Load baseline, training, and test results from saved pickle files with selected `date_id`. Set the `date_id` in the cell below to select which results to load. All results are expected to reside in the `../results` directory and are loaded via `utils.load_results`.

Results from directory `../results` used in the master thesis and in the analysis below:

- baseline policy: `execution_baseline_results_20250127_A.pkl`
- training: `execution_training_results_20250127_A.pkl`
- testing: `execution_test_results_20250127_A.pkl`

In [None]:
# set date_id of results to be loaded
date_id = '20250127_A'

In [None]:
# load baseline policy results
baseline_results_dict = utils.load_results(date_id, results_type='baseline')

In [None]:
# load training results
training_data = utils.load_results(date_id, results_type='training')

# Unpack with helper function
(training_results_dict, final_Q_functions, sigq_params, 
    training_params, env_params, training_seeds, n_runs) = utils.unpack_training_results(training_data)

# Display parameters
print(f'\nLoaded {n_runs} training runs with parameters:')
from pprint import pprint
pprint({
    k: v for k, v in training_data.items() 
    if k not in ('training_results', 'final_Q_functions')
})

In [None]:
# load test results
testing_data = utils.load_results(date_id, results_type='testing')

# Display seeds and checkpoint
print(f'Loaded {n_runs} test runs for training checkpoint: {testing_data["checkpoint"]},' \
      f' \n and with environment seeds: {testing_data["test_seeds"]}')

# We only need test runs results
test_results_dict, _, _ = testing_data.values()
del testing_data

---
# Baseline policy analysis

The baseline policy sells the starting inventory at a fixed rate until the absolute inventory falls below a threshold (5 % of the maximum inventory) and then stops trading. It serves as a benchmark against which the Signature Q-Learning agent is compared. In this section we compute summary statistics, plot reward and inventory trajectories, inspect the reward distribution, and construct confidence intervals for the mean baseline reward.

## Baseline statistics

Compute and display summary statistics of the baseline policy: mean, standard deviation, and median of the episode rewards, as well as mean, standard deviation, minimum, and maximum of the terminal inventory.

In [None]:
baseline_stats = []
baseline_stats.append(np.mean(baseline_results_dict['rewards'])) # mean reward
baseline_stats.append(np.std(baseline_results_dict['rewards'])) # std reward
baseline_stats.append(np.median(baseline_results_dict['rewards'])) # median reward
baseline_stats.append(np.mean(baseline_results_dict['terminal_inventories'])) # mean terminal inventory
baseline_stats.append(np.std(baseline_results_dict['terminal_inventories'])) # std terminal inventory
baseline_stats.append(int(np.min(baseline_results_dict['terminal_inventories']))) # min terminal inventory
baseline_stats.append(int(np.max(baseline_results_dict['terminal_inventories']))) # max terminal inventory
    
columns = [
    'Mean\nreward', 'Std\nreward', 'Median\nreward', 'Mean terminal\ninventory', 
    'Std terminal\ninventory', 'Min terminal\ninventory', 'Max terminal\ninventory' 
]
print(tabulate([baseline_stats], headers=columns, tablefmt='fancy_grid', floatfmt='.5f'))

## Baseline plots

Overview plot of the baseline run (rewards, terminal inventories, actions, and inventories for a selected episode) and separate trajectory plots for selected episodes.

In [None]:
episode_id = -1
plotting_utils.plot_baseline_results(baseline_results_dict, episode_id, ma_window=20, 
                                     figsize=(6, 4), date_id=date_id, save=False)

plotting_utils.plot_baseline_trajectories(baseline_results_dict, episode_ids=[-2, -3], 
                                          date_id=date_id, save=True, show=False)

## Confidence intervals of baseline rewards

We inspect the distribution of rewards from the baseline run through plotting a histogram and boxplot and find that it is moderately right-skewed.

In [None]:
baseline_rewards = np.array(baseline_results_dict['rewards'])
plotting_utils.plot_baseline_reward_distribution(baseline_rewards, figsize=(10, 3.5), 
                                                 date_id=date_id, save=False)

We compute confidence intervals for the mean reward using three different methods: 
1. Normal-based CI (Large Sample Theory),   
2. Bootstrap CI,
3. Log-Transformed CI.

The log-transformed CI is computed by fitting a lognormal distribution to the negative rewards (i.e. positive values) and following the approach for confidence intervals for three parameter log-normal distrobutions in the paper by Olsson (2017). The obtained CI are then transformed back to the original scale. The bootstrap CI is obtained by resampling the rewards with replacement and computing the mean of each sample.

In [None]:
baseline_rewards = np.array(baseline_results_dict['rewards']) # Ensure the data is a NumPy array

alpha = 0.05 # significance level 1 - alpha
n = len(baseline_rewards)
t_critical = stats.t.ppf(1 - alpha / 2, df=n-1)  # 95% confidence level

# compute skewness
sample_skewness = stats.skew(baseline_rewards)
print('Reward distribution skewness:', round(sample_skewness, 5))

# 1. Normal-based CI (Large Sample Theory)
sample_mean = np.mean(baseline_rewards)
sample_std = np.std(baseline_rewards, ddof=1)
print('Standard error:', round(sample_std / np.sqrt(n), 5))

normal_CI = (sample_mean - t_critical * sample_std / np.sqrt(n), 
             sample_mean + t_critical * sample_std / np.sqrt(n))

# 2. Bootstrap CI
np.random.seed(1) 
bootstrap_samples = np.random.choice(baseline_rewards, size=(1000, n), replace=True)
bootstrap_means = np.mean(bootstrap_samples, axis=1)
bootstrap_CI = np.percentile(bootstrap_means, [2.5, 97.5])

# 3. Log-Transformed CI
shape, loc, scale = stats.lognorm.fit(-baseline_rewards, loc=0.0)
stdized_rewards = (np.log(-baseline_rewards - loc) - np.log(scale)) / shape
fitted_mean = np.log(scale) # mean of (- rewards - loc), lognormal with loc=0
fitted_std_estimate = np.sqrt(shape ** 2 / n + shape ** 4 / (2*(n - 1))) # based on Olsson paper

lognorm_CI = (np.exp(fitted_mean + shape ** 2 / 2 - t_critical * fitted_std_estimate), 
              np.exp(fitted_mean + shape ** 2 / 2 + t_critical * fitted_std_estimate))
lognorm_CI = (-lognorm_CI[1] - loc, -lognorm_CI[0] - loc)

# print results
CI_results = [normal_CI, bootstrap_CI, lognorm_CI]
print(tabulate(CI_results, headers=[f'{int(100*(1-alpha))}% Confidence intervals', 'lower bound', 'upper bound'], 
               showindex=['normal CI', 'bootstrap CI', 'lognormal CI'], tablefmt='fancy_grid', floatfmt='.6f'))

lognorm_params = np.array([shape, loc, scale])
print(tabulate([lognorm_params], headers=['shape', 'loc', 'scale'], showindex=['lognormal fit'],
               tablefmt='fancy_grid', floatfmt='.5f'))

## Baseline log-normal fit

We inspect the log-normal fit by visual means. No statistical goodness-of-fit test is performed, however, one could transform the data with the fitted paramters to the normal scale and test for normality with e.g. Anderson-Darling, Shapiro-Wilk, Jarque-Bera test.
Note, that performing these tests with parameters fitted from the data changes the test value's dsitrbution under the null and p-values might be meaningless. 

Alternatively, parametric bootstrapping could be performed with a chosen test, to obtain the empirical distribution of the test value under the null distribution with the fitted parameters.

In [None]:
# visual inspection of log-normal fit
plotting_utils.plot_baseline_lognormal_fit(baseline_rewards, shape, loc, scale, 
                                           figsize=(10, 9), date_id=date_id, save=False)

Judging from the graphs above, the three-parameter log-normal fit seems to reasonably model the distribution of negative rewards. The Q-Q plot indicates a lighter right tail and a more left-skewed distribution of negative rewards than the theoretical log-normal fit, meaning that the fit models higher values than observed. On the reward scale, this corresponds to lower rewards than what was actually observed. The lighter right tail of the negative reward distribution is also apparent from the boxplot. This assessment does not replace a formal goodness-of-fit test. However, the calculated confidence interval based on the log-normal fit very closely resembles the CIs from the other two methods.

---
# Training analysis

This section analyses the training phase of the Signature Q-Learning algorithm. We inspect individual training runs, the averaged training metrics across all runs, the observation-action trajectories at selected episodes, and the convergence of the first-observation value towards the mean episode reward.

## Single training run

Plot the training trajectory (reward, loss, cash, terminal inventory) of a selected run. Additionally, a closer look at the terminal inventory in the final episodes is shown to assess whether the agent learned to liquidate its position.

In [None]:
# select training run number for plotting
run_id = 3
plotting_utils.plot_training_run_results(training_results_dict, run_id, 
                                         figsize=None, date_id=date_id,save=True)

# a closer look at terminal inventory trajectory of the selected run
start, end = -1500, -1
plt.plot(training_results_dict[run_id]['terminal_inventory'][start:end])
plt.plot([0 for _ in range(end - start)], color='black')
plt.plot([100 for _ in range(end - start)], color='black')
plt.plot([-100 for _ in range(end - start)], color='black')
plt.show()

## Averaged training results

Plot the mean training metrics (reward, loss, cash, terminal inventory) averaged over all runs, together with ±1 standard deviation bands. When `save=True`, each subplot is also saved as an individual file in `../figures` with the `date_id` in the file name.

In [None]:
plotting_utils.plot_mean_training_results(training_results_dict, figsize=None, 
                                          date_id=date_id, save=False)

## Observation-action histories

To gain a more qualitative insight into the policies based on the learned Q-functions, we plot the observation and action trajectories of specific training episodes.

In [None]:
run_ids = 'all'
episode_ids = [0, 999, 1999, 2999] 
plotting_utils.plot_inventory_action_histories(training_results_dict, run_ids, episode_ids, 
                                               figsize=(8,8), date_id=date_id, save=False)

## First observation value convergence

A key convergence diagnostic: compare the first-observation value $\hat{Q}(o_0, \cdot)$ provided by the learned Q-function at the start of each training episode with the mean episode reward. If both quantities converge towards the same value, this indicates that the Q-function approximation at the initial observation is consistent with the observed returns. Note that this does not guarantee convergence of the approximate Q-function $\hat{Q}$ to the trwu Q-functions at all time steps, since only the value at the fixed starting observation is considered.

The first plot shows the trajectory of the first-observation value across training episodes (mean ± std over all runs), overlaid with the mean baseline reward. The second plot compares the mean episode reward and the mean first-observation value over a window of the final training episodes.

In [None]:
mean_baseline_reward = baseline_stats[0]
plotting_utils.plot_first_observation_values(training_results_dict, run_ids='all', mean=True, 
                                             std=True, figsize=(8,3.2), line=mean_baseline_reward,
                                             date_id=date_id, save=False)

In [None]:
mean_baseline_reward = baseline_stats[0]
plotting_utils.plot_reward_vs_first_obs_value(training_results_dict, episode_window=(-1000,-1), 
                                              figsize=(8,3), line=mean_baseline_reward,
                                              date_id=date_id, save=False)

---
# Testing analysis

This section evaluates the learned policies by running the final Q-functions on unseen environment episodes. Summary statistics, inventory and action trajectories, and reward distributions are reported and compared to the baseline policy.

## Test statistics

The following statistics are reported for each test run:
- Mean, standard deviation, and median of episode rewards
- Mean and standard deviation of terminal inventory
- Minimum and maximum terminal inventory
- Percentage of episodes with terminal inventory in $[-\rho/2,\, \rho/2]$ and $[-\rho,\, \rho]$, with $\rho$ being 10 % of maximum inventory
- First-observation value of the Q-function at the last test episode

In [None]:
test_stats = []
rho = 50
for test_run in test_results_dict.values():
    run_stats = []
    run_stats.append(np.mean(test_run['rewards'])) # mean reward
    run_stats.append(np.std(test_run['rewards'])) # std reward
    run_stats.append(np.median(test_run['rewards'])) # median reward
    run_stats.append(np.mean(test_run['terminal_inventories'])) # mean terminal inventory
    run_stats.append(np.std(test_run['terminal_inventories'])) # std terminal inventory
    run_stats.append(int(np.min(test_run['terminal_inventories']))) # min terminal inventory
    run_stats.append(int(np.max(test_run['terminal_inventories']))) # max terminal inventory
    run_stats.append(sum(abs(np.array(test_run['terminal_inventories'])) <= rho//2) / 
                     len(test_run['terminal_inventories']) ) # pct in [-rho/2, rho/2]
    run_stats.append(sum(abs(np.array(test_run['terminal_inventories'])) <= rho) / 
                     len(test_run['terminal_inventories']) ) # pct in [-rho, rho]
    run_stats.append(test_run['first_obs_values'][-1]) # first observation value
    test_stats.append(run_stats)

columns = ['\nRun', 'Mean\nreward', 'Std\nreward', 'Median\nreward','Mean terminal\ninventory', 
           'Std terminal\ninventory', 'Min terminal\ninventory', 'Max terminal\ninventory', 
           'Pct in\n[-rho/2,rho/2]', 'Pct in\n[-rho,rho]', 'First obs\nvalue']
print(tabulate(test_stats, headers=columns, showindex=test_results_dict.keys(),tablefmt='simple', floatfmt='.5f'))

print(f'\nmean of mean rewards:    {round(np.mean([run_stats[0] for run_stats in test_stats]), 5)}' \
      f'\nmean of median rewards:  {round(np.mean([run_stats[2] for run_stats in test_stats]), 5)}')


## Test plots

### Single test run and episode

Plot rewards, terminal inventories, actions, and inventories for a selected test run and episode.

In [None]:
run_id = 0
episode_id = -1
plotting_utils.plot_test_run_results(test_results_dict, run_id, episode_id, ma_window=10, 
                                     figsize=(6, 4), date_id=date_id, save=False)

### Inventory and action trajectories across test runs

Plot the step-by-step inventory and action trajectories for all test episodes of the selected runs. These plots give a visual impression of how consistently the learned policies liquidate inventory across different environment seeds.

In [None]:
# trajectories of inventories for all test runs, takes a while to plot
runs = list(range(10)) # which runs to plot
plotting_utils.plot_test_inventory_trajectories(test_results_dict, runs=runs, figsize=(9, 14), 
                                                date_id=date_id, save=False)

In [None]:
# trajectories of actions for all test runs, takes a while to plot
runs = list(range(10)) # which runs to plot
plotting_utils.plot_test_action_trajectories(test_results_dict, figsize=(8, 14), runs=runs, 
                                             date_id=date_id, save=False)

### Reward box-plots

Box-plots comparing the reward distributions of each test run with the baseline reward distribution. This gives a direct visual comparison of the learned policy's performance against the benchmark.

In [None]:
plotting_utils.plot_test_rewards_boxplot(test_results_dict, baseline_results_dict, 
                                         date_id=date_id, save=False)