# Evaluate the demo experiments

This notebook demonstrates how the experiment results can be loaded and analyzed.

## Prerequisites

You need to a week of Puffer data and run the demo experiments first:

```bash
./demo.py download
./demo.py run-demo
```

The demo contains and evaluation and a replay. The evaluation is rather quick
(less than an hour, depending on the machine), but the replay can take several
hours due to the required model training.

If you only want to run only the evaluation or analysis, use the commands:

```bash
./demo.py run-demo analysis
./demo.py run-demo replay
```

See `REAMDE.md` for additional CLI options such as verbosity, running experiments in parallel, or scheduling jobs with Slurm.

## Imports and Setup

In [None]:
import sys
sys.path.append('..')  # To allow imports from parent directory
%load_ext autoreload
%autoreload 2

import experiment_helpers as eh
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

# Plot import and configuration.
%matplotlib inline

# Set theme and parameters
eh.plot_utils.setup(latex=False)

# Load the config, including the local config.
from config import Config as LocalConfig  # Load from file: ../config.py
from experiments.puffer.config import PufferExperimentConfig
config = PufferExperimentConfig.with_updates(LocalConfig)

# Puffer Data Analysis

In the following, we show the evaluation results obtained from running the demo.

For more in-depth visualization and to recreate the paper plots, go to the [puffer-analysis notebook](./puffer-analysis.ipynb).

In [None]:
# Load the evaluation results.
experiment_directory = config.output_directory / "run-demo/analysis"

_frames = []
for daydir in experiment_directory.iterdir():
    _frames.append(eh.data.read_csv(daydir / "results.csv.gz"))
results = pd.concat(_frames, ignore_index=True)

# Have a peek
results

The data is stored in [long (also called narrow) format][1]. The context of each value is given by:
- `day`: The evaluated day.
- `abr` The evaluated algorithm.
- `selection`: We evaluate three subsets of data.
  - `all`: All data.
  - `stalled`: Only results of sessions experiencing stalls.
  - `unstalled`: Only results of sessions not experiencing stalls.
- `variable`: Which aspect of the data is evaluated. Some important variables are:
  - `ssim`: Image quality of all video chunks.
  - `stream`: Time info for the stream, e.g. playtime and rebuffering time.
- `metric`: For each variable, we compute several aggregate statistics. E.g. for SSIM, we evaluate mean, std, and several percentiles.

As an example, let's plot the average SSIM, fraction of time spent stalled, and prediction scores per algorithm and day, for all sessions, for Fugu (static, labels fugu-feb) and the default version of Memento.

[1]: https://en.wikipedia.org/wiki/Wide_and_narrow_data

In [None]:
aggregated = (
    results
    # Select data.
    .query("(selection == 'all') and (abr in ('fugu-feb', 'memento'))")
    .drop(columns=["selection"])
    .query(
        "(variable == 'ssim' and metric == 'mean') or "
        "(variable == 'stream' and metric == 'stalled') or "
        "(variable == 'logscore' and metric == 'mean') or "
        "(variable == 'logscore' and metric == '0.01')"
    )
    # Reshape and round values for nicer visualization
    .pivot(
        index=["day"],
        columns=["variable", "metric", "abr"],
        values="value",
    )
    .round(2)
)
display(aggregated)
display(aggregated.mean().to_frame("Mean").T)

For an almost on-par image quality (on average 0.5% worse), Memento spends a smaller fraction of streamtime stalled than Fugu (on average 18% less).
The prediction score is slightly worse for Memento on average and significantly better at the tail.

Both ABRs use the same algorithm and ML model architecture, but Memento selects training samples to cover the whole sample space.
As a result, we are more robust to rare cases, which shows in the reduction of stalls. We pay a small price in image quality, as we learn to predict the most common cases a little worse.


# Puffer Data Replay

In this experiment, we replay the data from the demo week, to train a Fugu model from scratch using samples selected by Memento.

For each day of the week, we feed the data into Memento, which updates a training data set (which contains at most a million samples).

This model is evaluated on the data of the next day.
The Fugu (static) model is also evaluated on this data as a comparison point.

Note that during this experiment, we train and evaluate every day to get a close look at the samples selected at every step. In deployment, Memento would not retrain if the sample set does not change significantly.

We replay data to analyze the impact of various parameters on the quality of Memento. For a more in-depth analysis and to recreate the figures from the paper, see the [puffer-replay notebook](./puffer-replay.ipynb).

In [None]:
replay_directory = config.output_directory / "run-demo/replay"
replay_results = eh.data.read_csv(replay_directory / "results.csv.gz")

# Have a look at the raw data.
display(results.head())

The results contain a range of aggregation metrics (in the `metric` column) for different variables (in separate columns).

As an example, we plot the mean and tail prediction score over iterations for both the Memento-trained model and the Fugu (static) model.

In [None]:
selected_results = (
    replay_results
    .query("metric in ('mean', '0.01')")
    .melt(id_vars=["iteration", "metric"],
          value_vars=["logscore", "fugu_feb_logscore"])
)
selected_results.head()

In [None]:
sns.relplot(
    data=selected_results,
    kind="line",
    x="iteration", 
    y="value",
    style="metric",
    hue="variable",
).set(
    xlabel="Days replayed",
    ylabel="Logscore",
)
plt.show()

We can see that the mean prediction scores are almost equal, with Fugu (static) being slightly better.

At the tail, however, Memento is significantly better, as it has seen more rare cases during training, even after after only a couple of days.

Finally, let's have a look at the `stats.csv.gz`, which contains additional info about the retraining decisions process.

In the demo, we retrain every day to see the effects of sample selection.
In deployment, we only retrain if the estimated coverage of sample space increases by 10%. Let's plot this value.

In [None]:
stats = eh.data.read_csv(replay_directory / "stats.csv.gz")
display(stats.head())

In [None]:
sns.relplot(
    data=stats,
    kind="line",
    x="iteration",
    y="coverage_increase"
).set(
    xlabel="Days replayed",
    ylabel="Relative coverage increase",
)
plt.show()

Note that day 0 is excluded. Without previous samples, we cannot estimate a difference in coverage and always retrain.

We can see that the coverage increase is above 0.1 only for day 1, so Memento would have retrained on days 0 and 1 and stopped after, as newly observed samples are already covered by the sample selection and retraining would be unlikely to yield improvements.