In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# The setup

Imagine you have a research question, a dataset, and a novel model that you are sure will do better. 
How do you robustly compare it to any baseline you have, so to have convincing results to present?

Let us create some mock data to represent the problem, and analyze the use of statistical tests and confidence intervals within this context

## Simulating an experiment

We have a dataset, two baselines (B1 and B2), and our proposed model (M).

We split the dataset into training and test set, train the model, and measure the Mean Squared Error (MSE).
The measure of the performance of a model is a random variable, and we will skip here any form of training and simply sample
from some underlying distribution to have some results to play with.

In [15]:
np.random.seed(42)

results = []
results.append({"Model": "B1", "MSE": np.random.normal(loc=0.87, scale=0.021)})
results.append({"Model": "B2", "MSE": np.random.normal(loc=0.85, scale=0.034)})
results.append({"Model": "M", "MSE": np.random.normal(loc=0.79, scale=0.10)})
results = pd.DataFrame(results)
results

Unnamed: 0,Model,MSE
0,B1,0.880431
1,B2,0.845299
2,M,0.854769


Notice that if we repeat the procedure, we will get different results, that might even flip any ranking that we derive. 
This is not just an artifact of this notebook, but a real fact of research projects: if you repeat the same experiment with a different
train/test split, or with a different initialization of parameters, or possibly even just a different seed, you will get a different metric.

In [16]:
results = []
results.append({"Model": "B1", "MSE": np.random.normal(loc=0.87, scale=0.021)})
results.append({"Model": "B2", "MSE": np.random.normal(loc=0.85, scale=0.034)})
results.append({"Model": "M", "MSE": np.random.normal(loc=0.79, scale=0.10)})
results = pd.DataFrame(results)
results

Unnamed: 0,Model,MSE
0,B1,0.901984
1,B2,0.842039
2,M,0.766586


In [None]:
## Repeated evaluations

To solve 