Like any good analysis I am doing this in a [Jupyter Notebook](https://jupyter.org/). First we load the CSV data:

In [None]:
import pandas

baseline = pandas.read_csv("baseline.csv")["response_time"]
task_handler = pandas.read_csv("task_handler.csv")["response_time"]

We can then plot the distribution of response times, to see how they compare:

In [None]:
from matplotlib import pyplot

pyplot.hist(baseline, density=True, bins=1000, label="Baseline", alpha=0.5)
pyplot.hist(task_handler, density=True, bins=1000, label="Task Handler", alpha=0.5)
pyplot.ylabel("Request Density")
pyplot.xlabel("Response Time")
pyplot.legend(loc="upper right")
pyplot.xlim(0, 25)

pyplot.show()

From the plotted data we can see that the response times look to have improved. With the median response time being almost 30% less time:

In [None]:
(task_handler.median() - baseline.median()) / baseline.median()

Now we need to prove that the change was not random chance. For that we are going to use statistics, in this case the unpaired t-test, which, like all good statistics, comes from an interesting source: Guinness! We will be using the [Welch's t-test](https://en.wikipedia.org/wiki/Welch's_t-test) variant as it accounts for a different number of samples, which we will have.

The statistic we care about is called the p-value, which we calculate as such:

In [None]:
from scipy import stats

stats.ttest_ind(baseline, task_handler, equal_var=False).pvalue

Now what does this mean? A p-value is a number between `0`, and `1` that helps us decide whether the results are likely due to chance, or if they suggest a real effect.

 Think of it this way: if we assume there is no real difference, or effect (the null hypothesis), the p-value tells us how likely it is that we would see the results we observed by just random chance.

A p-value below `0.05` is often considered "statistically significant", meaning there's a less than 5% chance the results occurred due to random variation.

In this case the value is 0 (probably due to rounding errors) meaning we are definitely below `0.05` meaning we have very likely improved performance.