# Exercises for Section 3.3 Introduction to hypothesis testing

This notebook contains the solutions to the exercises
from [Section 3.3 Introduction to hypothesis testing]()
in the **No Bullshit Guide to Statistics**.

### Notebooks setup

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Pandas setup
pd.set_option("display.precision", 2)

In [3]:
# Plot helper functions
from plot_helpers import savefigure

In [4]:
# Figures setup
plt.clf()  # needed otherwise `sns.set_theme` doesn't work
from plot_helpers import RCPARAMS
RCPARAMS.update({'figure.figsize': (5, 1.6)})  # good for print
sns.set_theme(
    context="paper",
    style="whitegrid",
    palette="colorblind",
    rc=RCPARAMS,
)

# Useful colors
snspal = sns.color_palette()
blue, orange, purple = snspal[0], snspal[1], snspal[4]

# High-resolution please
%config InlineBackend.figure_format = 'retina'

<Figure size 640x480 with 0 Axes>

### Estimator functions defined in Section 3.1

In [5]:
def mean(sample):
    return sum(sample) / len(sample)

def var(sample):
    xbar = mean(sample)
    sumsqdevs = sum([(xi-xbar)**2 for xi in sample])
    return sumsqdevs / (len(sample)-1)

def std(sample):
    s2 = var(sample)
    return np.sqrt(s2)

def dmeans(xsample, ysample):
    dhat = mean(xsample) - mean(ysample)
    return dhat

## Exercises

### Mean test on batch 05 (estimated variance)

In [6]:
ksample05 = kombucha[kombucha["batch"]==5]["volume"]
bootstrap_test_mean(ksample05, mu0=1000)[1]

NameError: name 'kombucha' is not defined

Correctly rejects this as an irregular batch.

In [None]:
sns.kdeplot(ksample05)
ksample05.describe()

### Cohen's d via bootstrap

### Electric scooters

An electric scooter manufacturer launched a new model with large battery and claims the range of the scooter is 45 km. A distributor wants to put this claim to a test, and has collected data from 15 different electric scooters to measure the maximum range of trips, to see how long the battery lasts. We want to know if the data supports the manufacturer's claim.

The obtained the data $\texttt{ds} = [43.5, 44.8, 42.9, 46.2, 44.1, 43.8, 45.5, 43.4, 44.9, 42.7, 44.6, 47.0, 43.0, 44.2, 45.3]$.

Assume the theoretical distribution is normally distributed $D_0 \sim \mathcal{N}(45, \sigma_{D_0})$,
where $\sigma_{D_0}$ is unknown so we'll estimate it from the sample variance $\sigma_{D_0} \approx \texttt{dstd} = \texttt{std(ds)}$.


In [None]:
# sample
ds = [43.5, 44.8, 42.9, 46.2, 44.1, 43.8, 45.5,
      43.4, 44.9, 42.7, 44.6, 47.0, 43.0, 44.2, 45.3]

In [None]:
from stats_helpers import mean, std
dbar = mean(ds)
dstd = std(ds)

In [None]:
from scipy.stats import norm
muD0 = 45
Dbar0 = norm(45, dstd)

In [None]:
from stats_helpers import gen_sampling_dist

np.random.seed(42)
dbars0 = gen_sampling_dist(Dbar0, estfunc=mean, n=15)

In [None]:
obsdev = abs(dbar - muD0)
tails = [v for v in dbars0 if abs(v-muD0) >= obsdev]
pvalue = len(tails) / len(dbars0)
pvalue

In [None]:
from stats_helpers import simulation_test_mean

np.random.seed(42)
simulation_test_mean(ds, mu0=muD0, sigma0=dstd)