# Exercises for Section 3.2 Confidence intervals

This notebook contains the solutions to the exercises
from [Section 3.2 Confidence intervals]()
in the **No Bullshit Guide to Statistics**.

### Notebooks setup

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Estimator functions defined in Section 3.1

In [2]:
def mean(sample):
    return sum(sample) / len(sample)

def var(sample):
    xbar = mean(sample)
    sumsqdevs = sum([(xi-xbar)**2 for xi in sample])
    return sumsqdevs / (len(sample)-1)

def std(sample):
    s2 = var(sample)
    return np.sqrt(s2)

def dmeans(xsample, ysample):
    dhat = mean(xsample) - mean(ysample)
    return dhat

## Exercises

### Exercise 3.17

Compute a confidence 90% confidence interval for the population mean
based on the sample from Batch 04 of the `kombucha` dataset.

In [3]:
kombucha = pd.read_csv("datasets/kombucha.csv")
ksample04 = kombucha[kombucha["batch"]==4]["volume"]

#### a) analytical approximation

In [4]:
from scipy.stats import t as tdist

n04 = ksample04.count()
kbar04 = mean(ksample04)
seKbar04 = std(ksample04) / np.sqrt(n04)

t_l = tdist(df=n04-1).ppf(0.05)
t_u = tdist(df=n04-1).ppf(0.95)

[kbar04 + t_l*seKbar04, kbar04 + t_u*seKbar04]

[1001.7416639464577, 1005.9253360535422]

#### b) bootstrap estimation

In [5]:
from ministats import gen_boot_dist

np.random.seed(42)
kbars04_boot = gen_boot_dist(ksample04, estfunc=mean)
[np.quantile(kbars04_boot, 0.05), np.quantile(kbars04_boot, 0.95)]

[1001.8019875000002, 1005.8771124999998]

### Exercise 3.18

Calculate a confidence 90% confidence interval for the for population variance
based on the sample from Batch 05 of the `kombucha` dataset.

In [6]:
kombucha = pd.read_csv("datasets/kombucha.csv")
ksample05 = kombucha[kombucha["batch"]==5]["volume"]

#### a) analytical approximation

In [7]:
n05 = ksample05.count()
kvar05 = var(ksample05)

from scipy.stats import chi2
x2_l = chi2(df=n05-1).ppf(0.05)
x2_u = chi2(df=n05-1).ppf(0.95)

[(n05-1)*kvar05/x2_u, (n05-1)*kvar05/x2_l]

[30.692160944915106, 65.18443858816488]

#### b) bootstrap estimation

In [8]:
np.random.seed(43)
kvars05_boot = gen_boot_dist(ksample05, estfunc=var)
[np.quantile(kvars05_boot, 0.05), np.quantile(kvars05_boot, 0.95)]

[20.34142641025637, 70.91556086217975]

### Exercise 3.19

Compute a 95% confidence interval for the difference between rural and city sleep scores in the doctors dataset. **a)** Use analytical approximation formula in terms Student's $t$-distribution. **b)** Use bootstrap estimation.

Hint: Use the code `doctors[doctors["location"]=="rural"]` to select
the subset of the doctors working in a `rural` location.

In [9]:
doctors = pd.read_csv("datasets/doctors.csv")
scoresR = doctors[doctors["location"]=="rural"]["score"]
scoresU = doctors[doctors["location"]=="urban"]["score"]

# observed difference between scores
dscores = dmeans(scoresR,scoresU)
dscores

2.2236048265460084

#### a) analytical approximation

In [10]:
# obtain the sample sizes and stds of the two groups
nR, stdR = scoresR.count(), scoresR.std()
nU, stdU = scoresU.count(), scoresU.std()

# standard error of the difference between group means
seDscores = np.sqrt(stdU**2/nU + stdR**2/nR)

# calculate the degrees of freedom
from ministats import calcdf
dfD = calcdf(stdU, nU, stdR, nR)

# Student's t-distribution with df degrees of freedom
from scipy.stats import t as tdist
t_l = tdist(df=dfD).ppf(0.025)
t_u = tdist(df=dfD).ppf(0.975)

[dscores + t_l*seDscores, dscores + t_u*seDscores]

[0.48541688303387676, 3.96179277005814]

In [11]:
# ALT. using t-distribution with custom `loc` and `scale` params
rvDscores = tdist(df=dfD, loc=dscores, scale=seDscores)
[rvDscores.ppf(0.025), rvDscores.ppf(0.975)]  # = rvDscores.interval(0.95)

[0.48541688303387676, 3.96179277005814]

#### b) bootstrap estimation

In [12]:
# compute bootstrap estimates for mean in each group
np.random.seed(43)
meanR_boot = gen_boot_dist(scoresR, estfunc=mean)
meanU_boot = gen_boot_dist(scoresU, estfunc=mean)

# compute the difference between means of bootstrap samples
dmeans_boot = np.subtract(meanR_boot, meanU_boot)

[np.quantile(dmeans_boot, 0.025),
 np.quantile(dmeans_boot, 0.975)]

[0.5319758672699874, 3.9718042986425353]

### Exercise 3.20

Calculate a 80% confidence interval for the difference between debate and lecture groups the `students` dataset.

In [13]:
students = pd.read_csv("datasets/students.csv")
scoresD = students[students["curriculum"]=="debate"]["score"]
scoresL = students[students["curriculum"]=="lecture"]["score"]

# observed difference between scores
dhat = dmeans(scoresD, scoresL)

#### a) analytical approximation

In [14]:
# obtain the sample sizes and stds of the two groups
nD, stdD = scoresD.count(), scoresD.std()
nL, stdL = scoresL.count(), scoresL.std()

# standard error of the difference between group means
seDscores = np.sqrt(stdD**2/nD + stdL**2/nL)

# calculate the degrees of freedom
from ministats import calcdf
dfD = calcdf(stdD, nD, stdL, nL)

# Student's t-distribution with df degrees of freedom
from scipy.stats import t as tdist
t_l = tdist(df=dfD).ppf(0.1)
t_u = tdist(df=dfD).ppf(0.9)

[dhat + t_l*seDscores, dhat + t_u*seDscores]

[1.916499988422145, 14.722785725863563]

#### b) bootstrap estimation

In [15]:
np.random.seed(42)
meanD_boot = gen_boot_dist(scoresD, estfunc=mean)
meanL_boot = gen_boot_dist(scoresL, estfunc=mean)

# compute the difference between means of bootstrap samples
dmeans_boot = np.subtract(meanD_boot, meanL_boot)

[np.percentile(dmeans_boot, 10),
 np.percentile(dmeans_boot, 90)]

[2.7105357142857045, 14.039464285714285]

### Exercise 3.21

As part of a lab experiment,
sixty-four two-week old rats were given a vitamin D supplement for a period of one month,
and their weights were recored at the end of the month (30 days).
The sample mean was $89.60$ \;g with standard deviation $12.96$ \;g.
Calculate a 95\%confidence interval for the mean weight for rats undergoing this treatment based on: **a)** The normal model. **b)** Student's $t$ -distribution. **c)** Compare your answers in a) and b) and comment on the relevance of using Student's $t$ -distribution in this case.

In [16]:
n = 64
xbar = 89.60
xstd = 12.96

# estimated standard error
sehat = xstd / np.sqrt(n)
sehat

1.62

#### a) Using normal approximation

In [17]:
from scipy.stats import norm
z_l = norm.ppf(0.025)
z_u = norm.ppf(0.975)
[xbar + z_l*sehat, xbar + z_u*sehat]

[86.42485834504511, 92.77514165495488]

In [18]:
# ALT. using normal with custom `loc` and `scale` params
rvNXbar = norm(loc=xbar, scale=sehat)
[rvNXbar.ppf(0.025), rvNXbar.ppf(0.975)]  # = rvNXbar.interval(0.95)

[86.42485834504511, 92.77514165495488]

#### b) Using Student's $t$-distribution

In [19]:
from scipy.stats import t as tdist
t_l = tdist(df=n-1).ppf(0.025)
t_u = tdist(df=n-1).ppf(0.975)
[xbar + t_l*sehat, xbar + t_u*sehat]

[86.36268832232903, 92.83731167767095]

In [20]:
# ALT. using t-distribution with custom `loc` and `scale` params
rvTXbar = tdist(n-1, loc=xbar, scale=sehat)
[rvTXbar.ppf(0.025), rvTXbar.ppf(0.975)]  # = rvTXbar.interval(0.95)

[86.36268832232903, 92.83731167767095]