# Hypthesis test: a two-sided $z$-test

**date**
: 2021-04-17

**data**
: `practical-test.csv`

**ref**
: Computer book B, Activity 23

**desc**
: Performing a two-sided $z$-**test of a population mean**

In [1]:
from scripts.data import Data
from statsmodels.stats.weightstats import ztest

In [2]:
practical = Data.load_practical_test()

The sample contains the pass rates for 316 UK driving practical test centres over the period April 2014 to March 2015.
The first column (`Centre`) lists the 316 test centres; the second and third columns (`Male` and `Female`) contain the pass rates (%) for males and females, respectively, at each centre; and the fourth column (`Total`) contains the overall pass rates (%) for each centre.

During the period April 2013 to March 2014, the national driving practical test pass rate was 47.1%.

Let the hypotheses be

$$
\begin{aligned}
  &H_{0} : \mu = 47.1\% \\
  &H_{1} : \mu \neq 47.1\%,
\end{aligned}
$$

where $\mu$ is the mean total pass rate for the driving practical test nationally across all UK test centres during the period April 2014 to March 2015.

In [3]:
# declare two local vars for easier coding
total = practical["Total"]

In [4]:
total.describe()

count    316.000000
mean      49.630380
std        7.165444
min       30.300000
25%       44.975000
50%       49.650000
75%       54.500000
max       71.300000
Name: Total, dtype: float64

We will use `statsmodels.stats.weightstats.ztest` to help with this analysis.

In [5]:
# returns (z, p)
ztest(
    x1=total,
    value=47.1
)

(6.277491911091587, 3.440781584872767e-10)

Since $p <$ 0.01, there is strong evidence against the null hypothesis that the mean pass rate for the driving practical test nationally from April 2014 to March 2015 is the same as the national pass rate over the same period in the previous year.

Further, since $z \simeq$ 6.28, is positive, or equivalently because the sample mean, 49.629, is greater than the hypothesised population mean 47.1, the test suggests that the mean pass rate nationally over the period April 2014 to March 2015 is higher than the national pass rate over the same period in the previous year.

In [6]:
help(ztest)

Help on function ztest in module statsmodels.stats.weightstats:

ztest(x1, x2=None, value=0, alternative='two-sided', usevar='pooled', ddof=1.0)
    test for mean based on normal distribution, one or two samples
    
    In the case of two samples, the samples are assumed to be independent.
    
    Parameters
    ----------
    x1 : array_like, 1-D or 2-D
        first of the two independent samples
    x2 : array_like, 1-D or 2-D
        second of the two independent samples
    value : float
        In the one sample case, value is the mean of x1 under the Null
        hypothesis.
        In the two sample case, value is the difference between mean of x1 and
        mean of x2 under the Null hypothesis. The test statistic is
        `x1_mean - x2_mean - value`.
    alternative : str
        The alternative hypothesis, H1, has to be one of the following
    
           'two-sided': H1: difference in means not equal to value (default)
           'larger' :   H1: difference in means la