# Practical 3: Establishing and testing the hypothesis

This week is focussed on defining research hypotheses, and using
statistical tests to evaluate them. In particular we will use the
Student’s T-test, and the KS distribution test.

## To add a callout-note

> **Note**
>
> This practical follows on from practical 2, so if you haven’t done
> that yet I suggest going back and working through that first!

## Loading the data

We are going to look at schools perfomance data in England once again.

The data is sourced from []() and is downloadable [here]().

We have saved a copy of this dataset to the Github repo, in case that
the dataset is removed from the website.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as sps
import numpy as np 

# Read CSV file, handling common missing value entries
na_vals = ["", "NA", "SUPP", "NP", "NE", "SP", "SN", "SUPPMAT"]
df_ks4 = pd.read_csv(
    'L2_data/england_ks4final.csv',
    na_values = na_vals
)

info_cols = ['RECTYPE', 'LEA', 'SCHNAME', 'TOTPUPS', 'TOWN']
ebaccs_cols = ['EBACCAPS', 'EBACCAPS_LO', 'EBACCAPS_MID', 'EBACCAPS_HI']

df_ks4 = df_ks4[info_cols + ebaccs_cols]

df_ks4[['TOTPUPS']+ebaccs_cols] = df_ks4[['TOTPUPS']+ebaccs_cols].apply(pd.to_numeric, errors='coerce')

df_ks4 = df_ks4[df_ks4['RECTYPE'].isin([1, 2])].copy()

df_ks4.head()

  df_ks4 = pd.read_csv(

Looking at the metadata (which you can see in ‘L2_data/ks4_meta.xlsx’)
we can see the full meaning of each column header:

-   ‘RECTYPE’ = Record type (1=mainstream school; 2=special school;
    4=local authority; 5=National (all schools); 7=National (maintained
    schools))
-   ‘LEA’ = Local authority
-   ‘SCHNAME’ = School name
-   ‘TOTPUPS’ = Number of pupils on roll (all ages)
-   ‘TOWN’ = School town
-   ‘EBACCAPS’ = Average EBacc APS score per pupil
-   ‘EBACCAPS_LO’ = Average EBacc APS score per pupil with low prior
    attainment
-   ‘EBACCAPS_MID’ = Average EBacc APS score per pupil with middle prior
    attainment
-   ‘EBACCAPS_HI’ = Average EBacc APS score per pupil with high prior
    attainment

## Research question

The department for education is worried about regional inequality in
school grades. With this in mind they’ve come up with a research
question they’d like to address.

**Research question:** Is average pupil attainment on the EBacc
significantly different in London compared to the rest of England?

To do this we’re going to use the mean comparison test to compare the
schools in London to those outside of London.

## Preparing the data

### Splitting the groups

In [2]:
df_London = df_ks4[df_ks4['TOWN'] == 'London']
df_London = df_London[df_London['EBACCAPS'].notna()]

df_notLondon = df_ks4[df_ks4['TOWN'] != 'London']
df_notLondon = df_notLondon[df_notLondon['EBACCAPS'].notna()]

And lets look at the summary statistics for each group.

In [3]:
df_London['EBACCAPS'].describe()

count    385.000000
mean       3.788260
std        1.851894
min        0.000000
25%        3.020000
50%        4.290000
75%        4.970000
max        8.700000
Name: EBACCAPS, dtype: float64

In [4]:
df_notLondon['EBACCAPS'].describe()

count    4246.000000
mean        3.395921
std         1.659090
min         0.000000
25%         2.820000
50%         3.670000
75%         4.380000
max         8.560000
Name: EBACCAPS, dtype: float64

So from looking at their summary statistics the two groups are different
sizes. The two groups also have different means - but we want to test if
these means are statistically significantly different.

## The hypothesis test

We’re now going to work through the steps of the hypothesis test
according to the five steps discussed in the lecture:

1.  Define the null and alternative hypothesis
2.  Set you significance level
3.  Identify the evidence
4.  Calculate the p-value
5.  Compare p-value with hypothesis level

### Step 1

What is the null and alternative hypothesis?

#### Question

In [None]:
H_0 = '??'
H_1 = '??'

print(f'The null hypothesis is {H_0}')
print(f'The alternative hypothesis is {H_1}')

### Step 2

In [6]:
# Set the level of statistical significance 

alpha = 0.05

### Step 3

We already have the evidence - it’s our datasets `df_London['EBACCAPS']`
and `df_notLondon['EBACCAPS']`.

### Step 4

We can use a built in function from `scipy.stats` called `ttest_ind` to
do step 4 for us. You can read more about this function
[here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html).

First we need to check whether we can assume that the samples are drawn
from populations with the same standard deviation or not. (Provided
neither standard deviation is double the other, this should be ok).

In [7]:
London_std = df_London['EBACCAPS'].mean()
notLondon_std = df_notLondon['EBACCAPS'].mean()

# Calculate the ratio of standard deviations 
std_ratio = London_std/notLondon_std

print("std ratio =", std_ratio)

if std_ratio > 0.5 and std_ratio < 2:
    print("Can assume equal population standard deviations.")
    equal_stds = True
else:
    print("Cannot assume equal population standard deviations.")
    equal_stds = False

std ratio = 1.115532395766086
Can assume equal population standard deviations.

There are two outputs from the function `scipy.stats.ttest_ind`: the
**test statistic** and the **p value**.

In [8]:
test_stat, p_value = sps.ttest_ind(df_London['EBACCAPS'], df_notLondon['EBACCAPS'], equal_var = equal_stds)

print("test statistic = ", test_stat)
print("p-value =", p_value)

test statistic =  4.398340904538903
p-value = 1.1153058452679495e-05

### Step 5

For the final step we compare the p-value to the significance value in
order to reach a decision.

#### Question

In [None]:
if p_value ?? ?? :
    print(f"Reject the null hypothesis ({H_0}). Accept the alternative hypothesis ({H_1}).")
    print("Conclude that samples are drawn from populations with different means.")
elif p_value ?? ?? :
    print(f"No significant evidence to reject the null hypothesis ({H_0}).")
    print("Assume samples are drawn from populations with the same mean.")

Hence we can conclude that the evidence supports there is a
statistically significant differnece between the mean student attainment
on the EBacc in London, versus outside of London.

## A more complicated research question

Now I’d also like to know: Are the distribution of EBacc scores in
London drawn from a normal distribution?

## You’re Done!

Well done you’ve completed this weeks practical on establishing and
evaluating hypothesis questions. If you are still working on it, take
your time. And if you have any questions just ask!