# Tutorial exercises

We again use the wellbeing dataset, to practice running permutation tests.

### Set up Python libraries

As usual, run the code cell below to import the relevant Python libraries

In [1]:
# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
sns.set_theme(style='white')
import statsmodels.api as sm
import statsmodels.formula.api as smf



### Import and view the data

In [2]:
wb = pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/WellbeingSample.csv')
wb

Unnamed: 0,ID_code,College,Subject,Score_preVac,Score_postVac
0,247610,Lonsdale,PPE,60,35
1,448590,Lonsdale,PPE,43,44
2,491100,Lonsdale,engineering,79,69
3,316150,Lonsdale,PPE,55,61
4,251870,Lonsdale,engineering,62,65
...,...,...,...,...,...
296,440570,Beaufort,history,75,70
297,826030,Beaufort,maths,52,49
298,856260,Beaufort,Biology,83,84
299,947060,Beaufort,engineering,62,65



### Questions

#### Test the following hypotheses:
    
1. Wellbeing scores pre- and post-vac are correlated in engineering students
2. There is a difference in the wellbeing scores of PPE students between Beaufort or Lonsdale (before the vacation)?
3. Wellbeing over all students increases across the vacation

#### Slightly harder one:

4. Wellbeing increases more across the vacation for Beaufort students than Lonsdale students 

#### Detailed Instructions

In each case 1-4, you will need to decide what to do, carry it out and and write it up:

**a. Hypotheses**
* what is our null hypothesis
* what is our alternative hypothesis?

Is it a paired or unpaired test for difference of means, or a correlation test?
* therefore which `permutation_type` is needed, `samples`, `pairings` or `independent`?
        
Is it a one- or two-tailed test?
* therefore which `alternative` hypothesis type is needed, `two-sided`, `greater` or `less`?

What $\alpha$ value will you use?
* what value must $p$ be smaller than, to reject the null hypothesis?
* this is the experimenter's choice but usually 0.05 is used (sometimes 0.001 or 0.001)

**b. Test statistic and descriptive statistics**

What is your test statistic?

Report appropriate descriptive statstics and plot the data (you should choose an appropriate plot type)

**c. Carry out the permutation test**

Carry out the test. Plot the null distribution. Report the $p$-value.

**d. Report your conclusion**

Will you reject the null hypothesis, or fail to reject it? What is your conclusion in plain English?

**e. Finally, write it up**

In each case, include a final cell in which you write the test up as if for a journal article





## 1. Wellbeing scores pre- and post-vac are correlated in engineering students

**a) Hypotheses etc**

$\mathcal{H_0}$: The correlation in wellbeing pre- and post- vacation is zero for engineering students

$\mathcal{H_a}$: The correlation in wellbeing pre- and post- vacation is greater than zero (students with high wellbeing bbefore the vac also have high wellbeing after the vac)

We will test at the 5% ($\alpha=0.05$) level, one tailed as it only makes sense to look for a positive correlation

As this is a correlation, we need `permutation_type = 'pairings'` (shuffle which datapoints are paired with which)

**b) Test statistic and descriptive statistics**

Test statistic is Pearson's r. The relevant descriptive statistics are the observed value of r, which is 0.78 (calculated below), and the sample size which is 61

In [6]:
# find the relevant data
prevac = wb.query('Subject=="engineering"').Score_preVac
postvac = wb.query('Subject=="engineering"').Score_postVac

prevac.corr(postvac)



0.7812255461336073

In [11]:
print('n pre = ' + str(prevac.count()))
print('n post = ' + str(postvac.count()))

n pre = 61
n post = 61


**c) Carry out the test**

In [8]:
# define a function that gives the correlation for two series x and y
# note that np.corrcoef returns a 2x2 matrix so we need to get just the relevant element
def mycorr(x,y):
    c = np.corrcoef(x,y)
    return c[0,1]

# run the permutation test
stats.permutation_test((prevac,postvac), mycorr, alternative='greater', permutation_type='pairings', n_resamples=10000)



PermutationTestResult(statistic=0.7812255461336073, pvalue=9.999000099990002e-05, null_distribution=array([-0.21807336, -0.06231807, -0.07283783, ...,  0.02686439,
       -0.25803275,  0.12711615]))

**d) Report conclusion**

As the p value is much less than 0.05, we conclude that there is a significant correlation in wellbeing scores pre- and post- the vacation in engineering student

**e) Write up**

We hypothesised that wellbeing before and after the vacation would be correlated across individuals. We calculated Pearson's $r$ for a group of engineering students (n=51) before and after the vacation, and tested its significance using a permutation test. There was a highly significant positive correlation (r=0.78, p<0.0001, one tailed), meaning that students with higher wellbeing before the vacation also had high wellbeing after the vacation.