### Two test descriptive statistics and correlation
___

Using exercises spreadsheets from Designing and Analyzing Language Tests by Oxford.

The purpose of this notebook is to calculate the various descriptive statistics, to compute the Peardon $r$ for two tests, and then to calculate the Spearman $p$ for the same pair of tests.

*NOTE: both sets of test scores are continuous variables, interval-level data, and rather normally distributed, so it is appropriate to use $r$.*

<br>

#### General Setup
___

In [1]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as ss

<br>

#### Load the data
___

In [2]:
# loading the data
two_tests = pd.read_excel('Data/two_test_descr_stats_and_corr.xlsx')
two_tests.head()

Unnamed: 0,Student,Test X,Test Y
0,Student01,9,12
1,Student02,12,12
2,Student03,5,7
3,Student04,6,6
4,Student05,12,14


In [3]:
two_tests.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Student  30 non-null     object
 1   Test X   30 non-null     int64 
 2   Test Y   30 non-null     int64 
dtypes: int64(2), object(1)
memory usage: 848.0+ bytes


<br>

#### Descriptive stats
___

In [4]:
# calculate pandas stats and converting it to a dataframe
stats = pd.DataFrame([two_tests['Test X'].describe(), two_tests['Test Y'].describe()]).T
stats

Unnamed: 0,Test X,Test Y
count,30.0,30.0
mean,10.566667,10.5
std,3.793673,3.471559
min,5.0,3.0
25%,7.25,8.0
50%,10.5,11.0
75%,13.0,13.75
max,17.0,16.0


In [5]:
# renaming the std to std(sample) and add std for population
stats.loc['std(sample)'] = stats.loc['std']
stats.loc['std(pop)'] = [two_tests['Test X'].std(ddof=0), two_tests['Test Y'].std(ddof=0)]

# renaming the min and max 
stats.loc['high score'] = stats.loc['max']
stats.loc['low score'] = stats.loc['min']
stats.loc['n'] = stats.loc['count']

# adding other stats
stats.loc['mode'] = [ss.mode(two_tests['Test X'])[0][0], ss.mode(two_tests['Test Y'])[0][0]]
stats.loc['var(sample)'] = [two_tests['Test X'].var(), two_tests['Test Y'].var()/100]
stats.loc['var(pop)'] = [two_tests['Test X'].var(ddof=0), two_tests['Test Y'].var(ddof=0)/100]
stats.loc['range'] = stats.loc['high score'] - stats.loc['low score'] + 1
stats.loc['Q'] = (stats.loc['75%'] - stats.loc['25%']) / 2

stats.loc['skewness'] =  [two_tests['Test X'].skew(), two_tests['Test Y'].skew()]
stats.loc['SES'] = np.sqrt((6*stats.loc['n'] * (stats.loc['n']-1)) / ((stats.loc['n']-2) * (stats.loc['n']+1) * (stats.loc['n']+3)))
stats.loc['skew/SES'] = stats.loc['skewness'] / stats.loc['SES']
stats.loc['kurtosis'] = [two_tests['Test X'].kurt(), two_tests['Test Y'].kurt()]
stats.loc['SEK'] = np.sqrt((4*(stats.loc['n']**2-1)*stats.loc['SES']**2) / ((stats.loc['n']-3)*(stats.loc['n']+5)))
stats.loc['kurt/SEK'] = stats.loc['kurtosis'] / stats.loc['SEK']

# removing not needed lines
stats.drop(['std', 'min', 'max', 'count'], axis=0, inplace=True)

In [6]:
# round all stats to two decimal points and changing the order
stats = np.round(stats, 3)
stats = stats.reindex(index = ['mean','mode','25%', '50%', '75%', 'high score', 'low score', 
                       'range', 'std(pop)', 'std(sample)', 'var(pop)', 'var(sample)', 'Q', 'n',
                      'skewness', 'SES', 'skew/SES','kurtosis', 'SEK', 'kurt/SEK'])
stats.index.name = 'Statistics'
stats

Unnamed: 0_level_0,Test X,Test Y
Statistics,Unnamed: 1_level_1,Unnamed: 2_level_1
mean,10.567,10.5
mode,9.0,14.0
25%,7.25,8.0
50%,10.5,11.0
75%,13.0,13.75
high score,17.0,16.0
low score,5.0,3.0
range,13.0,14.0
std(pop),3.73,3.413
std(sample),3.794,3.472


<br>

#### Correlation. 
___

In [7]:
# Pearson correlation
pearson = two_tests.corr(method='pearson')
pearson.index.name = 'Pearson'
pearson

Unnamed: 0_level_0,Test X,Test Y
Pearson,Unnamed: 1_level_1,Unnamed: 2_level_1
Test X,1.0,0.757995
Test Y,0.757995,1.0


In [8]:
r2 = pd.DataFrame({'r^2': [pearson.values[0][1] ** 2]})
r2

Unnamed: 0,r^2
0,0.574556


In [9]:
# Spearman correlation
spearman = two_tests.corr(method='spearman')
spearman.index.name = 'Spearman'
spearman

Unnamed: 0_level_0,Test X,Test Y
Spearman,Unnamed: 1_level_1,Unnamed: 2_level_1
Test X,1.0,0.775093
Test Y,0.775093,1.0


In [10]:
# write and save 5 dataframes to the excel file 
writer = pd.ExcelWriter('Data/two_test_descr_stats_and_corr_results.xlsx', engine='xlsxwriter')
two_tests.to_excel(writer, index = False)
stats.to_excel(writer, startcol=len(two_tests.columns)+2, index=True)
pearson.to_excel(writer, startcol=len(two_tests.columns)+len(stats.columns)+4, index=True)
r2.to_excel(writer, startrow=len(pearson)+2, startcol=len(two_tests.columns)+len(stats.columns)+4, index=False)
spearman.to_excel(writer, startrow=len(pearson)+5, startcol=len(two_tests.columns)+len(stats.columns)+4, index=True)
writer.save()

<br>

___
#### End.