# Paired Samples t-Test
*By P. Stikker*<br>
https://PeterStatistics.com<br>
https://www.youtube.com/stikpet<br>

## Introduction

A paired samples t-test can be used to check if the difference in the population will also be different from zero (i.e. the two means are not the same). 

Note that unlike a two samples t-test we do not need to check if the variances are equal, since the data is paired (McDonald, 2014, p. 182). For the interested reader, there is a nice discussion on this on <a href="https://www.researchgate.net/post/Do_you_agree_that_homogeneity_of_variance_is_an_assumption_for_paired_samples_t-test">researchgate</a>.

## Example

To show an example, I'll load some data as a pandas dataframe. So I'll need the '<a href="https://pandas.pydata.org">pandas</a>' library:

In [21]:
#!pip install pandas
import pandas as pd

And then load the example data using the <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html">'read_csv'</a>. 

In [5]:
myDf = pd.read_csv('../../pairedTtest.csv')
myDf.head()

NameError: name 'pd' is not defined

One simple method to perform a paired samples t test is to use '<a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html">ttest_rel</a>' function of the scipy.stats package. We can load this function using:

In [23]:
# !pip install scipy
from scipy.stats import ttest_rel

To use the function we select the two fields of interes, and also decide on what to do if there are missing values (nan). There are three options to choose from in the *nan_policy*. The default is to 'propagate' which simply indicates the test will return 'nan', another option is 'raise' which will throw an error, and the last is to 'omit' which simply ignores pairs with missing values, and is the one I would use. 

In [24]:
ttest_rel(myDf['before'], myDf['after'], nan_policy='omit')

Ttest_relResult(statistic=-2.398927226696023, pvalue=0.02494298862638759)

In the example we find a t-value of 2.399 with a significance of 0.025. This means that there is a 0.025 (2.5%) chance of a t-value of less than -2.399 or a t-value of more than 2.399, if this t-value would be 0 in the population (i.e. no difference = equal means). 

With a usual 0.05 significance level we consider this chance to be so low, that most likely there is actually a difference in the population as well (and not only in the sample).

We can also use the '<a href="https://researchpy.readthedocs.io/en/latest/ttest_documentation.html">ttest</a>' function from researchpy to perform the test:

In [25]:
# !pip install researchpy
from researchpy import ttest as rpTtest

In [26]:
rpRes = rpTtest(myDf['before'], myDf['after'], equal_variances=True, paired=True)
rpRes

(  Variable     N       Mean         SD        SE  95% Conf.   Interval
 0   before  24.0  32.250000  22.552837  4.603578  22.726772  41.773228
 1    after  24.0  44.083333  30.144459  6.153212  31.354445  56.812222
 2     diff  24.0 -11.833333  24.165492  4.932760 -22.037526  -1.629141,
             Paired samples t-test  results
 0  Difference (before - after) =  -11.8333
 1           Degrees of freedom =   23.0000
 2                            t =   -2.3989
 3        Two side test p value =    0.0249
 4       Difference < 0 p value =    0.0125
 5       Difference > 0 p value =    0.9875
 6                    Cohen's d =   -0.4897
 7                    Hedge's g =   -0.4817
 8                Glass's delta =   -0.5247
 9                            r =    0.4474)

More results, but the t-value and the p-value are the same. A nice addition is the degrees of freedom, which usually is needed to report the results.

Pingouin also can do the paired t-test, with the function named '<a href="https://pingouin-stats.org/generated/pingouin.ttest.html#pingouin.ttest">ttest</a>':

In [27]:
# !pip install pingouin
from pingouin import ttest as pgTtest

In [28]:
pgTtest(myDf['before'], myDf['after'], paired=True)

Unnamed: 0,T,dof,tail,p-val,CI95%,cohen-d,BF10,power
T-test,-2.398927,23,two-sided,0.024943,"[-22.04, -1.63]",0.444517,2.272,0.55029


For those interested, in the appendix I'll go over the formulas involved and avoid using libraries as much as possible (only for the t distribution to get the  p-value).

## References

McDonald, J. H. (2014). *Handbook of Biological Statistics* (3rd ed.). Baltimore: Sparky House Publishing.

## Appendix: The Hard Way

First we convert our pandas series to a Python native format: a list

In [29]:
X = list(myDf['before'])
Y = list(myDf['after'])

The number of pairs (n) we have is useful to have available:

In [30]:
n = len(X)
n

24

Calculate the difference for each pair.

\begin{equation*}
d_i = X_{i} - Y_{i}
\end{equation*}


In [31]:
d = []
for i in range(n):
    d = d + [X[i] - Y[i]]
    
print(d)

[-8.0, -4.0, -18.0, -20.0, -48.0, -8.0, -12.0, 4.0, -12.0, -14.0, -14.0, 4.0, 22.0, -64.0, -28.0, -18.0, 10.0, -48.0, -30.0, 42.0, -16.0, -16.0, -22.0, 34.0]


Determine the mean of all these differences:

\begin{equation*}
\bar{d} = \frac{\sum_{i=1}^n d_i} {n}
\end{equation*}

In [32]:
meanD = sum(d) / n
meanD

-11.833333333333334

Determine the sum of squares of the differences:

\begin{equation*}
SS_d = \sum_{i=1}^n \left( d_i - \bar{d} \right)^2
\end{equation*}


In [33]:
SSd = 0

for i in range(n):
    SSd = SSd + (d[i] - meanD)**2

SSd

13431.333333333334

The standard deviation of the differences can then be determined using:
\begin{equation*}
s_d = \sqrt{\frac{SS_d} {n-1}}
\end{equation*}

In [34]:
sd = (SSd / (n - 1))**0.5
sd

24.165492225335566

The standard error is then:

\begin{equation*}
SE_d = \frac{s_d} {\sqrt{n}}
\end{equation*}

In [35]:
SEd = sd / n**0.5
SEd

4.932760444605509

Finally the t value can be determined using:
\begin{equation*}
t = \frac{\bar{d}} {SE_d}
\end{equation*}

Note that if you are testing against a specific difference (e.g. you assume the difference is 3) then the numerator becomes $\bar{d} - \mu_{h_0}$ where $\mu_{h_0}$ is your hypothesized difference. In other cases if you are just checking for a difference this can be ignored since then $\mu_{h_0} = 0$

In [36]:
tVal = meanD / SEd
tVal

-2.398927226696023

To find the corresponding p-value we would need the degrees of freedom, which is defined for this test as:

\begin{equation*}
df = n - 1
\end{equation*}

In [37]:
df = n - 1
df

23

To find the p-value of a t-value with a specific df, we will need a package. Scipy.stats has a t distribution, so lets use that:

In [38]:
from scipy.stats import t

Finally to get the p-value:

In [39]:
t.sf(abs(tVal),df)*2

0.02494298862638759