## ANCOVA

### 1. Importing the libraries
First, let us set the environment by importing the libraries

In [1]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pingouin as pg
import statsmodels.api as sm
from patsy.contrasts import ContrastMatrix
from scipy.stats import pearsonr
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

### 2. Data exploration
The "depression" data frame has been loaded for you. It contains information from 100 patients who agreed to participate in a study aiming to investigate the effects of a new intensive depression therapy. Relevant variables are: "group" (0 = control, 1 = treatment), "pre"- and "post"-therapy depression scores, "EM", a score reflecting emotional intelligence (it is suspected that this variable can influence the treatment outcome) and "change", which is computed by subtracting "pre" from "post"-therapy depression. Our task is to investigate whether the intensive therapy is effective at reducing self-reported depression. 

In [2]:
# Read the CSV file using a relative path
depression = pd.read_csv("../ANOVA_and_ANCOVA/Datasets/depression.csv")

# Display the first few rows of the dataframe
print(depression.head())

# Describe the dataframe
print("\nStatistical description of the data:")
print(depression.describe())

   id  group  pre  post  EM  change
0   1      0   34    76  25      42
1   2      0   87    51  34     -36
2   3      0   68    71  15       3
3   4      0   38    72  41      34
4   5      0   82    47  14     -35

Statistical description of the data:
               id       group         pre        post          EM      change
count  100.000000  100.000000  100.000000  100.000000  100.000000  100.000000
mean    50.500000    0.500000   60.830000   46.640000   25.760000  -14.190000
std     29.011492    0.502519   16.736855   19.779348    9.721495   25.343178
min      1.000000    0.000000   31.000000   14.000000   10.000000  -69.000000
25%     25.750000    0.000000   48.000000   31.500000   16.000000  -32.000000
50%     50.500000    0.500000   60.000000   47.500000   25.500000  -15.000000
75%     75.250000    1.000000   75.250000   59.000000   34.000000    7.250000
max    100.000000    1.000000   90.000000   87.000000   45.000000   42.000000


### 3. Background

As you learned in the theoretical session of the seminar, back in the day 
researchers used to follow a two-step analysis approach to investigate whether a treatment effect is different between individuals, for example if some individuals change more than others during a drug treatment. They calculated the change score for every observation and correlated the change score with the covariate of interest. We provide an example below for illustration purposes but remember the disadvantages of this method **(remember: range restriction of the change score)**. This two step approach is not acceptable today!

In [3]:
# Perform a correlation test
correlation, p_value = pearsonr(depression['EM'], depression['change'])
print(f"Pearson correlation: {correlation}")
print(f"P-value: {p_value}")

Pearson correlation: 0.07270882951614406
P-value: 0.4722029695800074


### 4. Concept of ANCOVA
ANCOVA (analysis of covariance) is frequently referred to as a combination of ANOVA and a simple or multiple regression with non-categorical predictors. That is because the ANCOVA framework allows to investigate differences between different groups (levels of an independent variable) when controlling / partialling out / adjusting for one or more covariates that are relevant to the investigation. In other words, ANCOVA evaluates differences in a dependent variable after the effect of one or more covariates has been removed. This is very important, especially when suspecting that a covariate may influence the effect of the independent variable. Its main goal is therefore to eliminate confounds and to reduce within-group error variance, thus increasing **internal validity**. Keep in mind however that this is true when the groups to be compared do not differ significantly in their levels of the covariate. In fact this is one of the assumptions of ANCOVA.

In the context of analyzing pre-post design data including a treatment and a control group, ANCOVA is generally used to explore adjusted group effects. That is, ANCOVA will determine whether there is a group effect on the outcome variable regardless of the starting values (where the participants started).

### 5. Assumption
We will not check for the following assumptions but please mind them if you intend to use ANCOVA in your later research. 

1. As outlined above, the treatment (independent variable) and the covariate to be controlled for must be independent. 

2. Homogeneity of regression slopes: ANCOVA fits a regression with the covariate as a predictor of the dependent variable for the entire data set. It is expected that the slopes within each group are not significantly different, so the overall model is a good representation of this individual slopes.

### 6. Computation (Exercise 3)
Given that ANCOVA is an extension of ANOVA and both are especial cases of the general linear model, it is not surprising that we can use the `anova_lm()` function. Please note that ANCOVA does not help us to investigate individual differences in change.

We can run the ANCOVA model to investigate whether the outcomes of the therapy are different between groups while accounting for the starting scores ("pre" variable). 

1. Set proper contrasts for your factor (group).
2. Specify and fir the model with `ols()`
3. Run the ANCOVA with `anova_lm()`
4. Use the summary() function to inspect the results.

In [4]:
# Define contrasts for the `group` variable
contrast_group = ContrastMatrix(np.array([[-1], [1]]), ["contrast_group"])

# Assign contrasts to the `group` variable
depression['group'] = depression['group'].astype('category')

# Fit the ANCOVA model
model = ols('post ~ pre + C(group, contrast_group)', data=depression).fit()

# Perform Type III ANOVA
anova_results = anova_lm(model, typ=3)

# Display the Type III ANOVA results
print("Type III ANOVA Results:")
print(anova_results)

# Display the summary of the ANCOVA model
print("\nANCOVA Model Summary:")
print(model.summary())

# Display the contrasts for the `group` variable
print("\nContrasts for 'group':")
print(contrast_group.matrix)

Type III ANOVA Results:
                                sum_sq    df          F        PR(>F)
Intercept                 14312.910843   1.0  63.937373  2.735472e-12
C(group, contrast_group)  16942.129734   1.0  75.682387  8.570490e-14
pre                          12.625995   1.0   0.056402  8.127772e-01
Residual                  21714.254005  97.0        NaN           NaN

ANCOVA Model Summary:
                            OLS Regression Results                            
Dep. Variable:                   post   R-squared:                       0.439
Model:                            OLS   Adj. R-squared:                  0.428
Method:                 Least Squares   F-statistic:                     38.01
Date:                Tue, 15 Apr 2025   Prob (F-statistic):           6.48e-13
Time:                        08:47:40   Log-Likelihood:                -410.92
No. Observations:                 100   AIC:                             827.8
Df Residuals:                      97   BIC:      

### 7. Interpretation
The output tells us that the covariate "pre" does not have a significant effect (t = 0.237, p = 0.813) on the outcome variable (scores after 
the intervention). In addition, there is a significant group effect (t = -8.7, p = 0) on the dependent variable. Taken together these results are of great importance for longitudinal studies, as they indicate that the differences across groups are not due to the initial scores of the participants. In other words, we are now more confident that significant differences across groups are related to the different treatments that they received and not to the baseline values. 