This tutorial aims to demonstrate that many of the statistical tests we've covered in this course, including t-tests, correlation, ANOVA, and chi-square, are essentially special versions of regression. This is why, when introducing each statistical test, we always discussed the variables involved, such as one variable, two variables of specific types, and so on. By understanding this relationship between regression and these other tests, hopefully, everything we've covered so far will start to make more sense.

The relation between the common statistical tests and linear models can be summarized in the following chart:

![Common statiscal tests are linear models](https://lindeloev.github.io/tests-as-linear/linear_tests_cheat_sheet.png)

**N.B.** Most of this tutorial is based on https://lindeloev.github.io/tests-as-linear. You are more than welcome to visit this page for more details.

# Week 6: One continuous variable (one sample t-test)

The dataset that we used for that tutorial was the following:

In [1]:
set.seed(123)

N<-35
X.1<-rnorm(N, mean = 115, sd=15)
mu.0<-110

dat<-data.frame(y=X.1-mu.0) #

In [2]:
t.test(dat$y, mu=0, alternative = "two.sided")


	One Sample t-test

data:  dat$y
t = 2.3186, df = 34, p-value = 0.02656
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
  0.686995 10.438537
sample estimates:
mean of x 
 5.562766 


Looking at the above chart, the equivalent linear model would be running `lm(y~1)`, where y is our continuous variable. That is, we are going to try to fit a regression model with just the intercept. And this makes sense, because the intercept is defined as the expected value of the dependent variable when the other variables are equal to zero. This expected value corresponds to the mean, which is basically what we are testing in the one-sample t-test.

Let's run the model as explained:

In [3]:
summary(lm(y~1, data = dat))


Call:
lm(formula = y ~ 1, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-30.062  -9.454   1.097  10.859  26.241 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)    5.563      2.399   2.319   0.0266 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 14.19 on 34 degrees of freedom


As we can see, the estimate, t-statistic and p-value for the intercept is the same as the output obtained from using `t-test`. 

In fact, as we saw earlier, we can also run this function using formulas. The formula we use for this case is the same as the one we used for `lm`. This suggests that we are indeed calculating the same things.

In [4]:
t.test(y~1, data = dat, mu=0, alternative = "two.sided")


	One Sample t-test

data:  y
t = 2.3186, df = 34, p-value = 0.02656
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
  0.686995 10.438537
sample estimates:
mean of x 
 5.562766 


Therefore, from here we can conclude that a linear regression model with only the intercept corresponds to running a one-sample t-test on the dependent variable.

# Week 6: Test for a categorical variable (Pearson's $\chi^2$-test )

Here we will talk about proportions instead of the actual observed values in our categorical variable.

This is the dataset that we created for that part of the tutorial:

In [5]:
set.seed(123)

X<-sapply(c(1:35), function(x) sample(c("happy","neutral","sad"), size=1))
X<-c(rep("happy", 15), X)
X.table<-table(X)

And we used this table to run a Pearson's $\chi^2$-test to assess whether the observed number of occurrences differ from the assumed ones, that is, from the null hypothesis:

In [6]:
chisq.test(X.table)


	Chi-squared test for given probabilities

data:  X.table
X-squared = 6.52, df = 2, p-value = 0.03839


As we saw in the chart above, this should correspond to running a regression model where the depedent variable will be the number of occurrences per category and the independent variable the actual categories. Let's create a data frame that contains these two variables:

In [7]:
dat<-data.frame(mood=names(X.table), counts=as.vector(X.table))
dat

mood,counts
<chr>,<int>
happy,25
neutral,11
sad,14


Now, the regression model can not be the usual linear regression, because that assumes that the dependent variable is continuous and follows a gaussian distribution. However, here we have number of occurrences in the dependent variable. This information can be generated with the so-called Poisson distribution, which is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space at a known constant mean rate. As a result, let's use this distribution as the link function in the *family* of `glm`:

In [8]:
reg.full = glm(counts ~ 1 + mood, data = dat, family = poisson())
summary(reg.full)


Call:
glm(formula = counts ~ 1 + mood, family = poisson(), data = dat)

Deviance Residuals: 
[1]  0  0  0

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   3.2189     0.2000  16.094   <2e-16 ***
moodneutral  -0.8210     0.3618  -2.269   0.0233 *  
moodsad      -0.5798     0.3338  -1.737   0.0824 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance:  6.2500e+00  on 2  degrees of freedom
Residual deviance: -2.2204e-16  on 0  degrees of freedom
AIC: 19.803

Number of Fisher Scoring iterations: 3


Mmm, running this yields p-values comparing the proportions between pairs of categories (here with respect to the baseline). 

What we can do is to compare the above model with a model that does not have our categorical variable, that is, with a model (our null) with only the intercept:

In [9]:
reg.null = glm(counts ~ 1, data = dat, family = poisson())
summary(reg.null)


Call:
glm(formula = counts ~ 1, family = poisson(), data = dat)

Deviance Residuals: 
      1        2        3  
 1.8991  -1.4805  -0.6719  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   2.8134     0.1414   19.89   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 6.25  on 2  degrees of freedom
Residual deviance: 6.25  on 2  degrees of freedom
AIC: 22.053

Number of Fisher Scoring iterations: 4


This is beyond the scope of the course, but we can use the  `anova` function to compare these regression models.

In [10]:
?anova

In [11]:
anova(reg.null, reg.full, test ="Rao") # Do not worry about the last argument

Unnamed: 0_level_0,Resid. Df,Resid. Dev,Df,Deviance,Rao,Pr(>Chi)
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,2,6.250021,,,,
2,0,-2.220446e-16,2.0,6.250021,6.520011,0.03838818


As we see, here we got the same p-value as that applying `chisq.test` to the table of occurrences.

# Week 6: Statistical test for two variables (continuous vs categorical)

For this scenario we generated the following datasets:

In [12]:
set.seed(1234)

# Generate random data for group 1 with mean 10 and standard deviation 2
students.data<-rbind(data.frame(value=rnorm(25, mean = 10, sd = 2), group='a'),
                  data.frame(value=rnorm(25, mean = 11, sd = 2), group='b'))

# Generate random data for group 1 with mean 10 and standard deviation 2
anova.data<-rbind(data.frame(value=rnorm(25, mean = 10, sd = 2), group='a'),
                  data.frame(value=rnorm(25, mean = 12, sd = 2), group='b'), 
                  data.frame(value=rnorm(25, mean = 10, sd = 2), group='c'))

## Student's t-test

In [13]:
t.test(x=students.data[students.data$group=="a", "value"],
       y=students.data[students.data$group=="b", "value"], 
       var.equal = TRUE)


	Two Sample t-test

data:  students.data[students.data$group == "a", "value"] and students.data[students.data$group == "b", "value"]
t = -0.31557, df = 48, p-value = 0.7537
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.1419734  0.8321391
sample estimates:
mean of x mean of y 
 9.516435  9.671353 


Looking at the above chart, the equivalent linear model would be running `lm(y~x)`, where *y* is our continuous variable and *x* our categorical variable. As a result, we are going to fit a linear regression model where the $\beta$ coefficient will encode the effect of *x* on *y*. 

In the case of *x* being binary, the $\beta$ coefficient is just the difference in means between both categories. And this is just what the Student's t-test addresses. 

In [14]:
summary(lm(value~group, data = students.data))


Call:
lm(formula = value ~ group, data = students.data)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.2078 -0.9809 -0.4843  0.7488  5.3152 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   9.5164     0.3471  27.415   <2e-16 ***
groupb        0.1549     0.4909   0.316    0.754    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.736 on 48 degrees of freedom
Multiple R-squared:  0.00207,	Adjusted R-squared:  -0.01872 
F-statistic: 0.09958 on 1 and 48 DF,  p-value: 0.7537


Here we have to look at the p-value for the entire model (the last one the displayed summary information), which is the same as that running the `t.test` function.

How about the statistic? Well in this case the output for the entire model from `lm` is the F-statistic. But as we saw in the lectures, for a two-sample t-test, $F=t^2$. Let's check that that is what happening here:

In [15]:
sqrt(summary(lm(value~group, data = students.data))$fstatistic[1])

t.test(x=students.data[students.data$group=="a", "value"],
       y=students.data[students.data$group=="b", "value"], 
       var.equal = TRUE)$stat

## ANOVA

Remember that we use ANOVA when the categorical variable has two or more categories.

In [16]:
summary(aov(value~group, data=anova.data))

            Df Sum Sq Mean Sq F value   Pr(>F)    
group        2  59.39  29.697   7.791 0.000865 ***
Residuals   72 274.45   3.812                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The modelling as a regression is the same as with the Student's t-test:

In [17]:
summary(lm(value~group, data = anova.data))


Call:
lm(formula = value ~ group, data = anova.data)

Residuals:
   Min     1Q Median     3Q    Max 
-4.035 -1.372 -0.242  1.247  4.676 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  10.4225     0.3905  26.692  < 2e-16 ***
groupb        1.7132     0.5522   3.102  0.00274 ** 
groupc       -0.3106     0.5522  -0.562  0.57556    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.952 on 72 degrees of freedom
Multiple R-squared:  0.1779,	Adjusted R-squared:  0.1551 
F-statistic: 7.791 on 2 and 72 DF,  p-value: 0.0008652


Here we have to look at the p-value for the full model (the last one in the displayed summary information), which as we see is the same as that using `aov`. We have other p-values as well, corresponding to the independent variables that were created as dummy variables from our original categorical variable. These would give the difference in means of dependent variable between pairs of categories, that is, it would correspond to running pairwise Student's t-tests. This was the usual posthoc step after running the ANOVA test, but here we obtain everything at once.

**BONUS TRACK**: What happens if we include an independent continuous variable to the regression model above? 

In [18]:
library(tidyverse)
anova.data.bonus<- anova.data %>% mutate(covariate=rnorm(nrow(anova.data)))
summary(lm(value~group + covariate, data = anova.data.bonus))

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.4.0      [32m✔[39m [34mpurrr  [39m 1.0.1 
[32m✔[39m [34mtibble [39m 3.1.8      [32m✔[39m [34mdplyr  [39m 1.0.10
[32m✔[39m [34mtidyr  [39m 1.2.1      [32m✔[39m [34mstringr[39m 1.5.0 
[32m✔[39m [34mreadr  [39m 2.1.3      [32m✔[39m [34mforcats[39m 0.5.2 
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



Call:
lm(formula = value ~ group + covariate, data = anova.data.bonus)

Residuals:
   Min     1Q Median     3Q    Max 
-4.037 -1.404 -0.254  1.253  4.682 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 10.42283    0.39320  26.508   <2e-16 ***
groupb       1.71052    0.55657   3.073    0.003 ** 
groupc      -0.31153    0.55611  -0.560    0.577    
covariate    0.02302    0.20822   0.111    0.912    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.966 on 71 degrees of freedom
Multiple R-squared:  0.1781,	Adjusted R-squared:  0.1433 
F-statistic: 5.127 on 3 and 71 DF,  p-value: 0.002882


Say we want to compute the p-value for the entire independent categorical variable. We can compare again the above regression model with a model that does not include it.

In [19]:
reg.full<-lm(value~group + covariate, data = anova.data.bonus)
reg.null<-lm(value~covariate, data = anova.data.bonus)

anova(reg.null, reg.full)

Unnamed: 0_level_0,Res.Df,RSS,Df,Sum of Sq,F,Pr(>F)
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,73,333.5758,,,,
2,71,274.4018,2.0,59.17398,7.655474,0.000975736


This is the same as running the following code:

In [20]:
car::Anova(aov(value~group + covariate, data = anova.data.bonus))

Unnamed: 0_level_0,Sum Sq,Df,F value,Pr(>F)
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>
group,59.1739755,2,7.65547416,0.000975736
covariate,0.04723915,1,0.01222288,0.912279478
Residuals,274.40183148,71,,


This is an ANCOVA test. We have not covered this type of test in the course, but FYI, it is just a usual ANOVA test in which we have a continuous variable that acts as a covariate, that is, as an effect we want to control for. And as we have just seen, we can easily run this kind of tests as a linear regression model using the `lm` function.

# Week 8: Statistical test for two variables (categorical vs categorical)

Let's import the data that we used for the tutorial that week:

In [21]:
dat<-read.csv("https://raw.githubusercontent.com/jrasero/cm-85309-2023/main/datasets/tutorial8chisquare.csv")
dat.table<-xtabs(~group + evolution, dat)
dat.table

         evolution
group     Better Same Worse
  Drug        30   28    11
  Placebo     25   44    33

We can run a $\chi^2$-test of independence to address the association between both categorical variables.

In [22]:
chisq.test(dat.table)


	Pearson's Chi-squared test

data:  dat.table
X-squared = 8.976, df = 2, p-value = 0.01124


Similar to the case where we had just one categorical variable, this analysis involves running a regression model with the number of occurrences for all possible pairs of categories between both categorical variables as the dependent variable, and the actual categories for the two categorical variables as independent variables. To do so, we can create a data frame that contains all this necessary information:

In [23]:
dat.occurences<-as.data.frame(xtabs(~group + evolution, dat))
dat.occurences

group,evolution,Freq
<fct>,<fct>,<int>
Drug,Better,30
Placebo,Better,25
Drug,Same,28
Placebo,Same,44
Drug,Worse,11
Placebo,Worse,33


As we learned, the purpose of the $\chi^2$-test is to examine the association between two independent variables. To account for this type of effect in a regression model, we require a term that considers the interaction between the two categorical variables. This is because the $\beta$ for each individual categorical variable alone cannot account for such an effect, as that would be equivalent to conducting a Pearson's $\chi^2$-test. The interaction term would enable us to study the combined effect of both categorical variables on the observed occurrences.

In [24]:
reg.full<-glm(Freq ~ group*evolution, data=dat.occurences, family = poisson())
reg.null<-glm(Freq ~ group + evolution, data=dat.occurences, family = poisson())
anova(reg.null, reg.full, test = 'Rao')

Unnamed: 0_level_0,Resid. Df,Resid. Dev,Df,Deviance,Rao,Pr(>Chi)
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,2,9.143495,,,,
2,0,1.754152e-14,2.0,9.143495,8.975973,0.01124326


In this case, the p-value for the interaction term is the same as that from running the `chisq.test` function on the contingency table.

# Week 8: Statistical test for two variables (continuous vs continuous)

Let's recreate the data that we used back then for the tutorial on this scenario:

In [25]:
set.seed(1234)
x<-rnorm(50)
y<- 0.1*x + rnorm(50, sd = 1e-1)
z<-rnorm(50)

dat<-data.frame(x,y,z)
head(dat)

Unnamed: 0_level_0,x,y,z
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
1,-1.2070657,-0.301309701,0.41452353
2,0.2774292,-0.030464668,-0.47471847
3,1.0844412,-0.002444845,0.06599349
4,-2.3456977,-0.336065971,-0.50247778
5,0.4291247,0.026681517,-0.82599859
6,0.5060559,0.106911171,0.16698928


In our study of assessing the statistical association between two continuous variables, we learned about Pearson's correlation test. This test can be conducted in R using the `cor.test` function.

In [26]:
cor.test(x, y)


	Pearson's product-moment correlation

data:  x and y
t = 5.4891, df = 48, p-value = 1.497e-06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.4142598 0.7668035
sample estimates:
     cor 
0.621001 


In terms of a linear model, this is not different from the scenario where the independent variable was categorical but having instead an independent continuous variable. That is, we should run `lm(y~x)`, where *y* is one continuous variable and *x* is the other continuous variable:

In [27]:
summary(lm(y~x, data = dat))


Call:
lm(formula = y ~ x, data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.20009 -0.06731 -0.01438  0.06123  0.23695 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.01063    0.01665   0.638    0.526    
x            0.09267    0.01688   5.489  1.5e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1046 on 48 degrees of freedom
Multiple R-squared:  0.3856,	Adjusted R-squared:  0.3728 
F-statistic: 30.13 on 1 and 48 DF,  p-value: 1.497e-06


**Important**: The p-value to compare here is again the one from the entire model. And this makes sense, because what defines the correlation coefficient is the entire line, parametrized by the slope and intercept. Similarly, keep in mind that for the linear regression case, $R2$ was equal to the square of the Pearson correlation, so it makes sense again to look at what happens to the model as a whole.

In [28]:
sqrt(summary(lm(y~x, data = dat))$r.squared)

cor.test(x, y)$estimate