# 02 - Multivariate linear regression

## Data

Source of data: UCLA

Introduction to regression in R. UCLA: Statistical Consulting Group. 
fromhttps://stats.oarc.ucla.edu/r/seminars/introduction-to-regression-in-r/.

elementary school academic performance index
https://stats.idre.ucla.edu/wp-content/uploads/2019/02/elemapi2v2.csv

Data set elemapi2.csv

In [1]:
library(readr)
elemapi2 <- read_csv("data/elemapi2.csv",
                 show_col_types = FALSE)
head(elemapi2)


snum,dnum,api00,api99,growth,meals,ell,yr_rnd,mobility,acs_k3,⋯,hsg,some_col,col_grad,grad_sch,avg_ed,full,emer,enroll,mealcat,collcat
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
906,41,693,600,93,67,9,0,11,16,⋯,0,0,0,0,,76,24,247,2,1
889,41,570,501,69,92,21,0,33,15,⋯,0,0,0,0,,79,19,463,3,1
887,41,546,472,74,97,29,0,36,17,⋯,0,0,0,0,,68,29,395,3,1
876,41,571,487,84,90,27,0,27,20,⋯,45,9,9,0,1.91,87,11,418,3,1
888,41,478,425,53,89,30,0,44,18,⋯,50,0,0,0,1.5,87,13,520,3,1
4284,98,858,844,14,10,3,0,10,20,⋯,8,24,36,31,3.89,100,0,343,1,2


## SAS program snippet

The following SAS code will be executed.

The option /stb provides standardized estimates.

## R chunk

Packages will be loaded in the chunk where they are first needed.

A similar R program might look like this. It uses the lm() function.

The tidy() function from the broom-packages formats the output into a tibble for easier processing.

In [2]:
library(broom)
# Linear regression using lm()
my_lm <- lm(api00 ~ ell + meals + yr_rnd + mobility + acs_k3 + acs_46 + full + emer + enroll, data = elemapi2)
tidy(my_lm)


term,estimate,std.error,statistic,p.value
<chr>,<dbl>,<dbl>,<dbl>,<dbl>
(Intercept),758.9417932,62.2860073,12.184788,4.13489e-29
ell,-0.86007067,0.21063175,-4.0832907,5.402931e-05
meals,-2.94821634,0.17034524,-17.3073006,4.715844e-50
yr_rnd,-19.88874706,9.25844226,-2.1481742,0.03232329
mobility,-1.30135168,0.43620533,-2.9833466,0.003032785
acs_k3,1.31870017,2.25268291,0.5853909,0.5586278
acs_46,2.03245622,0.79832127,2.5459126,0.01128813
full,0.609715,0.47582046,1.2813972,0.2008254
emer,-0.70661916,0.60540863,-1.1671772,0.2438612
enroll,-0.01216405,0.01679211,-0.7243903,0.4692661


## Results

The output is divided into blocks to explain it and to reproduce it afterwards in the different languages.

### Block 1
![Block 1](img_screenshots/block_1.png)

Number of observations read is the number of observations in the dataset.

Number of observation used is the number of complete cases regarding the variables used for the SAS program snippet.

Number of observations with missing values in the model variables.

## R chunk for reproduction


In [3]:
library(tidyverse)
# Number of observations read
nrow(elemapi2)
# Number of observations used
sum(elemapi2 %>% 
    select(api00, ell, meals, yr_rnd, mobility, acs_k3, acs_46, full, emer, enroll) %>%
    complete.cases())

# Alternative for number of observations used.
nobs(my_lm)

── [1mAttaching core tidyverse packages[22m ──────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.2     [32m✔[39m [34mpurrr    [39m 1.0.1
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.4.4     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.2     [32m✔[39m [34mtidyr    [39m 1.3.0
── [1mConflicts[22m ────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


The number of observations is the number of rows in the dataset.

The number of observations used is either the number of complete cases regarding the variables in the model or the number returned from the nobs() function.

### Block 2
![Block 2](img_screenshots/block_2.png)

An analysis of variance was performed for the data.

#### Source
The column source in this table presents the sources of variance. They are divided into

-  Model,
-  Residual, and
-  Total.

Model stands for the variance which is explained by the independent variables.

Total stands for the total variance which can be divided into the variance explained from the model and the variance not explained from the model called residual or error.

Sum of squares of model plus sum of squares of error is equal to the total sum of squares.

#### DF

The degrees of freedom are calculated as follows:

The df for total is the number of used observations minus one.

The df for the total is the number of variables in the model minus one. The intercept is counting as one variable if not explicitely omitted.

The for for the error is the difference of $df_{total} - df_{model}$.

#### Sum of squares

Calculation of sum squares might be added here later.

It can be found in several other tutorials.

#### Mean square

The mean square is the sum of squares divided by the degrees of freedom.

#### F-Value

The F-value is the mean square model divided by the mean square error. The degrees of freedom are $df_{model}$ and $df_{error}$.

#### Pr > F

The null hypothesis tested is that there is no linear relationship between the independent and the dependent variables.

The alternative hypothesis states that there is a linear relationship.


## R chunk for reproduction


In [4]:
my_aov<- aov(api00 ~ ell + meals + yr_rnd + mobility + acs_k3 + acs_46 + full + emer + enroll, data = elemapi2)
tidy(my_aov)

term,df,sumsq,meansq,statistic,p.value
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
ell,1,4677256.605,4677256.605,1451.384299,1.116668e-132
meals,1,1890980.592,1890980.592,586.7840428,2.049987e-79
yr_rnd,1,50911.559,50911.559,15.7982006,8.415214e-05
mobility,1,16615.912,16615.912,5.1560297,0.02371755
acs_k3,1,7182.54,7182.54,2.2287907,0.1362788
acs_46,1,20061.047,20061.047,6.2250782,0.01301343
full,1,71776.875,71776.875,22.2728489,3.316001e-06
emer,1,4225.835,4225.835,1.311305,0.2528699
enroll,1,1691.041,1691.041,0.5247414,0.4692661
Residuals,385,1240707.781,3222.618,,


The model here is split into the individual parameters. It is not complete model in one row compared to SAS.

Error here is called residuals.

The row with the total is missing. It could be appended easily as the sum of the df column and as the sum of the sum of squares column.

The columns and their contents are similar.

### Block 3
![Block 3](img_screenshots/block_3.png)

### Root MSE

Root MSE is the standard deviation of the error term.

It is the square root of the mean square error (or residual).

### Dependent mean

The dependent mean is the mean of the dependent variable of those observations which were used and not omitted.

### Coeff Var

The coefficient of variation is the root MSE divided by the dependent mean. It is a measure of variation in the data.

### R-square

R-square is the proportion of the explained variance based on the total variance. Sum of square model divided by sum of square total.

### Adj R-Sq

Adjusted R-square adjusts for the relation between the number of variables (k) in the model and the number of observations (N) in the dataset.

$R_{adj} = 1 – ((1 – Rsq)((N – 1) / (N – k – 1))$



## R chunk for reproduction

In [5]:
glance(my_lm)

r.squared,adj.r.squared,sigma,statistic,p.value,df,logLik,AIC,BIC,deviance,df.residual,nobs
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<int>
0.8445503,0.8409164,56.7681,232.4095,1.183585e-149,9,-2150.811,4323.623,4367.39,1240708,385,395


The glance() function provides

Root MSE: sigma
R-square: r.squared
Adjusted R-Square: adj.r.squared
Dependent mean and Coeff Var can be calculated.

In [6]:
df_complete <- elemapi2[complete.cases(elemapi2 %>% select(api00, ell, meals, yr_rnd, mobility, acs_k3, acs_46, full, emer, enroll)), ]

DependentMean <- mean(df_complete$api00)
DependentMean
CoeffVar <- glance(my_lm)$sigma / DependentMean * 100
CoeffVar

### Block 4
![Block 4](img_screenshots/block_4.png)

#### Variable

This column refers to the name of the variable in the model.

#### DF

The degrees of freedom are one for continous and binary variable. For categorial variables they are equal to the number of levels minus.

#### Parameter estimate

This columns containts the values which are the $b_i$ in the model.

$\hat{y} = b_0 + b_1 * x_1 + b_2 * x_2 + ... + b_n * x_n$

$api00 = 758.94 - 0.86 * ell - 2.95 * meals ...$

#### Standard error

The standard errors are provided for each variable.

They can be used for calculating the t-value: Parameter estimate divided by standard error.   

#### t value

The null hypothesis tested is that the coefficient is zero.

The alternative hypothesis is that the coefficient is unequal to zero.

#### Pr > |t|

p is provided for a two-sided test. It can be divided by two for a one-sided test.

#### Standardized estimate

This estimates results after standardizing all continous variables before including them into the model.


## R chunk for reproduction


In [7]:
my_lm <- lm(api00 ~ ell + meals + yr_rnd + mobility + acs_k3 + acs_46 + full + emer + enroll, data = elemapi2)
tidy(my_lm)

term,estimate,std.error,statistic,p.value
<chr>,<dbl>,<dbl>,<dbl>,<dbl>
(Intercept),758.9417932,62.2860073,12.184788,4.13489e-29
ell,-0.86007067,0.21063175,-4.0832907,5.402931e-05
meals,-2.94821634,0.17034524,-17.3073006,4.715844e-50
yr_rnd,-19.88874706,9.25844226,-2.1481742,0.03232329
mobility,-1.30135168,0.43620533,-2.9833466,0.003032785
acs_k3,1.31870017,2.25268291,0.5853909,0.5586278
acs_46,2.03245622,0.79832127,2.5459126,0.01128813
full,0.609715,0.47582046,1.2813972,0.2008254
emer,-0.70661916,0.60540863,-1.1671772,0.2438612
enroll,-0.01216405,0.01679211,-0.7243903,0.4692661


The column with the degrees of freedom is missing.

The other columns are similar.

The summary() function gives an overview of the results.

In [8]:
summary(my_lm)


Call:
lm(formula = api00 ~ ell + meals + yr_rnd + mobility + acs_k3 + 
    acs_46 + full + emer + enroll, data = elemapi2)

Residuals:
     Min       1Q   Median       3Q      Max 
-171.934  -39.294   -2.973   36.096  158.440 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 758.94179   62.28601  12.185  < 2e-16 ***
ell          -0.86007    0.21063  -4.083  5.4e-05 ***
meals        -2.94822    0.17035 -17.307  < 2e-16 ***
yr_rnd      -19.88875    9.25844  -2.148  0.03232 *  
mobility     -1.30135    0.43621  -2.983  0.00303 ** 
acs_k3        1.31870    2.25268   0.585  0.55863    
acs_46        2.03246    0.79832   2.546  0.01129 *  
full          0.60972    0.47582   1.281  0.20083    
emer         -0.70662    0.60541  -1.167  0.24386    
enroll       -0.01216    0.01679  -0.724  0.46927    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 56.77 on 385 degrees of freedom
  (5 Beobachtungen als fehlend gelösch