# Regression by Hand

## Introduction

This notebook focuses on running a multiple regression by hand (i.e without using the built-in `lm` function).

The analysis investigated predictors of the federal funds interest rate in the USA. The data was taken from St. Louis Federal Reserve Economic Dataset and contains interest rates from 1960 to 2014.

The model of interest is:

$$
\mathit{intrate}_t =
\beta_1 + \beta_2 \mathit{infl}_t + \beta_3 \mathit{commpri}_t + \beta_4 \mathit{pce}_t + \beta_5 \mathit{persinc}_t + \beta_6 \mathit{houst}_t + \varepsilon_t
$$

where $\mathit{intrate}_t$ is the Federal funds interest rate at time $t$, $\mathit{infl}_t$ is inflation, $\mathit{commpri}_t$ is commodity prices, $\mathit{pce}_t$ is personal consumption expenditure, $\mathit{persinc}_t$ is personal income, and $\mathit{houst}_t$ is housing starts.

In [1]:
# loads packages

library(data.table)
library(magrittr)

In [2]:
# loads data set

dat_interest<-fread("Data/Data_Interest_Rate.csv")

## Estimation

Parameters can be estimated through:

$$
\mathbf{b} = (\mathbf{X}^\intercal \mathbf{X})^{-1} \mathbf{X}^\intercal \mathbf{y}
$$

In [3]:
vec_y<-dat_interest$INTRATE

matr_x<-dat_interest[, .(intercept=1, INFL, COMMPRI, PCE, PERSINC, HOUST)] %>% as.matrix

vec_b<-(((matr_x %>% t) %*% matr_x) %>% solve) %*% (matr_x %>% t) %*% vec_y

In [4]:
vec_b

0,1
intercept,-0.240118833
INFL,0.71752653
COMMPRI,-0.007500665
PCE,0.340525448
PERSINC,0.240242001
HOUST,-0.020529694


### Fitted Values

$$
\hat{\mathbf{y}} = \mathbf{X} \mathbf{b}
$$

In [5]:
vec_y_hat<-matr_x %*% vec_b

### Residual Variance

$$
s^2 = \frac{1}{n-k} \mathrm{SSE}
$$

where $\mathrm{SSE}$ is the sum of squared errors of the full model, $n$ is the number of observation, and $k$ is the total number of parameters in the full model

Residual standard error is its square root $s$.

In [6]:
n<-nrow(dat_interest)

k<-length(vec_b)

SSE<-(vec_y-vec_y_hat)^2 %>% sum

s2<-(1/(n-k))*(SSE)

s<-sqrt(s2)

In [7]:
s

### Standard Errors of Coefficients

$$
\mathbf{C} = s^2 (\mathbf{X}^\intercal \mathbf{X})^{-1}
$$

Given $C_{jj}$ which are the diagonal elements of $\mathbf{C}$:

$$
\text{SE}(\mathbf{b}) = \sqrt{C_{jj}}
$$

In [8]:
matr_C<-s2 * (((matr_x %>% t) %*% matr_x) %>% solve)

C_jj<-diag(matr_C)

vec_SE_b<-sqrt(C_jj)

In [9]:
vec_SE_b

## Evaluation

### R Squared

$$
R^2 = (\text{cor}(\mathbf{y}, \mathbf{\hat{y}}))^2
$$

In [10]:
R2<-cor(vec_y, vec_y_hat)[1]^2

In [11]:
R2

### Adjusted R Squared

$$
1 - \frac{(1-R^2)(n-1)}{(n-g-1)}
$$

where $g$ is the number of restrictions (additional parameters compared to the intercept-only model).

In [12]:
g<-length(vec_b)-1

adjusted_R2<-1- ((1-R2) * (n-1))/(n-g-1)

In [13]:
adjusted_R2

### F-statistic

$$
F =
\frac{(SSE_0 - SSE)/g}{(SSE)/(n-k)}
$$

where $SSE_0$ is the sum of squared errors of the intercept-only model.

In [14]:
SSE_intercept_only<-(vec_y-mean(vec_y))^2 %>% sum

F<-((SSE_intercept_only-SSE)/g) / (SSE/(n-k))

In [15]:
F

The p-value assoicated with this F-statistic can be calculated with $df(g, n-k)$, which is highly significant.

In [16]:
p<-1 - pf(F, g, n-k)

In [17]:
p

### T-statistics of Coefficients

$$
\mathbf{t} =
\frac{\mathbf{b}}{\text{SE}(\mathbf{b})}
$$

In [18]:
vec_t<-vec_b/vec_SE_b

In [19]:
vec_t

0,1
intercept,-1.042336
INFL,12.55477
COMMPRI,-2.8411
PCE,5.7564
PERSINC,4.048463
HOUST,-4.6779


P-values assoicated with the t-statistics can be calculated with $\text{df}(n-k)$.

In [20]:
vec_p<-lapply(vec_t, function(x) (1-pt(abs(x), n-k))*2)

In [21]:
vec_p

## Reference Model

In [22]:
lm_reference<-lm(INTRATE~INFL+COMMPRI+PCE+PERSINC+HOUST, data=dat_interest)

summary(lm_reference)


Call:
lm(formula = INTRATE ~ INFL + COMMPRI + PCE + PERSINC + HOUST, 
    data = dat_interest)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.1631 -1.5244 -0.1125  1.3715  7.6725 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.240119   0.230366  -1.042  0.29764    
INFL         0.717527   0.057152  12.555  < 2e-16 ***
COMMPRI     -0.007501   0.002640  -2.841  0.00464 ** 
PCE          0.340525   0.059156   5.756 1.32e-08 ***
PERSINC      0.240242   0.059342   4.048 5.77e-05 ***
HOUST       -0.020530   0.004389  -4.678 3.52e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.188 on 654 degrees of freedom
Multiple R-squared:  0.6374,	Adjusted R-squared:  0.6346 
F-statistic: 229.9 on 5 and 654 DF,  p-value: < 2.2e-16
