# Ordinary Least Squares

## Theory

The simple linear regression model can be written as follows

$$
y_t = \beta_0 + \beta_1 x_{t,1} + \beta_2 x_{t,2} + ... + \beta_p x_{t,k} + u_t
$$
or in vector form:
$$
\mathbf{y} = \mathbf{X}\mathbf{\beta} + \mathbf{u}
$$

where $\mathbf{y}$ and $\mathbf{u}$ are $T \times 1$ vectors containing the dependent variables and the errors of the $T$ observations, and $\mathbf{X}$ is an $T \times K$ matrix of independent variables.

$$
{\displaystyle \mathbf {X} ={\begin{bmatrix}X_{11}&X_{12}&\cdots &X_{1K}\\X_{21}&X_{22}&\cdots &X_{2K}\\\vdots &\vdots &\ddots &\vdots \\X_{T1}&X_{T2}&\cdots &X_{TK}\end{bmatrix}},\qquad {\boldsymbol {\beta }}={\begin{bmatrix}\beta _{1}\\\beta _{2}\\\vdots \\\beta _{K}\end{bmatrix}},\qquad \mathbf {y} ={\begin{bmatrix}y_{1}\\y_{2}\\\vdots \\y_{T}\end{bmatrix}}.}
$$


The OLS estimate minimizes the residual sum of squares (RSS):
$$
RSS = \sum_{t=1}^T (y_t - x_t'\beta)^2
$$

Hence, assuming that $(X'X)$ is non-singular (i.e. invertible), the OLS estimate is given by
$$
b = (X'X)^{-1}X'y
$$

From here, we can obtain our predicted values for the dependent variable $\hat{y}$ as
$$
\hat{y} = y - Xb
$$

and the OLS sample residuals
$$
\hat{u} = y - \hat{y} = y - Xb 
$$

To get the standard deviation for our estimator $b_j$, we use the formula
$$
\hat{\sigma_{b_j}} = \hat{\sigma} \sqrt{[(X'X)^{-1}]_{jj}} 
$$
where $jj$ indicates the $j^{th}$ diagonal element of the matrix. Using the full matrix gives us our variance covariance matrix.

The variance of the residuals is given by
$$
\hat{\sigma}^2 = \frac{1}{T-K} \hat{u}'\hat{u}
$$

Using those estimated parameters, we can perform a simple test for $H_0: \beta_j = 0$ using the t-statistic
$$
t = \frac{b_j}{\hat{\sigma_{b_j}}}
$$
and calculate the p-value using the t-distribution with $T-K$ degrees of freedom.
Remember that if the p-value is smaller than the significance level, we reject the null hypothesis.

## Coding OLS using Numpy

In [None]:
import numpy as np

When coding the OLS estimation, the key formula you want to implement is 
$$
b = (X'X)^{-1}X'y
$$

To do this, remember the following methods from Numpy:
- `np.linalg.inv()` to calculate the inverse of a matrix
- `np.dot()` or `@` to perform matrix multiplication
- `np.transpose()` or `.T` to transpose a matrix


In [None]:
# Lets first generate some data to run the regression on
x = np.arange(1000, dtype=float) # Generate 100 data points from 0 to 99

# Specify the relationship between x and y
y = 2 * x + 3

# Add some noise to the data
np.random.seed(0) # Always use the same seed for reproducibility
noise = np.random.normal(0, 100, x.shape)
y += noise

# Add a column of ones to x to account for the intercept in the linear regression
X = np.column_stack((np.ones(x.shape[0]), x)) # Add a column of ones to x

Code the OLS estimator using the generated data from the previous cell:

In [None]:
# Calculate the coefficients
b_hat = 

Now test the statistical significance of the coefficients using the t-statistic. 

In [None]:
# Calculate the residuals
u_hat = 

# Calculate the variance of the residuals

sigma2_hat = 

# Calculate the t-statistics for the coefficients
t_stat = 


To calculate the p-value, you can use the `scipy.stats` library extract values from the CDF of the t-distribution.

In [None]:
import scipy.stats as stats

# Calculate the p-values for the t-statistics
p_values = 2 * (1 - stats.t.cdf(np.abs(t_stat), df=T - K)) # Two-tailed test

print("p-values:", p_values)

## Writing a function to calculate OLS estimates


Write a function `OLS(X, y)` that takes in the independent variables and dependent variable and returns the OLS estimates, standard errors, t-statistics, and p-values for each coefficient. The function should also return the variance-covariance matrix of the estimates.

In [None]:
def OLS(X,y):

    return coefficients, standard_errors, t_stats, p_values

In [None]:
coefficients, standard_errors, t_stats, p_values = OLS(X,y)
print("Coefficients from OLS function:", coefficients)
print("t-statistics from OLS function:", t_stats)
print("p-values from OLS function:", p_values)

Compare your results with the `statsmodels` library. You can use the `statsmodels.api` library to perform OLS regression and compare the results with your own implementation.

In [None]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Fit the model using statsmodels
model = sm.OLS(y, X).fit() # Fit the model
print(model.summary()) # Print the summary of the regression results

## Plot fitted values and residuals
Illustrate the results using a plot. You can use the `matplotlib` library to create a scatter plot of the data points and the fitted regression line. Add labels and a title to the plot.

In [None]:
import matplotlib.pyplot as plt

plt.figure()
plt.plot(X[:,1], y, 'bo')  # Plot the data points
plt.axline(xy1=(0, coefficients[0]), slope=coefficients[1], color='r') # Plot the regression line
plt.xlabel('x')
plt.ylabel('y')
plt.title('Linear Regression')
plt.show()

# Application
Look at the effect of a change in the US Federal Funds Rate on Output.

To run the code below, you need to install the `pandas` and `pandas_datareader` library.


In [None]:
import pandas as pd
import pandas_datareader # has to be installed as pandas-datareader (conda install -c conda-forge pandas-datareader)
import datetime

In [None]:
# download real GDP and the federal funds rate from FRED
indicator_list = ['GDPC1', 'FEDFUNDS']

# specify start and end dates
start = datetime.datetime(1950,1,1)
end = datetime.datetime(2023,1,1)

# download the data
df = pandas_datareader.data.DataReader(indicator_list, 'fred', start, end)
df = df.rename(columns={'GDPC1': 'GDP', 'FEDFUNDS': 'FFR'}) # rename columns
df = df.dropna() # drop missing values

# calculate the growth rates
df['GDP_growth'] = df['GDP'].pct_change(periods=1) * 100 # percentage change in GDP
df['FFR_growth'] = df['FFR'].diff() # percentage change in FFR (note that the FFR is already in percentage points)

# add date column with float values
df['Date'] = df.index.year.astype(int) + (df.index.quarter.astype(int) - 1) / 4 # convert quarter to date
# add columns with ones
df['const'] = 1 # add a column of ones for the intercept

df = df.dropna() # drop missing values
df.head()

In [None]:
# extract numpy matrix
X = df[['const', 'FFR_growth']].to_numpy()
y = df['GDP_growth'].to_numpy()

Run the OLS and evaluate:

In [None]:
# Run OLS

In [None]:
plt.figure()
plt.plot(X[:,1], y, 'bo')  # Plot the data points
plt.axline(xy1=(0, coefficients[0]), slope=coefficients[1], color='r') # Plot the regression line
plt.xlabel('FFR Growth')
plt.ylabel('GDP Growth')
plt.title('Effect of FFR on GDP growth')
plt.show()

In [None]:
# Potentially relevant for your thesis: The extreme developments during Covid can strongly influence the regression results.
selected_GDP = df[(df['GDP_growth'] > 5) | (df['GDP_growth'] < -5)] 
print("Extreme GDP growth values:\n", selected_GDP)

In [None]:
# Good practice: plot the series
fig, axes = plt.subplots(2, 1, figsize=(10, 8))
axes[0].plot(df.index, df['GDP_growth'], label='GDP Growth', color='blue')
axes[0].set_title('GDP Growth Rate')
axes[0].set_xlabel('Date')
axes[0].set_ylabel('Growth Rate (%)')

axes[1].plot(df.index, df['FFR_growth'], label='FFR Growth', color='red')
axes[1].set_title('Federal Funds Rate Growth Rate')
axes[1].set_xlabel('Date')
axes[1].set_ylabel('Growth Rate (%)')

plt.tight_layout()

In [None]:
# Different sample: 1990 to 2018

dates = df.Date.to_numpy()
start = np.where(dates == 1990.00)[0][0]
end = np.where(dates == 2018.00)[0][0]

assert dates.shape[0] == X.shape[0], "Dates and X do not have the same number of rows"

X_short = 
y_short = 


In [None]:
# Run OLS

In [None]:
plt.figure()
plt.plot(X_short[:,1], y_short, 'bo')  # Plot the data points
plt.axline(xy1=(0, coefficients[0]), slope=coefficients[1], color='r') # Plot the regression line
plt.xlabel('FFR Growth')
plt.ylabel('GDP Growth')
plt.title('Effect of FFR on GDP growth')
plt.show()

Does this effect make sense intuitively? Why or why not?