In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

np.set_printoptions(suppress=True)

## Exercise: OLS Regression with numpy

<p>In this exercise, we will compute the parameters (weights) of an OLS regression using only numpy (and our linear algebra knowledge).</p>
<p>The forward equation for OLS regression is given by:</p>

$$\hat{y} = \mathbf{X}\beta$$

where $\mathbf{X}$ is an $N{\times}M$ matrix of covariates, and $\beta$ is a vector of parameters estimated from the data. In OLS regression, we seek to minimize the MSE (mean squared error) criterion:

$$MSE(\beta) = \frac{1}N\sum_{i=1}^N(y_i - \hat{y_i})^2$$

<p>which can be written in matrix form as</p>

$$MSE(\beta) = (y - \mathbf{X}\beta)^T(y - \mathbf{X}\beta)$$

setting the gradient of the MSE criterion to zero and solving for $y$, we obtain the following solution which minimizes the MSE:<br>

$$\beta^* = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^Ty$$

See this excellent post for a full derivation: https://eli.thegreenplace.net/2014/derivation-of-the-normal-equation-for-linear-regression

Thus, in order to complete this exercise, you need to follow these steps:
1. Generate X (covariates) and y (outcome) by calling the `simulate_ols_data()` function
2. Complete the `ols(X, y)` function by computing the solution to the normal equation
3. Compare your best parameters with the parameters returned by the numpy function `numpy.linalg.lstsq(X, y)`
4. Compute RMSE on the training data by completing the function `rmse(y, y_hat)`. The RMSE is given by:
<br>
<br>

$$RMSE= \sqrt{\frac{1}N\sum_{i=1}^N(y_i - \hat{y_i})^2}$$

In [4]:
def simulate_ols_data(N=100, M=3):
    """Function to simulate synthetic regression data."""
    
    # Generate design matrix
    X = np.random.randn(N, M)

    # Add intercept to model
    X = np.c_[np.ones(shape=(N, 1)), X]

    # Generate true beta ~ N(0, 1)
    beta_true = np.random.randn(M + 1)

    # Compute noisy linear function
    y = X @ beta_true + np.random.randn(N)

    return X, y, beta_true

In [1]:
def ordinary_least_squares(X, y):
    """Function to compute the parameters of an OLS regression."""
    
    # YOUR CODE HERE
    pass

In [2]:
def rmse(y, y_hat):
    """Function to compute the RMSE between predictions and outcomes."""
    
    # YOUR CODE HERE
    pass

In [5]:
# 1. Simulate regression data
X, y, beta_true = simulate_ols_data(N=10000)

# 2. Obtain best parameters
beta_hat = ordinary_least_squares(X, y)

# 3. Obtain best parameters using numpy.linalg.lstsq
beta_hat_np = np.linalg.lstsq(X, y, rcond=None)[0]

# 4. Compute RMSE on training data
y_hat = X @ beta_hat_np

rmse_train = rmse(y, y_hat)

NameError: name 'np' is not defined

In [None]:
print('True betas: ', beta_true)
print('OLS betas: ', beta_hat)
print('Numpy LSTSQ betas', beta_hat_np)
print('RMSE train: ', rmse_train)

# Final exercise: PCA

![PCA meme](https://media.makeameme.org/created/brace-yourself-pca.jpg)

<p>PCA learns a representation of multidimensional data which has lower dimensionality than the original data.</p>
Furthermore is an orthogonal linear projection of the data, meaning that it decorrelates the dimensions of the learned representation.

Consider a data matrix $\mathbf{X}$. Suppose further that the data has been standardized (centered and scaled). The unbiased estimate of the covariance matrix is given by:

$$\widehat{Cov}[\mathbf{X}] = \frac{1}{m - 1}\mathbf{X}^T\mathbf{X}$$

A popular variant of PCA performs an eigendecomposition of $\mathbf{X}^T\mathbf{X}$ defined by:

$$\widehat{Cov}[\mathbf{X}] = \mathbf{V}\mathbf{\Lambda}\mathbf{V}^T$$

where $\mathbf{V}$ is the matrix of <em>eigenvectors</em> of $\widehat{Cov}[\mathbf{X}]$ and $\mathbf{\Lambda}$ is a diagonal matrix of <em>eigenvalues</em>.

(Note, that sometimes the decomposition is performed on the scatter matrix instead $\mathbf{X}^T\mathbf{X}$. This does not change the eigenvectors, but simply scales the eigenvalues.) 

In other words, we decompose the covariance matrix into a scale part (eigenvalues) and direction part (eigenvectors). The eigenvectors are the principle components of the $\widehat{Cov}[\mathbf{X}]$.

Eigenvectors determine orthogonal directions in feature space. Each component of the eigenvector encodes how each of the original dimension directions contributes to the eigenvector. The first eigenvector (first PC) encodes the direction of most variance, and so on.
Eigenvalues represent how much of the original variance is captured (explained) in the direction of the corresponding eigenvector.

Read in the data `extraversion_big5.csv` and complete the code below.

In [None]:
import pandas as pd

df = pd.read_csv('./data/extraversion_big5.csv', sep=';', header=0)

In [None]:
# Standardize data
### Your code here

# Compute covariance matrix
cov = None

# Perform an eigendecomposition
### Your code here
lambd, V = None, None

# Check reconstruction of the covariance matrix
np.allclose(V @ np.diag(lambd) @ V.T, cov) 

In [None]:
# Check results
f, ax = plt.subplots(figsize=(10, 6))
ax.plot(np.arange(1, lambd.shape[0]+1), lambd / np.sum(lambd), '-o', color='#aa0000')
ax.set_xlim([1 - 0.1, 10 + 0.1])
ax.set_xlabel('Eigenvalue index')
ax.set_ylabel('Proportion of explained variance')
ax.set_title('Eigenvalues plot')
ax.grid(alpha=0.3)
sns.despine(ax=ax)