# OLS Regression via Matrix Algebra

**CMVP Capstone — Statistics Foundations**

This notebook replicates the *Least Squares Matrix Formula* spreadsheet.
You will:
1. Build the X′X and X′Y matrices from raw data
2. Solve β = (X′X)⁻¹ X′Y by hand (Cramer's rule for 2×2)
3. Compute R², standard errors, and t-statistics
4. Cross-validate against numpy

---

## 1. Setup

Run this cell first — it clones the repo so the script imports work in Colab.

In [None]:
import os

if not os.path.exists('statssheets'):
    !git clone https://github.com/jskromer/statssheets.git

import sys
sys.path.insert(0, 'statssheets/scripts')

from least_squares_matrix import ols_matrix, print_matrix

## 2. Input data

The default dataset from the spreadsheet: 5 observations of (x, y).

In [None]:
x = [0.5, 4, 6, 8, 10]
y = [6, 7, 7, 8, 7]

print(f"{'Obs':<5} {'x':<10} {'y':<10} {'x·y':<12} {'x²':<10}")
print("-" * 47)
for i, (xi, yi) in enumerate(zip(x, y), 1):
    print(f"{i:<5} {xi:<10.2f} {yi:<10.2f} {xi*yi:<12.2f} {xi**2:<10.2f}")

## 3. Solve OLS via matrix algebra

In [None]:
result = ols_matrix(x, y)

### Matrix construction

$$X'X = \begin{bmatrix} n & \sum x_i \\ \sum x_i & \sum x_i^2 \end{bmatrix}, \quad
X'Y = \begin{bmatrix} \sum y_i \\ \sum x_i y_i \end{bmatrix}$$

In [None]:
print_matrix("X'X", result['xtx'])
print()
print_matrix("X'Y", result['xty'])
print(f"\ndet(X'X) = {result['det']:.4f}")
print()
print_matrix("(X'X)⁻¹", result['xtx_inv'])

### Solution

$$\hat{\beta} = (X'X)^{-1} X'Y$$

In [None]:
print(f"b₀ (intercept) = {result['b0']:.4f}")
print(f"b₁ (slope)     = {result['b1']:.4f}")
print(f"\nEquation: ŷ = {result['b0']:.4f} + {result['b1']:.4f} · x")

## 4. Predictions and residuals

In [None]:
print(f"{'Obs':<5} {'x':<8} {'y':<8} {'ŷ':<10} {'residual':<10}")
print("-" * 41)
for i, (xi, yi, yh, r) in enumerate(zip(x, y, result['y_hat'], result['residuals']), 1):
    print(f"{i:<5} {xi:<8.2f} {yi:<8.2f} {yh:<10.4f} {r:<10.4f}")

## 5. Goodness of fit

In [None]:
print(f"SS_regression = {result['ss_reg']:.4f}")
print(f"SS_residual   = {result['ss_res']:.4f}")
print(f"SS_total      = {result['ss_tot']:.4f}")
print(f"R²            = {result['r_squared']:.4f}")
print(f"MSE           = {result['mse']:.4f}")
print()
print(f"SE(b₀) = {result['se_b0']:.4f}    t(b₀) = {result['t_b0']:.4f}")
print(f"SE(b₁) = {result['se_b1']:.4f}    t(b₁) = {result['t_b1']:.4f}")

## 6. Cross-validate with numpy

In [None]:
import numpy as np

A = np.column_stack([np.ones(len(x)), x])
betas, _, _, _ = np.linalg.lstsq(A, y, rcond=None)

print(f"numpy lstsq:  b₀ = {betas[0]:.4f},  b₁ = {betas[1]:.4f}")
print(f"Our solution: b₀ = {result['b0']:.4f},  b₁ = {result['b1']:.4f}")
print(f"\nMatch: {'YES' if abs(betas[0] - result['b0']) < 1e-8 else 'NO'}")

## 7. Exercises

**Try these:**
1. Use energy consumption vs. outdoor air temperature from the capstone dataset. Does R² improve?
2. What does an R² of 0.52 tell you about this small dataset?
3. Add a second independent variable (e.g., occupancy). How would the matrix algebra change?

---
*CMVP Capstone · Counterfactual Designs*