# Lesson 0A: Linear Regression Theory<a name="introduction"></a>## IntroductionLinear regression is the foundation of machine learning - the algorithm you should learn first, before logistic regression, decision trees, or neural networks.Think about predicting house prices. You know intuitively that bigger houses cost more. If a 1,000 sq ft house costs $200,000, a 2,000 sq ft house probably costs around $400,000. You're drawing a mental straight line through the data points.That's linear regression - finding the best straight line (or hyperplane in higher dimensions) that predicts an output from inputs. It's simple, interpretable, and forms the basis for understanding more complex algorithms.In this lesson, we'll:1. Understand what linear regression is and when to use it2. Learn the mathematical foundations (least squares, gradients)3. Implement simple and multiple linear regression from scratch4. Explore the closed-form solution (Normal Equation)5. Implement gradient descent optimization6. Apply it to real housing price predictionThen in Lesson 0B, we'll:1. Use Scikit-learn and PyTorch for production implementations2. Handle polynomial features and feature engineering3. Add regularization (Ridge, Lasso) to prevent overfitting

## Table of Contents1. [Introduction](#introduction)2. [Required libraries](#required-libraries)3. [What is linear regression?](#what-is-linear-regression)4. [Simple linear regression](#simple-linear-regression)   - [The equation](#the-equation)   - [Finding the best line](#finding-the-best-line)   - [Worked example](#worked-example)5. [Multiple linear regression](#multiple-linear-regression)6. [The cost function](#the-cost-function)7. [Optimization methods](#optimization-methods)   - [Normal Equation (closed-form)](#normal-equation)   - [Gradient descent](#gradient-descent)8. [Implementation from scratch](#implementation-from-scratch)9. [California housing dataset](#california-housing-dataset)10. [Model evaluation](#model-evaluation)11. [Assumptions of linear regression](#assumptions)12. [Conclusion](#conclusion)

<a name="required-libraries"></a>## Required libraries<table style="margin-left:0"><tr><th align="left">Library</th><th align="left">Purpose</th></tr><tr><td>Numpy</td><td>Numerical computing</td></tr><tr><td>Pandas</td><td>Data manipulation</td></tr><tr><td>Matplotlib/Seaborn</td><td>Visualization</td></tr><tr><td>Scikit-learn</td><td>Datasets and metrics</td></tr></table>

In [None]:
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom sklearn.datasets import fetch_california_housingfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import mean_squared_error, r2_score, mean_absolute_errorfrom typing import Tuplefrom numpy.typing import NDArraynp.random.seed(42)plt.style.use('seaborn-v0_8-darkgrid')%matplotlib inlineprint("✅ Libraries loaded!")

<a name="what-is-linear-regression"></a>## What is linear regression?Linear regression models the relationship between:- **Independent variables** (features, predictors): X- **Dependent variable** (target, outcome): yUsing a **linear function**:### Simple (1 feature): $y = mx + b$### Multiple (n features): $y = w_1x_1 + w_2x_2 + ... + w_nx_n + b$Or in matrix form: $y = Xw + b$**Goal:** Find weights (w) and bias (b) that minimize prediction error.

<a name="simple-linear-regression"></a>## Simple linear regressionLet's start with one feature: predicting house price from square footage.**Example data:**- 600 sq ft → $150k- 1000 sq ft → $250k- 1400 sq ft → $350k- 1800 sq ft → $450k**Find:** Best line y = mx + b

In [None]:
# Example datasqft = np.array([600, 1000, 1400, 1800])price = np.array([150, 250, 350, 450])  # in thousands# Calculate best fit line (using numpy for now)m, b = np.polyfit(sqft, price, 1)print(f"Best fit line: price = {m:.2f} * sqft + {b:.2f}")print(f"Interpretation: Each sq ft adds ${m:.2f}k to price, base price is ${b:.2f}k")# Visualizeplt.figure(figsize=(10, 6))plt.scatter(sqft, price, s=100, alpha=0.7, label='Actual data')plt.plot(sqft, m * sqft + b, 'r-', linewidth=2, label=f'y = {m:.2f}x + {b:.2f}')plt.xlabel('Square Feet', fontsize=12)plt.ylabel('Price ($1000s)', fontsize=12)plt.title('Simple Linear Regression: House Price vs Size', fontsize=14, fontweight='bold')plt.legend()plt.grid(True, alpha=0.3)plt.show()

<a name="the-cost-function"></a>## The cost functionHow do we measure how "good" a line is?**Mean Squared Error (MSE):**### $MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$Where:- $y_i$ is the actual value- $\hat{y}_i$ is the predicted value- We square the errors to penalize large mistakes more**Goal:** Minimize MSE by finding optimal w and b

In [None]:
def compute_mse(y_true: NDArray, y_pred: NDArray) -> float:    """Compute Mean Squared Error."""    return np.mean((y_true - y_pred) ** 2)def compute_rmse(y_true: NDArray, y_pred: NDArray) -> float:    """Compute Root Mean Squared Error."""    return np.sqrt(compute_mse(y_true, y_pred))def compute_r2(y_true: NDArray, y_pred: NDArray) -> float:    """Compute R² score (coefficient of determination)."""    ss_res = np.sum((y_true - y_pred) ** 2)    ss_tot = np.sum((y_true - np.mean(y_true)) ** 2)    return 1 - (ss_res / ss_tot)# Test with our exampley_pred = m * sqft + bprint(f"MSE:  {compute_mse(price, y_pred):.2f}")print(f"RMSE: {compute_rmse(price, y_pred):.2f} (in $1000s)")print(f"R²:   {compute_r2(price, y_pred):.4f} (1.0 = perfect fit)")

<a name="optimization-methods"></a>## Optimization methodsTwo ways to find optimal weights:<a name="normal-equation"></a>### 1. Normal Equation (closed-form solution)For linear regression, there's a direct mathematical formula:### $w = (X^TX)^{-1}X^Ty$**Pros:** Exact solution, no iterations**Cons:** Slow for large datasets (matrix inversion is O(n³))

In [None]:
class LinearRegressionNormal:    """Linear Regression using Normal Equation."""    def __init__(self):        self.weights = None        self.bias = None    def fit(self, X: NDArray, y: NDArray):        """Fit using normal equation."""        # Add bias term (column of 1s)        X_b = np.c_[np.ones((X.shape[0], 1)), X]        # Normal equation        theta = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y        self.bias = theta[0]        self.weights = theta[1:]    def predict(self, X: NDArray) -> NDArray:        """Make predictions."""        return X @ self.weights + self.bias# Test on simple examplemodel_normal = LinearRegressionNormal()model_normal.fit(sqft.reshape(-1, 1), price)print(f"Weights: {model_normal.weights[0]:.2f}")print(f"Bias: {model_normal.bias:.2f}")print("✅ Normal equation implementation complete!")

<a name="gradient-descent"></a>### 2. Gradient DescentIteratively improve weights by following the gradient:### $w := w - \alpha \frac{\partial MSE}{\partial w}$Where α is the learning rate.**Gradients:**- $\frac{\partial MSE}{\partial w} = -\frac{2}{n}X^T(y - \hat{y})$- $\frac{\partial MSE}{\partial b} = -\frac{2}{n}\sum(y - \hat{y})$

In [None]:
class LinearRegressionGD:    """Linear Regression using Gradient Descent."""    def __init__(self, learning_rate=0.01, n_iterations=1000):        self.lr = learning_rate        self.n_iters = n_iterations        self.weights = None        self.bias = None        self.losses = []    def fit(self, X: NDArray, y: NDArray):        """Fit using gradient descent."""        n_samples, n_features = X.shape        # Initialize        self.weights = np.zeros(n_features)        self.bias = 0        # Gradient descent        for i in range(self.n_iters):            # Predictions            y_pred = X @ self.weights + self.bias            # Compute gradients            dw = -(2 / n_samples) * (X.T @ (y - y_pred))            db = -(2 / n_samples) * np.sum(y - y_pred)            # Update parameters            self.weights -= self.lr * dw            self.bias -= self.lr * db            # Track loss            if i % 100 == 0:                loss = compute_mse(y, y_pred)                self.losses.append(loss)    def predict(self, X: NDArray) -> NDArray:        return X @ self.weights + self.bias# Test gradient descentmodel_gd = LinearRegressionGD(learning_rate=0.0001, n_iterations=1000)model_gd.fit(sqft.reshape(-1, 1), price)print(f"Weights: {model_gd.weights[0]:.2f}")print(f"Bias: {model_gd.bias:.2f}")# Plot loss curveplt.figure(figsize=(10, 5))plt.plot(model_gd.losses, linewidth=2)plt.xlabel('Iteration (×100)', fontsize=12)plt.ylabel('MSE Loss', fontsize=12)plt.title('Gradient Descent Convergence', fontsize=14, fontweight='bold')plt.grid(True, alpha=0.3)plt.show()print("✅ Gradient descent implementation complete!")

<a name="california-housing-dataset"></a>## California housing datasetNow let's apply our implementation to real data with multiple features!

In [None]:
# Load datahousing = fetch_california_housing()X, y = housing.data, housing.targetprint(f"Dataset shape: {X.shape}")print(f"Features: {housing.feature_names}")print(f"\nTarget: Median house value in $100k")print(f"Target range: ${y.min():.1f}k - ${y.max():.1f}k")# Split dataX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Normalize features (important for gradient descent!)X_mean = X_train.mean(axis=0)X_std = X_train.std(axis=0)X_train_norm = (X_train - X_mean) / X_stdX_test_norm = (X_test - X_mean) / X_stdprint(f"\nTraining samples: {len(X_train):,}")print(f"Test samples: {len(X_test):,}")

In [None]:
# Train both modelsprint("Training models...\n")# Normal equationmodel_normal = LinearRegressionNormal()model_normal.fit(X_train_norm, y_train)y_pred_normal = model_normal.predict(X_test_norm)# Gradient descentmodel_gd = LinearRegressionGD(learning_rate=0.01, n_iterations=2000)model_gd.fit(X_train_norm, y_train)y_pred_gd = model_gd.predict(X_test_norm)# Evaluateprint("Normal Equation:")print(f"  MSE:  {compute_mse(y_test, y_pred_normal):.4f}")print(f"  RMSE: {compute_rmse(y_test, y_pred_normal):.4f}")print(f"  R²:   {compute_r2(y_test, y_pred_normal):.4f}")print("\nGradient Descent:")print(f"  MSE:  {compute_mse(y_test, y_pred_gd):.4f}")print(f"  RMSE: {compute_rmse(y_test, y_pred_gd):.4f}")print(f"  R²:   {compute_r2(y_test, y_pred_gd):.4f}")print("\n✅ Both methods produce similar results!")

<a name="conclusion"></a>## Conclusion**What we learned:**1. Linear regression finds the best linear relationship between features and target2. MSE measures prediction quality3. Normal Equation: Direct solution, fast for small datasets4. Gradient Descent: Iterative solution, scales to large datasets5. Feature normalization is crucial for gradient descent**When to use linear regression:**- ✅ Relationship is approximately linear- ✅ You need interpretable coefficients- ✅ Fast predictions required- ❌ Complex non-linear patterns (use trees, neural networks)**Next: Lesson 0B** - Production implementations with Scikit-learn, polynomial features, and regularization!