# 📊 Least Squares Regression with NumPy

**Prerequisites**: Read `linear_algebra_theory.md` for understanding of least squares and pseudoinverse.

This notebook demonstrates how to implement linear regression from scratch using NumPy's least squares solver and compare it with scikit-learn's implementation.

## Theory Background

**Linear Regression Problem**: Given dataset $(x_i, y_i)$, find parameters $\beta$ such that:
$$y \approx X\beta$$

**Design Matrix**: 
$$X = \begin{bmatrix}
1 & x_{1,1} & x_{1,2} & \cdots & x_{1,p} \\
1 & x_{2,1} & x_{2,2} & \cdots & x_{2,p} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
1 & x_{n,1} & x_{n,2} & \cdots & x_{n,p}
\end{bmatrix}$$

**Least Squares Solution**: 
$$\hat{\beta} = (X^T X)^{-1} X^T y = X^+ y$$

where $X^+$ is the Moore-Penrose pseudoinverse.

## Setup and Imports

In [None]:
# TODO: Import necessary libraries

## Exercise 1: Load and Explore Dataset

Choose one of the suggested datasets and explore its structure.

In [None]:
# TODO: Load dataset and explore its structure

## Exercise 2: Data Preparation

Prepare the data for least squares regression by creating the design matrix and splitting the data.

In [None]:
# TODO: Split data and create design matrix with intercept column

## Exercise 3: NumPy Least Squares Implementation

Solve the least squares problem using NumPy's `linalg.lstsq` function.

In [None]:
# TODO: Solve least squares using np.linalg.lstsq

## Exercise 4: Alternative Implementation Using Pseudoinverse

Implement least squares using the pseudoinverse formula: $\hat{\beta} = X^+ y$

In [None]:
# TODO: Implement using pseudoinverse formula

## Exercise 5: Scikit-learn Comparison

Compare your NumPy implementation with scikit-learn's LinearRegression.

In [None]:
# TODO: Fit scikit-learn LinearRegression and compare results

## Exercise 6: Make Predictions

Use the fitted models to make predictions on the test set.

In [None]:
# TODO: Make predictions on test set

## Exercise 7: Model Evaluation

Compute evaluation metrics to assess model performance.

In [None]:
# TODO: Compute evaluation metrics (RMSE, R² score, residuals)

## Exercise 8: Visualization

Create visualizations to understand the model performance.

In [None]:
# TODO: Create visualization plots (predicted vs actual, residuals, distribution)

## Exercise 9: Feature Analysis (Optional)

If using multiple features, analyze the importance and impact of each feature.

In [None]:
# TODO: Analyze feature coefficients and importance (if multiple features)

## Exercise 10: Discussion and Analysis

Analyze and discuss the results.

### TODO: Answer the following questions:

1. **Method Comparison**: 
   - Do NumPy's `lstsq`, pseudoinverse, and scikit-learn give identical results? Why or why not?

2. **Model Performance**: 
   - What do the R² and RMSE values tell you about the model's performance?
   - Is the model overfitting or underfitting?

3. **Residual Analysis**: 
   - Are the residuals randomly distributed around zero?
   - Do you see any patterns that suggest model limitations?

4. **Linear Algebra Connection**: 
   - How does the matrix rank affect the solution?
   - What happens if features are linearly dependent?

5. **Deep Learning Relevance**: 
   - How does this least squares approach relate to training linear layers in neural networks?
   - What are the advantages/disadvantages compared to gradient descent?

In [None]:
# TODO: Write analysis and conclusions