# Linear Regression Tutorial: From Theory to Implementation

Welcome to this hands-on tutorial on Linear Regression! This notebook will guide you through implementing three different approaches to Linear Regression from scratch:

1. **Batch Gradient Descent (BGD)**
2. **Mini-batch Stochastic Gradient Descent (Mini-batch SGD)**
3. **Normal Equation (Closed-form solution)**

## Learning Objectives

By the end of this tutorial, you will:
- Understand how to vectorize data for machine learning
- Understand forward pass, loss computation, gradients, and weight updates
- Implement complete gradient descent algorithms
- Understand the differences between optimization methods

## How to Use This Notebook

Each section contains:
- **Questions** that challenge you to think and implement
- **Starter code** with TODO comments
- **Hints** (expandable sections) to guide you

Try to solve each question before looking at hints!

---
## Part 1: Data Preparation & Vectorization

Understanding how data is organized is crucial for implementing machine learning algorithms efficiently.

### Example Dataset

Here's a small dataset about house prices with 5 samples and 3 features:

In [None]:
import numpy as np
import pandas as pd

# Sample dataset: House prices
data = {
    'size_sqft': [1400, 1600, 1700, 1875, 1100],
    'bedrooms': [3, 3, 2, 4, 2],
    'age_years': [20, 15, 10, 5, 25],
    'price': [245000, 312000, 279000, 308000, 199000]
}

df = pd.DataFrame(data)
print(df)

### Question: How to extract features (X) and target (y) from the dataset?

**Challenge**: Extract the feature matrix X and target vector y from the given dataset.

**Think about**:
- What are we trying to predict? (This becomes y)
- What information do we use to make predictions? (This becomes X)
- What should be the shape of X? (rows vs columns)
- What should be the shape of y?
- What does each row represent?
- What does each column represent?

**Expected output**: 
- X with shape (5, 3) - 5 samples, 3 features
- y with shape (5,) - 5 target values

In [None]:
import numpy as np
import pandas as pd

# Sample dataset
data = {
    'size_sqft': [1400, 1600, 1700, 1875, 1100],
    'bedrooms': [3, 3, 2, 4, 2],
    'age_years': [20, 15, 10, 5, 25],
    'price': [245000, 312000, 279000, 308000, 199000]
}

df = pd.DataFrame(data)

# TODO: Extract features (X) - all columns except 'price'
# X = ?

# TODO: Extract target (y) - the 'price' column
# y = ?

<details>
<summary>ðŸ’¡ Hint 1</summary>

In machine learning:
- **Target (y)**: What we want to predict (in this case: 'price')
- **Features (X)**: Information we use to make predictions (in this case: 'size_sqft', 'bedrooms', 'age_years')

Use `df.drop(columns=['price'])` to get features and `df['price']` to get target.

</details>

<details>
<summary>ðŸ’¡ Hint 2</summary>

Remember to convert to numpy arrays using `.values`:
```python
X = df.drop(columns=['price']).values
y = df['price'].values
```

</details>

---

## Understanding Linear Regression Components

Before implementing the complete algorithm, let's understand each component individually.

### Part 1.1: Forward Pass - Making Predictions

The forward pass calculates predictions using the linear equation: $\hat{y} = X \cdot W + b$

Where:
- $X$ = Feature matrix (m Ã— n)
- $W$ = Weight vector (n,)
- $b$ = Bias (scalar)
- $\hat{y}$ = Predictions (m,)

**Question**: Given random weights and bias, compute predictions for our house prices.

**Think about**:
- What operation combines X and weights?
- How is bias added?
- What should be the shape of predictions?

In [None]:
# Assuming X and y from Part 1
# X shape: (5, 3), y shape: (5,)

# Initialize random weights and bias
np.random.seed(42)
weights = np.random.randn(3)  # 3 weights for 3 features
bias = 0.0

print(f"Weights: {weights}")
print(f"Bias: {bias}")

predictions = np.dot(X, weights) + bias

# TODO: Compute predictions using forward pass
# predictions = ?

# TODO: Print predictions and their shape
# print(f"\nPredictions: {predictions}")
# print(f"Predictions shape: {predictions.shape}")

<details>
<summary>ðŸ’¡ Hint</summary>

Use matrix multiplication:
```python
predictions = np.dot(X, weights) + bias
# or equivalently: predictions = X @ weights + bias
```

</details>

### Part 1.2: Loss Function - Measuring Error

The loss function measures how wrong our predictions are. We use Mean Squared Error (MSE):

$$\text{Loss} = \frac{1}{m} \sum_{i=1}^{m} (\hat{y}_i - y_i)^2$$

Where:
- $\hat{y}$ = Predictions
- $y$ = Actual values
- $m$ = Number of samples

**Question**: Compute the MSE loss for our predictions.

**Think about**:
- What is the error (residual)?
- Why square the error?
- Why take the mean?

In [None]:
# Using predictions from previous step

# TODO: Compute error (predictions - actual)
# error = ?

# TODO: Compute MSE loss
# loss = ?

# TODO: Print error and loss
# print(f"Error for each sample: {error}")
# print(f"\nMSE Loss: {loss}")

<details>
<summary>ðŸ’¡ Hint</summary>

```python
error = predictions - y
loss = np.mean(error**2)
```

</details>

### Part 1.3: Computing Gradients

Gradients tell us how to adjust weights to reduce loss. They are the partial derivatives of the loss with respect to each parameter:

$$\frac{\partial L}{\partial W} = \frac{1}{m} X^T \cdot \text{error}$$

$$\frac{\partial L}{\partial b} = \frac{1}{m} \sum \text{error}$$

**Question**: Compute gradients for weights and bias.

**Think about**:
- Why do we need $X^T$ (transpose)?
- Why divide by m?
- What should be the shape of gradients?

In [None]:
# Using X, error from previous steps
m = X.shape[0]  # Number of samples

# TODO: Compute gradient for weights
# dw = ?

# TODO: Compute gradient for bias
# db = ?

# TODO: Print gradients and their shapes
# print(f"Weight gradients: {dw}")
# print(f"Weight gradients shape: {dw.shape}")
# print(f"\nBias gradient: {db}")

<details>
<summary>ðŸ’¡ Hint</summary>

```python
dw = np.dot(X.T, error) / m
db = np.sum(error) / m
```

</details>

### Part 1.4: Updating Weights

Gradient descent updates parameters by moving in the opposite direction of the gradient:

$$W_{\text{new}} = W_{\text{old}} - \alpha \cdot \frac{\partial L}{\partial W}$$

$$b_{\text{new}} = b_{\text{old}} - \alpha \cdot \frac{\partial L}{\partial b}$$

Where $\alpha$ is the learning rate.

**Question**: Update weights and bias, then check if loss decreased.

**Think about**:
- Why subtract (not add) the gradient?
- What does learning rate control?
- Should loss go down after update?

In [None]:
# Using weights, bias, dw, db from previous steps
alpha = 0.01  # Learning rate

print(f"Before update:")
print(f"Weights: {weights}")
print(f"Bias: {bias}")
print(f"Loss: {loss}")

# TODO: Update weights
# weights_new = ?

# TODO: Update bias
# bias_new = ?

# TODO: Compute new predictions and loss
# predictions_new = ?
# error_new = ?
# loss_new = ?

# TODO: Print results
# print(f"\nAfter update:")
# print(f"Weights: {weights_new}")
# print(f"Bias: {bias_new}")
# print(f"Loss: {loss_new}")
# print(f"\nLoss decreased by: {loss - loss_new}")

<details>
<summary>ðŸ’¡ Hint</summary>

```python
weights_new = weights - alpha * dw
bias_new = bias - alpha * db
```

</details>

---

## Part 2: Batch Gradient Descent Implementation

Now that you understand each component, let's implement the complete Batch Gradient Descent algorithm!

**The Algorithm**:
1. Initialize weights and bias
2. For each iteration:
   - Forward pass: compute predictions
   - Compute error and loss
   - Compute gradients
   - Update weights
3. Repeat until convergence

### Complete Implementation

Implement the LinearRegressionSGD class with all components:

In [None]:
class LinearRegressionSGD:
    def __init__(self, alpha=0.1, iterations=1000):
        # TODO: Store hyperparameters and initialize weights/bias
        pass
    
    def fit(self, X, y):
        # TODO: Get dimensions and initialize weights
        # TODO: Loop through iterations
        # TODO: Forward pass, compute error, compute gradients, update weights
        pass
    
    def predict(self, X):
        # TODO: Return predictions
        pass

---

## Part 3: Mini-batch Stochastic Gradient Descent

Now let's implement a more efficient version using mini-batches!

In [None]:
class LinearRegressionMiniSGD:
    def __init__(self, alpha=0.1, iterations=1000, batch_size=64):
        # TODO: Initialize hyperparameters
        pass
    
    def fit(self, X, y):
        # TODO: Initialize weights and bias
        # TODO: Loop for iterations (epochs)
        # TODO: Shuffle data
        # TODO: Loop through batches
        # TODO: Forward pass on batch, compute gradients, update weights
        pass
    
    def predict(self, X):
        # TODO: Return predictions
        pass

---

## Part 4: Normal Equation (Closed-form Solution)

Instead of iterative optimization, we can solve directly using linear algebra!

In [None]:
class LinearRegressionNormalEqn:
    def __init__(self, fit_intercept=True):
        # TODO: Store parameters
        pass
    
    def fit(self, X, y):
        # TODO: Center data if fit_intercept=True
        # TODO: Use np.linalg.lstsq to solve
        # TODO: Calculate intercept if needed
        pass
    
    def predict(self, X):
        # TODO: Return predictions
        pass

---

## Congratulations! ðŸŽ‰

You've successfully learned and implemented Linear Regression from scratch!

### What You've Learned

1. âœ… **Data Vectorization**: How to organize data as matrices
2. âœ… **Forward Pass**: Computing predictions using $\hat{y} = XW + b$
3. âœ… **Loss Function**: Measuring prediction error with MSE
4. âœ… **Gradients**: Computing derivatives for optimization
5. âœ… **Weight Updates**: Using gradient descent to minimize loss
6. âœ… **Batch Gradient Descent**: Full implementation
7. âœ… **Mini-batch SGD**: More efficient variant
8. âœ… **Normal Equation**: Direct analytical solution

### Key Takeaways

- **Vectorization** is crucial for efficient implementation
- **Gradient descent** iteratively moves towards the minimum loss
- **Mini-batch SGD** balances speed and stability
- **Normal Equation** gives exact solution but doesn't scale to large features

### Next Steps

- Test your implementations on real datasets (e.g., sklearn's diabetes dataset)
- Experiment with different learning rates and batch sizes
- Add regularization (Ridge, Lasso)
- Implement feature scaling/normalization
- Visualize the loss curves and decision boundaries