# Linear Regression Tutorial: From Theory to Implementation

Welcome to this hands-on tutorial on Linear Regression! This notebook will guide you through implementing three different approaches to Linear Regression from scratch:

1. **Batch Gradient Descent (BGD)**
2. **Mini-batch Stochastic Gradient Descent (Mini-batch SGD)**
3. **Normal Equation (Closed-form solution)**

## Learning Objectives

By the end of this tutorial, you will:
- Understand how to vectorize data for machine learning
- Implement forward pass, loss computation, and gradient descent
- Understand the differences between optimization methods
- Compare your implementations with scikit-learn

## How to Use This Notebook

Each section contains:
- **Questions** that challenge you to think and implement
- **Starter code** with TODO comments
- **Hints** (expandable sections) to guide you
- **Solutions** (expandable sections) for reference

Try to solve each question before looking at hints or solutions!

---

## Part 1: Data Preparation & Vectorization

Understanding how data is organized is crucial for implementing machine learning algorithms efficiently.

### Question 1: How to load and inspect the diabetes dataset?

**Challenge**: Load the diabetes dataset from sklearn and understand its structure.

**Think about**:
- What information does the dataset contain?
- How many features and samples are there?
- What are we trying to predict?

**Expected output**: Display basic information about the dataset

In [None]:
import numpy as np
from sklearn.datasets import load_diabetes
import pandas as pd

# TODO: Load the diabetes dataset
# Hint: Use load_diabetes with as_frame=True to get a DataFrame

# TODO: Extract the features (X) and target (y)

# TODO: Print the shape of X and y

# TODO: Display the first few rows of the dataset

<details>
<summary>üí° Hint 1</summary>

Use `load_diabetes(as_frame=True)` to get the data as a pandas DataFrame. The result will have a `.frame` attribute.

</details>

<details>
<summary>üí° Hint 2</summary>

```python
data = load_diabetes(as_frame=True)
df = data.frame
X = df.drop(columns=['target'])
y = df['target']
```

</details>

<details>
<summary>‚úÖ Solution</summary>

```python
import numpy as np
from sklearn.datasets import load_diabetes
import pandas as pd

# Load the diabetes dataset
data = load_diabetes(as_frame=True)
df = data.frame

# Extract features and target
X = df.drop(columns=['target']).values
y = df['target'].values

# Print shapes
print(f"X shape: {X.shape}")  # (442, 10)
print(f"y shape: {y.shape}")  # (442,)

# Display first few rows
print("\nFirst 5 rows of features:")
print(df.head())
```

**Explanation**: The diabetes dataset has 442 samples with 10 features each. We're predicting disease progression (target).

</details>

### Question 2: How to organize data as matrices? (Rows vs Columns)

**Challenge**: Verify that rows represent data points and columns represent features.

**Think about**:
- What does each row represent?
- What does each column represent?
- Why is this convention important for vectorization?

**Expected output**: Print the shape and verify the convention

In [None]:
# TODO: Print X.shape and explain what each dimension means

# TODO: Access the first data point (first row)

# TODO: Access the first feature across all data points (first column)

# TODO: Explain why this convention (rows=samples, columns=features) is standard

<details>
<summary>üí° Hint 1</summary>

In machine learning:
- Shape: (n_samples, n_features) or (m, n)
- Each row: One complete data point with all its features
- Each column: One feature across all data points

</details>

<details>
<summary>‚úÖ Solution</summary>

```python
print(f"X shape: {X.shape}")  # (442, 10)
print(f"Number of samples (m): {X.shape[0]}")
print(f"Number of features (n): {X.shape[1]}")

# First data point (all features for sample 0)
print(f"\nFirst data point: {X[0]}")

# First feature (across all samples)
print(f"\nFirst feature across all samples: {X[:, 0]}")

# Convention explanation
print("""
Convention: Rows = Datapoints, Columns = Features
- This allows matrix multiplication: X @ weights
- Shape (442, 10) @ (10, 1) = (442, 1) predictions
- Each row represents a complete observation
- Each column represents a variable/feature
""")
```

</details>

### Question 3: How to split data into training and testing sets?

**Challenge**: Split the data into training (80%) and testing (20%) sets.

**Think about**:
- Why do we need separate train and test sets?
- What does `random_state` do?

**Expected output**: X_train, X_test, y_train, y_test with appropriate shapes

In [None]:
from sklearn.model_selection import train_test_split

# TODO: Split data into train (80%) and test (20%) sets
# Use random_state=42 for reproducibility

# TODO: Print shapes of all four arrays

<details>
<summary>‚úÖ Solution</summary>

```python
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"X_train shape: {X_train.shape}")  # (353, 10)
print(f"X_test shape: {X_test.shape}")    # (89, 10)
print(f"y_train shape: {y_train.shape}")  # (353,)
print(f"y_test shape: {y_test.shape}")    # (89,)
```

**Why split?** To evaluate model performance on unseen data and prevent overfitting.

</details>

---

## Part 2: Batch Gradient Descent Implementation

Now let's implement Linear Regression using Batch Gradient Descent from scratch!

**Key Equations**:
- Forward pass: $\hat{y} = X \cdot W + b$
- Loss: $J = \frac{1}{2m} \sum (\hat{y} - y)^2$
- Gradients: $dW = \frac{1}{m} X^T \cdot error$, $db = \frac{1}{m} \sum error$
- Update: $W = W - \alpha \cdot dW$, $b = b - \alpha \cdot db$

### Question 4: How to initialize weights and bias?

**Challenge**: Create a LinearRegressionSGD class and initialize weights and bias.

**Think about**:
- What should be the shape of weights?
- What value should we initialize them to?
- Why is bias a scalar?

**Expected output**: Initialized weights and bias

In [None]:
class LinearRegressionSGD:
    def __init__(self, alpha=0.1, iterations=1000):
        # TODO: Store hyperparameters
        # TODO: Initialize weights and bias to None (will set in fit method)
        pass
    
    def fit(self, X, y):
        # TODO: Get number of instances and features from X.shape
        
        # TODO: Initialize weights as zeros with shape (n_features,)
        
        # TODO: Initialize bias as 0
        
        pass

<details>
<summary>üí° Hint 1</summary>

If X has shape (m, n), weights should have shape (n,) to allow matrix multiplication X @ weights.

</details>

<details>
<summary>‚úÖ Solution</summary>

```python
class LinearRegressionSGD:
    def __init__(self, alpha=0.1, iterations=1000):
        self.alpha = alpha
        self.iterations = iterations
        self.weights = None
        self.bias = None
    
    def fit(self, X, y):
        n_instances, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0
```

**Explanation**: 
- Weights have shape (n_features,) - one weight per feature
- Bias is a single scalar value
- Initialize to zeros (safe for linear regression, though random init is used in deep learning)

</details>

### Question 5: How to implement the forward pass?

**Challenge**: Implement the prediction formula: $\hat{y} = X \cdot W + b$

**Think about**:
- What operation combines X and weights?
- How is bias added to all predictions?
- What should be the shape of pred_y?

**Expected output**: Predictions for all training samples

In [None]:
# Add this inside the fit method, after initialization

# for i in range(self.iterations):
#     # TODO: Forward pass - compute predictions
#     # pred_y = ?
#     pass

<details>
<summary>üí° Hint 1</summary>

Use `np.dot(X, self.weights)` or `X @ self.weights` for matrix multiplication.

</details>

<details>
<summary>üí° Hint 2</summary>

```python
pred_y = np.dot(X, self.weights) + self.bias
```

Shape: (m, n) @ (n,) + scalar = (m,)

</details>

<details>
<summary>‚úÖ Solution</summary>

```python
for i in range(self.iterations):
    # Forward pass
    pred_y = np.dot(X, self.weights) + self.bias
```

**Explanation**: This computes predictions for ALL samples simultaneously (vectorized operation).

</details>

### Question 6: How to compute the error vector?

**Challenge**: Calculate the difference between predictions and actual values.

**Think about**:
- What is error?
- What should be the shape of error?
- Why is this called "residual"?

**Expected output**: Error vector with shape (m,)

In [None]:
# Add after forward pass
# TODO: Compute error
# error = ?

<details>
<summary>‚úÖ Solution</summary>

```python
error = pred_y - y
```

**Explanation**: Error (residual) is how much our prediction differs from the true value. Positive error means we over-predicted, negative means under-predicted.

</details>

### Question 7: How to compute the loss (cost function)?

**Challenge**: Calculate Mean Squared Error (MSE).

**Think about**:
- Why square the error?
- Why take the mean?
- How to do this efficiently with numpy?

**Expected output**: A single scalar loss value

In [None]:
# TODO: Compute loss (MSE)
# loss = ?

# TODO: Print loss every 1000 iterations
# if i % 1000 == 0:
#     print(f"loss at {i} iteration is {loss}")

<details>
<summary>üí° Hint 1</summary>

MSE = mean of squared errors. Use `np.mean(error**2)`.

</details>

<details>
<summary>‚úÖ Solution</summary>

```python
loss = np.mean(error**2)

if i % 1000 == 0:
    print(f"loss at {i} iteration is {loss}")
```

**Why MSE?**
- Squaring makes all errors positive
- Larger errors get penalized more (quadratic penalty)
- Differentiable for gradient computation

</details>

### Question 8: How to compute gradients?

**Challenge**: Implement gradient computation for weights and bias.

**Think about**:
- Why do we need $X^T$ (transpose)?
- Why divide by m (number of samples)?
- What's the relationship between error and gradients?

**Expected output**: dw and db with correct shapes

In [None]:
# TODO: Compute gradient for weights
# dw = ?

# TODO: Compute gradient for bias
# db = ?

<details>
<summary>üí° Hint 1</summary>

Formulas:
- $dW = \frac{1}{m} X^T \cdot error$
- $db = \frac{1}{m} \sum error$

</details>

<details>
<summary>üí° Hint 2</summary>

```python
dw = np.dot(X.T, error) / n_instances
db = np.sum(error) / n_instances
```

</details>

<details>
<summary>‚úÖ Solution</summary>

```python
# Gradients computation
dw = np.dot(X.T, error) / n_instances
db = np.sum(error) / n_instances
```

**Explanation**:
- `X.T @ error` computes how much each feature contributes to total error
- Shape: (n, m) @ (m,) = (n,) - one gradient per weight
- Division by m gives average gradient across all samples
- db is simpler: just average of all errors

</details>

### Question 9: How to update weights using learning rate?

**Challenge**: Implement the gradient descent update rule.

**Think about**:
- Why subtract (not add) the gradient?
- What does learning rate control?
- What happens if alpha is too large or too small?

**Expected output**: Updated weights and bias

In [None]:
# TODO: Update weights
# self.weights = ?

# TODO: Update bias
# self.bias = ?

<details>
<summary>üí° Hint 1</summary>

Gradient descent moves in the OPPOSITE direction of the gradient (steepest descent).

Update rule: $\theta = \theta - \alpha \cdot d\theta$

</details>

<details>
<summary>‚úÖ Solution</summary>

```python
# Weights update
self.weights = self.weights - self.alpha * dw
self.bias = self.bias - self.alpha * db
```

**Key Points**:
- Negative gradient direction leads to minimum
- Learning rate (alpha) controls step size
- Too large alpha ‚Üí overshooting, divergence
- Too small alpha ‚Üí slow convergence

</details>

### Question 10: Complete the predict method

**Challenge**: Implement prediction using learned weights.

**Think about**:
- Is this the same as forward pass?
- Can we use this for test data?

**Expected output**: Predictions for any input X

In [None]:
class LinearRegressionSGD:
    # ... (previous methods)
    
    def predict(self, X):
        # TODO: Return predictions using learned weights and bias
        pass

<details>
<summary>‚úÖ Solution</summary>

```python
def predict(self, X):
    return np.dot(X, self.weights) + self.bias
```

**Complete Class:**

```python
class LinearRegressionSGD:
    def __init__(self, alpha=0.1, iterations=1000):
        self.alpha = alpha
        self.iterations = iterations
        self.weights = None
        self.bias = None

    def fit(self, X, y):
        n_instances, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        for i in range(self.iterations):
            # Forward pass
            pred_y = np.dot(X, self.weights) + self.bias
            
            # Error
            error = pred_y - y
            loss = np.mean(error**2)
            
            if i % 1000 == 0:
                print(f"loss at {i} iteration is {loss}")
            
            # Gradients computation
            dw = np.dot(X.T, error) / n_instances
            db = np.sum(error) / n_instances
            
            # Weights update
            self.weights = self.weights - self.alpha * dw
            self.bias = self.bias - self.alpha * db

    def predict(self, X):
        return np.dot(X, self.weights) + self.bias
```

</details>

### Question 11: Test your Batch Gradient Descent implementation

**Challenge**: Train the model and make predictions.

**Expected output**: Loss decreasing over iterations, predictions on test set

In [None]:
# TODO: Create an instance of LinearRegressionSGD with alpha=0.1 and iterations=10000

# TODO: Fit the model on training data

# TODO: Make predictions on test data

# TODO: Print first 10 predictions

<details>
<summary>‚úÖ Solution</summary>

```python
model = LinearRegressionSGD(alpha=0.1, iterations=10000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"\nFirst 10 predictions: {y_pred[:10]}")
```

</details>

---

## Part 3: Mini-batch Stochastic Gradient Descent

Now let's implement a more efficient version using mini-batches!

### Question 12: Why shuffle data? Implement shuffling

**Challenge**: Shuffle the training data before creating batches.

**Think about**:
- Why shuffle before each epoch?
- How to shuffle X and y together (keep pairs intact)?
- What does `np.random.permutation` do?

**Expected output**: Shuffled indices

In [None]:
# TODO: Generate shuffled indices using np.random.permutation
# indices = ?

# TODO: Use indices to shuffle X and y
# X_shuffled = ?
# y_shuffled = ?

<details>
<summary>üí° Hint 1</summary>

```python
indices = np.random.permutation(n_instances)
```

This returns shuffled indices from 0 to n_instances-1.

</details>

<details>
<summary>‚úÖ Solution</summary>

```python
# Shuffle the data
indices = np.random.permutation(n_instances)
X_shuffled = X[indices]
y_shuffled = y[indices]
```

**Why shuffle?**
- Prevents model from learning order-dependent patterns
- Ensures each batch is representative of full dataset
- Improves convergence in SGD/mini-batch

Example: If n_instances=100, indices might be [47, 2, 89, 15, ...] (random order)

</details>

### Question 13: How to create batches?

**Challenge**: Implement batch creation logic.

**Think about**:
- How to split data into chunks of batch_size?
- What happens to the last batch if it's smaller?
- How to loop through batches?

**Expected output**: Batches of size 64 (or smaller for last batch)

In [None]:
# TODO: Loop through batches using range with step=batch_size
# for i in range(?, ?, ?):

# TODO: Extract batch from shuffled data
# X_batch = ?
# y_batch = ?

<details>
<summary>üí° Hint 1</summary>

Use `range(0, n_instances, batch_size)` to get starting indices: 0, 64, 128, 192, ...

</details>

<details>
<summary>üí° Hint 2</summary>

```python
for i in range(0, n_instances, self.batch_size):
    X_batch = X_shuffled[i:i+self.batch_size]
    y_batch = y_shuffled[i:i+self.batch_size]
```

</details>

<details>
<summary>‚úÖ Solution</summary>

```python
for i in range(0, n_instances, self.batch_size):
    X_batch = X_shuffled[i:i+self.batch_size]
    y_batch = y_shuffled[i:i+self.batch_size]
```

**Explanation**:
- If n_instances=353 and batch_size=64:
  - Batch 1: indices 0:64 (64 samples)
  - Batch 2: indices 64:128 (64 samples)
  - Batch 3: indices 128:192 (64 samples)
  - ...
  - Last batch: indices 320:353 (33 samples) ‚Üê smaller!

</details>

### Question 14: Implement Mini-batch SGD

**Challenge**: Complete the LinearRegressionMiniSGD class.

**Think about**:
- How does this differ from Batch GD?
- Why use `len(X_batch)` instead of `n_instances`?
- What's the tradeoff with batch_size?

**Expected output**: Complete working Mini-batch SGD implementation

In [None]:
class LinearRegressionMiniSGD:
    def __init__(self, alpha=0.1, iterations=1000, batch_size=64):
        # TODO: Initialize hyperparameters
        pass
    
    def fit(self, X, y):
        # TODO: Initialize weights and bias
        
        # TODO: Loop for iterations (epochs)
        # for _ in range(self.iterations):
        
            # TODO: Shuffle data
            
            # TODO: Loop through batches
            # for i in range(0, n_instances, self.batch_size):
            
                # TODO: Get batch
                
                # TODO: Forward pass on batch
                
                # TODO: Compute error on batch
                
                # TODO: Compute gradients using len(X_batch)
                
                # TODO: Update weights
        pass
    
    def predict(self, X):
        # TODO: Return predictions
        pass

<details>
<summary>‚úÖ Solution</summary>

```python
class LinearRegressionMiniSGD:
    def __init__(self, alpha=0.1, iterations=1000, batch_size=64):
        self.alpha = alpha
        self.iterations = iterations
        self.weights = None
        self.bias = None
        self.batch_size = batch_size

    def fit(self, X, y):
        n_instances, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        # Mini batch SGD
        for _ in range(self.iterations):
            # Shuffle the data
            indices = np.random.permutation(n_instances)
            X_shuffled = X[indices]
            y_shuffled = y[indices]
            
            for i in range(0, n_instances, self.batch_size):
                X_batch = X_shuffled[i:i+self.batch_size]
                y_batch = y_shuffled[i:i+self.batch_size]
                
                # Forward pass
                pred_y = np.dot(X_batch, self.weights) + self.bias
                error = pred_y - y_batch
                
                # Gradients (note: using len(X_batch), not n_instances)
                dw = np.dot(X_batch.T, error) / len(X_batch)
                db = np.sum(error) / len(X_batch)
                
                # Gradient update
                self.weights = self.weights - self.alpha * dw
                self.bias = self.bias - self.alpha * db

    def predict(self, X):
        return np.dot(X, self.weights) + self.bias
```

**Key Differences from Batch GD**:
- Updates happen multiple times per epoch (once per batch)
- Uses `len(X_batch)` for averaging (handles variable-size last batch)
- Faster convergence with less memory

</details>

### Question 15: Test Mini-batch SGD

**Challenge**: Train and compare with Batch GD.

**Expected output**: Predictions from Mini-batch SGD

In [None]:
# TODO: Create and train LinearRegressionMiniSGD

# TODO: Make predictions

# TODO: Print first 10 predictions

<details>
<summary>‚úÖ Solution</summary>

```python
model_mini = LinearRegressionMiniSGD(alpha=0.1, iterations=10000, batch_size=64)
model_mini.fit(X_train, y_train)
y_pred_mini = model_mini.predict(X_test)
print(f"First 10 predictions: {y_pred_mini[:10]}")
```

</details>

---

## Part 4: Normal Equation (Closed-form Solution)

Instead of iterative optimization, we can solve directly using linear algebra!

### Question 16: Understand the Normal Equation

**Challenge**: Implement the closed-form solution: $\theta = (X^T X)^{-1} X^T y$

**Think about**:
- Why does this work?
- What if we want to include an intercept?
- What's the difference between centering and adding a bias column?

**Expected output**: Optimal weights computed directly

In [None]:
class LinearRegressionNormalEqn:
    def __init__(self, fit_intercept=True):
        # TODO: Store parameters
        pass
    
    def fit(self, X, y):
        if self.fit_intercept:
            # TODO: Center X (subtract mean)
            # Xc = ?
            
            # TODO: Center y
            # yc = ?
            
            # TODO: Use np.linalg.lstsq to solve for parameters
            # parameters, *_ = np.linalg.lstsq(?, ?, rcond=None)
            
            # TODO: Store coefficients
            # self.coef = ?
            
            # TODO: Calculate intercept from means
            # self.intercept = ?
            pass
        else:
            # TODO: Solve without intercept
            pass
    
    def predict(self, X):
        # TODO: Return predictions
        pass

<details>
<summary>üí° Hint 1</summary>

Centering data means subtracting the mean:
```python
Xc = X - X.mean(axis=0)
yc = y - y.mean()
```

</details>

<details>
<summary>üí° Hint 2</summary>

`np.linalg.lstsq` solves the least squares problem. After solving with centered data, the intercept can be recovered:
```python
intercept = y.mean() - X.mean(axis=0) @ coef
```

</details>

<details>
<summary>‚úÖ Solution</summary>

```python
class LinearRegressionNormalEqn:
    def __init__(self, fit_intercept=True):
        self.coef = None
        self.fit_intercept = fit_intercept
        self.intercept = 0.0

    def fit(self, X, y):
        if self.fit_intercept:
            # Center the data
            Xc = X - X.mean(axis=0)
            yc = y - y.mean()
            
            # Solve using least squares
            parameters, *_ = np.linalg.lstsq(Xc, yc, rcond=None)
            self.coef = parameters
            
            # Calculate intercept
            self.intercept = y.mean() - X.mean(axis=0) @ self.coef
        else:
            parameters, *_ = np.linalg.lstsq(X, y, rcond=None)
            self.coef = parameters
    
    def predict(self, X):
        return X @ self.coef + self.intercept
```

**Why center data?**
- Centering allows fitting intercept without adding a column of 1's
- Mathematically equivalent to Normal Equation with augmented X
- More numerically stable

</details>

### Question 17: Test Normal Equation

**Challenge**: Train using Normal Equation and compare with GD methods.

**Expected output**: Predictions from Normal Equation

In [None]:
# TODO: Create and train LinearRegressionNormalEqn

# TODO: Make predictions

# TODO: Print first 10 predictions

<details>
<summary>‚úÖ Solution</summary>

```python
model_normal = LinearRegressionNormalEqn()
model_normal.fit(X_train, y_train)
y_pred_normal = model_normal.predict(X_test)
print(f"First 10 predictions: {y_pred_normal[:10]}")
```

</details>

---

## Part 5: Comparison & Analysis

### Question 18: Compare with sklearn

**Challenge**: Train sklearn's LinearRegression and compare predictions.

**Think about**:
- How close are your implementations to sklearn?
- Which method does sklearn use internally?

**Expected output**: sklearn predictions

In [None]:
from sklearn.linear_model import LinearRegression

# TODO: Create and train sklearn model

# TODO: Make predictions

# TODO: Print first 10 predictions

<details>
<summary>‚úÖ Solution</summary>

```python
from sklearn.linear_model import LinearRegression

model_sklearn = LinearRegression()
model_sklearn.fit(X_train, y_train)
y_pred_sklearn = model_sklearn.predict(X_test)
print(f"First 10 predictions: {y_pred_sklearn[:10]}")
```

**Note**: sklearn uses a variant of Normal Equation (SVD-based) for numerical stability.

</details>

### Question 19: Calculate and compare metrics

**Challenge**: Compute Mean Squared Error (MSE) for all methods.

**Think about**:
- Which method gives the best results?
- Are the results similar?
- Why might they differ slightly?

**Expected output**: MSE comparison table

In [None]:
from sklearn.metrics import mean_squared_error

# TODO: Compute MSE for all methods
# mse_bgd = ?
# mse_mini = ?
# mse_normal = ?
# mse_sklearn = ?

# TODO: Print comparison table

<details>
<summary>‚úÖ Solution</summary>

```python
from sklearn.metrics import mean_squared_error

# Assuming you have predictions from all models
mse_bgd = mean_squared_error(y_test, y_pred)
mse_mini = mean_squared_error(y_test, y_pred_mini)
mse_normal = mean_squared_error(y_test, y_pred_normal)
mse_sklearn = mean_squared_error(y_test, y_pred_sklearn)

print("\n=== Model Comparison ===")
print(f"Batch Gradient Descent MSE: {mse_bgd:.2f}")
print(f"Mini-batch SGD MSE: {mse_mini:.2f}")
print(f"Normal Equation MSE: {mse_normal:.2f}")
print(f"sklearn LinearRegression MSE: {mse_sklearn:.2f}")
```

**Expected Result**: All methods should give very similar MSE (within small rounding errors).

</details>

### Question 20: Final Analysis Questions

**Reflect on what you've learned**:

1. **When would you use each method?**
   - Batch GD: ?
   - Mini-batch SGD: ?
   - Normal Equation: ?

2. **What are the computational complexities?**
   - Gradient Descent: ?
   - Normal Equation: ?

3. **What happens if X^T X is not invertible?**

4. **How does batch size affect training?**

5. **What would you change if you had millions of samples?**

<details>
<summary>‚úÖ Answers</summary>

**1. When to use each method:**
- **Batch GD**: Small to medium datasets, when you want smooth convergence
- **Mini-batch SGD**: Large datasets, online learning, when memory is limited
- **Normal Equation**: Small number of features (n < 10,000), need exact solution

**2. Computational complexity:**
- **Gradient Descent**: O(k¬∑m¬∑n) where k = iterations, m = samples, n = features
- **Normal Equation**: O(n¬≥) due to matrix inversion + O(n¬≤¬∑m) for X^T X

**3. Non-invertible X^T X:**
- Happens when features are linearly dependent (multicollinearity)
- Solutions:
  - Remove redundant features
  - Use regularization (Ridge: adds ŒªI to make it invertible)
  - Use pseudo-inverse
  - Use gradient descent instead

**4. Batch size effects:**
- **Small batches** (1-32): Noisy updates, faster per iteration, may escape local minima
- **Large batches** (256+): Stable updates, better hardware utilization, but slower per epoch
- **Sweet spot**: Usually 32-128 for good balance

**5. For millions of samples:**
- Use Mini-batch SGD (not Batch GD or Normal Equation)
- Consider smaller batch sizes for memory
- Use momentum-based optimizers (Adam, RMSprop)
- Consider distributed training
- May need early stopping to avoid overfitting

</details>

---

## Congratulations! üéâ

You've successfully implemented Linear Regression from scratch using three different approaches:

1. ‚úÖ Batch Gradient Descent
2. ‚úÖ Mini-batch Stochastic Gradient Descent
3. ‚úÖ Normal Equation (Closed-form)

### Key Takeaways

- **Vectorization** is crucial for efficient implementation
- **Gradient descent** works by iteratively moving towards the minimum
- **Mini-batch SGD** balances speed and stability
- **Normal Equation** gives exact solution but doesn't scale to large features
- All methods converge to similar solutions for linear regression

### Next Steps

- Try with different datasets
- Experiment with different learning rates and batch sizes
- Add regularization (Ridge, Lasso)
- Implement feature scaling/normalization
- Visualize the loss curves

For complete reference implementations, check out the **Solution Reference** notebook!

---