# **Linear Regression from Scratch**
Welcome to this tutorial, where we'll implement **Linear Regression** along with its optimization algorithms: **Gradient Descent** (Batch) and **Stochastic Gradient Descent (SGD)**. 

We'll go step-by-step through the theory and Python implementation. While this video focuses on the code, the provided Jupyter Notebook includes detailed explanations and equations for your reference.

---

## **1. Introduction to Linear Regression**
Linear Regression is a fundamental algorithm used for predicting a continuous target variable ($y$) based on one or more input features ($x$).

### **1.1 Hypothesis Function**
The hypothesis function models the relationship between inputs and outputs:
$$
h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + \dots + \theta_nx_n
$$
In vectorized form, it can be written as:
$$
h_\theta(x) = X \cdot \theta
$$
Where:
- $X$: Matrix of input features ($m \times n$, where $m$ is the number of examples and $n$ is the number of features).
- $\theta$: Vector of model parameters ($n \times 1$).

### **1.2 Goal of Linear Regression**
The goal is to find parameters $\theta$ that minimize the error between predicted values ($h_\theta(x)$) and actual values ($y$).

---

## **2. Cost Function**
To measure how well the model predicts the target, we use the **Mean Squared Error (MSE)** as the cost function:
$$
J(\theta) = \frac{1}{2m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)} \right)^2
$$

Where:
- $m$: Number of training examples.
- $h_\theta(x^{(i)})$: Predicted value for the $i$-th example.
- $y^{(i)}$: Actual value for the $i$-th example.

### **Why Divide by $2m$?**
The factor $\frac{1}{2m}$ simplifies the gradient calculation by canceling the constant when differentiating.

---

## **3. Optimization with Gradient Descent**

### **3.1 What is Gradient Descent?**
Gradient Descent is an iterative optimization algorithm used to minimize the cost function by adjusting the model parameters ($\theta$) step by step.

### **3.2 Algorithm Steps**
1. Initialize parameters $\theta$ (e.g., to zeros or random values).
2. Repeat until convergence:
   $$
   \theta_j := \theta_j - \alpha \frac{\partial J(\theta)}{\partial \theta_j}
   $$
   Where:
   - $\alpha$: Learning rate, controlling the size of the update step.
   - $\frac{\partial J(\theta)}{\partial \theta_j}$: Gradient of the cost function with respect to $\theta_j$.

### **3.3 Gradient Calculation**
The partial derivative of the cost function is:
$$
\frac{\partial J(\theta)}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)} \right) x_j^{(i)}
$$

### **3.4 Vectorized Gradient Descent Update Rule**
For computational efficiency, we use the vectorized form:
$$
\theta := \theta - \alpha \cdot \nabla J(\theta)
$$
Where:
$$
\nabla J(\theta) = \frac{1}{m} X^T \cdot (h_\theta(X) - y)
$$

---

## **4. Stochastic Gradient Descent (SGD)**

### **4.1 What is SGD?**
Unlike Batch Gradient Descent, which updates parameters after evaluating all training examples, SGD updates parameters after processing each example. This makes SGD faster but noisier.

### **4.2 Algorithm Steps**
1. Shuffle the dataset.
2. For each training example $(x^{(i)}, y^{(i)})$:
   $$
   \theta_j := \theta_j - \alpha \left( h_\theta(x^{(i)}) - y^{(i)} \right)x_j^{(i)}
   $$

### **4.3 Comparison with Batch Gradient Descent**
- **Batch Gradient Descent**: Processes all $m$ examples to compute a single update.
- **Stochastic Gradient Descent**: Processes one example at a time for each update.

---

## **5. Multi-Dimensional Data**
So far, we've worked with one-dimensional data. For multi-dimensional data:
- $X$ becomes a matrix of shape $m \times n$.
- $\theta$ becomes a vector of shape $n \times 1$.
- The equations for $h_\theta(x)$, $J(\theta)$, and gradients remain consistent.

---

## **6. Visualization and Insights**
We'll visualize the data and model predictions to interpret the results. Stay tuned for the implementation in the next sections!

---

## **7. Additional Resources**
For a deeper understanding of the math, watch Andrew Ng's **CS229 Lecture on Linear Regression**:
[CS229 Lecture Notes](https://cs229.stanford.edu/notes2023fall/cs229-notes1.pdf)

---

Thank you for following along! The complete code and detailed explanations are available in the notebook attached below.