# Simple Linear Regression


---

## 1. Problem Statement

**Simple Linear Regression** is used to solve **regression problems** in supervised machine learning.

- Example dataset:  
  Features: `Weight`  
  Target: `Height`  

| Weight (kg) | Height (cm) |
|-------------|-------------|
| 74          | 170         |
| 80          | 180         |
| 75          | 175.5       |

**Goal:**  
Train a model such that, for a **new weight**, it can predict the corresponding **height**.

- `Weight` → Independent Feature (Input)  
- `Height` → Dependent Feature (Output)

---

## 2. Why "Simple" Linear Regression?

- **Simple Linear Regression:** 1 input feature + 1 output feature  
- **Multiple Linear Regression:** Multiple input features + 1 output feature  

Learning simple linear regression first helps understand the **terminology and mathematics**, which also applies to multiple linear regression.

---

## 3. Geometric Interpretation

- Plot the data points on a graph: `Weight` (x-axis) vs `Height` (y-axis)
- Aim: Create a **best fit line** through the points
- Purpose: Predict the height for a **new weight**

**Prediction process:**

1. Draw the **best fit line** that minimizes the distance between the **true points** (actual heights) and **predicted points** (points on the line).  
   - These distances represent the **errors**.
2. Minimize the **sum of all errors** to find the optimal line.
3. For a new weight:
   - Locate the weight on the x-axis
   - See where it intersects the best fit line
   - Read the predicted height on the y-axis

---

## 4. Summary

- **Simple Linear Regression** predicts a dependent variable (height) using a single independent variable (weight).
- The **best fit line** is chosen by minimizing the **total prediction error**.
- Once trained, the model can predict the output for **new inputs** accurately.

---





# Simple Linear Regression: Notations and Concepts

---

## 1. Setting Up the Data

- **X-axis:** Weight (independent feature)  
- **Y-axis:** Height (dependent feature)  
- **Data points:** Represent our dataset

**Goal:** Create the **best fit line** to predict height for new weight values.

---

## 2. Equation of the Best Fit Line

The equation can be written in different forms:

1. Standard straight line:  
   $$
   y = mx + c
   $$

2. Regression notation:  
   $$
   y = \beta_0 + \beta_1 x
   $$

3. Andrew Ng’s notation (used here):  
   $$
   h_\theta(x) = \theta_0 + \theta_1 x
   $$

- $x$ → Independent feature (Weight)  
- $h_\theta(x)$ → Predicted value (Height)

---

## 3. Parameters of the Line

### 3.1 Intercept ($\theta_0$)
- Also called **bias term**  
- Represents where the line meets the **y-axis** when $x = 0$  
- Determines the **baseline value** of the predicted output

### 3.2 Slope / Coefficient ($\theta_1$)
- Represents the **rate of change** in $y$ for a unit change in $x$  
- Determines how steep the line is  

> For multiple features:  
> $$
> h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \dots + \theta_n x_n
> $$  
> Each feature has its own slope ($\theta_i$)

---

## 4. Predictions

- A **new data point** is projected onto the **best fit line**  
- The **predicted output** is the corresponding y-value:  
  $$
  \hat{y} = h_\theta(x)
  $$  
- $\hat{y}$ represents the predicted point on the line

---

## 5. Error

- **Error** measures the difference between the **true output** and the **predicted output**:  
  $$
  \text{Error} = y - \hat{y}
  $$  
- Goal: Minimize the **sum of squared errors** for all data points  
- The **best fit line** is the one that **minimizes the total error**  

---

## 6. Cost Function (Mean Squared Error)

To formalize error minimization, we define the **cost function**:

$$
J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} \big( h_\theta(x_i) - y_i \big)^2
$$

- $m$ → Number of data points  
- $h_\theta(x_i)$ → Predicted value  
- $y_i$ → Actual value  

**Goal:** Minimize $J(\theta_0, \theta_1)$

---

## 7. Gradient Descent: Optimization Concept

Instead of randomly trying lines:

1. Initialize $\theta_0$ and $\theta_1$  
2. Compute the cost function $J(\theta_0, \theta_1)$  
3. Compute **partial derivatives** w.r.t each parameter  
4. Update parameters iteratively:

### Update Rules

- **Intercept ($\theta_0$):**
$$
\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^{m} \big( h_\theta(x_i) - y_i \big)
$$

- **Slope ($\theta_1$):**
$$
\theta_1 := \theta_1 - \alpha \frac{1}{m} \sum_{i=1}^{m} \big( h_\theta(x_i) - y_i \big) x_i
$$

- $\alpha$ → **Learning rate**, controls step size  

Repeat until convergence (global minimum reached).

---

## 8. Key Takeaways

- $\theta_0$ → Intercept / Bias term  
- $\theta_1$ → Slope / Coefficient  
- $h_\theta(x)$ → Predicted value ($\hat{y}$)  
- **Error** → Difference between actual and predicted values  
- **Cost function** → Measures total squared error  
- **Gradient descent** → Iteratively adjusts parameters to minimize cost  
- **Learning rate ($\alpha$)** → Controls convergence speed


# Simple Linear Regression: Cost Function & Gradient Descent

---

## 1. Cost Function

To find the optimal line, we define a **cost function** \(J(\theta_0, \theta_1)\) as:

\[
J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} \Big(h_\theta(x^{(i)}) - y^{(i)}\Big)^2
\]

Where:  
- \(h_\theta(x^{(i)})\) → Predicted value for the \(i^{th}\) point  
- \(y^{(i)}\) → True value for the \(i^{th}\) point  
- \(m\) → Total number of points  
- Squared error is used to **penalize larger errors more heavily**  

This is also known as the **Mean Squared Error (MSE)**.

> Other cost functions exist, such as Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE), but MSE is widely used for linear regression.

---

## 2. Our Aim

- **Goal:** Minimize the cost function by adjusting \(\theta_0\) (intercept) and \(\theta_1\) (slope).  
- When \(J(\theta_0, \theta_1)\) is minimal, we have found the **best fit line**.

---

## 3. Simplifying for Visualization

- Assume \(\theta_0 = 0\) (line passes through origin)  
- Equation becomes:  
\[
h_\theta(x) = \theta_1 x
\]

- Example dataset:

| x | y |
|---|---|
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |

- Initialize \(\theta_1 = 1\), then predictions exactly match true points:  

\[
h_\theta(1) = 1, \quad h_\theta(2) = 2, \quad h_\theta(3) = 3
\]

- Cost function:  
\[
J(\theta_1) = \frac{1}{2 \cdot 3} \big((1-1)^2 + (2-2)^2 + (3-3)^2\big) = 0
\]

✅ Perfect fit, cost function = 0

---

## 4. Changing \(\theta_1\)

- Example: \(\theta_1 = 0.5\)  

Predictions:

| x | \(h_\theta(x)\) | y | Error |
|---|----------------|---|-------|
| 1 | 0.5            | 1 | 0.5   |
| 2 | 1              | 2 | 1     |
| 3 | 1.5            | 3 | 1.5   |

- Cost function:

\[
J(0.5) = \frac{1}{2 \cdot 3} \big((0.5)^2 + (1)^2 + (1.5)^2\big) \approx 0.58
\]

- Another example: \(\theta_1 = 0\)  

Predictions:

| x | \(h_\theta(x)\) | y | Error |
|---|----------------|---|-------|
| 1 | 0              | 1 | 1     |
| 2 | 0              | 2 | 2     |
| 3 | 0              | 3 | 3     |

- Cost function:

\[
J(0) = \frac{1}{2 \cdot 3} \big(1^2 + 2^2 + 3^2\big) \approx 2.33
\]

---

## 5. Gradient Descent & Global Minimum

- Plotting \(J(\theta_1)\) vs \(\theta_1\) gives a **U-shaped curve**  
- The **minimum point** of the curve corresponds to the **best fit line** (global minimum)  
- Goal: Adjust \(\theta_1\) and \(\theta_0\) iteratively to reach the **global minimum**  
- This iterative optimization process is called **gradient descent**  

> Gradient descent is also widely used in **deep learning** to optimize weights of neural networks.

---

✅ **Key Takeaways:**

1. **Cost Function:** Measures error between predicted and true values (MSE).  
2. **Goal:** Minimize the cost function to find the best fit line.  
3. **Gradient Descent:** Iteratively updates \(\theta_0\) and \(\theta_1\) to reach global minimum.  
4. **Predicted Points:** \(h_\theta(x) = \hat{y}\)  
5. **Error:** \(y - \hat{y}\)  




# Simple Linear Regression: Convergence Algorithm (Gradient Descent)

We discuss the **convergence algorithm**, an **optimized way to reach the global minimum** of the cost function using **gradient descent**.

---

## 1. Cost Function

For $m$ training examples $(x_i, y_i)$, the cost function is:

$$
J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} \big( h_\theta(x_i) - y_i \big)^2
$$

Where the hypothesis is:

$$
h_\theta(x_i) = \theta_0 + \theta_1 x_i
$$

- $\theta_0$ → Intercept  
- $\theta_1$ → Slope

---

## 2. Gradient Descent Update Rules

The **general update formula** for any parameter $\theta_j$ is:

$$
\theta_j := \theta_j - \alpha \frac{\partial J(\theta_0, \theta_1)}{\partial \theta_j}
$$

Where:  
- $j = 0$ or $1$  
- $\alpha$ → **Learning rate**, controls step size  
- $\frac{\partial J}{\partial \theta_j}$ → Derivative of the cost function w.r.t $\theta_j$  

**Goal:** Repeat until convergence (global minimum is reached).

---

## 3. Partial Derivatives

### a) Derivative w.r.t $\theta_0$:

$$
\frac{\partial J}{\partial \theta_0} = \frac{1}{m} \sum_{i=1}^{m} \big( h_\theta(x_i) - y_i \big)
$$

**Explanation:**  
- θ₀ is multiplied by 1 in the hypothesis.  
- The derivative sums the errors and averages them over $m$ examples.

### b) Derivative w.r.t $\theta_1$:

$$
\frac{\partial J}{\partial \theta_1} = \frac{1}{m} \sum_{i=1}^{m} \big( h_\theta(x_i) - y_i \big) x_i
$$

**Explanation:**  
- θ₁ is multiplied by $x_i$ in the hypothesis.  
- The derivative sums the **error multiplied by $x_i$** and averages over $m$ examples.

---

## 4. Gradient Descent Updates (Explicit)

**Update θ₀:**

$$
\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^{m} \big( h_\theta(x_i) - y_i \big)
$$

**Update θ₁:**

$$
\theta_1 := \theta_1 - \alpha \frac{1}{m} \sum_{i=1}^{m} \big( h_\theta(x_i) - y_i \big) x_i
$$

- Repeat until the **cost function is minimized**.  
- Both parameters converge **simultaneously**.

---

## 5. Understanding the Slope (Derivative)

- **Slope indicates direction:**
  - **Negative slope:** increase θ  
  - **Positive slope:** decrease θ  
- **Tangent line:** Use the slope at the current point to decide the update direction.

---

## 6. Learning Rate (α)

- Controls **speed of convergence**.  
- **Too small:** slow convergence.  
- **Too large:** may overshoot or fail to converge.  
- Typical value: $\alpha = 0.001$ for linear regression.  

---

## 7. 3D Gradient Descent Perspective

- With θ₀ and θ₁ both varying, the **cost function surface is 3D**.  
- Gradient descent moves iteratively **down the slope** toward the **global minimum**.  
- Think of it as **descending an inverted mountain** to reach the **lowest point**.

---

## 8. Step-by-Step Algorithm

1. Initialize θ₀ and θ₁ (e.g., zero or random).  
2. Compute the cost function $J(\theta_0, \theta_1)$.  
3. Compute partial derivatives w.r.t each θ.  
4. Update θ₀ and θ₁ using the formulas above.  
5. Repeat until convergence (minimal cost).  
6. Final θ₀ and θ₁ define the **best fit line**.

---

✅ **Key Takeaways**

- **Convergence algorithm** updates parameters systematically.  
- **Gradient descent** iteratively moves parameters toward the global minimum.  
- **Learning rate α** controls convergence speed.  
- Once converged, θ₀ and θ₁ define the **best fit line**.


# Simple Linear Regression: Convergence Algorithm with θ₀ and θ₁

We extend gradient descent to **both parameters**:

- θ₀ → Intercept  
- θ₁ → Slope  

The goal is to **converge to the global minimum** of the cost function $J(θ₀, θ₁)$.

---

## 1. Cost Function

For $m$ training examples $(x_i, y_i)$:

$$
J(θ₀, θ₁) = \frac{1}{2m} \sum_{i=1}^{m} \big( h_θ(x_i) - y_i \big)^2
$$

Where the hypothesis is:

$$
h_θ(x_i) = θ₀ + θ₁ x_i
$$

---

## 2. Gradient Descent Update Rules

The **general convergence algorithm**:

$$
\text{Repeat until convergence:} \quad 
θ_j := θ_j - α \frac{\partial J(θ₀, θ₁)}{\partial θ_j}
$$

Where:  
- $α$ → Learning rate  
- $j = 0, 1$ → Corresponds to θ₀ and θ₁

---

## 3. Partial Derivatives

### a) Derivative with respect to θ₀:

$$
\frac{\partial J(θ₀, θ₁)}{\partial θ₀} = \frac{1}{m} \sum_{i=1}^{m} \big( h_θ(x_i) - y_i \big)
$$

**Explanation:**  
- θ₀ is multiplied by 1 in the hypothesis, so derivative is simply the sum of errors divided by $m$.

### b) Derivative with respect to θ₁:

$$
\frac{\partial J(θ₀, θ₁)}{\partial θ₁} = \frac{1}{m} \sum_{i=1}^{m} \big( h_θ(x_i) - y_i \big) x_i
$$

**Explanation:**  
- θ₁ is multiplied by $x_i$, so derivative is sum of **errors multiplied by $x_i$** divided by $m$.

---

## 4. Final Gradient Descent Updates

$$
\boxed{
θ₀ := θ₀ - α \frac{1}{m} \sum_{i=1}^{m} \big( h_θ(x_i) - y_i \big)
}
$$

$$
\boxed{
θ₁ := θ₁ - α \frac{1}{m} \sum_{i=1}^{m} \big( h_θ(x_i) - y_i \big) x_i
}
$$

- **Repeat** these updates until the **global minimum** is reached.  
- This ensures that θ₀ and θ₁ **converge simultaneously**.

---

## 5. 3D Gradient Descent Perspective

- When both θ₀ and θ₁ are variables, the **cost function surface is a 3D curve**.  
- Gradient descent moves iteratively **down the slope** of this surface toward the **global minimum**.  
- Think of it like descending an **inverted mountain** to reach the **lowest point**.

---

### ✅ Summary

1. Start with initial values of θ₀ and θ₁ (often zero or random).  
2. Compute the cost function $J(θ₀, θ₁)$.  
3. Update θ₀ and θ₁ using gradient descent formulas.  
4. Repeat until **convergence** (minimal cost).  
5. Final θ₀ and θ₁ define the **best fit line**.


# Multiple Linear Regression: Introduction and Concepts

---

## 1. Recap: Simple Linear Regression

- **Single input feature example:** Predict height based on weight  
- **Equation:**  
  $$
  h_\theta(x) = \theta_0 + \theta_1 x
  $$
  - $\theta_0$ → Intercept  
  - $\theta_1$ → Slope / Coefficient  
- **Goal:** Update $\theta_0$ and $\theta_1$ using **gradient descent** to minimize cost function $J(\theta)$  

---

## 2. Multiple Input Features

- **Example dataset:** House Pricing Dataset  
  Features:  
  1. Number of rooms → $x_1$  
  2. Size of the house → $x_2$  
  3. Location → $x_3$  
- **Output / Target feature:** Price of the house → $y$  

> **Independent features (inputs):** $x_1, x_2, x_3$  
> **Dependent feature (output):** $y$

---

## 3. Equation of Multiple Linear Regression

- **General form:**  
$$
h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 + \dots + \theta_n x_n
$$

- **Where:**  
  - $\theta_0$ → Intercept (always one)  
  - $\theta_1, \theta_2, \dots, \theta_n$ → Coefficients / slopes for each input feature  

- **Example (house pricing):**  
$$
h_\theta(x) = \theta_0 + \theta_1 (\text{rooms}) + \theta_2 (\text{size}) + \theta_3 (\text{location})
$$

---

## 4. Key Differences: Simple vs Multiple Linear Regression

| Feature                  | Simple Linear Regression | Multiple Linear Regression |
|---------------------------|------------------------|---------------------------|
| Number of input features  | 1                      | 2 or more                |
| Parameters                | $\theta_0, \theta_1$   | $\theta_0, \theta_1, \dots, \theta_n$ |
| Equation                  | $h_\theta(x) = \theta_0 + \theta_1 x$ | $h_\theta(x) = \theta_0 + \theta_1 x_1 + \dots + \theta_n x_n$ |
| Visualization             | 2D line                | 3D or higher dimensional plane |

---

## 5. Gradient Descent in Multiple Linear Regression

- **Goal:** Minimize cost function $J(\theta_0, \theta_1, \dots, \theta_n)$  
- **Cost function:**  
$$
J(\theta_0, \theta_1, \dots, \theta_n) = \frac{1}{2m} \sum_{i=1}^{m} \Big( h_\theta(x^{(i)}) - y^{(i)} \Big)^2
$$

- **Parameter updates (for each $\theta_j$):**  
$$
\theta_j := \theta_j - \alpha \frac{\partial J(\theta_0, \theta_1, \dots, \theta_n)}{\partial \theta_j} \quad \text{for } j = 0, 1, \dots, n
$$

- **Gradient descent intuition:**  
  - Each $\theta_j$ moves toward reducing the cost  
  - All parameters are updated simultaneously until convergence (global minimum)

---

## 6. Visualization (Conceptual)

- **2D:** Single feature → line  
- **3D:** Two features → plane  
- **Higher dimensions:** n features → hyperplane  

- **Gradient descent:**  
  - Imagine all $\theta_j$ starting from random points  
  - Iteratively adjust all $\theta_j$ simultaneously  
  - Converge to **global minimum** of $J(\theta)$

---

## 7. Summary

- **Simple Linear Regression:** 1 input → line  
- **Multiple Linear Regression:** 2+ inputs → plane/hyperplane  
- **Equation:** $h_\theta(x) = \theta_0 + \theta_1 x_1 + \dots + \theta_n x_n$  
- **Parameters:** 1 intercept + n coefficients  
- **Optimization:** Gradient descent updates all parameters together  

> Next, we will learn **assumptions of linear regression**, which are crucial for correct modeling and interpretation.
