### **REGRESSION TREES**

- When we have clustered regression-like data, we use regression trees rather than linear regression.

---

# 🌳 Regression Trees: Complete Guide

A **Regression Tree** is a decision tree algorithm used for predicting **continuous values** (like house price, temperature, etc.). It splits the dataset into **regions** using decision rules based on input features and predicts the output using the **mean value** of samples in each region.

---

## 🔍 How Does a Regression Tree Work?

1. Start with the entire dataset.
2. For each feature, try **all possible split points** (midpoints between unique values).
3. At each split, compute the **error** (usually **Mean Squared Error** or MSE).
4. Choose the split that gives the **lowest total error**.
5. Repeat the process recursively on resulting subsets (left and right).
6. Stop when a stopping criterion (like depth or min samples per leaf) is met.
7. At prediction time, the **mean of the leaf node** is used as output.

---

## 📊 Example Dataset (Three Clusters)

Let’s say the target values (`y`) form 3 natural clusters:

| X  | y    |
|----|------|
| 1  | 2    |
| 2  | 1.5  |
| 3  | 2.5  |
| 6  | 6    |
| 7  | 6.5  |
| 8  | 5.5  |
| 11 | 10   |
| 12 | 11   |
| 13 | 10.5 |

📌 You can visualize this as 3 groups (clusters) centered at:
- Cluster 1: Around 2
- Cluster 2: Around 6
- Cluster 3: Around 10.5

---

## 📈 Step-by-Step: Finding the Best Split

### Step 1: Sort the data by X:

| X  | y    |
|----|------|
| 1  | 2    |
| 2  | 1.5  |
| 3  | 2.5  |
| 6  | 6    |
| 7  | 6.5  |
| 8  | 5.5  |
| 11 | 10   |
| 12 | 11   |
| 13 | 10.5 |

---

### Step 2: Try all split points (between X values)

Possible split points = midpoints:

- 1.5, 2.5, 4.5, 6.5, 7.5, 9.5, 11.5, 12.5

At each split:
- Split data into Left and Right based on `X <= split`
- Compute **MSE** of each part
- Total Error = (Left_MSE × n_L + Right_MSE × n_R) / n

---

## ⚙️ Splitting Point Intuition

A **splitting point** is simply a **threshold** on a feature (X) such that the dataset is divided into two groups:

- The valleys in the graph show **ideal split points**.
- Since our data has **3 natural clusters**, we expect **2 local minima** in the error plot → one between cluster 1 & 2, and another between cluster 2 & 3.

### Example: Split at X = 4.5

**Left (X ≤ 4.5):** [1, 2, 3] → y = [2, 1.5, 2.5], Mean = 2, MSE = 0.166  
**Right (X > 4.5):** [6, 7, 8, 11, 12, 13] → y = [6, 6.5, 5.5, 10, 11, 10.5], Mean = 8.25, MSE ≈ 4.56

**Weighted MSE:**  
Total = (3/9)×0.166 + (6/9)×4.56 ≈ **3.08**

Try all splits, and the one with **lowest total error** is selected.

---

# 🌳 Regression Trees: Splitting Criteria and Error Calculation

This walkthrough explains how **Regression Trees** choose the best split by minimizing **Mean Squared Error (MSE)**. We’ll go through **manual calculations** and then visualize results with code.

---

## 🧠 Dataset

We want to predict `marks` based on `hours studied`.

| Hours (X) | Marks (y) |
|-----------|-----------|
| 1         | 22        |
| 2         | 24        |
| 2.5       | 21        |
| 3         | 23        |
| 4         | 50        |
| 5         | 48        |
| 6         | 47        |
| 7         | 80        |
| 8         | 78        |
| 9         | 82        |

---

## 📏 Step 1: Try Split at X = 3.5

Split the data:
- **Left group** (X ≤ 3.5): [22, 24, 21, 23]
- **Right group** (X > 3.5): [50, 48, 47, 80, 78, 82]

### ➤ Left Group (n = 4)

Mean:
$$
\bar{y}_L = \frac{22 + 24 + 21 + 23}{4} = 22.5
$$

MSE:
$$
\text{MSE}_L = \frac{(22 - 22.5)^2 + (24 - 22.5)^2 + (21 - 22.5)^2 + (23 - 22.5)^2}{4} \\
= \frac{0.25 + 2.25 + 2.25 + 0.25}{4} = \frac{5}{4} = 1.25
$$

### ➤ Right Group (n = 6)

Mean:
$$
\bar{y}_R = \frac{50 + 48 + 47 + 80 + 78 + 82}{6} = 64.17
$$

MSE:
$$
= \frac{(50 - 64.17)^2 + (48 - 64.17)^2 + \ldots + (82 - 64.17)^2}{6} \\
≈ \frac{203.5 + 262.4 + 295.1 + 250.3 + 191.7 + 316.4}{6} ≈ 253.23
$$

### ➤ Total MSE

$$
\text{Total MSE} = \frac{4}{10} \cdot 1.25 + \frac{6}{10} \cdot 253.23 = 0.5 + 151.94 = 152.44
$$

---

## 📏 Step 2: Try Split at X = 6.5

Split:
- **Left (X ≤ 6.5):** [22, 24, 21, 23, 50, 48, 47]
- **Right (X > 6.5):** [80, 78, 82]

### ➤ Left Group (n = 7)

Mean:
$$
\bar{y}_L = 33.57
$$

MSE ≈ 172.2

### ➤ Right Group (n = 3)

Mean = 80

MSE:
$$
= \frac{(80-80)^2 + (78-80)^2 + (82-80)^2}{3} = \frac{0 + 4 + 4}{3} = 2.67
$$

### ➤ Total MSE

$$
= \frac{7}{10} \cdot 172.2 + \frac{3}{10} \cdot 2.67 ≈ 121.34
$$

✅ **Better than 152.44**, so this is a better split.

---

## 📉 Final Visualization: Error vs Split Point

Below is Python code to try **all split points**, compute **total MSE**, and plot the result.

---

## 🐍 Python Code

```python
import numpy as np
import matplotlib.pyplot as plt

# Step 1: Dataset
X = np.array([1, 2, 2.5, 3, 4, 5, 6, 7, 8, 9])
y = np.array([22, 24, 21, 23, 50, 48, 47, 80, 78, 82])

# Step 2: MSE calculation function
def mse(values):
    if len(values) == 0:
        return 0
    return np.mean((values - np.mean(values))**2)

# Step 3: Try all split points
split_points = (X[:-1] + X[1:]) / 2
errors = []

for split in split_points:
    left = y[X <= split]
    right = y[X > split]

    # Optional: apply min_samples_leaf condition
    if len(left) >= 2 and len(right) >= 2:
        total_error = (len(left)/len(y))*mse(left) + (len(right)/len(y))*mse(right)
    else:
        total_error = float('inf')  # Penalize bad splits

    errors.append(total_error)

# Step 4: Plot
plt.figure(figsize=(10,6))
plt.plot(split_points, errors, marker='o')
plt.title("📉 Error vs Split Point")
plt.xlabel("Split Point (Hours Studied)")
plt.ylabel("Total MSE")
plt.grid(True)
plt.show()

# 🌳 Regression Tree: Choosing the Best Split Feature with Calculation

We are building a regression tree to predict `Marks` using two features:
- `Hours` (hours studied)
- `CGPA` (prior GPA)

---

## 📘 Dataset

| Index | Hours | CGPA | Marks |
|-------|-------|------|-------|
| 0     | 1     | 2.1  | 22    |
| 1     | 2     | 2.5  | 24    |
| 2     | 2.5   | 2.4  | 21    |
| 3     | 3     | 2.6  | 23    |
| 4     | 4     | 3.2  | 50    |
| 5     | 5     | 3.1  | 48    |
| 6     | 6     | 3.3  | 47    |
| 7     | 7     | 3.9  | 80    |
| 8     | 8     | 3.8  | 78    |
| 9     | 9     | 4.0  | 82    |

---

## 🧮 Step-by-step: Split on Hours

Let’s try a split at Hours = 4.5

Split groups:
- Left Group (≤ 4.5): Hours = [1, 2, 2.5, 3, 4] → Marks = [22, 24, 21, 23, 50]
- Right Group (> 4.5): Hours = [5, 6, 7, 8, 9] → Marks = [48, 47, 80, 78, 82]

Mean Left = (22+24+21+23+50)/5 = 28  
MSE Left = [(22-28)² + (24-28)² + (21-28)² + (23-28)² + (50-28)²]/5  
         = [36 + 16 + 49 + 25 + 484] / 5 = 122 / 5 = 122.0

Mean Right = (48+47+80+78+82)/5 = 67  
MSE Right = [(48-67)² + (47-67)² + (80-67)² + (78-67)² + (82-67)²]/5  
          = [361 + 400 + 169 + 121 + 225] / 5 = 1276 / 5 = 255.2

Weighted Total MSE = (5/10)×122 + (5/10)×255.2 = 61 + 127.6 = 188.6

---

## 🔁 Try Split on CGPA = 3.15

Left: CGPA ≤ 3.15 → [2.1, 2.5, 2.4, 2.6, 3.2, 3.1] → Marks = [22, 24, 21, 23, 50, 48]  
Right: CGPA > 3.15 → [3.3, 3.9, 3.8, 4.0] → Marks = [47, 80, 78, 82]

Mean Left = 31.33  
MSE Left = avg of [(22-31.33)², ..., (48-31.33)²]  
         ≈ [87.1, 54.0, 108.1, 69.4, 345.8, 275.8] → Sum ≈ 940.2 → MSE = 940.2/6 ≈ 156.7

Mean Right = 71.75  
MSE Right = [(47-71.75)² + ...]  
          ≈ [615.1, 68.1, 38.8, 104.1] → Sum = 826.1 → MSE = 206.5

Weighted Total MSE = (6/10)×156.7 + (4/10)×206.5 = 94.02 + 82.6 = 176.62

---

## ✅ Final Decision

- MSE for split on Hours at 4.5 = 188.6
- MSE for split on CGPA at 3.15 = 176.62

→ Choose CGPA ≤ 3.15 because it gives lower error.

---

## 📊 Code to Visualize MSE

```python
import numpy as np
import matplotlib.pyplot as plt

Hours = np.array([1, 2, 2.5, 3, 4, 5, 6, 7, 8, 9])
CGPA = np.array([2.1, 2.5, 2.4, 2.6, 3.2, 3.1, 3.3, 3.9, 3.8, 4.0])
Marks = np.array([22, 24, 21, 23, 50, 48, 47, 80, 78, 82])

def mse(y):
    return np.mean((y - np.mean(y))**2) if len(y) > 0 else 0

def split_errors(X, y):
    sps = (X[:-1] + X[1:]) / 2
    errors = []
    for sp in sps:
        left = y[X <= sp]
        right = y[X > sp]
        total_error = (len(left)/len(y))*mse(left) + (len(right)/len(y))*mse(right)
        errors.append(total_error)
    return sps, errors

sps_h, err_h = split_errors(Hours, Marks)
sps_c, err_c = split_errors(CGPA, Marks)

plt.figure(figsize=(10, 5))
plt.plot(sps_h, err_h, label="Hours", marker='o')
plt.plot(sps_c, err_c, label="CGPA", marker='s')
plt.title("MSE vs Split Point")
plt.xlabel("Split Point")
plt.ylabel("Total MSE")
plt.legend()
plt.grid(True)
plt.show()
