# AdaBoost Machine Learning Algorithm


---

## Recap: Bagging and Boosting
 we discussed **two ensemble techniques**:

1. **Bagging**
2. **Boosting**

### Bagging
- We covered the **Random Forest Classifier** and **Random Forest Regressor**.
- Bagging uses **base learners** (like decision trees) and aims to reduce **variance** by combining multiple learners trained on random samples.

### Boosting
- Boosting uses **weak learners** sequentially to form a strong learner.
- We will cover the following boosting algorithms:
  1. **AdaBoost**
  2. **Gradient Boosting**
  3. **XGBoost**
- Both bagging and boosting can solve **classification** and **regression** problems.
- **Decision trees** are commonly used as the base learner.

---

## Decision Trees and Overfitting

- A **Decision Tree** grown to full depth can lead to **overfitting**:
  - **High training accuracy** → Low bias
  - **Low test accuracy** → High variance

### Random Forest vs Single Decision Tree
- Random Forest uses **multiple decision trees** (base learners).
- Samples of the dataset are provided to each tree.
- Random Forest reduces **variance**, achieving **low bias and low variance**.

---

## Boosting

- Boosting is sequential:
  1. Train a **weak learner** (e.g., shallow decision tree).
  2. Focus on records wrongly predicted by the previous learner.
  3. Pass misclassified records to the next learner.
- The final model combines all weak learners into a **strong learner**.

### Weak Learner
- A weak learner is a model that **has not learned much** from the training data.
- Commonly, **decision stumps** (trees of depth 1) are used as weak learners.

### AdaBoost
- Assigns **weights** to weak learners.
- Weighted combination forms the final prediction.

The AdaBoost function can be represented as:

$$
F(x) = \alpha_1 \cdot M_1(x) + \alpha_2 \cdot M_2(x) + \dots + \alpha_N \cdot M_N(x)
$$

Where:
- \(M_1, M_2, \dots, M_N\) are **decision stumps**.
- \(\alpha_1, \alpha_2, \dots, \alpha_N\) are **weights assigned** to each weak learner.

---

## Decision Tree Stumps

- A **decision tree stump** is a tree with **depth = 1**.
- Individually, stumps are weak learners → **underfitting**:
  - High bias
  - Low variance

- Combining multiple stumps sequentially with weights:
  - Reduces **bias**
  - Increases **variance**
  - Results in a **strong learner**

---

## Summary

- Bagging reduces **variance** by using multiple base learners.
- Boosting reduces **bias** by using multiple weak learners sequentially.
- AdaBoost:
  - Uses **decision stumps** as weak learners.
  - Assigns **weights** to learners based on performance.
  - Sequentially focuses on **misclassified records**.
  - Can solve both **classification** and **regression** problems.

---




# AdaBoost: Constructing Decision Tree Stumps

In the previous video, we discussed the **main aim of AdaBoost**:

- Combine multiple **weak learners** to create a **strong learner**.
- In AdaBoost, a **weak learner** is typically a **decision tree stump** (depth = 1).

---

## Step 1: Create Decision Tree Stumps

Consider the dataset with features:

| Salary (K) | Credit Score | Credit Approval |
|------------|--------------|----------------|
| ≤50        | Bad          | No             |
| ≤50        | Good         | Yes            |
| >50        | Good         | Yes            |
| >50        | Bad          | No             |
| ...        | ...          | ...            |

### Decision Tree Stump 1: Feature `Salary`

- Condition: `Salary ≤ 50K`
- Two outcomes: **Yes** or **No**
  
| Salary ≤50K | Count Yes | Count No |
|-------------|-----------|----------|
| True        | 2         | 2        |
| False       | 2         | 1        |

- **Misclassifications** exist because the split is not perfect.

### Decision Tree Stump 2: Feature `Credit Score`

- Condition: `Credit = Good`
- Outcomes: **Yes** or **No**

| Credit = Good | Count Yes | Count No |
|---------------|-----------|----------|
| True          | 3         | 0        |
| False         | 1         | 3        |

- This stump has a **more informative split** than the first.

---

## Selecting the Best Decision Tree Stump

- AdaBoost selects the **best stump** based on a measure of impurity:
  - **Entropy**
  - **Gini Index / Gini Impurity**

### Entropy Calculation

- For a node with 50% Yes and 50% No:

$$
Entropy = -\sum_{i} p_i \log_2(p_i) = 1
$$

- For a node with 2 Yes and 1 No:

$$
Entropy = -\left(\frac{2}{3}\log_2\frac{2}{3} + \frac{1}{3}\log_2\frac{1}{3}\right) \approx 0.918
$$

### Gini Index Calculation

- For the same 50% Yes / 50% No node:

$$
Gini = 1 - \sum_i p_i^2 = 0.5
$$

- For 2 Yes / 1 No node:

$$
Gini = 1 - \left(\frac{2}{3}^2 + \frac{1}{3}^2\right) \approx 0.444
$$

> **Observation:** The stump with the **lower entropy or Gini index** is selected as the **first weak learner**.

---

## Summary of Step 1

1. Generate **decision tree stumps** for each feature.
2. Calculate **impurity** using **Entropy** or **Gini Index**.
3. Select the **best stump** as the **first weak learner**.

> Next step: We will see how AdaBoost updates weights and constructs the sequential model using weak learners.





# AdaBoost Step 2: Calculating Error and Performance of Stump

In Step 2 of AdaBoost, we perform two key operations:

1. **Calculate the total error of the selected stump**
2. **Determine the performance of the stump**

---

## Step 2.1: Assign Sample Weights

- Total records in dataset: 7
- Assign equal weights to all records:

$$
w_i = \frac{1}{7}, \quad i = 1, 2, ..., 7
$$

> These sample weights will help calculate the total error of the stump.

---

## Step 2.2: Identify Errors

Consider the first selected decision tree stump based on **Credit Score = Good**:

| Credit Score | Predicted | Actual | Error? |
|--------------|-----------|--------|--------|
| Good         | Yes       | Yes    | No     |
| Not Good     | No        | Yes    | Yes    |

- **Error** occurs when the stump misclassifies a record.
- In this example, **1 record is misclassified**.

---

## Step 2.3: Calculate Total Error

- Total error is the **sum of sample weights** for all misclassified records:

$$
\text{Total Error} (\varepsilon) = \sum_{i \in \text{wrong}} w_i
$$`

- Here, only **1 record is wrong**:

$$
\varepsilon = \frac{1}{7} \approx 0.1429
$$

---

## Step 2.4: Calculate Performance of Stump

- The **performance (weight) of the stump** is calculated using:

$$
\alpha = \frac{1}{2} \ln \frac{1 - \varepsilon}{\varepsilon}
$$

- Substitute \(\varepsilon = \frac{1}{7}\):

$$
\alpha = \frac{1}{2} \ln \frac{1 - \frac{1}{7}}{\frac{1}{7}} 
= \frac{1}{2} \ln \frac{6/7}{1/7} 
= \frac{1}{2} \ln 6 
\approx 0.896
$$

- This value of \(\alpha\) is the **weight assigned** to the first weak learner in AdaBoost.

---

## Step 2.5: Update AdaBoost Model

- The AdaBoost model function:

$$
F(x) = \alpha_1 M_1(x) + \alpha_2 M_2(x) + \dots + \alpha_N M_N(x)
$$

- For the first stump:

  - \(M_1(x)\) = Decision tree stump 1
  - \(\alpha_1 = 0.896\)

- Misclassified records from \(M_1\) will be **passed to the next stump**.
- Each subsequent stump is trained on **updated sample weights** emphasizing previous errors.

---

## Summary of Step 2

1. Assign **equal weights** to all records initially.
2. Identify **misclassified records**.
3. Compute **total error** as the sum of sample weights of wrong records.
4. Calculate **performance (α)** of the stump:

$$
\alpha = \frac{1}{2} \ln \frac{1 - \varepsilon}{\varepsilon}
$$

5. Pass misclassified records to the **next weak learner**.
6. Repeat the process to build a **strong learner** sequentially.

> Next step: Construct the second stump, update weights, and calculate \(\alpha_2\).



# AdaBoost Step 3: Updating Weights for Correctly and Incorrectly Classified Points

In Step 3, we update the **sample weights** based on the performance of the first decision tree stump.  

- Previously, we computed:
  1. **Total errors** of the first stump
  2. **Performance (α₁)** of the first stump, which determines its weight in AdaBoost.

---

## Step 3.1: Update Weights

### Key Idea

- **Correctly classified points:** Decrease their weights  
  → Reduce their probability of being selected in the next stump.  
- **Incorrectly classified points:** Increase their weights  
  → Increase their probability of being selected in the next stump.

---

## Step 3.2: Formulas for Weight Update

Let:  
- \(w_i\) = current weight of record \(i\)  
- \(\alpha\) = performance of the stump

### Correctly Classified Points:

$$
w_i^{\text{new}} = w_i \cdot e^{-\alpha}
$$

- Example: \(w_i = \frac{1}{7}\), \(\alpha = 0.896\)

$$
w_i^{\text{new}} = \frac{1}{7} \cdot e^{-0.896} \approx 0.058
$$

### Incorrectly Classified Points:

$$
w_i^{\text{new}} = w_i \cdot e^{\alpha}
$$

- Example: \(w_i = \frac{1}{7}\), \(\alpha = 0.896\)

$$
w_i^{\text{new}} = \frac{1}{7} \cdot e^{0.896} \approx 0.349
$$

---

## Step 3.3: Updated Weights Table

| Record | Classified Correctly? | Old Weight | Updated Weight |
|--------|----------------------|------------|----------------|
| 1      | Yes                  | 1/7        | 0.058          |
| 2      | Yes                  | 1/7        | 0.058          |
| 3      | Yes                  | 1/7        | 0.058          |
| 4      | Yes                  | 1/7        | 0.058          |
| 5      | Yes                  | 1/7        | 0.058          |
| 6      | Yes                  | 1/7        | 0.058          |
| 7      | No (misclassified)   | 1/7        | 0.349          |

> ✅ Correctly classified weights decrease, incorrectly classified weights increase.

---

## Step 3.4: Summary

1. Update weights **to emphasize misclassified records**.
2. Formulas:
   - Correctly classified:  
     $$
     w_i^{\text{new}} = w_i \cdot e^{-\alpha}
     $$
   - Incorrectly classified:  
     $$
     w_i^{\text{new}} = w_i \cdot e^{\alpha}
     $$
3. This ensures that the **next decision stump focuses more on previously misclassified points**.

---

> Next Step (Step 4): Normalize the updated weights so that they sum to 1 and can be used for training the next weak learner.



# AdaBoost Step 4: Normalizing Weights and Assigning Bins

In Step 4, we normalize the updated weights from Step 3 and create bins to select records for the next decision tree stump.

---

## Step 4.1: Normalize Weights

- After updating weights, the sum of all weights may **not be 1**.
- Example:  

$$
\text{Sum of updated weights} = 0.697
$$

- To normalize each weight:

$$
w_i^{\text{normalized}} = \frac{w_i^{\text{updated}}}{\sum_j w_j^{\text{updated}}}
$$

- Example calculations:
  - For \(w_i = 0.058\):

  $$
  w_i^{\text{normalized}} = \frac{0.058}{0.697} \approx 0.083
  $$

  - For \(w_i = 0.349\):

  $$
  w_i^{\text{normalized}} = \frac{0.349}{0.697} \approx 0.50
  $$

- After normalization, **all weights sum to 1**.

---

## Step 4.2: Assign Bins

- **Purpose:** Ensure misclassified records are selected more frequently for the next stump.
- Steps:
  1. Assign ranges (bins) proportional to normalized weights.
  2. Example:

| Record | Normalized Weight | Bin Range  |
|--------|-----------------|------------|
| 1      | 0.08            | 0.00 – 0.08 |
| 2      | 0.08            | 0.08 – 0.16 |
| 3      | 0.08            | 0.16 – 0.24 |
| 4      | 0.08            | 0.24 – 0.32 |
| 5      | 0.08            | 0.32 – 0.40 |
| 6      | 0.08            | 0.40 – 0.48 |
| 7      | 0.50            | 0.48 – 0.98 |

- **Observation:** The misclassified record (weight = 0.50) has a **larger bin**, increasing the probability of selection.

---

## Step 4.3: Purpose of Bin Assignment

1. Records with **higher normalized weights** (misclassified points) are more likely to be selected for the next stump.
2. This ensures the **next weak learner focuses on previously misclassified points**.
3. After creating bins, the **next decision tree stump** can be trained effectively on these weighted records.

---

## Step 4.4: Summary

1. **Normalized Weights:** Divide each updated weight by the total sum of weights so that all weights sum to 1.
2. **Bin Assignment:** Create ranges proportional to normalized weights.
3. **Selection Mechanism:** Misclassified records are selected more frequently due to larger bin sizes.
4. This completes the preparation for training **Decision Tree Stump 2** in AdaBoost.

> Next Step: Selecting records for the next stump based on bins and repeating the AdaBoost steps.



# AdaBoost Step 5: Selecting New Data Points for the Next Decision Tree Stump

In Step 5, we focus on **selecting the new data points** that will be sent to the next decision tree stump using the normalized weights and bins from Step 4.

---

## Step 5.1: Random Value Generation

- We generate **random numbers between 0 and 1** to select records based on their bin ranges.
- Each random number corresponds to a data point selected according to its weight.

| Salary      | Credit | Approval | Random Number |
|------------|--------|----------|---------------|
| >50K       | Normal | Yes      | 0.50          |
| <50K       | Good   | Yes      | 0.10          |
| >50K       | Normal | Yes      | 0.60          |
| >50K       | Normal | Yes      | 0.75          |
| <50K       | Good   | Yes      | 0.24          |
| >50K       | Bad    | No       | 0.32          |
| >50K       | Normal | Yes      | 0.87          |

---

## Step 5.2: Mapping Random Numbers to Bins

- Each random number is mapped to a **bin range**:

| Bin Range | Selected Record               |
|-----------|-------------------------------|
| 0.00 – 0.08 | <50K, Good, Yes              |
| 0.08 – 0.16 | <50K, Good, Yes              |
| 0.16 – 0.24 | <50K, Good, Yes              |
| 0.24 – 0.32 | <50K, Good, Yes              |
| 0.32 – 0.40 | >50K, Bad, No                |
| 0.40 – 0.90 | >50K, Normal, Yes            |

- Misclassified points have **larger bins**, increasing the probability of being selected.

---

## Step 5.3: Iterative Selection Process

1. Continue generating random numbers until **all new records** for the next stump are selected.
2. Misclassified points are likely to appear multiple times due to their larger bin ranges.
3. The selected dataset is now ready to **train the next decision tree stump**.

---

## Step 5.4: Preparing Data for the Next Stump

- Steps for the next stump repeat the process:
  1. Assign **sample weights** (e.g., \( \frac{1}{6} \) for each of 6 new records).
  2. Compute **total error** and **stump performance**.
  3. Update weights, normalize, and assign bins.
  4. Select new records based on bins for the next iteration.

- Example:

| Model      | Alpha (Performance) |
|------------|-------------------|
| Decision Tree Stump 1 | 0.896            |
| Decision Tree Stump 2 | 0.65             |

- The final AdaBoost prediction will **combine all stumps** weighted by their alpha values.

---

## Step 5.5: Summary

- Step 5 ensures that **misclassified points from the previous stump** are more likely to be selected for the next stump.
- This iterative process continues until the desired number of weak learners (stumps) is created.
- The AdaBoost classifier **sequentially improves** by focusing on harder-to-classify points.

> Next Step: Using all the stumps and their alpha values to **predict new test data**.



# AdaBoost Final Step: Making Predictions on New Test Data

In the final step, we understand how AdaBoost makes predictions for a **new test data point** in a **classification problem**.

---

## Step 6.1: Example Test Data

Suppose we have a new test data point:

- **Salary:** <50K  
- **Credit Score:** Good  

We have multiple decision tree stumps (weak learners):

- **Decision Tree Stump 1:** predicts `Yes`  
- **Decision Tree Stump 2:** predicts `No`  
- **Decision Tree Stump 3:** predicts `Yes`  
- **Decision Tree Stump 4:** predicts `No`  

---

## Step 6.2: Alpha Values (Weights)

Each stump has a **performance weight (\(\alpha\))**:

| Stump | Alpha (\(\alpha\)) | Prediction |
|-------|-----------------|------------|
| 1     | 0.896           | Yes        |
| 2     | 0.650           | No         |
| 3     | 0.244           | Yes        |
| 4     | -0.30           | No         |

- Alpha indicates the **importance** of each weak learner.
- Negative alpha can also occur, depending on error.

---

## Step 6.3: Combining Stumps

The **final AdaBoost prediction function** is:

$$
F(x) = \alpha_1 \cdot \text{model}_1(x) + \alpha_2 \cdot \text{model}_2(x) + \alpha_3 \cdot \text{model}_3(x) + \alpha_4 \cdot \text{model}_4(x)
$$

- `model_i(x)` is the prediction of the i-th stump, represented as:
  - +1 for `Yes`  
  - -1 for `No`

---

## Step 6.4: Calculating Weighted Votes

- Sum the **weights of Yes predictions**:

$$
\text{Yes weight} = 0.896 + 0.244 = 1.140
$$

- Sum the **weights of No predictions**:

$$
\text{No weight} = 0.650 + (-0.30) = 0.350
$$

---

## Step 6.5: Determine Final Prediction

- Compare the total weighted votes:

$$
\text{Yes: } 1.140 \quad \text{vs} \quad \text{No: } 0.350
$$

- **Maximum weight wins:** Yes > No  
- **Final prediction:** `Yes`  

> Interpretation: For a test data point with salary <50K and credit score good, the credit card **will be approved**.

---

## Step 6.6: Regression Case (Optional)

For **regression problems**:

- We **do not use entropy**.  
- Use **mean squared error (MSE)** to select the best decision stump.  
- Predictions are **continuous values**, weighted by \(\alpha\).  
- All other steps remain the same.

---

## Summary

- AdaBoost combines multiple **weak learners** into a **strong learner**.  
- Each stump's prediction is weighted by its performance (\(\alpha\)).  
- Final prediction is based on the **weighted majority vote** for classification or **weighted sum** for regression.
