# AdaBoost Algorithm - Custom Implementation

## Introduction

**AdaBoost (Adaptive Boosting)** is one of the most successful ensemble techniques for improving the accuracy of machine learning models. It combines multiple "weak learners" to form a "strong learner" through a process of learning from the misclassified data points of the previous models. The core principle behind AdaBoost is to set the weights of classifiers and training data to ensure that subsequent classifiers focus more on the examples that previous classifiers misclassified.

In this project, we will implement the AdaBoost algorithm using Python. We'll build the AdaBoost classifier from scratch, utilizing simple models (such as Decision Trees and Naive Bayes) as our weak learners. This hands-on approach will help us understand the fundamental mechanisms of AdaBoost, including weight updating and the role of error rates in shaping the sequential learning process.

## Objectives

1. **Implement the AdaBoost Algorithm**: We will develop functions to create and manage weak learners, calculate their errors, update data weights, and combine their predictions into a final ensemble decision.
   
2. **Experiment with Different Weak Learners**: Our implementation will randomly select between using a Naive Bayes classifier and a Decision Tree (stump) for each iteration, exploring how different weak learners perform within the same AdaBoost framework.

3. **Evaluate Performance**: The effectiveness of our ensemble model will be assessed using accuracy metrics. We will also visualize the performance improvements as the number of models in the ensemble increases.

4. **Parameter Tuning and Validation**: Optional steps will include tuning parameters and validating the model using techniques like cross-validation to ensure robustness.

## Dataset

The dataset used in this project (`drugY.csv`) involves predicting a categorical target based on a mix of categorical and numerical features. We will preprocess the data by converting categorical variables into dummy/indicator variables, ensuring compatibility with our model training functions.

## Structure of the Notebook

The notebook is structured into several blocks:
- **Setup**: Import necessary libraries and load the dataset.
- **Data Preprocessing**: Prepare the data for modeling, including train-test splitting and dummy encoding.
- **AdaBoost Implementation**: Define and implement the AdaBoost functions.
- **Model Training**: Train the AdaBoost ensemble using our implementation.
- **Evaluation**: Assess the model's performance and visualize results.
- **Parameter Tuning and Cross-Validation**: Further refine the model.


In [1]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
import math
import random

In [16]:
# Data Loading and Preprocessing
data = pd.read_csv('drugY.csv')

# Preprocess data: Convert categorical variables using dummy encoding and split data
X = pd.get_dummies(data.drop('Drug', axis=1))
y = data['Drug']*2 - 1  # Assuming 'Drug' is the target and needs transformation

# Split data into training and testing sets for validation purposes
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

data.head()

Unnamed: 0,Age,Sex,BP,Cholesterol,Na,K,Drug
0,23,F,HIGH,HIGH,0.792535,0.031258,1
1,47,M,LOW,HIGH,0.739309,0.056468,0
2,47,M,LOW,HIGH,0.697269,0.068944,0
3,28,F,NORMAL,HIGH,0.563682,0.072289,0
4,61,F,LOW,HIGH,0.559294,0.030998,1


## AdaBoost algorithm

In [11]:
def initialize_weights(n):
    """ Initialize the weights uniformly for all instances.
        This ensures that initially, every instance contributes equally to the learning of the model.
    """
    return np.ones(n) / n

def train_weak_learner(X, y, sample_weights):
    """ Train a weak learner. Randomly choose between Naive Bayes and Decision Tree.
        Weak learners are simple classifiers which perform slightly better than random guessing.
    """
    if random.choice(['NB', 'DT']) == 'NB':
        learner = GaussianNB()  # Naive Bayes model
    else:
        learner = DecisionTreeClassifier(max_depth=1)  # Decision tree with depth of 1 (stump)
    learner.fit(X, y, sample_weight=sample_weights)  # Fit the model with the sample weights
    return learner

def calculate_error(predictions, actual, weights):
    """ Calculate the error of the weak learner weighted by the instance weights.
        Error is calculated as the weighted average of incorrect predictions.
    """
    is_incorrect = predictions != actual
    weighted_errors = weights[is_incorrect]  # Weights corresponding to incorrect predictions
    return weighted_errors.sum()  # Sum of weighted errors gives the total error

def update_weights(weights, alpha, predictions, actual):
    """ Update the instance weights based on the predictions.
        Incorrectly classified instances are given higher weights.
    """
    is_incorrect = predictions != actual
    weights *= np.exp(alpha * is_incorrect)  # Increase weight for incorrectly classified instances
    return weights / weights.sum()  # Normalize weights to sum to 1

def ada_boost(X, y, M):
    """ AdaBoost algorithm to create and combine weak learners.
        It iteratively adds models to the ensemble, focusing on difficult instances by adjusting their weights.
    """
    n = len(y)
    weights = initialize_weights(n)  # Initialize uniform weights
    models = []
    alphas = []

    for _ in range(M):
        model = train_weak_learner(X, y, weights)
        predictions = model.predict(X)
        error = calculate_error(predictions, y, weights)
        alpha = 0.5 * np.log((1 - error) / error)  # Calculate the weight of the model based on its accuracy
        weights = update_weights(weights, alpha, predictions, y)  # Update instance weights for the next iteration

        models.append(model)  # Save the trained model
        alphas.append(alpha)  # Save the model weight

    return models, alphas

### Model Training

In [12]:
# Number of iterations
M = 10
models, alphas = ada_boost(X_train, y_train, M)

In [13]:
def predict_ensemble(models, alphas, X):
    """ Make predictions with the AdaBoost ensemble. """
    predictions = np.array([alpha * model.predict(X) for model, alpha in zip(models, alphas)])
    return np.sign(predictions.sum(axis=0))

# Evaluate model
ensemble_predictions = predict_ensemble(models, alphas, X_test)
print('Ensemble accuracy:', accuracy_score(y_test, ensemble_predictions))

Ensemble accuracy: 0.925


In [14]:
from sklearn.model_selection import GridSearchCV

# Parameter tuning using cross-validation could be performed here for the weak learners.
# Example for Decision Tree depth tuning:
param_grid = {'max_depth': [1, 2, 3, 4, 5]}
model = DecisionTreeClassifier()
cv_model = GridSearchCV(model, param_grid, cv=5)
cv_model.fit(X_train, y_train)
print('Best parameters:', cv_model.best_params_)

Best parameters: {'max_depth': 3}
