## Ensembles - wisdom of the crowd

### Introduction
Sometimes, you might try a few different models and get decent performance, but often, a model can appear to do well while actually being biased toward a particular class, especially when using accuracy alone as a metric. This is where *ensemble methods* come in.  When we leverage the strengths of different models — or reduce the weaknesses of a single one — they are especially effective when individual models struggle with class imbalance or inconsistent performance across data segments.

More specifically, ensembles in machine learning refer to methods that combine several different models to produce a single, more powerful prediction. The idea is that by bringing together multiple perspectives (in this case, individual models), you reduce the chance of errors and often end up with a more accurate, robust result. Think of it like seeking several experts’ opinions rather than trusting just one: if each has some skill, the combined wisdom often outperforms any single expert.

Below are three popular ways of creating these ensembles:

- *Bagging* (short for "Bootstrap Aggregating"): Imagine you have a large dataset, and you repeatedly choose different random samples from it to train the same type of model multiple times.  Each of these models will see a slightly different chunk of the data, so they’ll learn something slightly different. Finally, their individual predictions are combined, typically by voting (for classification problems) or averaging (for regression problems).   This approach smooths out any oddities in your data or model, making the final prediction more stable and reliable.

- *Boosting*:  In boosting, you also build multiple models (usually of the same type), but in a sequence rather than in parallel.  Each model in the sequence aims to correct the errors made by the model before it.  In the end, the ensemble focuses more on the tricky, misclassified cases.  As a result, boosting can produce a very accurate final model, though it might also become more sensitive to noise if not carefully managed.

- *Voting*:  Bagging and boosting typically use models of the same type, whereas, voting often involves different models altogether.  You might, for instance, combine the predictions of a decision tree, a neural network, and a logistic regression model all at once.  Each model "votes" on a prediction, and you take the final decision based on the majority vote (for classification) or an average (for regression).  When we blend different modelling approaches, we benefit from each model’s strengths.

In summary, ensembles are like having multiple teachers grade your paper and take the consensus. When we train several models (or teachers) and combine their answers, the final prediction (or grade) is usually more accurate than if you relied on just one model.

### Install Python libraries

In [None]:
!pip install pandas matplotlib seaborn scikit-learn

### Bike Sharing dataset
The Bike Sharing dataset from the <a href="https://archive.ics.uci.edu/dataset/275/bike+sharing+dataset" target="_blank">UCI Machine Learning Repository</a>  was collected by Fanaee-T Hadi and Gama Joao and made publicly available via the *UCI Machine Learning Repository*. It originates from the Capital Bikeshare system in Washington, D.C., and covers a two-year period from 2011 to 2012. The primary purpose was to support research in predictive modelling of bike rental demand.

The dataset captures both daily and hourly records of bike rentals, alongside relevant environmental and temporal variables. It enables exploration of how factors such as weather conditions, time of day, seasonality, and public holidays affect bike usage behaviour.

This real-world dataset has been widely used for forecasting, regression modelling, and machine learning for evaluating the performance of ensemble methods like Random Forests, which can effectively handle nonlinear patterns and mixed data types.

Here is a summary table detailing the features of the dataset. Our target variable is the column named `cnt`, which shows the total count of bike rentals on a given day:

| Column        | Type               | Description |
|---------------|--------------------|-------------|
| `instant`     | Integer            | Record index (row ID) |
| `dteday`      | Date (`yyyy-mm-dd`)| Calendar date |
| `season`      | Categorical (1-4)  | Season (1: spring, 2: summer, 3: autumn, 4: winter) |
| `yr`          | Binary (0/1)       | Year (0: 2011, 1: 2012) |
| `mnth`        | Integer (1-12)     | Month of the year |
| `hr`          | Integer (0-23)     | Hour of the day |
| `holiday`     | Binary (0/1)       | Indicates whether the day is a public holiday |
| `weekday`     | Integer (0-6)      | Day of the week (0: Sunday, 6: Saturday) |
| `workingday`  | Binary (0/1)       | Indicates whether the day is a working day (i.e., not a weekend or holiday) |
| `weathersit`  | Categorical (1-4)  | Weather condition:  1: Clear or partly cloudy, 2: Mist or cloudy, 3: Light rain or snow, 4: Heavy rain, snow, or fog (rare) |
| `temp`        | Float [0, 1]       | Normalised temperature (°C = `temp x 41`) |
| `atemp`       | Float [0, 1]       | Normalised "feels-like" temperature |
| `hum`         | Float [0, 1]       | Normalised humidity |
| `windspeed`   | Float [0, 1]       | Normalised wind speed |
| `casual`      | Integer            | Count of casual (non-registered) users |
| `registered`  | Integer            | Count of registered users |
| `cnt`         | Integer            | Total count of bike rentals (target variable) |


### Download the data
We will download and then unzip the dataset to a folder, and then load the daily version of the data set (there is also an hourly version):

In [None]:
import requests
import zipfile
import io
import pandas as pd

# Download the zip file
url = "https://archive.ics.uci.edu/static/public/275/bike+sharing+dataset.zip"
response = requests.get(url)

# Unzip the content
with zipfile.ZipFile(io.BytesIO(response.content)) as z:
    z.extractall("bike_sharing_data")  # Extract to a folder


### Load the data

In [None]:
# Load the downloaded dataset
df = pd.read_csv("bike_sharing_data/day.csv")

### Exploratory Data Analysis
This is a complex dataset of features, and we can do a lot of different things with it, so our first step is to review the data:

In [None]:
# Display first few rows
df.head()

Our data has information about time, which makes this a timeseries dataset. Let's also look at the types of data we have and check for missing values.

In [None]:
# Explore the data types
df.info()

We create a *histogram* to show the distribution of daily bike rental counts in the dataset. It plots how often different rental amounts occurred, with the number of rentals on the x-axis and the number of days (frequency) on the y-axis. The histogram is combined with a smooth curve (kernel density estimate) that highlights the overall shape of the data distribution.

This step is important because it helps us understand the *range*, *typical values*, and *variability* in the data — for example, whether rentals were fairly consistent day to day or if there were large fluctuations. This kind of visual check is a key part of *exploratory data analysis (EDA)* and helps guide later steps like feature selection, transformation, or choosing the right modelling approach:

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Histogram of daily rental counts
plt.figure(figsize=(10, 4))
sns.histplot(df['cnt'], bins=30, kde=True)
plt.title('Distribution of Daily Rental Counts')
plt.xlabel('Count')
plt.ylabel('Frequency')
plt.show()


The histogram shows the distribution of daily bike rental counts, allowing us to understand how frequently different rental volumes occurred throughout the recorded period. The shape of the distribution can reveal whether most days had similar rental activity or if there was significant variation. A symmetrical shape may suggest a normal distribution, while skewness to the left or right would indicate that lower or higher rental counts were more common. The height of the bars shows the most frequent rental ranges, helping identify what is typical in the dataset. The spread of the histogram reflects how consistent or varied the rental counts were from day to day. The smooth curve over the bars, known as the kernel density estimate (KDE), provides a clearer visual of the overall distribution pattern.

Looking at the data, a simple task to start with is to see if we can predict the volume of bike usage for a given day. This might be useful for a bike rental company, who might need to move the bikes to areas with high usage on certain days. It would make sense for us to consider using the date column (`dteday`) as a feature. We should therefore convert it to a `dateime` object to make it easier to work with:

In [None]:
# Convert 'dteday' to datetime
df['dteday'] = pd.to_datetime(df['dteday'], format='%Y-%m-%d')

# Set as index
df.set_index('dteday', inplace=True)

df.head()

Given we want to know whether there is more demand on certain days, we will use the `cnt` column as a target variable in our model. Let's plot the raw count (`cnt`) to see how it is distributed over the two years:

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 4))

df['cnt'].plot()

plt.title("Daily total bike rentals (2011 to 2012)")

plt.ylabel("Total Rentals")
plt.xlabel("Date")

Time series data can be noisy, and so it is better to smooth the data to look for patterns in bike usage over the period captured. We will do this by exploring the trends in the data over time.

We use pandas’ `.rolling` method to create a *sliding window* of 30 days. This rolling window moves through the dataset, and at each step, it calculates the average of the current 30-day period, helping us smooth out short-term fluctuations and highlight longer-term trends:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 4))

# Plot the raw daily rental counts (semi-transparent line for clarity)
df['cnt'].plot(label='Daily Rentals', alpha=0.5)

# Calculate and plot the 30-day moving average to smooth the data and highlight trends
df['cnt'].rolling(window=30).mean().plot(label='30-Day moving avg', linewidth=2, color='red')

# Set the title and axis labels
plt.title("Daily Bike Rentals with 30-Day Moving Average")
plt.ylabel("Total Rentals")
plt.xlabel("Date")

# Add a legend to distinguish between the raw data and the moving average
plt.legend()

# Show grid lines for easier reading of values
plt.grid(True)

# Adjust layout to prevent label overlap
plt.tight_layout()

Let’s add a new column to our data to calculate the 7-day rolling average (we will call it `cnt_rolling_7`) to smooth daily variations to obtain a weekly trend:

In [None]:
# Calculate the 7-day rolling average of daily rentals to smooth short-term fluctuations
# This helps reveal weekly trends in bike rental activity
df['cnt_rolling_7'] = df['cnt'].rolling(window=7).mean()

To explore the data in more detail, let's look at how the count of daily rentals is affected by the weather. The weather conditions on the day are represented by categorical labels (1-4) in the dataset, to make this more human-friendly, we will create a desription for the purpose of visualisation:

| Label | Description            |
|-------|------------------------|
| 1     | Clear / Few clouds     |
| 2     | Mist / Cloudy          |
| 3     | Light Snow / Rain      |
| 4     | Heavy Rain / Snow      |


In [None]:
# Create and store out mapping
weather_mapping = {
    1: "Clear/Few clouds",
    2: "Mist/Cloudy",
    3: "Light Rain/Snow",
    4: "Heavy Rain/Snow"
}

df['weather_label'] = df['weathersit'].map(weather_mapping)


We can now create a visualisation to better understand patterns in bike rental activity over time, while also examining how weather may play a role. The plot includes three key elements:

- the raw daily counts of bike rentals (shown as a thin line for context),
- a 7-day rolling average (plotted as a thicker line to smooth out short-term fluctuations and highlight underlying trends),
- a scatter plot overlay coloured by the weather conditions.

The weather overlay helps us see whether certain weather types are associated with higher or lower rental activity on specific days:

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(15, 5))

# Plot the raw daily counts (thin line)
plt.plot(df.index, df['cnt'], alpha=0.3, label='Daily Count')

# Plot the 7-day rolling average (thicker line)
plt.plot(df.index, df['cnt_rolling_7'], linewidth=2, label='7-day Rolling Avg')

# Scatter overlay, coloured by weather
for weather_type in df['weather_label'].unique():

    subset = df[df['weather_label'] == weather_type]

    plt.scatter(subset.index, subset['cnt'], s=15, alpha=0.6, label=weather_type)

plt.title("Bike rentals (daily + 7-day Rolling Avg) with weather overlay")

plt.xlabel("Date")
plt.ylabel("Daily Rentals")

plt.legend()


When we layer these components with time series data, the plot offers a much richer, multi-dimensional view of the data. The rolling average makes long-term trends easier to spot, while the scatter points allow us to identify how different weather conditions might influence those trends on a day-to-day basis. This approach helps balance detail with clarity, offering both granular and smoothed perspectives in a single visual.

### Preprocessing the data
No we have explored the raw data itself, let's begin to clean and process the data so that we can construct a suitable model. We will encode the data, and also generate some additional features in the process.

#### Cyclical encodings
One of the main steps we take in preprocessing our data before training, is to add *cyclical encodings* to our data to help machine learning models correctly interpret features that have a natural circular structure, such as months of the year or days of the week.

For example, in the `mnth` column (1 to 12), December (12) and January (1) are right next to each other in time, but a model might interpret them as being far apart numerically.

The same goes for weekdays — Sunday (0) and Monday (1) are neighbours, but so are Sunday (0) and Saturday (6).

To capture this cyclic relationship, we use *sine and cosine transformations* to map these values onto a circle. This way, the model sees that the start and end of the cycle are close together, preserving the true nature of the data.

#### Interaction features
In addition to the cyclical encodings, we will also create *interaction features* such as `temp_x_hum`, `atemp_x_wind`, and `workingday_x_season`. These help the model learn more complex relationships between variables — for instance, the combined effect of high humidity and temperature on bike usage, or how working days behave differently depending on the season. These engineered features can significantly improve model performance by uncovering patterns that wouldn't be captured by individual variables/features alone:

In [None]:
import numpy as np

# Create a copy of the original DataFrame to avoid modifying it directly.
df.reset_index(inplace=True)
data = df.copy()

# Encode the 'mnth' (month) column as cyclical features using sine and cosine:
# This helps the model understand the circular nature of months (e.g. December and January are close)
data['mnth_sin'] = np.sin(2 * np.pi * df['mnth'] / 12)
data['mnth_cos'] = np.cos(2 * np.pi * df['mnth'] / 12)

# Encode the 'weekday' column as cyclical features.
# Again, this accounts for the circular pattern in days of the week
data['weekday_sin'] = np.sin(2 * np.pi * df['weekday'] / 7)
data['weekday_cos'] = np.cos(2 * np.pi * df['weekday'] / 7)

# Create an interaction feature between temperature and humidity:
# Useful for capturing combined effects of heat and moisture on bike usage
data['temp_x_hum'] = df['temp'] * df['hum']

# Create an interaction feature between apparent temperature and windspeed
# Could reveal how perceived weather conditions affect bike rentals
# e.g. if it's too windy, may be not a good day for biking
data['atemp_x_wind'] = df['atemp'] * df['windspeed']

# Create an interaction between working day and season:
# Helps the model detect how workdays behave differently across seasons
# e.g. may be people only bike to work in the spring and summer, and take
# public transport on colder and wetter work days.
data['workingday_x_season'] = df['workingday'] * df['season']


In this next part of our code, we prepare the dataset for machine learning by cleaning (removing missing values), simplifying, and transforming features into a model-friendly format.

First, we remove any rows with missing values using `dropna()`, which ensures we’re only working with complete, reliable data. Next, we drop columns that are either not helpful or could introduce *data leakage* — such as `'casual'` and `'registered'`, which together make up the target variable `'cnt'`, and thus shouldn't be used as input features.

We then focus on processing *categorical variables*, such as `'season'`, `'weekday'`, and `'holiday'`, which represent fixed sets of categories rather than continuous numbers. These are first explicitly converted to the `'category'` data type.

Since we are using *ensemble models* like *Random Forest*, which can handle a large number of features well and don’t assume linearity, we can safely apply *one-hot encoding*. This transforms each category into its own binary feature, allowing the model to learn from categorical distinctions without introducing ordering or assumptions. Using `drop_first=True` helps avoid redundancy by removing one category from each set of dummy variables, keeping the data efficient and suitable for training:

In [None]:
# Remove any rows with missing values to ensure clean and complete data
data = data.dropna()

# Drop columns that are either identifiers or could cause data leakage:
# 'casual' and 'registered' add up to the target variable 'cnt' and should not be used as inputs
# 'instant' is just a row index, and 'weather_label' is redundant if we already have 'weathersit'
data = data.drop(columns=['instant', 'casual', 'registered', 'weather_label'])

# Define a list of categorical columns that represent categories rather than continuous values
categorical_cols = ['season', 'yr', 'mnth', 'holiday', 'weekday', 'workingday', 'weathersit']

# Convert these columns to pandas 'category' data type
data[categorical_cols] = data[categorical_cols].astype('category')

# Apply one-hot encoding to the categorical variables:
# This creates binary columns for each category (excluding the first one to avoid redundancy)
# One-hot encoding works well with ensemble models like Random Forest and XGBoost
data = pd.get_dummies(data, columns=categorical_cols, drop_first=True)

# Print the final set of column names after encoding
print(data.columns)


### Convert a regression problem to a multi-class categorical problem
Originally, the target variable `cnt` (bike rental count) is a continuous numeric value, which makes this a regression problem — the goal is to predict an exact number of bike rentals.

However, in some cases, it's useful to simplify the problem by turning it into a classification task instead. This is especially helpful for interpretability, certain model types, or when we care more about general usage levels (e.g., "low", "medium", "high") rather than precise counts.

To do this, we convert the `cnt` column into a categorical variable using `pd.qcut()`, which splits the data into quantile-based bins:

In [None]:
data['cnt_class'] = pd.qcut(data['cnt'], q=3, labels=["low", "medium", "high"])

The code divides the rental counts into three equal-sized groups (terciles), with labels:

- `low` for days with relatively fewer rentals,

- `medium` for mid-range days,

- `high` for peak rental days.

The result is a new column `cnt_class` that turns a complex regression target into a multi-class classification problem, making it easier to build models that predict general demand levels rather than specific numbers.

### Prepare the train and test data (resampling)
We are almost ready to start looking at ensembles. First, we extract our training data and target labels.

In [None]:
X = data.drop(columns=['cnt', 'cnt_class', 'dteday'])
Y = data['cnt_class']

 Now we split this into separate train and test sets as usual:

In [None]:
from sklearn.model_selection import train_test_split

seed = 7 # Fix the random_state to reproduce the results
test_size = 0.2

# Split into training (80%) and testing (20%) sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)


We are now ready to get started. Let's first discuss the approaches we will be covering in more detail:

### Bagging algorithms
Bootstrap Aggregation (or Bagging) is a straightforward yet powerful way of creating an ensemble of models. Implementing this approach involves the following steps:
- You take multiple samples from your training data, using a method we call *sampling with replacement*. Imagine you have a deck of cards (your dataset), and you draw a card, note it down, then put it back in the deck before drawing the next one. Some cards will be chosen multiple times, while others might not appear at all in a particular sample.  
- We train a model on each of these sampled datasets.  
- Finally, we combine (or “aggregate”) their predictions—often by *averaging* for regression, or *voting* for classification, to produce a single result.  

This approach works best with models that are inherently *high variance*, meaning they can produce quite different outcomes if you nudge the training data around. Decision trees, especially when grown without pruning, are an excellent example of such high-variance models. When training an entire “forest” of these trees, each grown on a slightly different sample of the data, we smooth out any peculiarities of individual trees and end up with a more dependable prediction.

Two popular methods that use Bagging are:

1. *Bagged Decision Trees*: This is simply applying the Bagging approach to the decision tree model. However, you can apply the Bagging approach to any models that usually have high variance. Models with low variance may benefit, but potentially not by much.
2. *Random Forest* – This goes a step further by adding randomness at the feature-selection stage, making the trees even more diverse and often leading to better performance.

If you would like more details on the underlying functions, have a look at the official [BaggingClassifier API Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html). It shows you how to create a bagging classifier in scikit-learn using any base model.



The code below defines a basic decision tree model, which is our base model we’ll be using inside our bagging ensemble. We create a Bagging ensemble model with 100 base models (in this case, decision trees). Each tree is trained on a “bootstrapped” sample of the original training data, and the final prediction will be the average of all those trees. We define the number of trees using parameter `n_estimators=100`. By default, `BaggingClassifier` uses decision trees as its base model, so you don’t need to specify anything further.

In [None]:
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import BaggingClassifier

# Define the number of trees
num_trees = 100

# Setup the model
model = BaggingClassifier(n_estimators=num_trees, random_state=seed)

# Run k-fold cross-validation
kfold = KFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(model, X, Y, cv=kfold)

# Plot accuracy for each fold
plt.figure(figsize=(10, 4))

plt.plot(range(1, len(results) + 1), results, marker='o', linestyle='-')

plt.title("Decision tree accuracy per Fold (10-Fold Cross-Validation)")

plt.xlabel("Fold Number")
plt.ylabel("Accuracy")

plt.ylim(0, 1)
plt.grid(True)


# Print average accuracy
print(f"Mean Accuracy: {results.mean():.4f}")

As we can see from the plot, some folds produced higher accuracy than others. However, training multiple high-variance models and combining their predictions, allows Bagging to reduce overall variance and often provides a solid boost in accuracy compared to a single decision tree trained on the entire dataset.

 The `BaggingClassifier` is quite flexible — it can be used with any classification model (not just decision trees). That's one of its biggest strengths. You can apply the bagging technique (bootstrap sampling + aggregation) to a wide range of base estimators, for example, K-Nearest Neighbours (KNN) is another high variance model that would benefit from this approach:

In [None]:
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold, cross_val_score

from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier

base_model = KNeighborsClassifier()

model = BaggingClassifier(estimator=base_model, n_estimators=50, random_state=seed)

# Run k-fold cross-validation
kfold = KFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(model, X, Y, cv=kfold)

# Plot accuracy for each fold
plt.figure(figsize=(10, 4))

plt.plot(range(1, len(results) + 1), results, marker='o', linestyle='-')

plt.title("KNN accuracy per Fold (10-Fold Cross-Validation)")

plt.xlabel("Fold Number")
plt.ylabel("Accuracy")

plt.ylim(0, 1)
plt.grid(True)

# Print average accuracy
print(f"Mean Accuracy: {results.mean():.4f}")

Bagging works best with unstable models — ones that change a lot if the training data is tweaked. These models benefit most from variance reduction.

### Random Forest classification
Random Forest takes the bagging concept and applies it specifically to decision trees, but it introduces an additional layer of randomness that improves performance.

Think of a Random Forest as a committee of decision trees, each trained on slightly different data and encouraged to consider different viewpoints (features) at each decision point. This diversity means that when they vote on a final answer, the group decision is usually better than relying on just one overly confident tree.

More specifcally, at each split in a tree, it considers only a random subset of features rather than all features. This added randomness helps make each tree more unique, reduces the chance of them making the same errors, and leads to a more accurate and robust model overall. It also helps guard against overfitting.

The `RandomForestClassifier` has a `max_features` parameter, which controls the number of features considered when looking for the best split at each node. Setting it to 3 limits the model to only look at 3 features at a time, adding extra randomness (and potentially reducing overfitting):

In [None]:
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier

# Define the number of trees
num_trees = 100

# Setup the Random Forest model
model = RandomForestClassifier(n_estimators=num_trees, random_state=seed)

# Run k-fold cross-validation
kfold = KFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(model, X, Y, cv=kfold)

# Plot accuracy for each fold
plt.figure(figsize=(10, 4))

plt.plot(range(1, len(results) + 1), results, marker='o', linestyle='-')

plt.title("Random Forest Accuracy per Fold (10-Fold Cross-Validation)")

plt.xlabel("Fold Number")
plt.ylabel("Accuracy")

plt.ylim(0, 1)
plt.grid(True)

# Print average accuracy
print(f"Mean Accuracy: {results.mean():.4f}")

### Boosting algorithms

Boosting is a technique in machine learning that combines multiple simple models (often called "weak learners") to create a much stronger model. Instead of training all models independently like in bagging, boosting trains them one after another, where each model tries to fix the mistakes made by the previous one.

Think of it like a group of teachers marking a student's homework in turns. The first teacher marks it and misses a few things. The second teacher looks at the same homework and pays extra attention to the mistakes the first teacher missed. Then the third teacher does the same, and so on. By the end, the student gets a very well-reviewed paper! The two boosting algorithms we will cover are:

- AdaBoost classification.
- Voting Ensemble for classification.

### AdaBoost classification

AdaBoost (Short for Adaptive Boosting) is one of the earliest and most popular boosting methods. It starts by training a simple model (like a small decision tree) on the dataset. It tracks which data points were classified incorrectly. Then, in the next round, it gives more importance (or "weight") to those harder-to-classify points.

In other words, imagine you're sorting socks by colour, and you keep messing up the red ones. Boosting notices this and says, "Hey, pay more attention to the red socks!" In technical terms, it increases the weight of those red socks in the next round of sorting, so you’re more likely to learn to handle them properly. The easy cases (like the green socks you got right every time) get less focus in later rounds. Each new model is therefore trained with more focus on these tricky examples. This process repeats for several rounds, each time focusing more on the errors made by the previous model.

Finally, all these models are combined into one strong model, where each contributes based on how well it performed. The result is a model that gets better and better at handling the tough cases in our data.


You can construct an AdaBoost model for classification using the `AdaBoostClassifier` class. The example below demonstrates the construction of 30 decision trees in sequence using the AdaBoost algorithm. See the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html">API Documentation</a> to learn more about its use:

In [None]:
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import AdaBoostClassifier

num_trees = 30

kfold = KFold(n_splits=10, random_state=seed, shuffle=True)

model = AdaBoostClassifier(n_estimators=num_trees, random_state=seed)

results = cross_val_score(model, X, Y, cv=kfold)

# Plot accuracy for each fold
plt.figure(figsize=(10, 4))

plt.plot(range(1, len(results) + 1), results, marker='o', linestyle='-')

plt.title("AdaBoost accuracy per Fold (10-Fold Cross-Validation)")

plt.xlabel("Fold Number")
plt.ylabel("Accuracy")

plt.ylim(0, 1)
plt.grid(True)

# Print average accuracy
print(f"Mean Accuracy: {results.mean():.4f}")

### Voting Ensemble for classification
Voting is not technically a boosting method, but it’s another way to combine multiple models. In a voting ensemble, you train different models (like a decision tree, a logistic regression, and a k-Nearest Neighbours classifier), and they each “vote” on what the correct answer is. The majority vote is chosen as the final prediction. It’s like asking several experts and going with the most common answer.

Voting is one of the simplest ways of combining the predictions from multiple machine learning algorithms. It works by first creating two or more standalone models from your training dataset. A `VotingClassifier` can be used to wrap the models and average the predictions of the sub-models, when asked to make predictions for new data. The predictions of the sub-models can be weighted, but specifying the weights for classifiers manually or even heuristically is difficult. More advanced methods can learn how to best weight the predictions from sub-models, but this is called stacking (stacked aggregation) and is currently not provided in scikit-learn.

You can create a voting ensemble model for classification using the ```VotingClassifier``` class. The code below provides an example of combining the predictions of logistic regression, classification and regression trees and support vector machines together for a classification problem. See the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html">API Documentation</a> for more details on this approach:

In [None]:
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold, cross_val_score

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier

# Create a list to store our sub models
estimators = []

dt = DecisionTreeClassifier()
estimators.append(('Decision Tree', dt))

svm = SVC()
estimators.append(('SVM', svm))

# Create the ensemble model
ensemble = VotingClassifier(estimators)

# Run k-fold cross validation
kfold = KFold(n_splits=10, random_state=seed, shuffle=True)
results = cross_val_score(ensemble, X, Y, cv=kfold)

# Plot accuracy for each fold
plt.figure(figsize=(10, 4))

plt.plot(range(1, len(results) + 1), results, marker='o', linestyle='-')

plt.title("Voting ensemble accuracy per Fold (10-Fold Cross-Validation)")

plt.xlabel("Fold Number")
plt.ylabel("Accuracy")

plt.ylim(0, 1)
plt.grid(True)

# Print average accuracy
print(f"Mean Accuracy: {results.mean():.4f}")

### What have we learnt?
We explored *ensemble learning* — an approach in machine learning where multiple models are combined to make better predictions. We began with *bagging* (Bootstrap Aggregating), which builds several models in parallel using random samples of the training data. The final prediction is made by combining the outputs (e.g. voting or averaging). Bagging helps reduce overfitting, especially with high-variance models like decision trees.

We also looked at the *Random Forest* algorithm, which builds on bagging by introducing extra randomness during the tree-building process — selecting only a random subset of features at each split. This helps to make the trees more diverse and further improves accuracy and robustness.

We then examined *boosting*, which takes a different approach by building models in sequence. Each new model focuses on correcting the mistakes of the previous one, gradually improving performance. *AdaBoost*, one of the earliest boosting methods, does this by giving more weight to harder-to-classify examples.

Finally, we looked at the *Voting Ensemble*, which combines the predictions of different types of models by majority vote, allowing us to benefit from their individual strengths.

In summary, ensemble methods like bagging, boosting, or voting provide a more accurate and reliable way to solve classification problems.