# Ensemble learning and random forests - democracy in Machine Learning

## 1. Data import

**In this notebook we will use the Wisconsin Breast Cancer dataset available in sklearn.datasets module**

**For simplicity and visualization purposes, we will focus on two features:**
- `mean texture`
- `mean symmetry`

**Our goal is to classify tumors as `malignant` or `benign` using ensemble methods.**

**This project aims to demonstrate how `ensemble learning`, by combining multiple weak or individual models, can produce **stronger and more reliable predictions** than those obtained from any single model alone.   We will compare various ensemble strategies to see how they improve performance and generalization.**

In [1]:
from sklearn import datasets

data_breast_cancer = datasets.load_breast_cancer(as_frame=True)
print(data_breast_cancer['DESCR'])

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

:Number of Instances: 569

:Number of Attributes: 30 numeric, predictive attributes and the class

:Attribute Information:
    - radius (mean of distances from center to points on the perimeter)
    - texture (standard deviation of gray-scale values)
    - perimeter
    - area
    - smoothness (local variation in radius lengths)
    - compactness (perimeter^2 / area - 1.0)
    - concavity (severity of concave portions of the contour)
    - concave points (number of concave portions of the contour)
    - symmetry
    - fractal dimension ("coastline approximation" - 1)

    The mean, standard error, and "worst" or largest (mean of the three
    worst/largest values) of these features were computed for each image,
    resulting in 30 features.  For instance, field 0 is Mean Radius, field
    10 is Radius SE, field 20 is Worst Radius.

    - 

In [2]:
df_data_breast_cancer = data_breast_cancer.frame
df_data_breast_cancer.head(10)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0
5,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,0.2087,0.07613,...,23.75,103.4,741.6,0.1791,0.5249,0.5355,0.1741,0.3985,0.1244,0
6,18.25,19.98,119.6,1040.0,0.09463,0.109,0.1127,0.074,0.1794,0.05742,...,27.66,153.2,1606.0,0.1442,0.2576,0.3784,0.1932,0.3063,0.08368,0
7,13.71,20.83,90.2,577.9,0.1189,0.1645,0.09366,0.05985,0.2196,0.07451,...,28.14,110.6,897.0,0.1654,0.3682,0.2678,0.1556,0.3196,0.1151,0
8,13.0,21.82,87.5,519.8,0.1273,0.1932,0.1859,0.09353,0.235,0.07389,...,30.73,106.2,739.3,0.1703,0.5401,0.539,0.206,0.4378,0.1072,0
9,12.46,24.04,83.97,475.9,0.1186,0.2396,0.2273,0.08543,0.203,0.08243,...,40.68,97.65,711.4,0.1853,1.058,1.105,0.221,0.4366,0.2075,0


In [3]:
from sklearn.model_selection import train_test_split

X = df_data_breast_cancer.iloc[:,:-1]
y = df_data_breast_cancer['target']
X = X[["mean texture", "mean symmetry"]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle = True)

results_list = []
models_list = []

# 2. Base Classifiers and Voting Ensembles

## We start by training three basic classifiers:


**k-Nearest Neighbors (k-NN)**

In [4]:
import sklearn.neighbors
from sklearn.metrics import accuracy_score

knn_5_clf = sklearn.neighbors.KNeighborsClassifier() # n_neighbors = 5 by default
knn_5_clf.fit(X_train, y_train)

y_train_pred = knn_5_clf.predict(X_train)
y_test_pred = knn_5_clf.predict(X_test)
acc_train = accuracy_score(y_train_pred, y_train)
acc_test = accuracy_score(y_test_pred, y_test)
print(acc_train)
print(acc_test)

results_list.append((acc_train, acc_test))
models_list.append(knn_5_clf)

0.7824175824175824
0.6403508771929824


**Decision Tree**

In [5]:
from sklearn.tree import DecisionTreeClassifier

tree_clf = DecisionTreeClassifier() # by default max_depth is as big as possible (till improvement in split clarity improves)
tree_clf.fit(X_train, y_train) 

y_train_pred = tree_clf.predict(X_train)
y_test_pred = tree_clf.predict(X_test)
acc_train = accuracy_score(y_train_pred, y_train)
acc_test = accuracy_score(y_test_pred, y_test)
print(acc_train)
print(acc_test)

results_list.append((acc_train, acc_test))
models_list.append(tree_clf)

1.0
0.6578947368421053


**Logistic Regression**

In [6]:
from sklearn.linear_model import LogisticRegression

log_clf = LogisticRegression() # default = 'lbfgs'
log_clf.fit(X_train, y_train) 

y_train_pred = log_clf.predict(X_train)
y_test_pred = log_clf.predict(X_test)
acc_train = accuracy_score(y_train_pred, y_train)
acc_test = accuracy_score(y_test_pred, y_test)
print(acc_train)
print(acc_test)

results_list.append((acc_train, acc_test))
models_list.append(log_clf)

0.7054945054945055
0.7368421052631579


## We then build two voting classifiers:

**Hard voting: predicts the class label that gets the most votes.**

In [7]:
from sklearn.ensemble import VotingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier


log_clf = LogisticRegression()
tree_clf = DecisionTreeClassifier()
knn_5_clf = KNeighborsClassifier()

voting_clf = VotingClassifier(
    estimators=[('lr', log_clf),
                ('tc', tree_clf),
                ('svc', knn_5_clf)],
    voting='hard')

voting_clf.fit(X_train, y_train)


y_train_pred = voting_clf.predict(X_train)
y_test_pred = voting_clf.predict(X_test)
acc_train = accuracy_score(y_train_pred, y_train)
acc_test = accuracy_score(y_test_pred, y_test)
print(acc_train)
print(acc_test)

results_list.append((acc_train, acc_test))
models_list.append(voting_clf)

0.8483516483516483
0.6754385964912281


**Soft voting: averages the predicted class probabilities.**

In [8]:
from sklearn.ensemble import VotingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier


log_clf = LogisticRegression()
tree_clf = DecisionTreeClassifier()
knn_5_clf = KNeighborsClassifier()

voting_clf = VotingClassifier(
    estimators=[('lr', log_clf),
                ('tc', tree_clf),
                ('svc', knn_5_clf)],
    voting='soft')

voting_clf.fit(X_train, y_train)


y_train_pred = voting_clf.predict(X_train)
y_test_pred = voting_clf.predict(X_test)
acc_train = accuracy_score(y_train_pred, y_train)
acc_test = accuracy_score(y_test_pred, y_test)
print(acc_train)
print(acc_test)

results_list.append((acc_train, acc_test))
models_list.append(voting_clf)

0.967032967032967
0.6578947368421053


## 🧾 Summary of Base Classifiers and Voting Ensemble Results

After training and evaluating individual classifiers and their ensemble combinations, we observed the following:

- The individual classifiers (Decision Tree, Logistic Regression, k-NN) performed moderately well on their own.
- The **Hard Voting** ensemble often outperforms the worst individual models by relying on majority consensus.
- The **Soft Voting** ensemble generally yields the best performance, as it takes into account the prediction confidence (probabilities) of each model.

These results demonstrate the power of ensemble learning — **aggregating multiple models helps reduce variance and improve generalization**, especially when the models are diverse and make different kinds of errors.


**saving the results to pickle file**

In [9]:
import pickle

with open('acc_vote.pkl', 'wb') as f:
    pickle.dump(results_list, f)

with open('vote.pkl', 'wb') as f:
    pickle.dump(models_list, f)  

# 3. Random Forests and Boosting algorithms

##  Bagging, Pasting and Random Forest – Parallel Ensemble Methods

### Now we train several ensemble models based on **parallel aggregation of multiple estimators**, using Decision Trees as the base learners:

**Bagging (Bootstrap Aggregating): trains each tree on a random subset of the training data with replacement**

In [10]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

results_list = []
models_list = []


# DecisionTreeClassifier(), n_estimators=500, max_samples=100, bootstrap=True, random_state=42
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=30) #by defualt bootstrap = True
bag_clf.fit(X_train, y_train)

y_train_pred = bag_clf.predict(X_train)
y_test_pred = bag_clf.predict(X_test)
acc_train = accuracy_score(y_train_pred, y_train)
acc_test = accuracy_score(y_test_pred, y_test)

print(acc_train)
print(acc_test)

results_list.append((acc_train, acc_test))
models_list.append(bag_clf)

0.9978021978021978
0.6754385964912281


**Bagging (50%): similar to standard Bagging but uses only 50% of the data per tree**

In [11]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# DecisionTreeClassifier(), n_estimators=500, max_samples=100, bootstrap=True, random_state=42
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=30, max_samples = 0.5) #by defualt bootstrap = True
bag_clf.fit(X_train, y_train)

y_train_pred = bag_clf.predict(X_train)
y_test_pred = bag_clf.predict(X_test)
acc_train = accuracy_score(y_train_pred, y_train)
acc_test = accuracy_score(y_test_pred, y_test)

print(acc_train)
print(acc_test)

results_list.append((acc_train, acc_test))
models_list.append(bag_clf)

0.9208791208791208
0.6929824561403509


**Pasting: same as Bagging but the data subsets are sampled without replacement**

In [12]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# DecisionTreeClassifier(), n_estimators=500, max_samples=100, bootstrap=True, random_state=42
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=30, bootstrap = False) #by defualt bootstrap = True
bag_clf.fit(X_train, y_train)

y_train_pred = bag_clf.predict(X_train)
y_test_pred = bag_clf.predict(X_test)
acc_train = accuracy_score(y_train_pred, y_train)
acc_test = accuracy_score(y_test_pred, y_test)

print(acc_train)
print(acc_test)

results_list.append((acc_train, acc_test))
models_list.append(bag_clf)

1.0
0.6578947368421053


**Pasting (50%): Pasting using only 50% of the data per tree**

In [13]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# DecisionTreeClassifier(), n_estimators=500, max_samples=100, bootstrap=True, random_state=42
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=30, max_samples = 0.5, bootstrap = False) #by defualt bootstrap = True
bag_clf.fit(X_train, y_train)

y_train_pred = bag_clf.predict(X_train)
y_test_pred = bag_clf.predict(X_test)
acc_train = accuracy_score(y_train_pred, y_train)
acc_test = accuracy_score(y_test_pred, y_test)

print(acc_train)
print(acc_test)

results_list.append((acc_train, acc_test))
models_list.append(bag_clf)

0.9626373626373627
0.6754385964912281


**Random Forest: an extension of Bagging where, in addition to bootstrapping samples, each split in the tree considers only a random subset of features**  
Trees in Random Forest choose the best feautre-split but only based on the currently chosen (randomly) subset of features


In [14]:
from sklearn.ensemble import RandomForestClassifier

# n_estimators=500, max_leaf_nodes=16, random_state=42
rnd_clf = RandomForestClassifier(n_estimators=30) # by default, max_samples = 1.0, boostrap = True,  n_jobs = -1
rnd_clf.fit(X_train, y_train)

y_train_pred = rnd_clf.predict(X_train)
y_test_pred = rnd_clf.predict(X_test)
acc_train = accuracy_score(y_train_pred, y_train)
acc_test = accuracy_score(y_test_pred, y_test)

print(acc_train)
print(acc_test)

results_list.append((acc_train, acc_test))
models_list.append(rnd_clf)

0.9978021978021978
0.6666666666666666


### These methods aim to **reduce model variance** by averaging predictions from weak learners trained on different data subsets. This makes the final model more robust and less prone to overfitting.

##  AdaBoost and Gradient Boosting – Sequential Ensemble Methods

Next, we apply **boosting** techniques, which combine multiple weak learners trained sequentially:

**AdaBoost: focuses on the samples that were misclassified by previous estimators. It adjusts the weights of training examples to prioritize harder cases.**

In [15]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(DecisionTreeClassifier(), n_estimators= 30)
ada_clf.fit(X_train, y_train)
            
y_train_pred = ada_clf.predict(X_train)
y_test_pred = ada_clf.predict(X_test)
acc_train = accuracy_score(y_train_pred, y_train)
acc_test = accuracy_score(y_test_pred, y_test)

print(acc_train)
print(acc_test)

results_list.append((acc_train, acc_test))
models_list.append(ada_clf)

1.0
0.6578947368421053




**Gradient Boosting: fits each new estimator to the residual errors of the current model using gradient descent. Each tree attempts to correct the mistakes of the entire ensemble so far.**

In [16]:
from sklearn.ensemble import GradientBoostingClassifier

gbrt_clf = GradientBoostingClassifier(n_estimators= 30) # uses decision trees (only)
gbrt_clf.fit(X_train, y_train)
            
y_train_pred = gbrt_clf.predict(X_train)
y_test_pred = gbrt_clf.predict(X_test)
acc_train = accuracy_score(y_train_pred, y_train)
acc_test = accuracy_score(y_test_pred, y_test)

print(acc_train)
print(acc_test)

results_list.append((acc_train, acc_test))
models_list.append(gbrt_clf)

0.8263736263736263
0.7105263157894737


**Boosting** methods are powerful for achieving **high accuracy** by focusing on hard-to-predict examples.
- Both **AdaBoost** and **Gradient Boosting** reduce **bias** but are more sensitive to **noise** compared to methods like Random Forest.
- They require careful **hyperparameter tuning**, especially in **Gradient Boosting**, to avoid overfitting and to optimize performance.

**saving the results to pickle file**

In [17]:
import pickle

with open('acc_bag.pkl', 'wb') as f:
    pickle.dump(results_list, f)

with open('bag.pkl', 'wb') as f:
    pickle.dump(models_list, f)

## Feature Sampling in Bagging

The **`max_features`** parameter is a crucial tool for increasing the independence of the trees in the ensemble. By randomly selecting different subsets of features, we ensure that the trees do not learn the same decision rules, improving the robustness of the overall model.

In [18]:
from sklearn.model_selection import train_test_split

X = df_data_breast_cancer.iloc[:,:-1]
y = df_data_breast_cancer['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle = True)

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(),
    n_estimators=30,
    max_samples=0.5,      # half of data for every tree
    bootstrap=True,       # samples chosen with repetition (data sample can be chosen twice for the same tree)
    max_features=2,       # decision tree can split by only two of randomly chosen features before tree creation (instead of all, as by default)
    bootstrap_features=False,  # decisive-features chosen with repetition (decisive-features can be chosen twice for the same tree)
)
bag_clf.fit(X_train, y_train)  

y_train_pred = bag_clf.predict(X_train)
y_test_pred = bag_clf.predict(X_test)
acc_train = accuracy_score(y_train_pred, y_train)
acc_test = accuracy_score(y_test_pred, y_test)

print(acc_train)
print(acc_test)

results_list = []
models_list = []

results_list.append(acc_train)
results_list.append(acc_test)
models_list.append(bag_clf)

0.9934065934065934
0.9298245614035088


**saving the results to pickle file**

In [19]:
import pickle

with open('acc_fea.pkl', 'wb') as f:
    pickle.dump(results_list, f)

with open('fea.pkl', 'wb') as f:
    pickle.dump(models_list, f) 

In this part of the code, we are evaluating each individual estimator within the Bagging ensemble. Specifically, we look at:

1. **Selected Features**: For each estimator, we identify which features were selected for training the decision tree.
2. **Training and Testing Accuracy**: For each tree, we calculate the accuracy on both the training and test sets based on the selected features.

In [20]:
# list of names of columns
feature_names = X.columns

# we crate a rank list
ranking = []

rnd_clf = RandomForestClassifier(n_estimators=30) # by default, max_samples = 1.0, boostrap = True,  n_jobs = -1
rnd_clf.fit(X_train, y_train)

for est, feat_idxs in zip(bag_clf.estimators_, bag_clf.estimators_features_): #estimator _features works if we trained RandomForest earlier
    selected_features = feature_names[list(feat_idxs)]

    X_train_sub = X_train[selected_features]
    X_test_sub = X_test[selected_features]

    acc_train = accuracy_score(y_train, est.predict(X_train_sub))
    acc_test = accuracy_score(y_test, est.predict(X_test_sub))

    ranking.append({
        "train_acc": acc_train,
        "test_acc": acc_test,
        "features": list(selected_features)
    })

# we sort the created dataframe
import pandas as pd

df_rank = pd.DataFrame(ranking)
df_rank = df_rank.sort_values(by=["test_acc", "train_acc"], ascending=False)

# saving data to pickle file
with open("acc_fea_rank.pkl", "wb") as f:
    pickle.dump(df_rank, f)



**The output shows the ranking of different decision trees within the Bagging ensemble, sorted by their `test accuracy` and `training accuracy`**.

In [21]:
print(df_rank)

    train_acc  test_acc                                           features
12   0.947253  0.929825                  [mean concave points, worst area]
23   0.940659  0.929825                    [worst concavity, worst radius]
18   0.949451  0.921053                      [worst radius, texture error]
7    0.940659  0.894737                  [mean area, worst concave points]
6    0.927473  0.894737          [worst concave points, compactness error]
9    0.931868  0.885965                          [worst area, mean radius]
2    0.931868  0.868421                     [mean area, worst compactness]
29   0.925275  0.868421          [worst fractal dimension, mean perimeter]
26   0.920879  0.868421          [worst compactness, worst concave points]
3    0.912088  0.859649                [mean concave points, radius error]
10   0.896703  0.859649                        [worst symmetry, mean area]
8    0.896703  0.850877                     [mean perimeter, radius error]
1    0.890110  0.842105  

## 📌 Summary

In this exercise, we explored the power of **ensemble methods** like **Bagging** with **Decision Trees**. By utilizing different subsets of features and data for each tree, we were able to reduce variance and improve model generalization. 

- **Feature selection**: The models showed that certain features (e.g., `worst area`, `mean texture`, `mean radius`) have a significant impact on prediction accuracy.
- **Performance**: We observed that ensemble models outperformed individual decision trees by reducing overfitting and improving overall accuracy on the test set.
  
This approach demonstrates how combining multiple models with diverse perspectives can lead to more robust and reliable predictions, especially in complex tasks like tumor classification.
