1. Implement simple bagging with n decision trees on a small dataset
2. Implement simple random forest with n decision trees on a small dataset
3. Compare the performance and results of single trees (voting predict), bagging, and a random forest

In [1]:
# assume n = 10
import numpy as np

n = 10
random_seed = 42

In [2]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

data = load_breast_cancer(as_frame=True)

print(data.DESCR)

X, y = data.data.to_numpy().astype('float'), data.target.to_numpy().astype('int')
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=random_seed, stratify=y)

data.frame.head()

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

:Number of Instances: 569

:Number of Attributes: 30 numeric, predictive attributes and the class

:Attribute Information:
    - radius (mean of distances from center to points on the perimeter)
    - texture (standard deviation of gray-scale values)
    - perimeter
    - area
    - smoothness (local variation in radius lengths)
    - compactness (perimeter^2 / area - 1.0)
    - concavity (severity of concave portions of the contour)
    - concave points (number of concave portions of the contour)
    - symmetry
    - fractal dimension ("coastline approximation" - 1)

    The mean, standard error, and "worst" or largest (mean of the three
    worst/largest values) of these features were computed for each image,
    resulting in 30 features.  For instance, field 0 is Mean Radius, field
    10 is Radius SE, field 20 is Worst Radius.

    - 

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


In [8]:
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.tree import DecisionTreeClassifier
from scipy import stats

In [9]:
class VotingOfTreesClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self, n):            
        self.n = n    
        self.trees = []
    def fit(self, X, y):        
        for i in range(self.n):            
            tree = DecisionTreeClassifier(
                criterion='gini',
                max_depth=None,
                max_features=None,
                min_samples_split=2,
                random_state=i,                
            )            
            tree.fit(X, y)            
            self.trees.append(tree)
        return self
    
    def predict(self,X):
        Y = []
        for tree in self.trees:                        
            Y.append(tree.predict(X))
        Y = np.array(Y)                       
        return stats.mode(Y, axis=0).mode.flatten()

In [10]:
class BaggingOfTreesClassifier(VotingOfTreesClassifier):    
    def __init__(self, n, random_state=None):
        super().__init__(n)
        self.rng = np.random.default_rng(seed=random_state)
        
    def fit(self, X, y):
        num_samples = X.shape[0]
        for i in range(self.n):            
            indices = np.arange(num_samples)            
            bootstrap_indices = self.rng.choice(indices, size=num_samples, replace=True)            
            X_bootstrap = X[bootstrap_indices]
            y_bootstrap = y[bootstrap_indices]
            
            tree = DecisionTreeClassifier(
                criterion='gini',
                max_depth=None,
                max_features=None,
                min_samples_split=2,
                random_state=i,                
            )            
            tree.fit(X_bootstrap, y_bootstrap)            
            self.trees.append(tree)
        return self

In [11]:
class RandomForestClassifier(BaggingOfTreesClassifier):
    def fit(self, X, y):    
        num_samples = X.shape[0]            

        for i in range(self.n):                        
            bootstrap_indices = self.rng.choice(np.arange(num_samples), size=num_samples, replace=True)                                    
            X_bootstrap = X[bootstrap_indices]
            y_bootstrap = y[bootstrap_indices]                    
            
            
            tree = DecisionTreeClassifier(
                criterion='gini',
                max_depth=None,
                max_features='sqrt',
                min_samples_split=2,
                random_state=i,                
            )            
            tree.fit(X_bootstrap, y_bootstrap)            
            self.trees.append(tree)


        return self

In [12]:
%%time
single_tree = DecisionTreeClassifier(
                criterion='gini',
                max_depth=None,
                max_features=None,
                min_samples_split=2,
                random_state=random_seed,                
            )            
single_tree.fit(X_train, y_train)
single_tree_pred = single_tree.predict(X_test)

CPU times: total: 15.6 ms
Wall time: 13.5 ms


In [13]:
%%time
simple_model = VotingOfTreesClassifier(n=n)
simple_model.fit(X_train, y_train)
simple_model_pred = simple_model.predict(X_test)

CPU times: total: 156 ms
Wall time: 141 ms


In [14]:
%%time 
bagging_model = BaggingOfTreesClassifier(n=n, random_state=random_seed)
bagging_model.fit(X_train, y_train)
bagging_model_pred = bagging_model.predict(X_test)

CPU times: total: 78.1 ms
Wall time: 115 ms


In [15]:
%%time 
rf_model = RandomForestClassifier(n=n)
rf_model.fit(X_train, y_train)
rf_model_pred = rf_model.predict(X_test)

CPU times: total: 62.5 ms
Wall time: 45 ms


In [18]:
from sklearn.metrics import classification_report

print('\nsingle tree\n', classification_report(y_true=y_test, y_pred=single_tree_pred))
print("\nvoting\n", classification_report(y_true=y_test, y_pred=simple_model_pred))
print("\nbagging\n", classification_report(y_true=y_test, y_pred=bagging_model_pred))
print("\nrandom forest\n", classification_report(y_true=y_test, y_pred=rf_model_pred))



single tree
               precision    recall  f1-score   support

           0       0.85      0.93      0.89        42
           1       0.96      0.90      0.93        72

    accuracy                           0.91       114
   macro avg       0.90      0.92      0.91       114
weighted avg       0.92      0.91      0.91       114


voting
               precision    recall  f1-score   support

           0       0.83      0.93      0.88        42
           1       0.96      0.89      0.92        72

    accuracy                           0.90       114
   macro avg       0.89      0.91      0.90       114
weighted avg       0.91      0.90      0.90       114


bagging
               precision    recall  f1-score   support

           0       0.93      0.90      0.92        42
           1       0.95      0.96      0.95        72

    accuracy                           0.94       114
   macro avg       0.94      0.93      0.93       114
weighted avg       0.94      0.94      0.

## Summary



An ensemble of identical trees offers no advantage over a single tree, as their perfect correlation prevents any reduction in variance, while needlessly increasing computational cost. Bagging successfully reduces variance by training each tree on a different bootstrap data sample, which decorrelates the models.
Random Forest enhances bagging by further decorrelating trees. By considering only a random subset of features at each split, it improves variance reduction. While this makes each split faster to compute, overall training is not guaranteed to be quicker, as the resulting "weaker" individual trees may need to be more numerous to achieve the same predictive power.

### Let's compare how classification results grow if n will be 100 (compare simple bagging and rf)

In [12]:
%%time 
bagging_model = BaggingOfTreesClassifier(n=100, random_state=random_seed)
bagging_model.fit(X_train, y_train)
bagging_model_pred = bagging_model.predict(X_test)

CPU times: total: 984 ms
Wall time: 990 ms


In [13]:
%%time 
rf_model = RandomForestClassifier(n=100)
rf_model.fit(X_train, y_train)
rf_model_pred = rf_model.predict(X_test)

CPU times: total: 328 ms
Wall time: 410 ms


In [14]:
from sklearn.metrics import classification_report

print("bagging", classification_report(y_true=y_test, y_pred=bagging_model_pred))
print("random forest", classification_report(y_true=y_test, y_pred=rf_model_pred))

bagging               precision    recall  f1-score   support

           0       0.93      0.93      0.93        42
           1       0.96      0.96      0.96        72

    accuracy                           0.95       114
   macro avg       0.94      0.94      0.94       114
weighted avg       0.95      0.95      0.95       114

random forest               precision    recall  f1-score   support

           0       0.95      0.93      0.94        42
           1       0.96      0.97      0.97        72

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114



We see that increasing count of trees in simple bagging will not improve the result as good as increasing in  random forest (because trees are not correlated to each other)