## Ensemble Learning

Ensemble means combining different models, using multiple models and training that dataset. in Ensemble we have two techniques, Bagging and Boosting, Bagging is also called as Bootstrap Aggregation, One of the method in Bagging is Random Forest(where we have multiple decision trees). Following figure explains Bagging...


![bagging_basic_explain](bagging_basic_explain.png)




In [3]:
import pandas as pd

#### Loading the datasets

Download the dataset from: https://www.kaggle.com/datasets/mathchi/diabetes-data-set

In [5]:
df = pd.read_csv("datasets/diabetes.csv")
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


#### Checking the missing values

In [6]:
df.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

As we have no missing values
#### Adding x and y variables

In [7]:
X = df.drop("Outcome", axis="columns")
y = df.Outcome

#### Document Scaling

Dataset scaling is transforming a dataset to fit within a specific range. For example, you can scale a dataset to fit within a range of 0-1, -1-1, or 0-100.

Dataset scaling ensures that no data point value is left out during model training.

In [10]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [11]:
# viewing few scaled datasets
X_scaled[:3]

array([[ 0.63994726,  0.84832379,  0.14964075,  0.90726993, -0.69289057,
         0.20401277,  0.46849198,  1.4259954 ],
       [-0.84488505, -1.12339636, -0.16054575,  0.53090156, -0.69289057,
        -0.68442195, -0.36506078, -0.19067191],
       [ 1.23388019,  1.94372388, -0.26394125, -1.28821221, -0.69289057,
        -1.10325546,  0.60439732, -0.10558415]])

#### Splitting the Dataset

In [19]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, stratify=y, random_state=10)
# This code uses default splitting ratio, 80% -> training set and 20% -> test set
# No of data samples in training/testing set
print(f"No of samples in training set {X_train.shape}\n")
print(f"No of samples in testing set {X_test.shape}")

No of samples in training set (576, 8)

No of samples in testing set (192, 8)


#### Model building using Decision Tree Classifier

In [20]:
from sklearn.tree import DecisionTreeClassifier

We will use k-fold cross-validation to build our decision tree classifier. In addition, K-fold cross-validation allows us to split our dataset into various subsets or portions.

The model is then trained using each subset and gets the accuracy scores after each iteration. Finally, the mean accuracy score is calculated. K refers to the number of subsets/portions we split the dataset.

In [21]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(DecisionTreeClassifier(), X, y, cv=5)
scores

array([0.68831169, 0.68831169, 0.72077922, 0.78431373, 0.71895425])

In [23]:
# Getting mean accuracy score
scores.mean()

0.7201341142517613

Using the cross-validation score, we get the accuracy score to be 0.7214497920380273. We can build the same model using the bagging algorithms to compare the accuracy scores.

#### Implementing Bagging Algorithms

The BaggingClassifier classifier will follow all the bagging steps and build an optimized model. The BaggingClassifier will fit the weak/base learners on the randomly sampled subsets.

Next, it will use the voting techniques to produce an aggregated final model. Finally, we will use the DecisionTreeClassifier algorithm as our weak/base learners.

In [25]:
from sklearn.ensemble import BaggingClassifier

bag_model = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),
    n_estimators=100,
    max_samples=0.8,
    bootstrap=True,
    oob_score=True,
    random_state=0
)

In [26]:
# fitting the model
bag_model.fit(X_train, y_train)



In [27]:
# Accuracy Score with training set
bag_model.oob_score_

0.7534722222222222

In [28]:
# Accuracy Score with test set
bag_model.score(X_test, y_test)

0.7760416666666666

#### Building the model using Random Forest Classifier

Random Forest Classifier has several decision trees trained on the various subsets. This algorithm is a typical example of a bagging algorithm.

Random Forests uses bagging underneath to sample the dataset with replacement randomly. Random Forests samples not only data rows but also columns. It also follows the bagging steps to produce an aggregated final model.

In [32]:
from sklearn.ensemble import RandomForestClassifier

score = cross_val_score(RandomForestClassifier(n_estimators=50), X, y, cv=5)
scores

array([0.68831169, 0.68831169, 0.72077922, 0.78431373, 0.71895425])

In [33]:
scores.mean()

0.7201341142517613