**ENSEMBLE LEARNING - BOOTSTRAP AGGREGATION (BAGGING)**

**Used to improve the stability and accuracy of machine learning algorithms by reducing variance**

**WORKING:**

**1. Bootstrap Sampling: Bagging involves creating multiple subsets of the original dataset through bootstrap sampling. This method randomly selects instances from the dataset with replacement, resulting in new subsets that introduces new subsets or repeats older ones.**

**2. Model Training: Each subset is used to train a separate instance of the base learning algorithm. Because each subset is slightly different due to the randomness of bootstrapping, each model learns unique patterns from the data.**

**3. Prediction Aggregation: After training, predictions are made using each model on the original dataset. For regression tasks, predictions are typically averaged, while for classification tasks, a majority voting approach is used.**



**It is particularly effective when the base learning algorithm exhibits high variance, meaning it is sensitive to small changes in the training data.**

**Bagging methods are typically used on weak learners that exhibit high variance and low bias.**

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [6]:
patients = pd.read_csv('D:/Code/Python Projects/ML Projects/Diabetes Prediction KNN/diabetes.csv')
patients.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [7]:
patients.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [21]:
patients.Outcome.value_counts()

Outcome
0    500
1    268
Name: count, dtype: int64

In [22]:
non_zero = ['Glucose', 'BloodPressure', 'SkinThickness', 'BMI', 'Insulin']

In [23]:
for i in non_zero:
    patients[i]=patients[i].replace(0,np.NaN)
    mean=int(patients[i].mean(skipna=True))
    patients[i]=patients[i].replace(np.NaN,mean)

In [24]:
X= patients.iloc[:,:-1]
y= patients.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [18]:
print(len(X_train))
print(len(y_train))
print(len(X_test))
print(len(y_test))

614
614
154
154


In [19]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [20]:
X_train.shape

(614, 8)

**IMPLEMENT DECISION TREE CLASSIFIER SO THAT WE CAN COMPARE**

**Calculating score using Cross-Validation Method**

In [53]:
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier

scores1 = cross_val_score(DecisionTreeClassifier(), X_train, y_train, cv=5)
scores1

array([0.66666667, 0.62601626, 0.68292683, 0.66666667, 0.68852459])

In [55]:
scores1.mean()

0.6661602025856324

**Fit the BaggingClassifier**

In [56]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_model = BaggingClassifier(n_estimators=100, max_samples=0.8, oob_score=True, random_state=0)
bag_model.fit(X_train, y_train)
bag_model.oob_score_

0.739413680781759

In [57]:
x = bag_model.score(X_test, y_test)
x

0.8181818181818182

In [58]:
bag_model = BaggingClassifier(n_estimators=100, max_samples=0.8, oob_score=True, random_state=0)
scores2 = cross_val_score(bag_model, X, y, cv=5)
scores2

array([0.72727273, 0.74675325, 0.75324675, 0.83006536, 0.73202614])

In [59]:
scores2.mean()

0.7578728461081403