# Bagging Algorithms

Bootstrap Aggregation or bagging involves taking multiple samples from your training dataset (with replacement) and training a model for each sample.

The final output prediction is averaged across the predictions of all of the sub-models.

In [3]:
#Importing required Libraries
import pandas
from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier

In [4]:
#Locating the dataset and creating a variable for the URL
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names) #Reading the data from the URL into a dataframe

In [16]:
array = dataframe.values #Reading the content of the dataframe
X = array[:,0:8] #Creating X array with first 8 elements of the Data in a row - Where X is an independent variable
Y = array[:,8] #Creating Y array with last element of the Data in a row- Where Y is a dependent variable

print("X-->", X[0])

print("Y-->", Y[0])


X--> [  6.    148.     72.     35.      0.     33.6     0.627  50.   ]
Y--> 1.0


In [17]:
seed = 7 #No. of random numbers to be generated
num_trees = 100 #Create 100 random trees
max_features = 3

# Random forest is an extension of bagged decision trees.

Samples of the training dataset are taken with replacement, but the trees are constructed in a way that reduces the correlation between individual classifiers. Specifically, rather than greedily choosing the best split point in the construction of the tree, only a random subset of features are considered for each split.

You can construct a Random Forest model for classification using the RandomForestClassifier class.

The example below provides an example of Random Forest for classification with 100 trees and split points chosen from a random selection of 3 features.

In [33]:
kfold = model_selection.KFold(n_splits=100, random_state=seed) #Cross Validation
model = RandomForestClassifier(n_estimators=num_trees, max_features=max_features) #Building model for Randomforestclassifier
results = model_selection.cross_val_score(model, X, Y, cv=kfold) 
print(results.mean())

0.7667857142857143
