### Ensemble Learning
* Source blog: https://towardsdatascience.com/ensemble-learning-using-scikit-learn-85c4531ff86a
* Source data: https://www.kaggle.com/uciml/pima-indians-diabetes-database

We will use three different models to put into our Voting Classifier: k-Nearest Neighbors, Random Forest, and Logistic Regression. We will use the Scikit-learn library in Python to implement these methods and use the diabetes dataset in our example.

In [2]:
import pandas as pd
#read in the dataset
df = pd.read_csv('pima-india-diabetes.csv')
#take a look at the data
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:
#check dataset size
df.shape

(768, 9)

#### Split up the dataset into inputs and targets
Now let’s split up our dataset into inputs (X) and our target (y). Our input will be every column except ‘diabetes’ because ‘diabetes’ is what we will be attempting to predict. Therefore, ‘diabetes’ will be our target.
We will use the pandas ‘drop’ function to drop the column ‘diabetes’ from our dataframe and store it in the variable ‘X’.

In [6]:
#split data into inputs and targets
X = df.drop(columns = ['Outcome'])
y = df['Outcome']

#### Split the dataset into train and test data
Now we will split the dataset into into training data and testing data. The training data is the data that the model will learn from. The testing data is the data we will use to see how well the model performs on unseen data.
Scikit-learn has a function we can use called ‘train_test_split’ that makes it easy for us to split our dataset into training and testing data.

In [7]:
from sklearn.model_selection import train_test_split
#split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)

‘train_test_split’ takes in 5 parameters. The first two parameters are the input and target data we split up earlier. 

Next, we will set ‘test_size’ to 0.3. This means that 30% of all the data will be used for testing, which leaves 70% of the data as training data for the model to learn from.
Setting ‘stratify’ to y makes our training split represent the proportion of each value in the y variable. For example, in our dataset, if 25% of patients have diabetes and 75% don’t have diabetes, setting ‘stratify’ to y will ensure that the random split has 25% of patients with diabetes and 75% of patients without diabetes.

### Building the models
Next, we have to build our models. Each model we build has a set of hyper parameters that we can tune. Tuning parameters is when you go through a process to find the optimal parameters for your model to improve accuracy. We will use grid search to find the optimal hyperparamters for each model.

Grid search works by training our model multiple times on a range of parameters that we specify. That way, we can test our model with each hyperparameter value and figure out the optimal values to get the best accuracy results.

#### Model 1 of Ensemble: k-Nearest Neighbors (k-NN)
The first model we will build is k-Nearest Neighbors (k-NN). k-NN models work by taking a data point and looking at the ‘k’ closest labeled data points. The data point is then assigned the label of the majority of the ‘k’ closest points.

For example, if k = 5, and 3 of points are ‘green’ and 2 are ‘red’, then the data point in question would be labeled ‘green’, since ‘green’ is the majority (as shown in the above graph).

In [9]:
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
#create new a knn model
knn = KNeighborsClassifier()
#create a dictionary of all values we want to test for n_neighbors
params_knn = {'n_neighbors': np.arange(1, 25)}
#use gridsearch to test all values for n_neighbors
knn_gs = GridSearchCV(knn, params_knn, cv=5)
#fit model to training data
knn_gs.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24])})

First, we will create a new k-NN classifier. Next, we need to create a dictionary to store all the values we will test for ‘n_neighbors’, which is the hyperparameter we need to tune. We will test 24 different values for ‘n_neighbors’. 

Then we will create our grid search, inputing our k-NN classifier, our set of hyperparamters and our cross validation value.

Cross-validation is when the dataset is randomly split up into ‘k’ groups. One of the groups is used as the test set and the rest are used as the training set. The model is trained on the training set and scored on the test set. Then the process is repeated until each unique group as been used as the test set.

In our case, we are using 5-fold cross validation. The dataset is split into 5 groups, and the model is trained and tested 5 separate times so each group would get a chance to be the test set. This is how we will score our model running with each hyperparamter value to see which value for ‘n_neighbors’ gives us the best score.

Then we will use the ‘fit’ function to run our grid search.

Now we will save our best k-NN model to ‘knn_best’ using the ‘best_estimator_’ function and check what the best value was for ‘n_neighbors’.

In [10]:
#save best model
knn_best = knn_gs.best_estimator_
#check best n_neigbors value
print(knn_gs.best_params_)

{'n_neighbors': 11}


For the next two models, I will not go into as much detail since some parts are repeated from building the k-NN model.
#### Model 2 in Ensemble : Random Forest
The next model we will build is a random forest. A random forest is considered an ensemble model in itself, since it is a collection of decision trees combined to make a more accurate model.

In [11]:
from sklearn.ensemble import RandomForestClassifier
#create a new random forest classifier
rf = RandomForestClassifier()

#create a dictionary of all values we want to test for n_estimators
params_rf = {'n_estimators': [50, 100, 200]}

#use gridsearch to test all values for n_estimators
rf_gs = GridSearchCV(rf, params_rf, cv=5)

#fit model to training data
rf_gs.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=RandomForestClassifier(),
             param_grid={'n_estimators': [50, 100, 200]})

We will create a new random forest classifier and set the hyperparameters we want to tune. ‘n_estimators’ is the number of trees in our random forest. Then we can run our grid search to find the optimal number of trees.
Just like before, we will save our best model and print the best ‘n_estimators’ value.

In [12]:
#save best model
rf_best = rf_gs.best_estimator_

#check best n_estimators value
print(rf_gs.best_params_)

{'n_estimators': 50}


#### Model 3 in Ensemble: Logistic Regression
Our last model is logistic regression. Even though it has ‘regression’ in its name, logistic regression is a classification method. This one is more simple since we won’t tune any hyperparameters. We just need to create and train the model.

In [18]:
from sklearn.linear_model import LogisticRegression
#create a new logistic regression model
log_reg = LogisticRegression(max_iter=1000)

#fit the model to the training data
log_reg.fit(X_train, y_train)

LogisticRegression(max_iter=1000)

Now let’s check the accuracy scores of all three of our models on our test data.

In [20]:
#test the three models with the test data and print their accuracy scores
print('knn: {}'.format(knn_best.score(X_test, y_test)))
print('rf: {}'.format(rf_best.score(X_test, y_test)))
print('log_reg: {}'.format(log_reg.score(X_test, y_test)))

knn: 0.7835497835497836
rf: 0.7792207792207793
log_reg: 0.8225108225108225


As you can see from the output, logistic regression is the most accurate out of the three.

### Now for the Ensemble Aggregation using Voting Classifier
Now that we’ve built our three individual models, it’s time we built our voting classifier.

In [22]:
from sklearn.ensemble import VotingClassifier
#create a dictionary of our models
estimators=[('knn', knn_best), ('rf', rf_best), ('log_reg', log_reg)]

#create our voting classifier, inputting our models
ensemble = VotingClassifier(estimators, voting='hard')

First, let’s place our three models in an array called ‘estimators’. Next, we will create our voting classifier. It takes two inputs. The first is our estimator array of our three models. We will set the voting parameter to hard, which tells our classifier to make predicitons by majority vote.

Now we can fit our ensemble model to our training data and score it on our testing data.

In [23]:
#fit model to training data
ensemble.fit(X_train, y_train)

#test our model on the test data
ensemble.score(X_test, y_test)

0.8095238095238095

Oops!! Our ensemble model performed better than our individual k-NN, random forest but not better than logistic regression model!

That’s it! You’ve now built an ensemble model to combine individual models!