Hi, in this lesson, you will discover the voting ensemble.

Voting ensembles use simple statistics to combine the predictions from multiple models.

Typically, this involves fitting multiple different model types on the same training dataset, then calculating the average prediction in the case of regression or the class label with the most votes for classification, called hard voting.

Voting can also be used when predicting the probability of class labels on classification problems by summing predicted probabilities and selecting the label with the largest summed probability. This is called soft voting and is preferred when the base-models used in the ensemble natively support predicting class probabilities as it can result in better performance.

Voting ensembles are available in scikit-learn via the VotingClassifier and VotingRegressor classes. A list of base-models can be provided as an argument to the model and each model in the list must be a tuple with a name and the model, e.g. ('lr', LogisticRegression()). The type of voting used for classification can be specified via the voting argument and set to either 'soft' or 'hard'.

The complete example of evaluating a voting ensemble for classification is listed below.

In [2]:
from numpy import math
from numpy import mean
from numpy import std

from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import VotingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression


In [3]:
# create the synthetic classification data set
X,y= make_classification(random_state=1)

In [5]:
# configure the models to use in the ensemble
models=[('lr',LogisticRegression()),('nb',GaussianNB())]

In [6]:
# configgure the ensumble model
model=VotingClassifier(models,voting="soft")

In [7]:
#  configuring the re-samplin model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the ensemble on the dataset using the resampling method
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report ensemble performance
print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Mean Accuracy: 0.960 (0.061)
