an ensemble model is one where multiple models are combined.

The voting ensemble is where each individual model is ran on the classification problem, and the class getting the most votes wins.

So if 4 models are run on a digit classifier, and 3 models predict an `8`, and one predicts a `3`, the voting will give us the `8`.

So let's create a moon dataset

In [16]:
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

In [17]:
X, y = make_moons(n_samples=1000, noise=0.5, random_state=15)

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=15)

We will run a random forest, SVM, and logistic regression.

In [19]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

In [20]:
rand_clf = RandomForestClassifier()
log_clf = LogisticRegression()
svm_clf = SVC()

In [21]:
rand_clf.fit(X_train, y_train)
y_pred = rand_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.7757575757575758

In [22]:
log_clf.fit(X_train, y_train)
y_pred = log_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.8090909090909091

In [23]:
svm_clf.fit(X_train, y_train)
y_pred = svm_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.8090909090909091

Now let's build the voter

In [24]:
from sklearn.ensemble import VotingClassifier

In [25]:
estimators = [('rn', rand_clf), ('lg',log_clf), ('sv',svm_clf)]

In [12]:
voting_clf = VotingClassifier(estimators=estimators, voting='hard')

In [13]:
voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('rn',
                              RandomForestClassifier(bootstrap=True,
                                                     ccp_alpha=0.0,
                                                     class_weight=None,
                                                     criterion='gini',
                                                     max_depth=None,
                                                     max_features='auto',
                                                     max_leaf_nodes=None,
                                                     max_samples=None,
                                                     min_impurity_decrease=0.0,
                                                     min_impurity_split=None,
                                                     min_samples_leaf=1,
                                                     min_samples_split=2,
                                                     min_weight_fraction_leaf=0.0,
        

In [14]:
y_pred = voting_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.8242424242424242

The voting out performed the other models.

Now we can try soft voting. In hard voting, it with frequency or Mode that the final is chosen. In soft voting all predictions have a say... an average is used. The class with the highest probability across all predictors is chosen.

In [31]:
voting_clf = VotingClassifier(estimators=estimators, voting='soft')
voting_clf.fit(X_train, y_train)
y_pred = voting_clf.predict(X_test)
accuracy_score(y_test, y_pred)

AttributeError: predict_proba is not available when  probability=False