A voting classifier is a machine learning estimator that trains various base models or estimators and predicts on the basis of aggregating the findings of each base estimator.


We can implement the voting classifier in 2 ways:


**Majority Voting**


Every model makes a prediction (votes) for each test instance and the final output prediction is the one that receives **more than half of the votes**. If none of the predictions get more than half of the votes, we may say that the ensemble method could not make a stable prediction for this instance.


**Weighted Voting**


Unlike majority voting, where each model has the same weights, we can increase the importance of one or more models. In weighted voting we count the prediction of the better models **multiple times**.


So in general, in ensemble methods, instead of learning a weak classifier, we learn **many weak classifiers** that are good at different parts of the input space.

First I import the libraries that needed:

In [41]:
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from statistics import mean

Before implementing the ensemble algorithm, I **preprocess** the data, since there is **missing data** in the columns.
I replace the missing values with the **mean of the column** as implemented below:

In [42]:
data = pd.read_csv("/Users/Nika/Desktop/cancer.csv")
Bare_Nuclei_column = data["Bare Nuclei"]
Bare_Nuclei_column.drop(Bare_Nuclei_column.index[Bare_Nuclei_column == '?'], inplace=True)
Bare_Nuclei_column = Bare_Nuclei_column.values.tolist()
for i in range (len(Bare_Nuclei_column)):
    Bare_Nuclei_column[i] = int(Bare_Nuclei_column[i])
Bare_Nuclei_mean = mean(Bare_Nuclei_column)
data["Bare Nuclei"].replace({"?": Bare_Nuclei_mean}, inplace=True)
data = data.astype(int)

Then I process the dataset by implementing **K-fold cross-validation** with k = 10:

In [43]:
Xs = pd.DataFrame(data, columns=["Clump Thickness","Uniformity of Cell Size","Uniformity of Cell Shape","Marginal Adhesion","Single Epithelial Cell Size","Bare Nuclei","Bland Chromatin","Normal Nucleoli","Mitoses"])
ys = pd.DataFrame(data, columns=["Class"])
y = np.asarray(ys["Class"])
kf = KFold(n_splits=10)

Now that the dataset is ready, I implement the **Ensemble Learning** algorithm with **Random Forest**, **SVM**, and **Logistic Regression** with **voting classifier** as below, and report the average accuracy of the model:

In [44]:
log_clf = LogisticRegression(solver="lbfgs", random_state=42)
rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42)
svm_clf = SVC(gamma="scale", random_state=42)
voting_clf = VotingClassifier(estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)])
result = cross_val_score(voting_clf , Xs, y, cv = kf)
print("Avg accuracy: {}".format(result.mean()))

Avg accuracy: 0.9699792960662528
