## __Applying Averaging and Max Voting__
Let's look at averaging and voting techniques.

## Step 1: Import the Required Libraries and Load the Dataset

- Import **pandas, NumPy, SimpleImputer**, and **MinMaxScaler**
- Load the breast cancer dataset and preprocess it


In [None]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler

In [None]:
data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data', header=None)
data.columns = ['Sample code', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape',
                'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin',
                'Normal Nucleoli', 'Mitoses','Class']

- Drop the sample, which is of no use
- Convert Bare Nuclei into an integer

In [None]:
data.drop(['Sample code'], axis =1, inplace=True)
data.replace('?', 0, inplace=True)
data['Bare Nuclei'] = data['Bare Nuclei'].astype('int64')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 10 columns):
 #   Column                       Non-Null Count  Dtype
---  ------                       --------------  -----
 0   Clump Thickness              699 non-null    int64
 1   Uniformity of Cell Size      699 non-null    int64
 2   Uniformity of Cell Shape     699 non-null    int64
 3   Marginal Adhesion            699 non-null    int64
 4   Single Epithelial Cell Size  699 non-null    int64
 5   Bare Nuclei                  699 non-null    int64
 6   Bland Chromatin              699 non-null    int64
 7   Normal Nucleoli              699 non-null    int64
 8   Mitoses                      699 non-null    int64
 9   Class                        699 non-null    int64
dtypes: int64(10)
memory usage: 54.7 KB


__Observation:__
- Now, you can see that all columns are defined as integers.

## Step 2: Impute and Normalize the Data

- Impute missing values using SimpleImputer
- Normalize the data using MinMaxScaler

In [None]:
values = data.values
imputer = SimpleImputer()
imputeData = imputer.fit_transform(values)

In [None]:
scaler = MinMaxScaler(feature_range=(0,1))
normalizedData = scaler.fit_transform(imputeData)

Let's split X and y.

In [None]:
X = normalizedData[:, 0:9]
y = normalizedData[:, 9]

## Step 3: Train the Classifiers and Calculate Average Predictions

- In this example, we are applying three different algorithms, namely train **LogisticRegression**, **DecisionTreeClassifier**, and **SVC**, to get the average.
- Calculate the average predictions and R2 score


In [None]:
from sklearn.metrics import r2_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
LogRefg_clf  = LogisticRegression()
Dtree_clf = DecisionTreeClassifier()
svc_slf = SVC()

LogRefg_clf.fit(X, y)
Dtree_clf.fit(X,y)
svc_slf.fit(X,y)

- Let's predict the text datasets for all three models and calculate the accuracy by taking the mean square of all three models.

In [None]:
LogRefg_pred = LogRefg_clf.predict(X)
Dtree_pred = Dtree_clf.predict(X)
svc_pred = svc_slf.predict(X)

avg_preds = (LogRefg_pred + Dtree_pred + svc_pred)//3

acc = r2_score(y, avg_preds)

print(acc)

0.9113410281034264


__Observations:__
- We have 91% accuracy.
- We will take the average prediction and then consider that as the final predictive value.
- Let's take a look at the predicted values that the algorithm has predicted by taking the average.

In [None]:
avg_preds

array([0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0.,
       0., 1., 0., 1., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0.,
       0., 0., 1., 0., 1., 1., 0., 1., 1., 0., 1., 0., 1., 0., 0., 1., 1.,
       0., 1., 1., 1., 1., 1., 0., 1., 1., 1., 0., 1., 0., 0., 1., 0., 1.,
       1., 0., 0., 1., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
       1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 0.,
       0., 1., 1., 1., 1., 1., 0., 1., 0., 1., 1., 1., 0., 0., 0., 1., 0.,
       0., 0., 0., 1., 1., 1., 0., 1., 0., 1., 0., 0., 0., 1., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 1., 0., 1., 1.,
       0., 0., 1., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1., 1., 0., 0.,
       0., 0., 0., 1., 1., 1., 0., 1., 0., 1., 0., 0., 0., 1., 1., 0., 1.,
       1., 1., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0.,
       0., 1., 1., 0., 0., 0., 1., 1., 0., 1., 1., 1., 0., 0., 1., 0., 0.,
       1., 0., 1., 1., 0.

__Observation:__
- These are the predicted values that the algorithm has predicted by taking the average.

## Step 4: Implement the Voting Ensemble

- Import VotingClassifier and model_selection from sklearn
- Perform k-fold cross-validation and calculate the mean of the results

In [None]:
from sklearn.ensemble import VotingClassifier
from sklearn import model_selection

kfold = model_selection.KFold(n_splits =10, random_state=7, shuffle=True)
estimators= []
model1 = LogisticRegression()
estimators.append(('logistic', model1))
model2 = DecisionTreeClassifier()
estimators.append(('cart', model2))
model3 = SVC()
estimators.append(('svm', model3))

ensemble = VotingClassifier(estimators)
results =  model_selection.cross_val_score(ensemble, X,y, cv=kfold)
print(results.mean())

0.9627950310559006


__Observations:__
- Now we can compare the scores.
- In the earlier case, it was 91.13%.
- In this case, using the voting ensemble, it is 96.27%.