### Basic Aggregating of Models

This activity focuses on combining models in an ensemble to make predictions.  You will first create an ensemble on your own and then be introduced to the `VotingClassifier` from `scikitlearn` to implement these ensembles.  You will consider a classification problem and use Logistic Regression, KNN, and Support Vector Machines to build your ensemble.  

#### Index

- [Problem 1](#-Problem-1)
- [Problem 2](#-Problem-2)
- [Problem 3](#-Problem-3)
- [Problem 4](#-Problem-4)
- [Problem 5](#-Problem-5)
- [Problem 6](#-Problem-6)



In [1]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import VotingClassifier

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### The Data


The data was retrieved from [kaggle]() and contains information from fetal Cardiotocogram exams that were classified into three categories:


- Normal
- Suspect
- Pathological


In [3]:
df = pd.read_csv('codio_21_5_solution/data/fetal.zip', compression = 'zip')

In [4]:
df.head()

Unnamed: 0,baseline value,accelerations,fetal_movement,uterine_contractions,light_decelerations,severe_decelerations,prolongued_decelerations,abnormal_short_term_variability,mean_value_of_short_term_variability,percentage_of_time_with_abnormal_long_term_variability,...,histogram_min,histogram_max,histogram_number_of_peaks,histogram_number_of_zeroes,histogram_mode,histogram_mean,histogram_median,histogram_variance,histogram_tendency,fetal_health
0,120.0,0.0,0.0,0.0,0.0,0.0,0.0,73.0,0.5,43.0,...,62.0,126.0,2.0,0.0,120.0,137.0,121.0,73.0,1.0,2.0
1,132.0,0.006,0.0,0.006,0.003,0.0,0.0,17.0,2.1,0.0,...,68.0,198.0,6.0,1.0,141.0,136.0,140.0,12.0,0.0,1.0
2,133.0,0.003,0.0,0.008,0.003,0.0,0.0,16.0,2.1,0.0,...,68.0,198.0,5.0,1.0,141.0,135.0,138.0,13.0,0.0,1.0
3,134.0,0.003,0.0,0.008,0.003,0.0,0.0,16.0,2.4,0.0,...,53.0,170.0,11.0,0.0,137.0,134.0,137.0,13.0,1.0,1.0
4,132.0,0.007,0.0,0.008,0.0,0.0,0.0,16.0,2.4,0.0,...,53.0,170.0,9.0,0.0,137.0,136.0,138.0,11.0,1.0,1.0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2126 entries, 0 to 2125
Data columns (total 22 columns):
 #   Column                                                  Non-Null Count  Dtype  
---  ------                                                  --------------  -----  
 0   baseline value                                          2126 non-null   float64
 1   accelerations                                           2126 non-null   float64
 2   fetal_movement                                          2126 non-null   float64
 3   uterine_contractions                                    2126 non-null   float64
 4   light_decelerations                                     2126 non-null   float64
 5   severe_decelerations                                    2126 non-null   float64
 6   prolongued_decelerations                                2126 non-null   float64
 7   abnormal_short_term_variability                         2126 non-null   float64
 8   mean_value_of_short_term_variability  

In [6]:
df['fetal_health'].value_counts()

fetal_health
1.0    1655
2.0     295
3.0     176
Name: count, dtype: int64

In [7]:
X = df.drop('fetal_health', axis = 1)
y = df['fetal_health']

In [9]:
scaler = StandardScaler()
X = scaler.fit_transform(X)

### Problem 1

#### Model Predictions

Given the models below and the starter code, scale that data and train the models on the data, assigning the predictions as an array to the given dictionary.  

In [10]:
models = [LogisticRegression(), KNeighborsClassifier(), SVC()]

In [11]:
results = {'logistic':[], 'knn':[], 'svc':[]}

In [12]:
i = 0
for model in models:
    model.fit(X,y)
    results[list(results.keys())[i]] = model.predict(X)
    i += 1
results

{'logistic': array([2., 1., 1., ..., 2., 2., 1.], shape=(2126,)),
 'knn': array([2., 1., 1., ..., 2., 2., 1.], shape=(2126,)),
 'svc': array([2., 1., 1., ..., 2., 2., 1.], shape=(2126,))}

### Problem 2

#### Majority Vote

Using your dictionary of predictions, create a DataFrame called `prediction_df` and add a column to the DataFrame named `ensemble_prediction` based on the majority vote of your predictions.

In [15]:
prediction_df = pd.DataFrame(results)
prediction_df['ensemble_prediction'] = prediction_df.mode(axis = 1).iloc[:, 0]
prediction_df.head()

Unnamed: 0,logistic,knn,svc,ensemble_prediction
0,2.0,2.0,2.0,2.0
1,1.0,1.0,1.0,1.0
2,1.0,1.0,1.0,1.0
3,1.0,1.0,1.0,1.0
4,1.0,1.0,1.0,1.0


### Problem 3


#### Accuracy of Classifiers


Create a list of accuracy scores for each of the classifiers.  Use this list with the columns to create a DataFrame named `results_df` to hold the accuracy scores of the classifiers.  What rank was your ensemble?


In [16]:
from sklearn.metrics import accuracy_score

In [17]:
accuracies = []
for col in prediction_df.columns:
    accuracies.append(accuracy_score(y, prediction_df[col]))
accuracies


[0.9045155221072436, 0.9374412041392286, 0.929444967074318, 0.9270931326434619]

### Problem 4

#### Using the Voting Classifier

Use the documentation and User Guide [here](https://scikit-learn.org/stable/modules/ensemble.html#voting-classifier) to create a voting ensemble using the `VotingClassifier` based on the majority vote using the same three classifiers `svc`, `lgr`, and `knn`.  Assign the accuracy of the ensemble to `vote_accuracy` below.

In [18]:
voter =  VotingClassifier([('svc',SVC()),('lgr',LogisticRegression()),('knn',KNeighborsClassifier())])
voter

In [19]:
voter.fit(X, y)
vote_accuracy = voter.score(X, y)
vote_accuracy

0.9270931326434619

### Problem 5

#### Voting based on probabilities

Consult the user guide and create a new ensemble that makes predictions based on the probabilities of the estimators.  **HINT**: This has to do with the `voting` parameter.  Assign the ensemble as `soft_voter` and the accuracy as `soft_accuracy`. 

In [24]:
soft_voter = VotingClassifier([('svc',SVC(probability = True)),('lgr',LogisticRegression()),('knn',KNeighborsClassifier())],voting = 'soft')
soft_voter

In [25]:
soft_voter.fit(X, y)
soft_accuracy = soft_voter.score(X, y)
soft_accuracy

0.9379115710253998

### Problem 6

#### Using different weights 

Finally, consider weighing the classifiers differently.  Use the Logistic Regression estimator as .5 of the weight in predicting based on majority votes, and the SVC and KNN as 0.25 each.  Assign the accuracy of these predictions on the test data to `weighted_acc`.  

In [26]:
weighted_voter = VotingClassifier([('svc', SVC(probability = True)), ('lgr', LogisticRegression()), ('knn', KNeighborsClassifier())],
                                 weights=[0.25, .5, .25])
weighted_voter.fit(X, y)
weighted_score = weighted_voter.score(X, y)

weighted_score

0.9214487300094073