<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Ensembles & Voting

_By: Jeff Hale - Penelope adapted from other materials_
___

### Learning Objectives
After this lesson students will be able to:
- Explain the difference between hard and soft voting
- Use a scikit-learn VotingClassifier and VotingRegressor 
- Describe calibration


### Prior Knowledge Required:
- Python basics
- Pandas basics
- Scikit-learn basics

## Ensemble Methods

Ensembling is building multiple models and then combining their results in some way to create predictions.

## Why would we build an "ensemble model?"

We can summarize this as the **wisdom of the crowd**.

## Wisdom of the Crowd: Guess the weight of Penelope

![](./images/penelope.jpg)

[Image source: https://www.npr.org](https://www.npr.org/sections/money/2015/07/17/422881071/how-much-does-this-cow-weigh)

In [None]:
first_guess= 1000

In [3]:
guesses = [1000, 1100, 1650, 1600, 1500, 1200, 1100, 1300, 1250]

In [4]:
import numpy as np
np.mean(guesses)
# the actual weight of the cow is 1350
# so getting more samples or ideas can generally be better 

1300.0

#### Imports

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import mean_squared_error

### Regression

Carvana car price prediction

In [6]:
df_cars = pd.DataFrame(dict(
    price=
    [34990, 32590, 25990, 32590, 30990, 36990, 44990, 28990, 39990, 
     30990, 31990, 28590, 15990, 21990, 35590, 27990, 21990],
    miles=
    [11791, 14893, 13256, 37654, 38127, 42904, 1358, 10659, 
    9255, 32743, 15990, 17428, 14833, 25848, 12505, 6877, 82197],
    year=
    [2019, 2018, 2019, 2015, 2018, 2017, 2020, 2019, 2019, 
    2014, 2019, 2019, 2010, 2018, 2018, 2019, 2014]
))

In [None]:
df_cars

### Set up X & y, tts, standardize.

Get the RMSE for a LinearRegression model, a KNN model, and a baseline model

In [7]:
X = df_cars.drop('price', axis=1)

In [8]:
y = df_cars['price']

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=23)

In [10]:
X_train

Unnamed: 0,miles,year
5,42904,2017
12,14833,2010
13,25848,2018
3,37654,2015
14,12505,2018
7,10659,2019
11,17428,2019
15,6877,2019
16,82197,2014
9,32743,2014


In [11]:
y_test

0     34990
10    31990
2     25990
1     32590
4     30990
Name: price, dtype: int64

### Baseline null model

In [14]:
# with this approach, we would guess the mean every time
np.mean(y_train)

30556.666666666668

In [17]:
# how do we get a prediction for each item based on the null model? MSE expected guesses for all records
# one option is to create and array of all of the same predictions for every record in y_test


baseline_test_predictions = np.ones(y_test.shape) * np.mean(y_train)


In [18]:
# now we can use MSE to get a value as a baseline

mean_squared_error(y_test, baseline_test_predictions)



9377111.111111108

In [19]:
# remember these units are squared to unsqare, add the squared=False param
mean_squared_error(y_test, baseline_test_predictions, squared=False)
# so on average, the baseline is off by 3062.2 
# once we cross validate- we compare the average of all of the root mean squared errors are below this number
# in order to decide if our model is better than the baseline. 

3062.206902074239

### Standardize with 0 mean and unit variance

In [21]:
## because we are using KNN, we do need to Scale our data using Standard Scaler
## instantiate Standard Scale
scaler = StandardScaler()

# fit and transform the train data
X_train_scaled = scaler.fit_transform(X_train)

# only transform the test data
X_test_scaled = scaler.transform(X_test)

# scaling data with linear regressors doesn't affect performance, but if you are using KNN, it will
# scaling data also won't affect root mean squared error because the y values haven't changed

### Linear Regression model RMSE

In [24]:
# linear regression model
lr = LinearRegression()

In [25]:
# fit on train scaled, evaluate on train
lr.fit(X_train_scaled, y_train)

LinearRegression()

In [28]:
# make predictions - if you train on scaled data, eval on squared data
lr_preds= lr.predict(X_test_scaled)

In [30]:
# use predictions to get the root mean squared error
mean_squared_error(y_test, lr_preds, squared=False)
# so this is actual worse than just using the mean as the baseline. 
# but this is a small data set so that could be the reason why?

3937.6798105946023

### KNN model RMSE

In [31]:
#knn instantiate
knn_reg = KNeighborsRegressor()

In [34]:
# fit model
knn_reg.fit(X_train_scaled, y_train)

KNeighborsRegressor()

In [37]:
# make predictions, using test data scaled
knn_preds = knn_reg.predict(X_test_scaled)

In [39]:
# use prediction to get the root mean square error
mean_squared_error(y_test, knn_preds, squared=False)
# this is better than the lr, but still worse than the baseline. 

3132.794279872204

# Ensemble! 🎻🎺

**ensemble:** "a group of items viewed as a whole rather than individually." [Source](https://languages.oup.com/google-dictionary-en/)

In machine learning, when you combine several models to form an _ensemble_ model.

![](./images/Ensemble.png)

Let's combine predictions from our KNN and Linear Regression models and weight them equally.

In [40]:
# this is like if you had a table with the knn_preds as a column and the lr_preds as a column, then we take the avg 
# across the rows of each model
np.mean([knn_preds, lr_preds], axis = 0)

array([33312.42573053, 33273.32282743, 33298.78301726, 31852.52095791,
       31696.15590652])

In [43]:
pd.DataFrame([knn_preds, lr_preds]).T

Unnamed: 0,0,1
0,32230.0,34394.851461
1,32230.0,34316.645655
2,32230.0,34367.566035
3,31030.0,32675.041916
4,31150.0,32242.311813


In [44]:
ensemble_preds = np.mean([knn_preds, lr_preds], axis = 0)

In [45]:
mean_squared_error(y_test, ensemble_preds, squared=False)
# still performing worse than the baseline model. 
# knn is still better than lr, so perhaps we could weight it a bit. 

3432.8417849984708

In this case, we'd be better off just sticking with the KNN model - but some models perform better on some datapoints, so combining them can be superior to either. (caveat here: very small sample size).

## Weights

We can also give more weight to one algorithm.

![Weights](./images/weights.jpg)

Let's weight the model predictions 80% KNN and 20% Linear Regression.

In [47]:
weighted_preds = .8*knn_preds + .2*lr_preds

In [48]:
mean_squared_error(weighted_preds, y_test, squared=False)


3223.707658340071

In [49]:
# do it with sklearn, instead of manually
from sklearn.ensemble import VotingRegressor

In [51]:
voter_1 = VotingRegressor([
    # this is a list of regressors as tuples 
    ('knn', KNeighborsRegressor()),
    ('lr', LinearRegression())
])

In [53]:
# by default, equally wights regressors, but you can add wieghts if needed
voter_1.fit(X_train_scaled, y_train)

VotingRegressor(estimators=[('knn', KNeighborsRegressor()),
                            ('lr', LinearRegression())])

In [54]:
# to make predictions, use predict
voter_1_preds = voter_1.predict(X_test_scaled)

In [55]:
mean_squared_error(voter_1_preds, y_test, squared=False)
# still worse than baseline. 

3432.8417849984708

#### Add a decision tree

In [56]:
    from sklearn.tree import DecisionTreeRegressor

In [58]:
voter_2 = VotingRegressor([
    # this is a list of regressors as tuples 
    ('knn', KNeighborsRegressor()),
    ('lr', LinearRegression()),
    ('dtree', DecisionTreeRegressor(max_depth = 2))
    
])

In [60]:
voter_2.fit(X_train_scaled, y_train)

VotingRegressor(estimators=[('knn', KNeighborsRegressor()),
                            ('lr', LinearRegression()),
                            ('dtree', DecisionTreeRegressor(max_depth=2))])

In [61]:
voter_2_preds = voter_2.predict(X_test_scaled)

In [62]:
mean_squared_error(voter_2_preds, y_test, squared=False)
# better, but not better than the null model yet 
# it looks like the decision tree helped here. So perhaps we give it a bit more weight. 

3139.404357709738

#### The voting regressor can take a list of weights for each model

In [63]:
voter_3 = VotingRegressor([
    # this is a list of regressors as tuples 
    ('knn', KNeighborsRegressor()),
    ('lr', LinearRegression()),
    ('dtree', DecisionTreeRegressor(max_depth = 2))
    
], weights=[.3, .1, .6])

In [64]:
voter_3.fit(X_train_scaled, y_train)

VotingRegressor(estimators=[('knn', KNeighborsRegressor()),
                            ('lr', LinearRegression()),
                            ('dtree', DecisionTreeRegressor(max_depth=2))],
                weights=[0.3, 0.1, 0.6])

In [67]:
voter_3_preds = voter_3.predict(X_test_scaled)

In [68]:
mean_squared_error(voter_3_preds, y_test, squared=False)
# getting lower!! finally lower than baseline. 

3013.7999729803996

In [None]:
# this works just like a model, so you can run it through something like grid search

## Take aways

- Ensembling can lead to better predictions
- You can weight model predictions to give more importance to one model

## Classification Ensemble
#### Let's use the penguins dataset 🐧

![penguin parent and child](./images/penguins.jpg)

In [None]:
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv"

In [None]:
df_pens = pd.read_csv(url)
df_pens

In [None]:
df_pens.info()

### Quick drop

The problem is too easy with all the columns. Let's make it harder by just using bill length.

In [None]:
df_pens = df_pens.loc[:, ['species', 'bill_length_mm']]

Drop missing values

In [None]:
df_pens = df_pens.dropna()

In [None]:
df_pens.info()

#### Target

In [None]:
df_pens['species'].value_counts()

In [None]:
df_pens['species'].value_counts(normalize=True)

### Split into X and y, then training and test

In [None]:
X = df_pens.drop('species', axis=1)

In [None]:
X

In [None]:
y = df_pens['species']

In [None]:
y

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=111)

### Null  baseline

#### What is our null prediction for each observation?

#### How does that prediction perform?

If only looking at accuracy, you can shortcut to your answer:

### KNN Model

#### Plot confusion matrix and score on accuracy

#### Make a function to show scores

In [None]:
def model_score(classifier, X, y):
    """fit and score a model - print and return accuracy and predict_proba
    
    Args:
        classifier: an instance of a scikit-learn classification estimator
        X (2d pd.DataFrame or np.ndarray): features 
        y (1d pd.Series on np.ndarry): outcome variable
    
    Returns: 
        accuracy score (float): accuracy on the X_test
        predict_proba (array of floats): predicted probabilities for each class for each sample
    """


#### Pass our new function a LogisticRegression algorithm and data

## Voting classifier ensemble

---
## Hard vs soft voting for classifiers

## Hard vs soft voting 

### Hard voting 
Each classifier predicts the class (0, 1, or 2). Then take the majority.

### Soft voting
Each classifier predicts the probabilities of each class. Sum the probabilities for each class. The class with the highest total is the prediction. 

### Ensemble classifier with soft voting

---
## Summary

You've seen how to put models you create into a voting regressor or voting classifier.

Ensembles give you the wisdom of the crowds.

You're about to see ensembles of decision trees that are among the most powerful algorithms available.

### Check for understanding

- What's the difference between hard voting and soft voting? 
- What type of machine learning problems do hard and soft voting apply to?
