<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Ensembles & Voting

_By: Jeff Hale - Penelope adapted from other materials_
___

### Learning Objectives
After this lesson students will be able to:
- Explain the difference between hard and soft voting
- Use a scikit-learn VotingClassifier and VotingRegressor 
- Describe calibration


### Prior Knowledge Required:
- Python basics
- Pandas basics
- Scikit-learn basics

## Ensemble Methods

Ensembling is building multiple models and then combining their results in some way to create predictions.

## Why would we build an "ensemble model?"

We can summarize this as the **wisdom of the crowd**.

## Wisdom of the Crowd: Guess the weight of Penelope

![](./images/penelope.jpg)

[Image source: https://www.npr.org](https://www.npr.org/sections/money/2015/07/17/422881071/how-much-does-this-cow-weigh)

In [None]:
guesses = 

#### Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import mean_squared_error

### Regression

Carvana car price prediction

In [None]:
df_cars = pd.DataFrame(dict(
    price=
    [34990, 32590, 25990, 32590, 30990, 36990, 44990, 28990, 39990, 
     30990, 31990, 28590, 15990, 21990, 35590, 27990, 21990],
    miles=
    [11791, 14893, 13256, 37654, 38127, 42904, 1358, 10659, 
    9255, 32743, 15990, 17428, 14833, 25848, 12505, 6877, 82197],
    year=
    [2019, 2018, 2019, 2015, 2018, 2017, 2020, 2019, 2019, 
    2014, 2019, 2019, 2010, 2018, 2018, 2019, 2014]
))

In [None]:
df_cars

### Set up X & y, tts, standardize.

Get the RMSE for a LinearRegression model, a KNN model, and a baseline model

In [None]:
X = df_cars.drop('price', axis=1)

In [None]:
y = df_cars['price']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=23)

In [None]:
X_train

In [None]:
y_test

### Baseline null model

### Standardize with 0 mean and unit variance

### Linear Regression model RMSE

### KNN model RMSE

# Ensemble! 🎻🎺

**ensemble:** "a group of items viewed as a whole rather than individually." [Source](https://languages.oup.com/google-dictionary-en/)

In machine learning, when you combine several models to form an _ensemble_ model.

![](./images/Ensemble.png)

Let's combine predictions from our KNN and Linear Regression models and weight them equally.

In this case, we'd be better off just sticking with the KNN model - but some models perform better on some datapoints, so combining them can be superior to either. (caveat here: very small sample size).

## Weights

We can also give more weight to one algorithm.

![Weights](./images/weights.jpg)

Let's weight the model predictions 80% KNN and 20% Linear Regression.

#### Add a decision tree

#### The voting regressor can take a list of weights for each model

## Take aways

- Ensembling can lead to better predictions
- You can weight model predictions to give more importance to one model

## Classification Ensemble
#### Let's use the penguins dataset 🐧

![penguin parent and child](./images/penguins.jpg)

In [None]:
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv"

In [None]:
df_pens = pd.read_csv(url)
df_pens

In [None]:
df_pens.info()

### Quick drop

The problem is too easy with all the columns. Let's make it harder by just using bill length.

In [None]:
df_pens = df_pens.loc[:, ['species', 'bill_length_mm']]

Drop missing values

In [None]:
df_pens = df_pens.dropna()

In [None]:
df_pens.info()

#### Target

In [None]:
df_pens['species'].value_counts()

In [None]:
df_pens['species'].value_counts(normalize=True)

### Split into X and y, then training and test

In [None]:
X = df_pens.drop('species', axis=1)

In [None]:
X

In [None]:
y = df_pens['species']

In [None]:
y

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=111)

### Null  baseline

#### What is our null prediction for each observation?

#### How does that prediction perform?

If only looking at accuracy, you can shortcut to your answer:

### KNN Model

#### Plot confusion matrix and score on accuracy

#### Make a function to show scores

In [None]:
def model_score(classifier, X, y):
    """fit and score a model - print and return accuracy and predict_proba
    
    Args:
        classifier: an instance of a scikit-learn classification estimator
        X (2d pd.DataFrame or np.ndarray): features 
        y (1d pd.Series on np.ndarry): outcome variable
    
    Returns: 
        accuracy score (float): accuracy on the X_test
        predict_proba (array of floats): predicted probabilities for each class for each sample
    """


#### Pass our new function a LogisticRegression algorithm and data

## Voting classifier ensemble

---
## Hard vs soft voting for classifiers

## Hard vs soft voting 

### Hard voting 
Each classifier predicts the class (0, 1, or 2). Then take the majority.

### Soft voting
Each classifier predicts the probabilities of each class. Sum the probabilities for each class. The class with the highest total is the prediction. 

### Ensemble classifier with soft voting

---
## Summary

You've seen how to put models you create into a voting regressor or voting classifier.

Ensembles give you the wisdom of the crowds.

You're about to see ensembles of decision trees that are among the most powerful algorithms available.

### Check for understanding

- What's the difference between hard voting and soft voting? 
- What type of machine learning problems do hard and soft voting apply to?
