### Ensemble models

This lesson we aim to improve upon our decision tree models from last class and explore some alternative approaches to model building to overcome some of the drawbacks of the decision tree model. We will also discuss saving and resuse of models and explore a basic streamlit application.  

- [Download and Install VSCode](https://code.visualstudio.com/)
- [Install the Python extension](https://marketplace.visualstudio.com/items?itemName=ms-python.python)

**OBJECTIVES**

- Identify shortcomings of Decision Tree models
- Understand and Implement Ensemble models
- Understand and Implement Boosted models



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression 
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import ConfusionMatrixDisplay 
from sklearn.compose import make_column_transformer 
from sklearn.pipeline import Pipeline

### Decision Tree Review

In [None]:
#load data
heart = pd.read_csv('https://raw.githubusercontent.com/jfkoehler/nyu_bootcamp_fa23/main/data/Heart.csv', index_col = 0)

In [None]:
#inspect
heart.head()

In [None]:
#inspect
heart.info()

In [None]:
#drop missing values
heart = heart.dropna()

In [None]:
#target count
sns.countplot(data = heart, x = 'AHD')

### Train/Test Split

In [None]:
#define X
X = heart.drop('AHD', axis = 1)

In [None]:
#baseline
y = heart['AHD']
y.value_counts(normalize = True)

In [None]:
#define y (make it numeric)
y = np.where(y == 'No', 0, 1)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 22)

### Preprocessing

- Encode categorical features

In [None]:
X.head(1)

In [None]:
encoder = make_column_transformer((OneHotEncoder(), ['ChestPain', 'Thal']),
                                  remainder = 'passthrough',
                                  verbose_feature_names_out=False)

### Decision Tree

To begin, we use a Decision Tree to model the data.

In [None]:
pipe = Pipeline([('preprocess', encoder),
                 ('model', DecisionTreeClassifier())])

In [None]:
#fit 
pipe.fit(X_train, y_train)

In [None]:
#accuracy
pipe.score(X_train, y_train)

In [None]:
#confusion matrix for train and test
pipe.score(X_test, y_test)

**Reminder**

Use `.named_steps` to extract elements of pipeline -- here we want the `preprocess` step and to use the `get_feature_names_out` method to extract feature names.

In [None]:
#get feature names after transformations
pipe.named_steps['preprocess'].get_feature_names_out()

In [None]:
#visualize the tree
plt.figure(figsize = (100, 100))
plot_tree(pipe.named_steps['model'], 
          feature_names=pipe.named_steps['preprocess'].get_feature_names_out(),
          fontsize = 50,
          filled = True,
          class_names = ['No', 'Yes']);

### Issues with Decision Trees

When left alone, Decision Trees will overfit the data.  One approach to dealing with this would be to grid search different parameters and see if improved performance is possible.

**REMINDER**: When grid searching pipelines, name the step followed by two underscores followed by the parameter that you want to search.

In [None]:
#decision tree parameters
params = {'model__max_depth': [2, 3, 4, 5],
          'model__min_samples_split': [2,3,4,5,6]}

In [None]:
#grid for searching
grid = GridSearchCV(pipe, param_grid = params, cv = 2)
grid.fit(X_train, y_train)

In [None]:
#train score
grid.score(X_train, y_train)

In [None]:
#test score
grid.score(X_test, y_test)

### Ensemble methods

One approach to improving our tree model is to consider it alongside other models we have already discussed and form a voting block for the models.  Here, each model is allowed a vote on the prediction.  Scikitlearn implements this idea with a `VotingClassifier` model.



In [None]:
from sklearn.ensemble import VotingClassifier, BaggingClassifier

In [None]:
#voting approach
voter = VotingClassifier([('tree1', DecisionTreeClassifier(max_depth = 2)),
                          ('tree2',DecisionTreeClassifier(max_depth = 5)),
                          ('tree3',DecisionTreeClassifier(min_samples_split=5))])
vote_pipe = Pipeline([('preprocess', encoder),
                      ('model', voter)])

In [None]:
#fit it
vote_pipe.fit(X_train, y_train)

In [None]:
vote_pipe.score(X_train, y_train)

In [None]:
vote_pipe.score(X_test, y_test)

### Bagging Classifier

Building on the earlier ideas and taking them one step further, perhaps we build an ensemble of models on different samples of the data.  One such approach is referred to as **BAGGING**. Here, the samples are created with replacement -- **BOOTSTRAP** -- and the results are aggregated.  In classification this will be a vote either based on predictions or probabilities.

- **BOOTSTRAP**:  "*Bootstrapping is any test or metric that uses random sampling with replacement (e.g. mimicking the sampling process), and falls under the broader class of resampling methods*."

- **BAGGING**: Aggregating bootstrapped models

- **HARD VOTING**: Using majority of predicted values when ensembling

- **SOFT VOTING**: Using probabilities to determine predictions from an ensemble

In [None]:
#bagging pipeline
bag_pipe = Pipeline([('preprocess', encoder),
                     ('model', BaggingClassifier())])

In [None]:
#fit
bag_pipe.fit(X_train, y_train)

In [None]:
#train score
bag_pipe.score(X_train, y_train)

In [None]:
#test score
bag_pipe.score(X_test, y_test)

In [None]:
#confusion matrix on test
ConfusionMatrixDisplay.from_estimator(bag_pipe, X_test, y_test);

### Random Forests

*Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the mean or average prediction of the individual trees is returned.*

The main difference to bagging is that we are also sampling features!

In [None]:
# import
from sklearn.ensemble import RandomForestClassifier

In [None]:
# pipeline
forest_pipe = Pipeline([('preprocess', encoder), 
                       ('model', RandomForestClassifier(max_depth = 2))])

In [None]:
# fit
forest_pipe.fit(X_train, y_train)

In [None]:
# train score
forest_pipe.score(X_train, y_train)

In [None]:
# test score
forest_pipe.score(X_test, y_test)

In [None]:
models = [DecisionTreeClassifier(max_depth = 3), DecisionTreeClassifier(max_depth = 3)]

In [None]:
X.columns

In [None]:
X1 = X[['Age', 'Sex']]
X2 = X[['RestBP', 'Chol']]
models[0].fit(X1, y)
models[1].fit(X2, y)

In [None]:
models[0].predict(X1)[:5]

In [None]:
models[1].predict(X2)[:5]

In [None]:
forest_pipe.named_steps['preprocess'].get_feature_names_out()

In [None]:
forest_pipe.named_steps['model'].feature_importances_

In [None]:
pd.DataFrame({'features': forest_pipe.named_steps['preprocess'].get_feature_names_out(),
             'importance': forest_pipe.named_steps['model'].feature_importances_ }).sort_values(by = 'importance', ascending = False)

### Boosted Models

An alternative to aggregating across models would be to iteratively update a model based on pervious performance.  This is what boosting does, and while we will gloss over most of the details -- the mechanism for updating the models is what determines the name of the boosted model.  

Scikitlearn implements an `AdaBoostClassifier` and `GradientBoostedClassifier` both iteratively update models based on prior performance.

In [None]:
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier

In [None]:
#adaboost
ada_pipeline = Pipeline([('preprocess', encoder), ('model', AdaBoostClassifier())])

In [None]:
# fit and score
ada_pipeline.fit(X_train, y_train)
ada_pipeline.score(X_test, y_test)

In [None]:
#gradient boosting classifier


In [None]:
#fit and score


### `xgboost`

In [None]:
# !pip install xgboost

In [None]:
import xgboost as xgb

In [None]:
# instantiate classifier
xboost = xgb.XGBClassifier(n_estimators = 10, max_depth = 2)

In [None]:
# try with out of the box settings
boost_pipe = Pipeline([('encoder', encoder), 
                      ('model', xboost)])

In [None]:
# score it
boost_pipe.fit(X_train, y_train)
boost_pipe.score(X_train, y_train)

In [None]:
boost_pipe.score(X_test, y_test)

In [None]:
xgb.plot_importance(boost_pipe.named_steps['model'])

In [None]:
fig, ax = plt.subplots(figsize = (20, 20))
xgb.plot_tree(boost_pipe.named_steps['model'], ax = ax);

### Model Persistence

After building a model and identifying the optimal parameters its time to put it to use. The `pickle` module is one way to save and reuse python objects including sklearn models.  Below, we use the pickle module to save and load a list and sklearn model. 

In [None]:
import pickle

In [None]:
a = [1, 2, 3, 4]

In [None]:
#write out pickle file
with open('alist.pkl', 'wb') as f:
    pickle.dump(a, f)

In [None]:
#load in pickle file
with open('alist.pkl', 'rb') as f:
    thelist = pickle.load(f)

In [None]:
#here is the list again
thelist

In [None]:
#save the boosted model as boost.pkl
heart.head()

In [None]:
X = heart[['Age', 'Sex', 'Slope']]
y = heart['AHD']
forest = RandomForestClassifier().fit(X, y)

In [None]:
with open('streamlit_example/forestmodel.pkl', 'wb') as f:
    pickle.dump(forest, f)

#### A Simple Application

Below is the code for a basic streamlit application.  This is a way to deploy and share your models.  For more options see the documentation [here](https://docs.streamlit.io/).

```python
import streamlit as st 
import numpy as np
import pickle

st.header('A Model for AHD')

st.write('Please enter the Age, Sex, and Slope information below.')

age = st.number_input('Age')
sex = st.number_input('Sex')
slope = st.number_input('Slope')

X = np.array([[age, sex, slope]])

with open('forestmodel.pkl', 'rb') as f:
    model = pickle.load(f)
    
pred = model.predict(X)

st.write(f'The model predicts {pred[0]}')
```

Once the app is created, you can run it by writing 

```
streamlit run app.py
```