# Ensemble Methods Code Examples

Using SciKit-Learn to practice with ensemble methods

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

## Austin Pet Dataset

Trying to predict whether a given pet will be adopted

### Data Preparation

In [2]:
austin_df = pd.read_csv("austin.csv")

In [3]:
from shelter_preprocess import preprocess_df

In [4]:
# normally we would do preprocessing after train-test split
# in this case the preprocessing is all just "hard-coded"
df = preprocess_df(austin_df)

In [5]:
df.head()

Unnamed: 0,is_dog,age_in_days,is_female,adoption
1,0,46,1,0
2,0,136,1,1
3,1,575,1,1
5,0,748,1,0
6,0,75,0,1


In [23]:
df.shape

(908, 4)

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
y = df["adoption"]
X = df.drop("adoption", axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2020)

### Modeling

Stacking classifier composed of:

1. Random forest classifier (ensemble of trees with 2 kinds of randomness)
2. kNN classifier
3. Logistic regression

Then it uses the default final estimator—a logistic regression—to aggregate the answers from the other models

In [8]:
from sklearn.ensemble import StackingClassifier, RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

In [10]:
rfc = RandomForestClassifier(random_state=2020)

In [11]:
knn = KNeighborsClassifier()

In [12]:
lr = LogisticRegression(penalty='none', random_state=2020)

In [13]:
# similar to a pipeline, the stacking classifier wants you to give
# each model a human-readable name string in addition to just the
# model variable (initially I didn't notice this in the docs)
estimators = [
    ("random_forest", rfc),
    ("knn", knn),
    ("logistic_regression", lr)
]

In [14]:
stack = StackingClassifier(estimators=estimators)

In [15]:
%time
stack.fit(X_train, y_train)

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 7.15 µs


StackingClassifier(cv=None,
                   estimators=[('random_forest',
                                RandomForestClassifier(bootstrap=True,
                                                       ccp_alpha=0.0,
                                                       class_weight=None,
                                                       criterion='gini',
                                                       max_depth=None,
                                                       max_features='auto',
                                                       max_leaf_nodes=None,
                                                       max_samples=None,
                                                       min_impurity_decrease=0.0,
                                                       min_impurity_split=None,
                                                       min_samples_leaf=1,
                                                       min_samples_split=2,
                          

### Model Evaluation

In [19]:
stack.score(X_test, y_test)

0.7224669603524229

In [20]:
rfc.fit(X_train, y_train)
rfc.score(X_test, y_test)

0.7004405286343612

In [21]:
knn.fit(X_train, y_train)
knn.score(X_test, y_test)

0.7224669603524229

In [22]:
lr.fit(X_train, y_train)
lr.score(X_test, y_test)

0.5198237885462555

### Summary for Austin Dataset

The Austin dataset has 908 rows and 4 columns.  The task is predicting whether an animal will be adopted based on:

1. Whether the animal is a dog
2. Whether the animal is female
3. The age of the animal in days

In this case, **the kNN model has the best performance, and there is no improvement when it is stacked with a random forest and logistic regression model**. This may be related to the fact that there are so few features.

Based on this simple analysis (no hyperparameter tuning, no feature engineering, no additional models attempted) we would most likely choose the kNN model as our final, best model

## SciKit-Learn Breast Cancer Dataset

Trying to predict whether a patient has breast cancer, based on 30 features

### Data Preparation

In [24]:
from sklearn.datasets import load_breast_cancer

In [26]:
data = load_breast_cancer()

In [31]:
print(data.target_names)

['malignant' 'benign']


In [29]:
print(data.feature_names)

['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']


In [40]:
data.data.shape

(569, 30)

In [32]:
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=2020)

In [9]:
# I didn't do this for time reasons in the original demo, but in order to
# make sure the logistic regression can converge with all these features,
# we want to scale the data
from sklearn.preprocessing import StandardScaler

In [33]:
scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

### Modeling

We'll just go ahead and use the exact same stack from the previous example

(In reality you would want to do more EDA, make a more intentional decision)

In [34]:
%time
stack.fit(X_train, y_train)

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 5.25 µs


StackingClassifier(cv=None,
                   estimators=[('random_forest',
                                RandomForestClassifier(bootstrap=True,
                                                       ccp_alpha=0.0,
                                                       class_weight=None,
                                                       criterion='gini',
                                                       max_depth=None,
                                                       max_features='auto',
                                                       max_leaf_nodes=None,
                                                       max_samples=None,
                                                       min_impurity_decrease=0.0,
                                                       min_impurity_split=None,
                                                       min_samples_leaf=1,
                                                       min_samples_split=2,
                          

### Model Evaluation

In [35]:
stack.score(X_test, y_test)

0.9790209790209791

In [36]:
rfc.fit(X_train, y_train)
rfc.score(X_test, y_test)

0.958041958041958

In [37]:
knn.fit(X_train, y_train)
knn.score(X_test, y_test)

0.951048951048951

In [38]:
lr.fit(X_train, y_train)
lr.score(X_test, y_test)

0.965034965034965

#### Looking at the Confusion Matrices

In [39]:
from sklearn.metrics import confusion_matrix

In [41]:
confusion_matrix(y_test, stack.predict(X_test))

array([[59,  2],
       [ 1, 81]])

In [42]:
confusion_matrix(y_test, rfc.predict(X_test))

array([[58,  3],
       [ 3, 79]])

In [43]:
confusion_matrix(y_test, knn.predict(X_test))

array([[57,  4],
       [ 3, 79]])

In [44]:
confusion_matrix(y_test, lr.predict(X_test))

array([[59,  2],
       [ 3, 79]])

### Summary for Breast Cancer Dataset

We are trying to predict whether a given patient has breast cancer based on 30 different features, with a dataset of 569 records.

In this case, the **stacking classifier was more than the sum of its parts**.  Each of the estimators inside the stacking classifier got around 95-96% accuracy, and the stacking classifier got closer to 98% accuracy.

As also shown by the confusion matrices, the different models are getting things wrong in different ways.  For example, the kNN model has the most false positives, and the logistic regression has the least false positives.  This makes this combination of models + dataset a good candidate for stacking.

Therefore based on this simple analysis (no hyperparameter tuning, no feature engineering, no additional models attempted) we would most likely choose the stacking model as our final, best model

## Conclusion

Stacking classifiers can leverage multiple different models, which can sometimes result in better metrics than any individual model

In this case, we were able to run the analysis fairly quickly, since there was not a lot of data.  If you are using significantly larger datasets, you may need to be more selective with which models you use, and a stacking classifier may end up being too slow