**NOTE: This notebook is written for the Google Colab platform. However it can also be run (possibly with minor modifications) as a standard Jupyter notebook.** 



In [None]:
#@title -- Installation of Packages -- { display-mode: "form" }
import sys
!{sys.executable} -m pip install lightgbm
!{sys.executable} -m pip install git+https://github.com/michalgregor/class_utils.git

In [None]:
#@title -- Import of Necessary Packages -- { display-mode: "form" }
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.impute import SimpleImputer, MissingIndicator
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from lightgbm import LGBMClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from IPython.display import display, Markdown

In [None]:
#@title -- Downloading Data -- { display-mode: "form" }
DATA_HOME = "https://github.com/michalgregor/ml_notebooks/blob/main/data/{}?raw=1"

from class_utils.download import download_file_maybe_extract
download_file_maybe_extract(DATA_HOME.format("titanic.zip"), directory="data/titanic")

# also create a directory for storing any outputs
import os
os.makedirs("output", exist_ok=True)

In [None]:
#@title -- Loading and Preprocessing the Data -- { display-mode: "form" }
df = pd.read_csv("data/titanic/train.csv")
df_train, df_test = train_test_split(df, test_size=0.25,
                     stratify=df["Survived"], random_state=4)

categorical_inputs = ["Pclass", "Sex", "Embarked"]
numeric_inputs = ["Age", "SibSp", 'Parch', 'Fare']

output = "Survived"

input_preproc = make_column_transformer(
    (make_pipeline(
        SimpleImputer(strategy="most_frequent"),
        OrdinalEncoder()),
     categorical_inputs),
    
    (make_pipeline(
        SimpleImputer(),
        StandardScaler()),
     numeric_inputs)
)

X_train = input_preproc.fit_transform(df_train[categorical_inputs+numeric_inputs])
Y_train = df_train[output].values.reshape(-1)

X_test = input_preproc.transform(df_test[categorical_inputs+numeric_inputs])
Y_test = df_test[output].values.reshape(-1)

## Heterogeneous Ensembles

In the previous notebook we have explored methods to train homogeneous ensembles: i.e. ensembles made of several models of the same class (decision trees in our case). These methods were able to set up and train all the models automatically.

In this notebook we are going to explore heterogeneous ensembles, which are going to need a bit more work: we will need to create each model separately and then join them into an ensemble using some kind of meta-classifier such as `sklearn.ensemble.VotingClassifier` (or `sklearn.ensemble.VotingRegressor` for regression). The reward for this extra work, however, should be better performance: heterogeneous models tend to make very different errors so a heterogeneous ensemble can often achieve better generalization than a homogeneous one.

### An Ensemble Using `VotingClassifier`

We will now briefly illustrate how to use `VotingClassifier` on our task. We will start by creating a list of models that we want to form an ensemble from. We can first create each of them separately and use cross-validation to do a little hyperparameter tuning on them like we did in the previous notebooks.

We are going to use a small auxiliary function that will display the cross-validation results for both: the test folds and the train folds so that we can distinguish overfitting from underfitting.



In [None]:
def crossval(model):
    scores = cross_validate(model, X_train, Y_train, cv=10, return_train_score=True)
    display(Markdown("train: {:.5f}; **test: {:.5f}**".format(
        scores['train_score'].mean(),
        scores['test_score'].mean()
    )))

---
#### Task 1: Tuning Hyperparams for Each Classifier Separately

**In the cells below, try experimenting with the classifiers' hyperparameters to find a setting which does reasonably well in cross-validation.**  Aid: You can run ?NameOfTheClassifier to display the classifier's docstring.

---


In [None]:
dtree_model = DecisionTreeClassifier(
    
    # ---
    
)
crossval(dtree_model)

In [None]:
lgbm_model = LGBMClassifier(
    
    # ---
    
)
crossval(lgbm_model)

In [None]:
knn_model = KNeighborsClassifier(
    
    # ---
    
)
crossval(knn_model)

In [None]:
svc_model = svm.SVC(
    
    # ---
    
)
crossval(svc_model)

In [None]:
logistic_model = LogisticRegression(
    
    # ---
    
)
crossval(logistic_model)

In [None]:
estimators = [
    ("dtree", dtree_model),
    ("lgbm", lgbm_model),
    ("knn", knn_model),
    ('svc', svc_model),
    ('logistic', logistic_model)
]

We pass the list to `VotingClassifier`. We can also specify the voting mode and other parameters, the meaning of which can be found in the documentation. Having constructed the classifier, we train it. This will get all the contained models trained on the data.



In [None]:
model = VotingClassifier(estimators)
crossval(model)

### An Ensemble Using `StackingClassifier`

As a further alternative, you could use stacking instead of voting. There you would first train a bunch of models and then you would add their outputs to the dataset as further columns. Finally, you would stack another classifier on top – i.e. train it on the full dataset including the new columns.

This second 2nd-level model can make use of 1st-level models' predictions, e.g. it could figure out which models might be best at predicting for this kind of sample and weight the predictions accordingly, etc.

Here we are going to construct a `StackingClassifier` with our bunch of estimators at the 1st level and a logistic regression model at the 2nd level.



In [None]:
model = StackingClassifier(
    estimators,
    final_estimator=LogisticRegression(C=10),
    cv=10
)
crossval(model)

### Testing

Now select the best ensemble and test it on our testing set. With any luck its performance should be better than any of the component models.



In [None]:


# ---


accuracy_score(Y_test, y_test)