<img style="float: right;" src="../../assets/htwlogo.svg">

# Exercise: Using ensembles for classical tabular data

Ensembles are a general principle in machine learning and can be fitted to data in very different ways. Let's look at them closer from a practical perspective. We will analyze different ensembles and look into the performance of individual classifiers as well as different techniques to fit ensembles.

**Author**: _Erik Rodner_<br>

In [None]:
# Import the necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


In the following we use the breast cancer dataset of ``scikit-learn``. What is this dataset about? What are the features? Which study was performed to acquire the data? 

In [None]:
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

## Exercise 1: Fit an ensemble of decision trees with bagging

1. Use decision trees as a base classifier (element of the ensemble)!
2. Fit an ensemble using bagging!
3. What is the performance on the test set?
4. Analyze the performance for relevant hyperparameters: maximum depth of the trees, random fraction of the dataset used for bagging, size of the ensemble, etc.

In [None]:
# Your code comes here.

## Exercise 2: Boosting instead of bagging

Whereas bagging, just uses random samples to increase the diversity of the ensemble, boosting
sequentially fits the individual base classifiers, just that the next classifier always concentrates
on the previously wrong classified examples.

1. Use decision stump (tree with depth 1) as a base classifier!
2. Fit an ensemble using Adaboost (learning rate 0.1)!
3. What is the performance on the test set?
4. Analyze the performance for relevant hyperparameters: learning rate, size of ensemble etc.

In [None]:
# Your code comes here.

## Exercise 3: In-depth analysis of ensembles

Write a function that takes an trained ensemble and a labeled test set (``X_test``, ``y_test``) and 
performs the following analysis:
1. A bar plot showing the individual performance of each element in the ensembles (``estimators_``)
2. The overall test performance on the test set
3. How do the two ensembles compare with each other?

In [None]:
# Your code comes here.