# Introduction to Tree-based methods (with external dependencies)

Written by:
- Manuel Szewc (School of Physics, University of Cincinnati)
- Philip Ilten (School of Physics, University of Cincinnati)
$\renewcommand{\gtrsim}{\raisebox{-2mm}{\hspace{1mm}$\stackrel{>}{\sim}$\hspace{1mm}}}\renewcommand{\lessim}{\raisebox{-2mm}{\hspace{1mm}$\stackrel{<}{\sim}$\hspace{1mm}}}\renewcommand{\as}{\alpha_{\mathrm{s}}}\renewcommand{\aem}{\alpha_{\mathrm{em}}}\renewcommand{\kT}{k_{\perp}}\renewcommand{\pT}{p_{\perp}}\renewcommand{\pTs}{p^2_{\perp}}\renewcommand{\pTe}{\p_{\perp\mrm{evol}}}\renewcommand{\pTse}{\p^2_{\perp\mrm{evol}}}\renewcommand{\pTmin}{p_{\perp\mathrm{min}}}\renewcommand{\pTsmim}{p^2_{\perp\mathrm{min}}}\renewcommand{\pTmax}{p_{\perp\mathrm{max}}}\renewcommand{\pTsmax}{p^2_{\perp\mathrm{max}}}\renewcommand{\pTL}{p_{\perp\mathrm{L}}}\renewcommand{\pTD}{p_{\perp\mathrm{D}}}\renewcommand{\pTA}{p_{\perp\mathrm{A}}}\renewcommand{\pTsL}{p^2_{\perp\mathrm{L}}}\renewcommand{\pTsD}{p^2_{\perp\mathrm{D}}}\renewcommand{\pTsA}{p^2_{\perp\mathrm{A}}}\renewcommand{\pTo}{p_{\perp 0}}\renewcommand{\shat}{\hat{s}}\renewcommand{\a}{{\mathrm a}}\renewcommand{\b}{{\mathrm b}}\renewcommand{\c}{{\mathrm c}}\renewcommand{\d}{{\mathrm d}}\renewcommand{\e}{{\mathrm e}}\renewcommand{\f}{{\mathrm f}}\renewcommand{\g}{{\mathrm g}}\renewcommand{\hrm}{{\mathrm h}}\renewcommand{\lrm}{{\mathrm l}}\renewcommand{\n}{{\mathrm n}}\renewcommand{\p}{{\mathrm p}}\renewcommand{\q}{{\mathrm q}}\renewcommand{\s}{{\mathrm s}}\renewcommand{\t}{{\mathrm t}}\renewcommand{\u}{{\mathrm u}}\renewcommand{\A}{{\mathrm A}}\renewcommand{\B}{{\mathrm B}}\renewcommand{\D}{{\mathrm D}}\renewcommand{\F}{{\mathrm F}}\renewcommand{\H}{{\mathrm H}}\renewcommand{\J}{{\mathrm J}}\renewcommand{\K}{{\mathrm K}}\renewcommand{\L}{{\mathrm L}}\renewcommand{\Q}{{\mathrm Q}}\renewcommand{\R}{{\mathrm R}}\renewcommand{\T}{{\mathrm T}}\renewcommand{\W}{{\mathrm W}}\renewcommand{\Z}{{\mathrm Z}}\renewcommand{\bbar}{\overline{\mathrm b}}\renewcommand{\cbar}{\overline{\mathrm c}}\renewcommand{\dbar}{\overline{\mathrm d}}\renewcommand{\fbar}{\overline{\mathrm f}}\renewcommand{\pbar}{\overline{\mathrm p}}\renewcommand{\qbar}{\overline{\mathrm q}}\renewcommand{\rbar}{\overline{\mathrm{r}}}\renewcommand{\sbar}{\overline{\mathrm s}}\renewcommand{\tbar}{\overline{\mathrm t}}\renewcommand{\ubar}{\overline{\mathrm u}}\renewcommand{\Bbar}{\overline{\mathrm B}}\renewcommand{\Fbar}{\overline{\mathrm F}}\renewcommand{\Qbar}{\overline{\mathrm Q}}\renewcommand{\tms}{{t_{\mathrm{\tiny MS}}}}\renewcommand{\Oas}[1]{{\mathcal{O}\left(\as^{#1}\right)}}$

This notebook wants to implement decision trees, random forests and gradient boosting. A lot of it is based on [Aurelien Geron's lectures](https://github.com/ageron/handson-ml3).

In [None]:
import os
import sys

# To generate data and handle arrays
import numpy as np

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

%matplotlib inline
mpl.rc("axes", labelsize=14)
mpl.rc("xtick", labelsize=12)
mpl.rc("ytick", labelsize=12)

In [None]:
import pandas as pd
import cv2  # pip install opencv-python

%matplotlib inline
from scipy.stats import norm, multivariate_normal

# Useful classes for data manipulation
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

# Useful classes for model evaluation and selection
from sklearn.model_selection import train_test_split, cross_val_predict, GridSearchCV
from sklearn.metrics import (
    accuracy_score,
    recall_score,
    confusion_matrix,
    mean_squared_error,
)

# a baseline classifier
from sklearn.linear_model import LinearRegression, RidgeCV

# The necessary models
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor

## Theory

Decision Tree, most usually based on the **Classification and Regression Tree** (CART) framework, work by **recursively partitioning** the input space through a series of **binary decisions** in order to predict a target (either for classification or regression). Once an appropriate partitioning of the feature space is achieved, a new prediction is computed by following the set of binary splittings.

Under the CART framework, at a given decision step $m$, we split a **node** $m$ containing $N_{m}$ instances into two by looking at all available features $\vec{x}\in\mathbb{R}^{D}$. A decision tree finds a feature $i$ and the best cut $\theta_{m}$ in **one** of the features such that when the data is divided in two new nodes according to $x_{i}\leq \theta$ the weighted sum of a given metric $H$ evaluated over the two candidate nodes is optimized:

$$G(\theta_{m},i)=\frac{N_{m,x_{i}\leq \theta_{m}}}{N_{m}}H(\text{instances with }x_{i}\leq \theta)+\frac{N_{m,x_{i}> \theta_{m}}}{N_{m}}H(\text{instances with }x_{i}> \theta_{m})$$

That is,

$$\theta^{*}_{m},i^{*} = \arg \min_{\theta_{m},i}\sum_{n=1}^{N_{m}}G(\theta_{m},i)$$

The resulting two nodes are called **children**. The initial node is called a **root** and the nodes which have no children, and are thus final, are called **leaves**.

For $K$ class classification problems, the usual metric $H$ is either the `Gini` or the `entropy` defined as

$$\mathrm{Gini} = \sum\sum_{k=1}^{K}p_{m,k}(1-p_{m,k})$$

$$\mathrm{Entropy} = -\sum_{k=1}^{K}p_{m,k}\ln p_{m,k}$$

where $p_{k}$ are the fraction of instances belonging to class $k$ in the node.

For regression problems, it's usually the mean squared error defined as

$$\mathrm{MSE} = \frac{1}{N_{m}}\sum_{n=1}^{N_{m}}(y_{m}-\bar y_{m})^{2}$$

where $\bar y_{m}$ is the average target value in the node $\bar y_{m} = \frac{1}{N_{m}}\sum_{n=1}^{N_{m}}y_{m}$.


Decision Trees are **greedy** algorithms, in that all binary partitions are decided based on how well they perform, without regard to **global strategies**.

From `sklearn`:

Some advantages of decision trees are:

- Simple to understand and to interpret. Trees can be visualized.

* Requires little data preparation. Other techniques often require data normalization, dummy variables need to be created and blank values to be removed. Some tree and algorithm combinations support missing values.

* The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree.

* Able to handle both numerical and categorical data. However, the scikit-learn implementation does not support categorical variables for now. Other techniques are usually specialized in analyzing datasets that have only one type of variable. See algorithms for more information.

* Able to handle multi-output problems.

* Uses a white box model. If a given situation is observable in a model, the explanation for the condition is easily explained by boolean logic. By contrast, in a black box model (e.g., in an artificial neural network), results may be more difficult to interpret.

* Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the model.

* Performs well even if its assumptions are somewhat violated by the true model from which the data were generated.

The disadvantages of decision trees include:

* Decision-tree learners can create over-complex trees that do not generalize the data well. This is called overfitting. Mechanisms such as pruning, setting the minimum number of samples required at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem.

* Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an ensemble.

* Predictions of decision trees are neither smooth nor continuous, but piecewise constant approximations as seen in the above figure. Therefore, they are not good at extrapolation.

* The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees in an ensemble learner, where the features and samples are randomly sampled with replacement.

* There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems.

* Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.

## Classification with Decision Trees

We'll use the `sklearn` implementation of decision trees.

Let's use a dataset as an example

In [None]:
!wget -q -N https://gitlab.com/mcgen-ct/tutorials/-/raw/2025-cteq/.full/ml/datasets/season-1112.csv

In [None]:
# https://datahub.io/sports-data/english-premier-league and https://www.football-data.co.uk/notes.txt
df = pd.read_csv("season-1112.csv")

This file has all matches of the 2011-2012 English Premier League season.
For each match, we have local and away goals both at half-time and at the end of the match. We also have the number of shots, shots on goal, fouls, yellow cards, red cards and betting odds from some known sites.

In [None]:
df.head()

In [None]:
len(df)

In [None]:
column_names = df.columns

In [None]:
column_names

We can play with this.

One possibility is trying to predict a winner based on all other features.

Let's make a copy before further processingHagamos una copia antes de empezar

In [None]:
df_copy = (
    df.copy()
)  # [['HomeTeam','AwayTeam', 'FTHG','FTAG','FTR','HTHG', 'HTAG', 'HTR','HS','AS','HST', 'AST','HF','AF', 'HY', 'AY', 'HR', 'AR']]

In [None]:
df_copy_train, df_copy_test = train_test_split(df_copy)

Let's explore the data to try and understand how a Decision Tree works.

To plot, let's look at only two features for now.

In [None]:
target_train = np.zeros(len(df_copy_train))
target_train[df_copy_train["FTR"] == "H"] = 1.0
target_train[df_copy_train["FTR"] == "D"] = 0.0
target_train[df_copy_train["FTR"] == "A"] = -1.0
features_train = np.asarray(df_copy_train[["FTHG", "FTAG"]])

target_test = np.zeros(len(df_copy_test))
target_test[df_copy_test["FTR"] == "H"] = 1.0
target_test[df_copy_test["FTR"] == "D"] = 0.0
target_test[df_copy_test["FTR"] == "A"] = -1.0
features_test = np.asarray(df_copy_test[["FTHG", "FTAG"]])

In [None]:
features_scatter = features_train + 0.1 * np.random.randn(len(features_train), 2)
plt.scatter(features_scatter[:, 0], features_scatter[:, 1], c=target_train, alpha=0.2)
plt.colorbar()
xvals = np.linspace(0.0, 8.0, 10)
plt.plot(xvals, xvals, linestyle="dotted", color="black", label="Draw")
plt.xlabel("Home team goals")
plt.ylabel("Away team goals")
plt.xlim(-0.2, 8.2)
plt.ylim(-0.2, 6.2)
plt.legend(loc="upper right")

Let's forget about all the DT hyperparameters for now and just train a naive classifier:

In [None]:
dt = DecisionTreeClassifier(max_depth=None)
dt.fit(features_train, target_train)

And let's see how it works:

In [None]:
dt.predict(np.asarray([1.0, 2.0]).reshape(1, -1))

In [None]:
xvals = np.linspace(0.0, 8.0, 100)
yvals = np.linspace(0.0, 6.0, 100)
X, Y = np.meshgrid(xvals, yvals)
Z = dt.predict(np.c_[X.ravel(), Y.ravel()]).reshape(X.shape)
plt.contourf(xvals, yvals, Z, levels=[-1.5, -0.5, 0.5, 1.5], label="DT")
plt.colorbar()
plt.scatter(features_train[:, 0], features_train[:, 1], c=target_train)
plt.plot(xvals, xvals, linestyle="dotted", color="black", label="Draw")
plt.xlabel("Home team goals")
plt.ylabel("Away team goals")
plt.xlim(-0.2, 8.2)
plt.ylim(-0.2, 6.2)
plt.legend(loc="upper right")

It's really overfitting! Weird looking curves.

We wouldn't be able to tell from the confusion matrix though...

In [None]:
print(confusion_matrix(target_train, dt.predict(features_train)))
print(confusion_matrix(target_test, dt.predict(features_test)))

But an inspection of the defined tree would show it:

In [None]:
plt.figure(figsize=(20, 10))
tree.plot_tree(
    dt,
    filled=True,
    rounded=True,
    feature_names=["Home team goals", "Away team goals"],
    class_names=["Away team wins", "Draw", "Home team wins"],
)
plt.show()

This plot can also be exported as a `.dot` file and saved as `.png`.

In [None]:
tree.export_graphviz(
    dt,
    out_file="futbol.dot",
    feature_names=["Home team goals", "Away team goals"],
    class_names=["Away team wins", "Draw", "Home team wins"],
    rounded=True,
    filled=True,
)

# dot to png
if "google.colab" in sys.modules:
    !apt-get install graphviz

! dot -Tpng futbol.dot -o futbol.png

# Plot the image
img = cv2.imread("futbol.png")
plt.figure(figsize=(20, 10))
plt.imshow(img)

The DT can only do cuts in the individual features. Thus, it looks at home and away goals separately. But we know that in futbol the only important thing is to score more than the other team. The fact that it can only perform cuts on individual features can be a problem for DTs (but also the reason why we do not need to preprocess the features to remove units).

We can do some feature engineering

In [None]:
features_train = df_copy_train[["FTHG", "FTAG"]]
features_train["Local - Visitante"] = features_train["FTHG"] - features_train["FTAG"]
features_train = np.asarray(features_train)

features_test = df_copy_test[["FTHG", "FTAG"]]
features_test["Local - Visitante"] = features_test["FTHG"] - features_test["FTAG"]
features_test = np.asarray(features_test)

In [None]:
dt = DecisionTreeClassifier()
dt.fit(features_train, target_train)

In [None]:
tree.export_graphviz(
    dt,
    out_file="futbol.dot",
    feature_names=["Home", "Away", "Goal Difference"],
    class_names=["Away team wins", "Draw", "Home team wins"],
    rounded=True,
    filled=True,
)

# Convierto el dot a png
! dot -Tpng futbol.dot -o futbol.png

# Ploteamos el png
img = cv2.imread("futbol.png")
plt.figure(figsize=(20, 10))
plt.imshow(img)

Much better!

For such an easy example, DTs are not particularly useful. But now let's look at all the features that are less obvious in relation to wins. Let's remove the betting scores also.

In [None]:
names_train = df_copy_train[["HomeTeam", "AwayTeam"]]
features_train = df_copy_train.drop(
    [
        "Div",
        "Date",
        "Referee",
        "HomeTeam",
        "AwayTeam",
        "FTHG",
        "FTAG",
        "FTR",
        "HTHG",
        "HTAG",
        "HTR",
        "B365H",
        "B365D",
        "B365A",
        "BWH",
        "BWD",
        "BWA",
        "GBH",
        "GBD",
        "GBA",
        "IWH",
        "IWD",
        "IWA",
        "LBH",
        "LBD",
        "LBA",
        "SBH",
        "SBD",
        "SBA",
        "WHH",
        "WHD",
        "WHA",
        "SJH",
        "SJD",
        "SJA",
        "VCH",
        "VCD",
        "VCA",
        "BSH",
        "BSD",
        "BSA",
        "Bb1X2",
        "BbMxH",
        "BbAvH",
        "BbMxD",
        "BbAvD",
        "BbMxA",
        "BbAvA",
        "BbOU",
        "BbMx>2.5",
        "BbAv>2.5",
        "BbMx<2.5",
        "BbAv<2.5",
        "BbAH",
        "BbAHh",
        "BbMxAHH",
        "BbAvAHH",
        "BbMxAHA",
        "BbAvAHA",
    ],
    axis=1,
)

names_test = df_copy_test[["HomeTeam", "AwayTeam"]]
features_test = df_copy_test.drop(
    [
        "Div",
        "Date",
        "Referee",
        "HomeTeam",
        "AwayTeam",
        "FTHG",
        "FTAG",
        "FTR",
        "HTHG",
        "HTAG",
        "HTR",
        "B365H",
        "B365D",
        "B365A",
        "BWH",
        "BWD",
        "BWA",
        "GBH",
        "GBD",
        "GBA",
        "IWH",
        "IWD",
        "IWA",
        "LBH",
        "LBD",
        "LBA",
        "SBH",
        "SBD",
        "SBA",
        "WHH",
        "WHD",
        "WHA",
        "SJH",
        "SJD",
        "SJA",
        "VCH",
        "VCD",
        "VCA",
        "BSH",
        "BSD",
        "BSA",
        "Bb1X2",
        "BbMxH",
        "BbAvH",
        "BbMxD",
        "BbAvD",
        "BbMxA",
        "BbAvA",
        "BbOU",
        "BbMx>2.5",
        "BbAv>2.5",
        "BbMx<2.5",
        "BbAv<2.5",
        "BbAH",
        "BbAHh",
        "BbMxAHH",
        "BbAvAHH",
        "BbMxAHA",
        "BbAvAHA",
    ],
    axis=1,
)

In [None]:
features_train

We took off the team names since we don't care about them in order to predict. The DT could use them if we turn them into a categorical variable.

Let's now look at the DT hyperparameters to regularize the algorithm. In particular, we can choose whether it uses Gini or Entropy to calculate the impurity of a split. Generally, there is no difference, but by definition, Gini may favor the most frequent class more. The advantage is that it is faster.

Looking at the other hyperparameters, the options we have in `sklearn` are:

`max_depth`: By default, this is `None`; it controls the depth of the tree.
`min_samples_split`: Sets the minimum number of samples a node must have to continue splitting it.
`min_samples_leaf`: The minimum number of samples a leaf (i.e., the end node) must have.
`min_weight_fraction_leaf`: The minimum weighted fraction of samples a leaf must have.
`max_leaf_nodes`: Maximum number of leaves.
`max_features`: Maximum number of features evaluated in a split.

If you raise the minimum values ​​or lower the maximum values, you are restricting the tree and regularizing the model.

There are other regularization methods, such as pruning, in which you train without restrictions and then remove unnecessary nodes.

In [None]:
dt = DecisionTreeClassifier()
# dt?

Let's play:

In [None]:
dt = DecisionTreeClassifier(max_depth=3)
dt.fit(features_train, target_train)
tree.export_graphviz(
    dt,
    out_file="futbol.dot",
    feature_names=features_train.columns,
    class_names=["Away team wins", "Draw", "Home team wins"],
    rounded=True,
    filled=True,
)

# Convierto el dot a png
! dot -Tpng futbol.dot -o futbol.png

# Ploteamos el png
img = cv2.imread("futbol.png")
plt.figure(figsize=(20, 10))
plt.imshow(img)

In [None]:
dt = DecisionTreeClassifier(min_samples_leaf=50, max_depth=100)
dt.fit(features_train, target_train)
tree.export_graphviz(
    dt,
    out_file="futbol.dot",
    feature_names=features_train.columns,
    class_names=["Away team wins", "Draw", "Home team wins"],
    rounded=True,
    filled=True,
)

# Convierto el dot a png
! dot -Tpng futbol.dot -o futbol.png

# Ploteamos el png
img = cv2.imread("futbol.png")
plt.figure(figsize=(20, 10))
plt.imshow(img)

In [None]:
dt = DecisionTreeClassifier(max_leaf_nodes=6)
dt.fit(features_train, target_train)
tree.export_graphviz(
    dt,
    out_file="futbol.dot",
    feature_names=features_train.columns,
    class_names=["Away team wins", "Draw", "Home team wins"],
    rounded=True,
    filled=True,
)

# Convierto el dot a png
! dot -Tpng futbol.dot -o futbol.png

# Ploteamos el png
img = cv2.imread("futbol.png")
plt.figure(figsize=(20, 10))
plt.imshow(img)

Now let's really optimize things

In [None]:
dt = DecisionTreeClassifier()
params = {
    "max_depth": [2, 3, 5],
    "min_samples_leaf": [10, 50],
    "max_leaf_nodes": [3, 4, 5],
}
grid = GridSearchCV(dt, params, cv=10, scoring="accuracy")
grid.fit(features_train, target_train)

In [None]:
grid.best_params_

In [None]:
model = grid.best_estimator_

In [None]:
tree.export_graphviz(
    model,
    out_file="futbol.dot",
    feature_names=features_train.columns,
    class_names=["Away team wins", "Draw", "Home team wins"],
    rounded=True,
    filled=True,
)

# Convierto el dot a png
! dot -Tpng futbol.dot -o futbol.png

# Ploteamos el png
img = cv2.imread("futbol.png")
plt.figure(figsize=(20, 10))
plt.imshow(img)

In [None]:
predicts = cross_val_predict(model, features_train, target_train, cv=5)
print(confusion_matrix(target_train, predicts))
print(
    recall_score(
        np.where(target_train == -1.0, 1.0, 0.0), np.where(predicts == -1.0, 1.0, 0.0)
    )
)
print(
    recall_score(
        np.where(target_train == 0.0, 1.0, 0.0), np.where(predicts == 0.0, 1.0, 0.0)
    )
)
print(
    recall_score(
        np.where(target_train == 1.0, 1.0, 0.0), np.where(predicts == 1.0, 1.0, 0.0)
    )
)

print(
    accuracy_score(
        np.where(target_train == -1.0, 1.0, 0.0), np.where(predicts == -1.0, 1.0, 0.0)
    )
)
print(
    accuracy_score(
        np.where(target_train == 0.0, 1.0, 0.0), np.where(predicts == 0.0, 1.0, 0.0)
    )
)
print(
    accuracy_score(
        np.where(target_train == 1.0, 1.0, 0.0), np.where(predicts == 1.0, 1.0, 0.0)
    )
)

print(confusion_matrix(target_test, model.predict(features_test)))

In [None]:
print(model.predict_proba(features_train[:3]))
print(np.argmax(model.predict_proba(features_train[:3]), axis=1) - 1)
print(model.predict(features_train[:3]))

In [None]:
print(np.where(model.predict(features_train[:3]) == -1.0, 1.0, 0.0))
print(np.where(model.predict(features_train[:3]) == 0.0, 1.0, 0.0))
print(np.where(model.predict(features_train[:3]) == 1.0, 1.0, 0.0))

In [None]:
thresholds = [0.2, 0.4, 0.6, 0.8]
for threshold in thresholds:
    print("Threshold " + str(threshold) + "\n")
    y_pred_away = np.where(
        model.predict_proba(features_train)[:, 0] >= threshold, 1.0, 0.0
    )
    y_pred_draw = np.where(
        model.predict_proba(features_train)[:, 1] >= threshold, 1.0, 0.0
    )
    y_pred_home = np.where(
        model.predict_proba(features_train)[:, 2] >= threshold, 1.0, 0.0
    )
    print(accuracy_score(np.where(target_train == -1.0, 1.0, 0.0), y_pred_away))
    print(accuracy_score(np.where(target_train == 0.0, 1.0, 0.0), y_pred_draw))
    print(accuracy_score(np.where(target_train == 1.0, 1.0, 0.0), y_pred_home))
    print("\n")

In [None]:
print(
    recall_score(
        np.where(target_train == -1.0, 1.0, 0.0),
        np.where(np.argmax(model.predict_proba(features_train), axis=1) == 0, 1.0, 0.0),
    )
)
print(
    recall_score(
        np.where(target_train == 0.0, 1.0, 0.0),
        np.where(np.argmax(model.predict_proba(features_train), axis=1) == 1, 1.0, 0.0),
    )
)
print(
    recall_score(
        np.where(target_train == 1.0, 1.0, 0.0),
        np.where(np.argmax(model.predict_proba(features_train), axis=1) == 2, 1.0, 0.0),
    )
)

In [None]:
from sklearn.metrics import precision_recall_curve

class_names = ["Away team wins", "Draw", "Home team wins"]
for nclass_label, class_label in enumerate([-1.0, 0.0, 1.0]):
    precision, recall, thresholds = precision_recall_curve(
        target_train, model.predict_proba(features_train)[:, 0], pos_label=class_label
    )
    plt.plot(precision[:-1], recall[:-1])
    plt.title("Precision-Recall curve for " + str(class_names[nclass_label]))
    plt.xlabel("Precision")
    plt.ylabel("Recall")
    # plt.xlim(0.0,1.0)
    # plt.ylim(0.0,1.0)
    plt.show()

In [None]:
plt.plot(thresholds, precision[:-1], label="Precision")
plt.plot(thresholds, recall[:-1], label="Recall")
plt.title("Precision and Recall vs Threshold")
plt.xlabel("Threshold")
plt.ylabel("Metric")
plt.legend()
plt.xlim(0.0, 1.0)
plt.ylim(0.0, 1.0)
plt.show()

It's hard to predict draws!

## Exercise:

Add bets as features and optimize the DT. What do you find? Can you assess feature importance? In particular, is it more important to see at "in-game" info or "pre-game" bets?

## Regression

Let's see how DTs can be used for regression with a synthetic dataset:

In [None]:
# Create a random dataset
rng = np.random.RandomState(1)
X = np.sort(5 * rng.rand(80, 1), axis=0)
y = np.sin(X).ravel()
y[::5] += 3 * (0.5 - rng.rand(16))

X_test = np.arange(0.0, 5.0, 0.01)[:, np.newaxis]


# Plot the results
plt.figure()
plt.scatter(X, y, s=20, edgecolor="black", c="darkorange", label="data")
plt.xlabel("data")
plt.ylabel("target")
# plt.title("Decision Tree Regression")
plt.legend()
plt.show()

We can see how DTs operate by exploring different depths:

In [None]:
# Fit regression model
regr_1 = DecisionTreeRegressor(max_depth=2)
regr_2 = DecisionTreeRegressor(max_depth=5, min_samples_leaf=5)
regr_1.fit(X, y)
regr_2.fit(X, y)
y_1 = regr_1.predict(X_test)
y_2 = regr_2.predict(X_test)

In [None]:
# Plot the results
plt.figure(figsize=(20, 10))
plt.scatter(X, y, s=20, edgecolor="black", c="darkorange", label="data")
plt.plot(X_test, y_1, color="cornflowerblue", label="max_depth=2", linewidth=2)
plt.plot(X_test, y_2, color="yellowgreen", label="max_depth=5", linewidth=2)
plt.axvline(3.133, linestyle="dashed", color="black")
plt.axhline(0.571, linestyle="dashed", color="black")
plt.axhline(-0.667, linestyle="dashed", color="black")
plt.xlabel("X")
plt.ylabel("t")
plt.title("Decision Tree Regression")
plt.legend()
plt.show()

The tree decides on a predicted value by doing cuts in feature space. `max_depth` controls the number of cuts the algorithm makes. Let's see how the target is assigned:

In [None]:
np.mean(y)

In [None]:
mean_squared_error(np.mean(y) * np.ones(len(y)), y)

In [None]:
plt.figure(figsize=(20, 10))
tree.plot_tree(regr_1)
plt.show()

In [None]:
y_first_cut = y[(X[:, 0] <= 3.133)]
print(np.mean(y_first_cut), np.mean(y[(X[:, 0] > 3.133)]))
print(mean_squared_error(np.mean(y_first_cut) * np.ones(len(y_first_cut)), y_first_cut))

In [None]:
plt.figure(figsize=(20, 10))
tree.plot_tree(regr_2)
plt.show()

In [None]:
tree.export_graphviz(regr_1, out_file="reg_tree.dot", rounded=True, filled=True)

# Convierto el dot a png
! dot -Tpng reg_tree.dot -o reg_tree.png

# Ploteamos el png
img = cv2.imread("reg_tree.png")
plt.figure(figsize=(20, 10))
plt.imshow(img)

In [None]:
tree.export_graphviz(regr_2, out_file="reg_tree.dot", rounded=True, filled=True)

# Convierto el dot a png
! dot -Tpng reg_tree.dot -o reg_tree.png

# Ploteamos el png
img = cv2.imread("reg_tree.png")
plt.figure(figsize=(20, 20))
plt.imshow(img)

To select the cut, it does not consider neither Gini nor entropy, it uses the MSE! Additionally, it assigns as predicted target the mean of all features before the cut is made.

## Exercise:

Let's consider the California dataset. Train a DT to predict the house price. Optimize the hyperparameter and report the RMSE and a predicted vs actual house value.

In [None]:
HOUSING_PATH = "datasets"
import pandas as pd


def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

In [None]:
### from Geron

if "google.colab" in sys.modules:
    import tarfile

    DOWNLOAD_ROOT = "https://github.com/ageron/handson-ml2/raw/master/"
    HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

    !mkdir -p ./datasets/housing

    def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
        os.makedirs(housing_path, exist_ok=True)
        tgz_path = os.path.join(housing_path, "housing.tgz")
        # urllib.request.urlretrieve(housing_url, tgz_path)
        !wget {HOUSING_URL} -P {housing_path}
        housing_tgz = tarfile.open(tgz_path)
        housing_tgz.extractall(path=housing_path)
        housing_tgz.close()

    # Corramos la función
    fetch_housing_data()

else:
    print("Not running on Google Colab. This cell is did not do anything.")

In [None]:
housing_pre = load_housing_data()
from sklearn.model_selection import StratifiedShuffleSplit

housing_pre["income_cat"] = pd.cut(
    housing_pre["median_income"],
    bins=[0.0, 1.5, 3.0, 4.5, 6.0, np.inf],
    labels=[1, 2, 3, 4, 5],
)

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=445543)
for train_index, test_index in split.split(housing_pre, housing_pre["income_cat"]):
    california_housing_train = housing_pre.loc[train_index]
    california_housing_test = housing_pre.loc[test_index]

for set_ in (california_housing_train, california_housing_test):
    set_.drop("income_cat", axis=1, inplace=True)

In [None]:
housing = california_housing_train.copy()

problematic_columns = ["median_house_value", "housing_median_age", "median_income"]
max_values = []
for col in problematic_columns:
    max_value = housing[col].max()
    print(
        f"{col}: {sum(housing[col] == max_value)} districts with {col} = {max_value} ({round(sum(housing[col] == max_value)/len(housing)*100,2)}%)."
    )
    max_values.append(max_value)

housing_clean = housing.copy()
for col, max_value in zip(problematic_columns, max_values):
    housing_clean = housing_clean[housing_clean[col] != max_value]

housing_test = california_housing_test.copy()
housing_test_clean = housing_test.copy()
for col, max_value in zip(problematic_columns, max_values):
    housing_test_clean = housing_test_clean[housing_test_clean[col] != max_value]

In [None]:
housing_clean["rooms_per_household"] = (
    housing_clean["total_rooms"] / housing_clean["households"]
)
housing_clean["bedrooms_per_room"] = (
    housing_clean["total_bedrooms"] / housing_clean["total_rooms"]
)
housing_clean["population_per_household"] = (
    housing_clean["population"] / housing_clean["households"]
)

housing_test_clean["rooms_per_household"] = (
    housing_test_clean["total_rooms"] / housing_test_clean["households"]
)
housing_test_clean["bedrooms_per_room"] = (
    housing_test_clean["total_bedrooms"] / housing_test_clean["total_rooms"]
)
housing_test_clean["population_per_household"] = (
    housing_test_clean["population"] / housing_test_clean["households"]
)

In [None]:
housing_labels = housing_clean["median_house_value"].copy()
housing_clean = housing_clean.drop(
    "median_house_value", axis=1
)  # drop labels for training set
housing_num = housing_clean.drop("ocean_proximity", axis=1)

housing_test_labels = housing_test_clean["median_house_value"].copy()
housing_test_clean = housing_test_clean.drop(
    "median_house_value", axis=1
)  # drop labels for training set
housing_test_num = housing_test_clean.drop("ocean_proximity", axis=1)

Some useful preprocessing

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

num_pipeline = Pipeline(
    [
        ("imputer", SimpleImputer(strategy="median")),  # hay mas opciones aca
        ("std_scaler", StandardScaler()),
    ]
)

num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

full_pipeline = ColumnTransformer(
    [
        ("num", num_pipeline, num_attribs),
        ("cat", OneHotEncoder(), cat_attribs),
    ]
)

housing_prepared = full_pipeline.fit_transform(housing_clean)
housing_test_prepared = full_pipeline.transform(housing_test_clean)

## Another nice example

This is verbatim from `sklearn` documentation:

In [None]:
from sklearn.datasets import fetch_olivetti_faces
from sklearn.utils.validation import check_random_state

# Load the faces datasets
data, targets = fetch_olivetti_faces(return_X_y=True)

We can try to predict the lower half of a face using the upper half:

In [None]:
train = data[targets < 30]
test = data[targets >= 30]  # Test on independent people

# Test on a subset of people
n_faces = 5
rng = check_random_state(4)
face_ids = rng.randint(test.shape[0], size=(n_faces,))
test = test[face_ids, :]

n_pixels = data.shape[1]
# Upper half of the faces
X_train = train[:, : (n_pixels + 1) // 2]
# Lower half of the faces
y_train = train[:, n_pixels // 2 :]
X_test = test[:, : (n_pixels + 1) // 2]
y_test = test[:, n_pixels // 2 :]

# Fit estimators
ESTIMATORS = {
    "Decision Trees": DecisionTreeRegressor(),
    "Linear regression": LinearRegression(),
    "Ridge": RidgeCV(),
}

y_test_predict = dict()
for name, estimator in ESTIMATORS.items():
    estimator.fit(X_train, y_train)
    y_test_predict[name] = estimator.predict(X_test)

# Plot the completed faces
image_shape = (64, 64)

n_cols = 1 + len(ESTIMATORS)
plt.figure(figsize=(2.0 * n_cols, 2.26 * n_faces))
plt.suptitle("Face completion with multi-output estimators", size=16)

for i in range(n_faces):
    true_face = np.hstack((X_test[i], y_test[i]))

    if i:
        sub = plt.subplot(n_faces, n_cols, i * n_cols + 1)
    else:
        sub = plt.subplot(n_faces, n_cols, i * n_cols + 1, title="true faces")

    sub.axis("off")
    sub.imshow(
        true_face.reshape(image_shape), cmap=plt.cm.gray, interpolation="nearest"
    )

    for j, est in enumerate(sorted(ESTIMATORS)):
        completed_face = np.hstack((X_test[i], y_test_predict[est][i]))

        if i:
            sub = plt.subplot(n_faces, n_cols, i * n_cols + 2 + j)

        else:
            sub = plt.subplot(n_faces, n_cols, i * n_cols + 2 + j, title=est)

        sub.axis("off")
        sub.imshow(
            completed_face.reshape(image_shape),
            cmap=plt.cm.gray,
            interpolation="nearest",
        )

plt.show()

## Bagging and Random Forests

Bagging is a particular type of **ensemble** training. Ensemble methods combine different estimators to build a better one, usually reducing the variance and overfitting. In bagging, which originates from **bootstrapping agreggating**, we bootstrap the data and train a model for each bootstrapped dataset. The overall model is then an average of the trained predictors.

A **RandomForest** is a bagging model where the base estimator is a Decision Tree and where additionally **feature bagging** is performed. That is, at each decision step for each bootstrapped dataset, only a subset of features chosen at random is considered to select the optimal cut. This further increases the variability of the ensembled models. The added stochasticity can make the decision frontier more irregular, but usually increases performance.

Let's see this using an example.

In [None]:
# Let us define a couple of useful functions (if in colab, otherwise, take from utils module)

### From Rodrigo Diaz


def plot_clasi(
    x,
    t,
    ws,
    labels=[],
    xp=[-1.0, 1.0],
    thr=[
        0,
    ],
    spines="zero",
    equal=True,
    join_centers=False,
    margin=None,
):
    """
    Figura con el resultado del ajuste lineal
    """
    assert len(labels) == len(ws) or len(labels) == 0
    assert len(ws) == len(thr)

    if margin is None:
        margin = [False] * len(ws)
    else:
        margin = np.atleast_1d(margin)
    assert len(margin) == len(ws)

    if len(labels) == 0:
        labels = np.arange(len(ws)).astype("str")

    # Agregemos el vector al plot
    fig = plt.figure(figsize=(9, 7))
    ax = fig.add_subplot(111)

    xc1 = x[t == np.unique(t).max()]
    xc2 = x[t == np.unique(t).min()]

    ax.plot(*xc1.T, "ob", mfc="None", label="C1")
    ax.plot(*xc2.T, "or", mfc="None", label="C2")

    for i, w in enumerate(ws):
        # Compute vector norm
        wnorm = np.sqrt(np.sum(w**2))

        # Ploteo vector de pesos
        x0 = 0.5 * (xp[0] + xp[1])
        ax.quiver(
            0,
            thr[i] / w[1],
            w[0] / wnorm,
            w[1] / wnorm,
            color="C{}".format(i + 2),
            scale=10,
            label=labels[i],
            zorder=10,
        )

        # ploteo plano perpendicular
        xp = np.array(xp)
        yp = (thr[i] - w[0] * xp) / w[1]

        plt.plot(xp, yp, "-", color="C{}".format(i + 2))

        # Plot margin
        if margin[i]:
            for marg in [-1, 1]:
                ym = yp + marg / w[1]
                plt.plot(xp, ym, ":", color="C{}".format(i + 2))

    if join_centers:
        # Ploteo línea que une centros de los conjuntos
        mu1 = xc1.mean(axis=1)
        mu2 = xc2.mean(axis=1)
        ax.plot([mu1[0], mu2[0]], [mu1[1], mu2[1]], "o:k", mfc="None", ms=10)

    ax.legend(loc=0, fontsize=12)
    if equal:
        ax.set_aspect("equal")

    if spines is not None:
        for a in ["left", "bottom"]:
            ax.spines[a].set_position("zero")
        for a in ["top", "right"]:
            ax.spines[a].set_visible(False)

    return


def makew(fitter):
    # # Obtengamos los pesos y normalicemos
    w = fitter.coef_.copy()

    # # Incluye intercept
    if fitter.fit_intercept:
        w = np.hstack([fitter.intercept_.reshape(1, 1), w])

    # # Normalizon
    # w /= np.linalg.norm(w)
    return w.T


# Utility from Geron
def plot_decision_regions(
    clf,
    X,
    t,
    axes=None,
    npointsgrid=500,
    legend=False,
    plot_training=True,
    figkwargs={"figsize": [12, 8]},
    contourkwargs={"alpha": 0.3},
):
    """
    Plot decision regions produced by classifier.

    :param Classifier clf: sklearn classifier supporting XXX
    """

    fig = plt.figure(**figkwargs)
    ax = fig.add_subplot(111)

    if axes is None:
        dx = X[:, 0].max() - X[:, 0].min()
        dy = X[:, 1].max() - X[:, 1].min()
        axes = [
            X[:, 0].min() - 0.1 * dx,
            X[:, 0].max() + 0.1 * dx,
            X[:, 1].min() - 0.1 * dy,
            X[:, 1].max() + 0.1 * dy,
        ]

    # Define grid for regions
    x1s = np.linspace(axes[0], axes[1], npointsgrid)
    x2s = np.linspace(axes[2], axes[3], npointsgrid)
    x1, x2 = np.meshgrid(x1s, x2s)

    # Make predictions on points of grid; reshape to grid format
    X_new = np.c_[x1.ravel(), x2.ravel()]
    y_pred = clf.predict(X_new).reshape(x1.shape)

    # custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])
    ax.contourf(x1, x2, y_pred, **contourkwargs)

    #     custom_cmap2 = ListedColormap(['#7d7d58','#4c4c7f','#507d50'])
    #         plt.contour(x1, x2, y_pred, cmap=custom_cmap2, alpha=0.8)

    if plot_training:
        for label in np.unique(t):
            ax.plot(
                X[:, 0][t == label], X[:, 1][t == label], "o", label="C{}".format(label)
            )

    # Axis
    plt.xlabel(r"$x_1$", fontsize=18)
    plt.ylabel(r"$x_2$", fontsize=18, rotation=0)

    if legend:
        plt.legend(loc="lower right", fontsize=14)

    plt.show()
    return fig

## Example with Moons dataset

Let's use a simple non-linearly separable dataset to exemplify this:

In [None]:
from sklearn.datasets import make_moons

X, t = make_moons(n_samples=400, noise=0.25, random_state=1234)

In [None]:
plot_clasi(X, t, [], [], [], [], spines=None)

In [None]:
# Split
X, X_test, t, t_test = train_test_split(X, t, test_size=0.2)

## Simple RF training

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rf = RandomForestClassifier(n_estimators=100, max_depth=2, n_jobs=6)
rf.fit(X, t)

In [None]:
fig = plot_decision_regions(
    rf,
    X,
    t,
    legend=True,
    npointsgrid=500,
    figkwargs={"figsize": [12, 8]},
    contourkwargs={"alpha": 0.5, "levels": 5, "cmap": "viridis"},
)

This will not generalize well...

In [None]:
from sklearn.metrics import accuracy_score

y_train = rf.predict(X)
y_test = rf.predict(X_test)
print("Accuracy (train): {:.3f}".format(accuracy_score(t, y_train)))
print("Accuracy (test): {:.3f}".format(accuracy_score(t_test, y_test)))

Below you can solve this by optimizing the Random Forest using `GridSearchCV`.

Another feature of RFs is their interpretability. Since it's based on a white box algorithm, Decision Trees, we can use to study the learned properties. In particular, we can gauge feature importance by inspecting the fitted DTs. For a given DT, the most important features are closer to the root. We can perform statistics on the feature importances by averaging over the fitted DTs.

`sklearn` stores this through `feature_importances_`

In [None]:
print(rf.feature_importances_)
for name, score in zip(["x_1", "x_2"], rf.feature_importances_):
    print(name, score)

## Exercise

Train an optimize Random Forest by exploring the possible hyperparameters. Compare with a simpler classifier like an optimized polynomial Logistic Regressor or a optimized Decision Tree.

## Boosted and Boosted Decision Trees

Boosting methods are another example of **ensemble** methods. They also combine different instances of a base estimator. However, in boosting each successive instance learns both from the data and from the previous estimator. That is, it learns to "correct" the previous estimator.

* It usually **greatly improves** the performance of **weak predictors**.
* It's not easily paralellizable.
* It's greedy. Each step seeks to be as good as possible without thinking of global strategies.


We'll see two types of boosting: AdaBoosting and GradientBoosting.

In [None]:
from sklearn.ensemble import (
    AdaBoostClassifier,
    AdaBoostRegressor,
    GradientBoostingRegressor,
    GradientBoostingClassifier,
)

## AdaBoost

In AdaBoost, at each step the data points are weighted according to the performance of the previous estimator (they are initiated to 1)

That is, for steps $i=1,\dots,N$
1. We train a predictor $h_i$.
2. We update the per sample weight $w_{n,i}=f(w_{n,i-1},h_{i})$
The final predictor combines all $N$ predictors.

The two `sklearn` classes are `AdaBoostClassifier` and `AdaBoostRegressor`, with algorithm specific hyperparameters:

The AdaBoost-specific hyperparameters are:

* `estimator`: The weak predictor used. By default, it is a `DecisionStump` (a `DecisionTree` with `max_depth=1`).
* `n_estimators`: How many estimators to use.
* `learning_rate`: The learning rate when taking a new estimator. The lower the learning_rate, the more estimators are needed to fit the data. This is a regularizer for the algorithm.
* `loss`: Exclusively for regression. This is the loss function used by the algorithm. The options are `linear`, `square`, and `exponential`.

From the fitted class, you can obtain:

* `estimators_`: The list of estimators.
* `estimators_weights_`: The weights of each estimator. 1 for `SAMME.R` classification, not equal to 1 for regression and classification with `SAMME`.
* `estimators_errors_`: The error of each estimator when evaluated on the dataset. This is not the error when applying the ensemble.
* `feature_importances_`: The importance of the features.

In addition, AdaBoost has the `.staged_` function that allows the ensemble to be evaluated at each step as if it were complete.

## Example

In [None]:
X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [None]:
from matplotlib.colors import ListedColormap


def plot_decision_boundary(
    clf, X, y, axes=[-1.5, 2.45, -1, 1.5], alpha=0.5, contour=True
):
    x1s = np.linspace(axes[0], axes[1], 100)
    x2s = np.linspace(axes[2], axes[3], 100)
    x1, x2 = np.meshgrid(x1s, x2s)
    X_new = np.c_[x1.ravel(), x2.ravel()]
    y_pred = clf.predict(X_new).reshape(x1.shape)
    custom_cmap = ListedColormap(["#fafab0", "#9898ff", "#a0faa0"])
    plt.contourf(x1, x2, y_pred, alpha=0.3, cmap=custom_cmap)
    if contour:
        custom_cmap2 = ListedColormap(["#7d7d58", "#4c4c7f", "#507d50"])
        plt.contour(x1, x2, y_pred, cmap=custom_cmap2, alpha=0.8)
    plt.plot(X[:, 0][y == 0], X[:, 1][y == 0], "yo", alpha=alpha)
    plt.plot(X[:, 0][y == 1], X[:, 1][y == 1], "bs", alpha=alpha)
    plt.axis(axes)
    plt.xlabel(r"$x_1$", fontsize=18)
    plt.ylabel(r"$x_2$", fontsize=18, rotation=0)

In [None]:
ridgeCV = RidgeCV()
ridgeCV.fit(X_train, y_train, sample_weight=np.where(X_train[:, 1] > -0.5, 100.0, 1.0))

plot_decision_boundary(ridgeCV, X, y)
# plt.axhline(-0.5)

In [None]:
n_estimators = 300
# AdaBoostClassifier(base_estimator=SVC/DT/Perceptron/RL,n_estimator= cuantos voy a considerar, algorithm=que algoritmo uso, learning_rate = ,...)
ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1),
    n_estimators=n_estimators,
    algorithm="SAMME",
    learning_rate=0.5,
    random_state=42,
)

ada_clf.fit(X_train, y_train)
plot_decision_boundary(ada_clf, X, y)

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score, cross_val_predict

preds = cross_val_predict(ada_clf, X_train, y_train)
cm = confusion_matrix(y_train, preds)  # ,ada_clf.predict(X_train))
print(cm)
print(accuracy_score(y_train, preds))  # ada_clf.predict(X_test)))

Let's look at the individual estimators, their weights and loss function, computed as

$$ \text{Loss} = \sum_{i}w_{i}\text{Loss}_{i}$$

In [None]:
print(np.asarray(ada_clf.estimators_).shape)

In [None]:
from sklearn.tree import plot_tree

plot_tree(ada_clf.estimators_[1])

In [None]:
print(ada_clf.estimator_weights_.shape)
plt.plot(ada_clf.estimator_weights_, "r.")
plt.xlabel("Estimator")
plt.ylabel(r"Weight $\alpha$")

In [None]:
print(ada_clf.estimator_errors_.shape)
plt.plot(ada_clf.estimator_errors_, "ro")
plt.xlabel("Estimator")
plt.ylabel("Loss")

We can explore explicitly the evolution as we add estimators:

In [None]:
for nest, est_pred in enumerate(ada_clf.staged_predict(X_train[:2])):
    print(nest, est_pred[:2])

In [None]:
from sklearn.metrics import zero_one_loss  # counts the misclassified fraction

err_train = np.zeros((n_estimators, 2))
for i, y_pred in enumerate(ada_clf.staged_predict(X_train)):
    err_train[i, 0] = zero_one_loss(y_pred, y_train)
    err_train[i, 1] = accuracy_score(y_pred, y_train)

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 4))

ax[0].plot(np.arange(n_estimators) + 1, err_train[:, 0])
ax[1].plot(np.arange(n_estimators) + 1, err_train[:, 1])

ax[0].set_xlabel("# Estimators")
ax[1].set_xlabel("# Estimators")
ax[0].set_ylabel("Zero One Loss")
ax[1].set_ylabel("Accuracy Score")
plt.show()

## Learning rate effect in convergence

A nice example from Geron:

In [None]:
from sklearn.svm import SVC

m = len(X_train)

fix, axes = plt.subplots(ncols=2, figsize=(10, 4), sharey=True)
for subplot, learning_rate in ((0, 1), (1, 0.5)):
    sample_weights = np.ones(m)
    plt.sca(axes[subplot])
    for i in range(5):
        svm_clf = SVC(kernel="rbf", C=0.05, gamma="scale", random_state=42)
        svm_clf.fit(X_train, y_train, sample_weight=sample_weights)
        y_pred = svm_clf.predict(X_train)
        sample_weights[y_pred != y_train] *= 1 + learning_rate
        plot_decision_boundary(svm_clf, X, y, alpha=0.2)
        plt.title("learning_rate = {}".format(learning_rate), fontsize=16)
    if subplot == 0:
        plt.text(-0.7, -0.65, "1", fontsize=14)
        plt.text(-0.6, -0.10, "2", fontsize=14)
        plt.text(-0.5, 0.10, "3", fontsize=14)
        plt.text(-0.4, 0.55, "4", fontsize=14)
        plt.text(-0.3, 0.90, "5", fontsize=14)
    else:
        plt.ylabel("")

plt.show()

We can do this for Decision Trees and see the overall evolution

In [None]:
m = len(X_train)

learnings = [1.0, 0.5]
fix, axes = plt.subplots(
    nrows=5, ncols=len(learnings), figsize=(5 * len(learnings), 25), sharey=True
)
for subplot, learning_rate in enumerate(learnings):
    ada_clf = AdaBoostClassifier(
        DecisionTreeClassifier(max_depth=1),
        n_estimators=5,
        algorithm="SAMME",
        learning_rate=learning_rate,
        random_state=42,
    )
    ada_clf.fit(X_train, y_train)
    y_pred_train = np.zeros((5, X_train.shape[0]))
    for nest_train, est_dec_train in enumerate(ada_clf.staged_predict(X_train)):
        y_pred_train[nest_train] = est_dec_train
    # axes=[-1.5, 2.45, -1, 1.5]
    alpha = 0.5
    x1s = np.linspace(-1.5, 2.45, 100)
    x2s = np.linspace(-1, 1.5, 100)
    x1, x2 = np.meshgrid(x1s, x2s)
    X_new = np.c_[x1.ravel(), x2.ravel()]
    for nest, est_dec in enumerate(ada_clf.staged_predict(X_new)):
        y_pred_estimator_only = (
            ada_clf.estimators_[nest].predict(X_new).reshape(x1.shape)
        )
        y_pred = est_dec.reshape(x1.shape)
        custom_cmap2 = ListedColormap(["#7d7d58", "#4c4c7f", "#507d50"])
        axes[nest, subplot].plot(
            X_train[:, 0][y_train == 0], X_train[:, 1][y_train == 0], "yo", alpha=alpha
        )
        axes[nest, subplot].plot(
            X_train[:, 0][y_train == 1], X_train[:, 1][y_train == 1], "bs", alpha=alpha
        )
        axes[nest, subplot].plot(
            X_train[:, 0][y_pred_train[nest] != y_train],
            X_train[:, 1][y_pred_train[nest] != y_train],
            "rx",
            alpha=1.0,
        )
        axes[nest, 0].set_ylabel(r"$x_2$", fontsize=18, rotation=0)
        axes[nest, subplot].contour(x1, x2, y_pred, cmap=custom_cmap2, alpha=0.8)
        # axes[nest,subplot].contour(x1, x2, y_pred_estimator_only, cmap='plasma', alpha=0.8)
        axes[nest, subplot].set_title(
            "learning_rate = {}, estimator ={}".format(learning_rate, nest + 1),
            fontsize=16,
        )
    #      plt.show()
    axes[-1, subplot].set_xlabel(r"$x_1$", fontsize=18)
plt.show()

And the individual cuts

In [None]:
m = len(X_train)

learnings = [1.0, 0.5]
fix, axes = plt.subplots(
    nrows=5, ncols=len(learnings), figsize=(5 * len(learnings), 25), sharey=True
)
for subplot, learning_rate in enumerate(learnings):
    ada_clf = AdaBoostClassifier(
        DecisionTreeClassifier(max_depth=1),
        n_estimators=5,
        algorithm="SAMME",
        learning_rate=learning_rate,
        random_state=42,
    )
    ada_clf.fit(X_train, y_train)
    y_pred_train = np.zeros((5, X_train.shape[0]))
    for nest_train, est_dec_train in enumerate(ada_clf.staged_predict(X_train)):
        y_pred_train[nest_train] = est_dec_train
    # axes=[-1.5, 2.45, -1, 1.5]
    alpha = 0.5
    x1s = np.linspace(-1.5, 2.45, 100)
    x2s = np.linspace(-1, 1.5, 100)
    x1, x2 = np.meshgrid(x1s, x2s)
    X_new = np.c_[x1.ravel(), x2.ravel()]
    for nest, est_dec in enumerate(ada_clf.estimators_):
        y_pred = est_dec.predict(X_new).reshape(x1.shape)
        custom_cmap2 = ListedColormap(["#7d7d58", "#4c4c7f", "#507d50"])
        axes[nest, subplot].plot(
            X_train[:, 0][y_train == 0], X_train[:, 1][y_train == 0], "yo", alpha=alpha
        )
        axes[nest, subplot].plot(
            X_train[:, 0][y_train == 1], X_train[:, 1][y_train == 1], "bs", alpha=alpha
        )
        axes[nest, subplot].plot(
            X_train[:, 0][y_pred_train[nest] != y_train],
            X_train[:, 1][y_pred_train[nest] != y_train],
            "rx",
            alpha=1.0,
        )
        axes[nest, 0].set_ylabel(r"$x_2$", fontsize=18, rotation=0)
        axes[nest, subplot].contour(x1, x2, y_pred, cmap=custom_cmap2, alpha=0.8)
        axes[nest, subplot].set_title(
            "learning_rate = {}, estimator ={}".format(learning_rate, nest + 1),
            fontsize=16,
        )
    #      plt.show()
    axes[-1, subplot].set_xlabel(r"$x_1$", fontsize=18)
plt.show()

## Regression example

This is an example of how to use `AdaBoostRegressor`. It's very similar, you only need to specify the `loss`.

In [None]:
# Create the dataset
rng = np.random.RandomState(1)
X = np.linspace(0, 6, 100)[:, np.newaxis]
y = np.sin(X).ravel() + np.sin(6 * X).ravel() + rng.normal(0, 0.1, X.shape[0])

# Fit regression model
regr_1 = DecisionTreeRegressor(max_depth=4)

regr_2 = AdaBoostRegressor(
    DecisionTreeRegressor(max_depth=4),
    loss="square",
    n_estimators=300,
    random_state=rng,
)

regr_1.fit(X, y)
regr_2.fit(X, y)

# Predict
y_1 = regr_1.predict(X)
y_2 = regr_2.predict(X)

# Plot the results
plt.figure()
plt.scatter(X, y, c="k", label="training samples")
plt.plot(X, y_1, c="g", label="n_estimators=1", linewidth=2)
plt.plot(X, y_2, c="r", label="n_estimators=300", linewidth=2)
plt.xlabel("data")
plt.ylabel("target")
plt.title("Boosted Decision Tree Regression")
plt.legend()
plt.show()

In [None]:
print(regr_2.estimator_weights_.shape)
plt.plot(regr_2.estimator_weights_, "ro")
plt.xlabel("Iteracion")
plt.ylabel("Peso")

In [None]:
print(regr_2.estimator_errors_.shape)
plt.plot(regr_2.estimator_errors_, "ro")
plt.xlabel("Iteracion")
plt.ylabel("Error")

## GradientBoosting

Gradient Boosting follows a different iterative procedure than AdaBoost.

Instead of correcting based on weights, Gradient Boosting improves by effectively training each estimator on the residuals of the previous estimators. For predictors  $h_{m}$ with $m=1,\dots,M$

$$\hat{y}^{m}_{n} = F_{m}(x_{n})$$

with $F$ built from the collection of all $m$ estimators in an iterative way

$$F_{m}(x)=F_{m-1}(x)+h_{m}(x)=\sum_{m'=1}^{m}h_{m'}(x)$$

Thus, $h_{m}$ is learned by optimizing

$$h_{m}=\text{arg }\min_{h} \mathcal{L}_{m}(h) = \text{arg }\min_{h} \sum_{n=1}^{N}\mathcal{l}(y_{n},F_{m-1}(x_{n})+h(x_{n}))$$

which can be efficiently approximated via a linear expansion on $h$

$$\mathcal{l}(y_{n},F_{m-1}(x_{n})+h(x_{n})) \approx \mathcal{L}(y_{n},F_{m-1}(x_{n}))+h(x_{n})\frac{\partial \mathcal{l}(y_{n},F(x_{n}))}{\partial F}|_{F=F_{m-1}}$$

and we have that

$$h_{m}=\text{arg }\min_{h}\sum_{n=1}^{N}h(x_{n})\frac{\partial \mathcal{l}(y_{n},F(x_{n}))}{\partial F}|_{F=F_{m-1}}$$

## An example

From Geron, we can get a nice qualitative picture of how GradientBoosting works (it's not exactly this, but it's similar)

In [None]:
np.random.seed(42)
X = np.random.rand(100, 1) - 0.5
y = 3 * X[:, 0] ** 2 + 0.05 * np.random.randn(100)

tree_reg1 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg1.fit(X, y)

y2 = y - tree_reg1.predict(X)  # first estimator residuals
tree_reg2 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg2.fit(X, y2)

y3 = y2 - tree_reg2.predict(X)  # second estimator residuals
tree_reg3 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg3.fit(X, y3)

And we can predict by aggregating the estimators

In [None]:
X_new = np.array([[0.8]])
y_pred = sum(
    tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3)
)  # sum of predictions from all trees
y_pred

Let's plot this:

In [None]:
def plot_predictions(
    regressors, X, y, axes, label=None, style="r-", data_style="b.", data_label=None
):
    x1 = np.linspace(axes[0], axes[1], 500)
    y_pred = sum(regressor.predict(x1.reshape(-1, 1)) for regressor in regressors)
    plt.plot(X[:, 0], y, data_style, label=data_label)
    plt.plot(x1, y_pred, style, linewidth=2, label=label)
    if label or data_label:
        plt.legend(loc="upper center", fontsize=16)
    plt.axis(axes)


plt.figure(figsize=(11, 11))

plt.subplot(321)
plot_predictions(
    [tree_reg1],
    X,
    y,
    axes=[-0.5, 0.5, -0.1, 0.8],
    label="$h_1(x_1)$",
    style="g-",
    data_label="Training set",
)
plt.ylabel("$y$", fontsize=16, rotation=0)
plt.title("Residuals and tree predictions", fontsize=16)

plt.subplot(322)
plot_predictions(
    [tree_reg1],
    X,
    y,
    axes=[-0.5, 0.5, -0.1, 0.8],
    label="$h(x_1) = h_1(x_1)$",
    data_label="Training set",
)
plt.ylabel("$y$", fontsize=16, rotation=0)
plt.title("Ensemble predictions", fontsize=16)

plt.subplot(323)
plot_predictions(
    [tree_reg2],
    X,
    y2,
    axes=[-0.5, 0.5, -0.5, 0.5],
    label="$h_2(x_1)$",
    style="g-",
    data_style="k+",
    data_label="Residuals",
)
plt.ylabel("$y - h_1(x_1)$", fontsize=16)

plt.subplot(324)
plot_predictions(
    [tree_reg1, tree_reg2],
    X,
    y,
    axes=[-0.5, 0.5, -0.1, 0.8],
    label="$h(x_1) = h_1(x_1) + h_2(x_1)$",
)
plt.ylabel("$y$", fontsize=16, rotation=0)

plt.subplot(325)
plot_predictions(
    [tree_reg3],
    X,
    y3,
    axes=[-0.5, 0.5, -0.5, 0.5],
    label="$h_3(x_1)$",
    style="g-",
    data_style="k+",
)
plt.ylabel("$y - h_1(x_1) - h_2(x_1)$", fontsize=16)
plt.xlabel("$x_1$", fontsize=16)

plt.subplot(326)
plot_predictions(
    [tree_reg1, tree_reg2, tree_reg3],
    X,
    y,
    axes=[-0.5, 0.5, -0.1, 0.8],
    label="$h(x_1) = h_1(x_1) + h_2(x_1) + h_3(x_1)$",
)
plt.xlabel("$x_1$", fontsize=16)
plt.ylabel("$y$", fontsize=16, rotation=0)

plt.show()

## `sklearn` implementation

The two classes are GradientBoostingClassifier y GradientBoostingRegressor.

In [None]:
# GradientBoostingClassifier?

In [None]:
# GradientBoostingRegressor?

In [None]:
gbrt = GradientBoostingRegressor(
    max_depth=2, n_estimators=50, learning_rate=1.0, random_state=42
)
gbrt.fit(X, y)

gbrt_slow = GradientBoostingRegressor(
    max_depth=2, n_estimators=50, learning_rate=0.1, random_state=42
)
gbrt_slow.fit(X, y)

In [None]:
fix, axes = plt.subplots(ncols=2, figsize=(10, 4), sharey=True)

plt.sca(axes[0])
plot_predictions(
    [gbrt], X, y, axes=[-0.5, 0.5, -0.1, 0.8], label="Ensemble predictions"
)
plt.title(
    "learning_rate={}, n_estimators={}".format(gbrt.learning_rate, gbrt.n_estimators),
    fontsize=14,
)
plt.xlabel("$x_1$", fontsize=16)
plt.ylabel("$y$", fontsize=16, rotation=0)

plt.sca(axes[1])
plot_predictions([gbrt_slow], X, y, axes=[-0.5, 0.5, -0.1, 0.8])
plt.title(
    "learning_rate={}, n_estimators={}".format(
        gbrt_slow.learning_rate, gbrt_slow.n_estimators
    ),
    fontsize=14,
)
plt.xlabel("$x_1$", fontsize=16)

plt.show()

## Optimal number of trees

The number of estimators needs to be optimized. Too few, we underfit. Too many, we overfit. A nice way to find the optimal number of trees is by implementing an **early stopping** algorithm, which evaluates the predictor on a validation dataset to assess performance. Once the validation metric worsens, we can stop and get back to the best estimator.

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=49)

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=120, random_state=42)
gbrt.fit(X_train, y_train)

errors = [
    np.sqrt(mean_squared_error(y_val, y_pred)) for y_pred in gbrt.staged_predict(X_val)
]
bst_n_estimators = np.argmin(errors) + 1

gbrt_best = GradientBoostingRegressor(
    max_depth=2, n_estimators=bst_n_estimators, random_state=42
)
gbrt_best.fit(X_train, y_train)

In [None]:
print(bst_n_estimators)

In [None]:
min_error = np.min(errors)
plt.figure(figsize=(10, 4))

plt.subplot(121)
plt.plot(errors, "b.-")
plt.plot([bst_n_estimators, bst_n_estimators], [0, min_error], "k--")
plt.plot([0, 120], [min_error, min_error], "k--")
plt.plot(bst_n_estimators, min_error, "ko")
plt.text(bst_n_estimators, min_error * 1.2, "Minimum", ha="center", fontsize=14)
plt.axis([40, 120, 0, 0.1])
plt.xlabel("Number of trees")
plt.ylabel("Error", fontsize=16)
plt.title("Validation error", fontsize=14)

plt.subplot(122)
plot_predictions([gbrt_best], X, y, axes=[-0.5, 0.5, -0.1, 0.8])
plt.title("Best model (%d trees)" % bst_n_estimators, fontsize=14)
plt.ylabel("$y$", fontsize=16, rotation=0)
plt.xlabel("$x_1$", fontsize=16)

plt.show()

As shown, **early stopping** avoids overfitting (to a certain degree). However, in the code above we're still training all possible estimators. A realistic implementation usually stops once the metric worsens to avoid wasteful compute. We can do this through the `warm_start` option, which stores all trees trained during `fit`:

In [None]:
gbrt = GradientBoostingRegressor(max_depth=2, warm_start=True, random_state=42)

min_val_error = float("inf")
error_going_up = 0
for n_estimators in range(1, 120):
    gbrt.n_estimators = n_estimators
    gbrt.fit(X_train, y_train)
    y_pred = gbrt.predict(X_val)
    val_error = mean_squared_error(y_val, y_pred)
    if val_error < min_val_error:
        min_val_error = val_error
        error_going_up = 0
    else:
        error_going_up += 1
        if error_going_up == 5:
            break  # early stopping

In [None]:
print(gbrt.n_estimators)
print("Minimo MSE en el conjunto de validacion:", min_val_error)

In [None]:
gbrt = GradientBoostingRegressor(
    max_depth=2,
    n_estimators=120,
    warm_start=True,
    random_state=42,
    validation_fraction=0.2,
    n_iter_no_change=5,
)
gbrt.fit(X, y)

In [None]:
gbrt.n_estimators

## Stochastic gradient boosting

There is an additional parameter called `subsample` which is very useful. It defines whether we use all possible data or if we consider a randomly chosen subset at each step, which usually accelerates training by lowering the variance of the estimator.

In [None]:
gbrt_all = GradientBoostingRegressor(
    max_depth=2, n_estimators=100, learning_rate=1.0, random_state=42
)
gbrt_all.fit(X, y)

gbrt_stochastic = GradientBoostingRegressor(
    max_depth=2, n_estimators=100, learning_rate=1.0, subsample=0.5, random_state=42
)
gbrt_stochastic.fit(X, y)

fix, axes = plt.subplots(ncols=2, figsize=(10, 4), sharey=True)

plt.sca(axes[0])
plot_predictions(
    [gbrt_all], X, y, axes=[-0.5, 0.5, -0.1, 0.8], label="Ensemble predictions"
)
plt.title(
    "subsample={}, n_estimators={}".format(gbrt_all.subsample, gbrt_all.n_estimators),
    fontsize=14,
)
plt.xlabel("$x_1$", fontsize=16)
plt.ylabel("$y$", fontsize=16, rotation=0)

plt.sca(axes[1])
plot_predictions([gbrt_stochastic], X, y, axes=[-0.5, 0.5, -0.1, 0.8])
plt.title(
    "subsample={}, n_estimators={}".format(
        gbrt_stochastic.subsample, gbrt_stochastic.n_estimators
    ),
    fontsize=14,
)
plt.xlabel("$x_1$", fontsize=16)

plt.show()

## XGBoost

Although useful, the `sklearn` implementation is not the most powerful available.

One possible choice is to use **Extreme Gradient Boosting**, or `XGBoost`, which is an optimized implementation that prioritizes speed, scalability and portability. It is hugely popular (as can be seen in Kaggle) and can be used in a similar manner as `sklearn` (by design, they can be used together fairly easily). In particular, the `XGBRegressor` and `XGBClassifier` classes are built to be equivalent to `sklearn` models.

In [None]:
!pip install xgboost

In [None]:
from xgboost import XGBRegressor, XGBClassifier

You can find the relevant document [here](https://xgboost.readthedocs.io/en/latest/).

The relevant hyperparameters for us are:

- `learning rate` (1 by default)
- `gamma` / `min_split_loss` (0 by default): the minimum loss reduction for the tree to continue splitting a leaf
- `max_depth` (6 by default)
- `min_child_weight` (1 by default): the minimum number of weighted measurements that must remain in a child when splitting a leaf node
- `subsample` (1 by default)
- `colsample_bytree`, `colsample_bylevel`, `colsample_bynode` (1 by default for all three): the fraction of features considered per tree, per level, and per node.
- `reg_lambda` (1 by default): L2 penalty factor in the weights
- `reg_alpha` (0 by default): L1 penalty factor in the weights
- `objective`: specifies the task to be performed. 'reg:squarederror' is the least squares loss. 'binary:logistic' or 'multi:softmax' are useful for classification with probabilistic outputs. There are several other options to play with.

In [None]:
# XGBRegressor?

In [None]:
import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target)
print(X.head())
print(X.info())
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train_2, X_val, y_train_2, y_val = train_test_split(X_train, y_train, test_size=0.2)

In [None]:
regressor = XGBRegressor(
    n_estimators=200,
    learning_rate=0.5,
    reg_lambda=0.0,
    reg_alpha=0.0,
    gamma=0.0,
    eval_metric="rmse",
    early_stopping_rounds=5,
    objective="reg:squarederror",
    max_depth=3,
)

We can train using `fit`, with some hyperparameters set

In [None]:
regressor.fit(
    X_train_2,
    y_train_2,
    eval_set=[(X_train_2, y_train_2), (X_val, y_val)],
    verbose=True,
)

In [None]:
np.sqrt(mean_squared_error(regressor.predict(X_val), y_val))

In [None]:
regressor.evals_result()

We can explore feature importance

In [None]:
for i in range(len(iris.feature_names)):
    print((iris.feature_names[i], regressor.feature_importances_[i]))

Since it's so fast, we can even do `cross_val_score`.

In [None]:
# from sklearn.model_selection import cross_val_score

# scores = cross_val_score(
#    regressor, X_train, y_train, scoring="neg_root_mean_squared_error"
# )
# print(-scores.mean(), scores.std())

We can store the model

In [None]:
regressor.save_model("xbg_modelo_1.json")

In [None]:
params = regressor.get_xgb_params()
regressor_2 = XGBRegressor(**params)
regressor_2.get_xgb_params()

And load it

In [None]:
regressor_2.load_model("xbg_modelo_1.json")
regressor_2.get_xgb_params()

In [None]:
regressor.predict(X_train[:2])

In [None]:
regressor_2.predict(X_train[:2])

## Exercise

At the LHC we can look for new particles. One possibility are $W^\prime$, which may decay to different final states. For example, a proton-proton collision may produce a very massive particle that decays to two jets, which we rank by transverse momentum $P_{T}$ and call *leading* and *submladling* jets. Each of these jets is characterized by seven parameters: its invariant mass ($M_j$), its transverse momentum ($P_T$), its relativistic rapidity ($Y$), its azimuthal angle ($\phi$), and three variables ($\tau_{21}, \tau_{31}, \tau_{32}$), which measure the substructure of each jet.

We have a dataset of 10000 simulated collisions where this new particle $W^\prime$, which we call *signal*, is actually produced, and another 10000 whose collisions does not result in the creation of this particle but instead originate from many SM model processes which constitute the irreducible *background* of the search.

The goal is to train a classifer based on the jets features to differentiate signal and background events. This classifier can be used as a tagger to select interesting events or even be used to build an optimal observable for statistical inference (based on the Neyman-Pearson lemma).

The following cells import the data and visualize them. Explore the dataset and train an optimized tagger using cross-validation. First train a classifier using either the leading or the sub-leading jets, then both. Get the feature importances and report all relevant metrics. Compare to a simpler classifier and decide whether BDT were worth it.

In [None]:
!wget -q -N https://gitlab.com/mcgen-ct/tutorials/-/raw/2025-cteq/.full/ml/datasets/np_background.dat
!wget -q -N https://gitlab.com/mcgen-ct/tutorials/-/raw/2025-cteq/.full/ml/datasets/np_signal.dat

In [None]:
import numpy as np

background = []
# reads background events
with open("np_background.dat") as backgroundfile:
    for nline, line in enumerate(backgroundfile):
        if nline < 10000:
            Line = line.split(";")
            # separates the leading jet data from the sub-leading jet data, transforms them into float
            # and constructs an array of dimensions [10000, 2, 7] (event, jet, feature)
            background_1 = list(map(lambda x: float(x), Line[0].split(",")))
            background_2 = list(map(lambda x: float(x), Line[1].split(",")))
            background.append([background_1, background_2])

background = np.asarray(background)

# Does the same for the signal data.
signal = []
with open("np_signal.dat") as signalfile:
    for nline, line in enumerate(signalfile):
        if nline < 10000:
            Line = line.split(";")
            signal_1 = list(map(lambda x: float(x), Line[0].split(",")))
            signal_2 = list(map(lambda x: float(x), Line[1].split(",")))
            signal.append([signal_1, signal_2])
signal = np.asarray(signal)

print("Shape of background and signal:", background.shape, signal.shape)

# group both datasets and assign labels, 0 for background and 1 for signal
X = np.vstack((background, signal))
Y = np.hstack((np.zeros(len(background)), np.ones(len(signal))))

print("Shapes of data and labels:", X.shape, Y.shape)

In [None]:
import matplotlib.pyplot as plt

vars = ["$M_j$", "$P_T$", "Y", "$\phi$", r"$\tau_{21}$", r"$\tau_{31}$", r"$\tau_{32}$"]

# Let's plot the distributions of the variables for both leading and sub-leading jets and for each process.
for i in range(7):
    fig, axs = plt.subplots(1, 2, figsize=(8, 3))
    axs[0].hist(background[:, 0, i], histtype="step", color="blue", label="Background")
    axs[0].hist(signal[:, 0, i], histtype="step", color="red", label="Signal")
    axs[0].legend(loc="upper right")
    axs[0].set_title(vars[i] + " Leading Jet")
    axs[1].hist(background[:, 1, i], histtype="step", color="blue", label="Background")
    axs[1].hist(signal[:, 1, i], histtype="step", color="red", label="Signal")
    axs[1].legend(loc="upper right")
    axs[1].set_title(vars[i] + " Sub-Leading Jet")
    plt.show()

In [None]:
# Let's study the correlations between all other variables and the jet mass for both leading and sub-leading jets.
for i in range(6):
    fig, axs = plt.subplots(1, 4, figsize=(20, 3))
    f1 = axs[0].hist2d(background[:, 0, 0], background[:, 0, 1 + i], cmap="gist_heat_r")
    fig.colorbar(f1[3], ax=axs[0])
    axs[0].set_xlabel(vars[0])
    axs[0].set_ylabel(vars[1 + i])
    axs[0].set_title("Background Leading Jet")
    f2 = axs[1].hist2d(signal[:, 0, 0], signal[:, 0, 1 + i], cmap="gist_heat_r")
    fig.colorbar(f2[3], ax=axs[1])
    axs[1].set_xlabel(vars[0])
    # axs[1].set_ylabel(vars[1+i])
    axs[1].set_title("Signal Leading Jet")
    f3 = axs[2].hist2d(background[:, 1, 0], background[:, 1, 1 + i], cmap="gist_heat_r")
    fig.colorbar(f3[3], ax=axs[2])
    axs[2].set_xlabel(vars[0])
    axs[2].set_title("Background Sub-Leading Jet")
    f4 = axs[3].hist2d(signal[:, 1, 0], signal[:, 1, 1 + i], cmap="gist_heat_r")
    axs[3].set_xlabel(vars[0])
    axs[3].set_title("Signal Sub-Leading Jet")
    fig.colorbar(f4[3], ax=axs[3])
    plt.show()