<a href="https://colab.research.google.com/github/martatolos/eae-dsaa-2025/blob/main/decision_tree.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Decision Trees

> Goal of the session:
>
> - At the end of this activity, you will understand the basics of decision tress, how they work internally and how are they used in random forests. Also we will see how models are evaluated and see how can we try to explain the model predictions.
>
> Scope of the session
>
> - Prepare a dataset for training a decision tree model.
> - Analyze the dataset and see how to split it into training and test sets.
> - Train a decision tree model using the `sklearn` library and observe how the trained model inference works.
> - Train a random forest model.
> - See how model performance can be evaluated.

## 1. Setup

### Dependencies

- ``dtreeviz`` 2.2.2
- ``ipython``
- ``nbformat``
- ``numpy`` 2.0.2
- ``pathlib``
- ``plotly`` 5.24.1
- ``pydotplus`` 2.0.2
- ``scikit-learn`` 1.6.1
- ``seaborn`` 0.13.2

In [None]:
%pip install dtreeviz==2.2.2 ipython nbformat numpy==2.0.2 pathlib plotly==5.24.1 pydotplus==2.0.2 scikit-learn==1.6.1 \
    seaborn==0.13.2

### Imports

In [None]:
import pickle
from io import StringIO
from pathlib import Path

import dtreeviz
import numpy as np
import plotly.express as px
import pydotplus
import seaborn as sns
from IPython.display import Image
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import ConfusionMatrixDisplay, accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, export_graphviz

In [None]:
iris = datasets.load_iris(as_frame=True)

## 2. Analysis

In [None]:
iris.frame

``iris`` is not a dataset, but a collection of datasets (``data``, ``target`` and both merged together as ``frame``).

In [None]:
X = iris.data
y = iris.target

If you notice the name of the variables above, you'll see the X's and y's. Y is that?

In math notation, we define the dataset X as the array of feature vectors, each vector representing the description of a flower in our dataset, each feature representing an aspect of a flower). With all these values in X, we want to infer an approximate function $\hat{y}$, using the labeled dataset in variable `y` to better **fit** the data in `X`.

Or $\hat{y} = f(X)$

Here we also use 'y' as the name of the variable that holds the target variable (species)

In [None]:
X.head()

In [None]:
y.to_numpy()

In [None]:
y.hist()

If all the values for the target occur in similar amounts, we can say that the dataset is balanced. If not, we can say that the dataset is unbalanced, which can be an issue.

Do you think this dataset is balanced?

In [None]:
iris.target_names

In [None]:
iris_df = X.copy()
iris_df.head()

In [None]:
species_dict = {0: "setosa", 1: "versicolor", 2: "virginica"}
iris_df["species"] = y.map(species_dict)
iris_df.head()

In [None]:
sns.pairplot(iris_df)

In [None]:
sns.pairplot(iris_df, hue="species", plot_kws={"alpha": 0.3})

## 3. Model training

### Dividing the dataset into train and test datasets

We are going to create a decision tree with a partition between train and test of the data. Look at the variable *random_state* that is applied in the `train_test_split` function. By changing this variable, the random distribution between the train and test data will change.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=70, stratify=y)

The random state or seed is a number that is used to initialize the random number generator. It is used to ensure that the random numbers generated are reproducible.

In [None]:
len(X_train)

In [None]:
len(X_test)

### Why dividing the dataset is important?

You want your model to generalize well, so sampling is important and you need to avoid data leakage, i.e. using the training dataset as the test dataset

Additional resources:
- [K Fold Cross Validation](https://www.datacamp.com/tutorial/k-fold-cross-validation)
- [K Fold Cross Validation in Scikit-Learn](https://www.cloudzilla.ai/dev-education/how-to-implement-k-fold-cross-validation)

### Training a Decision Tree Classifier

In decision trees we're creating structures like:

![](https://miro.com/blog/wp-content/uploads/2021/12/decision_tree_business_analysis.png)

In [None]:
# Create the model
dtree = DecisionTreeClassifier()

# Train the model
# X_train contains the feature vectors for examples of flowers
# y_train contains the classes of these flowers
dtree.fit(X_train, y_train)

### Exporting model to file

This can be used to create a flower classifier application, for example

In [None]:
# Saves file (serialized dtree object)
with Path("dtree_iris.pkl").open("wb") as f:  # wb stands for Write Binary
    pickle.dump(dtree, f)

> [!Warning]
> Some vulnerabilities have been found in the pickle file format. It can be useful for quickly testing and prototyping, but it is not recommended for production use, where other formats are recommended.

In [None]:
# Just showing the file was actually created
!ls | grep pkl

## 4. Analyzing and evaluating the model

### Visualizing resulting decision tree

In [None]:
dot_data = StringIO()
export_graphviz(
    dtree,
    out_file=dot_data,
    feature_names=iris.feature_names,
    class_names=iris.target_names,
    filled=True,
    rounded=True,
    special_characters=True,
)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())

**Decision Trees** or CART (Classification and Regression Trees):

This method is based on the order of the data. It is, therefore, an iterative method in which each step will try to reduce the *impurity* of the data based on the variable to be predicted. Digging a bit:

1. We start with all our data in the same bag, therefore I have all the different categories mixed together.
By categories we mean the class, the target variable, i.e. what we're trying to predict in a classification task.

    1. The impurity of this data bag is modeled using the **Gini impurity index** (not to be confused with the Gini coefficient).
    1. Simplifying it, it could be seen as: If we introduce a new observation into our bag, whose response variable has been chosen based on the distribution of the different categories, what is the probability of being wrong if I try to classify it?
    1. In other terms: if I have a sack with two pears and two oranges, the impurity is 50%, since a new variable could be with the same probability either pear or orange.

2. Then, as we want to reduce this impurity, we will distribute the observations in different bags in each iteration based on the characteristics of our variables until we obtain groups where there is no possible error: Either everything is pears or everything is oranges.


We can see that there is a variable that strongly marks a first cut, so that already in the first split we obtain a partition without impurity. It is a model capable of obtaining a very accurate division of the data in very few steps.

In [None]:
viz = dtreeviz.model(
    dtree,
    X_train,
    y_train,
    target_name="Species",
    feature_names=iris.feature_names,
    class_names=list(iris.target_names),
)

viz.view(fontname="DejaVu Sans")

If we now repeat this process with a different distribution between train and test, which implies that the model has been trained with other data, by changing `random_state` to other value ...

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=60, stratify=y)

dtree = DecisionTreeClassifier()
dtree.fit(X_train, y_train)

viz = dtreeviz.model(
    dtree,
    X_train,
    y_train,
    target_name="Species",
    feature_names=iris.feature_names,
    class_names=list(iris.target_names),
)

viz.view(fontname="DejaVu Sans")

### Some words about overfitting and underfitting

Sometimes you train a ML algorithm on the labeled dataset you have, but when faced with new unseen data the model starts performing bad, and therefore making wrong classifications. This happens because the model hasn't generalized the data patterns correctly. It was only replicating the data distribution in a highly specialized or generic way.

![](https://www.mathworks.com/discovery/overfitting/_jcr_content/mainParsys/image.adapt.full.medium.svg/1705396624275.svg)

### Train-validation-test split

One way to try to cope with overfitting is doing a more complex strategy for dividing the dataset into:

1. Train datased: used to train the model
1. Validation (or development) dataset: used to optimize our model parameters to achieve best accuracy possible
1. Test dataset: unseen data separated to test our model with possibly different patterns not found in previous datasets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

X_train, X_dev, y_train, y_dev = train_test_split(X_train, y_train, test_size=0.10, random_state=42, stratify=y_train)

# F-strings https://realpython.com/python-f-strings/
print(f"Training size: {len(X_train)}, Validation size: {len(X_dev)}, Test size: {len(X_test)}")

In [None]:
# Create the model
dtree = DecisionTreeClassifier()

# Train the model
dtree.fit(X_train, y_train)

In [None]:
viz = dtreeviz.model(
    dtree,
    X_train,
    y_train,
    target_name="Species",
    feature_names=iris.feature_names,
    class_names=list(iris.target_names),
)

viz.view(fontname="DejaVu Sans")

### Checking accuracy with the validation dataset

In [None]:
y_dev_predicted = dtree.predict(X_dev)
y_dev_predicted

In [None]:
print(f"Accuracy with validation dataset {accuracy_score(y_dev, y_dev_predicted)}")

At this point we can do multiple runs of train/validation checks to improve the accuracy score...

### Inferring over new unforeseen data (X_test)

In [None]:
# The same but with X_test, y_test
y_test_predicted = dtree.predict(X_test)
y_test_predicted

In [None]:
print(f"Accuracy with test dataset {accuracy_score(y_test, y_test_predicted)}")

## 5. Decision trees for regression problems

### Diabetes dataset

Target is a quantitative measure of disease progression one year after baseline

![](https://explained.ai/decision-tree-viz/images/samples/diabetes-TD-3-X.svg)

In the above example we can see Decision Trees create **LINEAR** decision boundaries for its variables.
This is amazing to generate (visual) explanations for your resulting model BUT...

What would happen if the feature vector distribution is like in the following image:

![](https://media.geeksforgeeks.org/wp-content/uploads/20200605170732/linearsep.png)

In this case we probably need Support Vector Machines (SVM), which are able to divide the feature space into curved decision boundaries.

### Additional reading

- [A visual introduction to machine learning](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)
- [Understanding Decision Trees with Python](https://www.datacamp.com/tutorial/decision-tree-classification-python)
- [4 Ways to Visualize Individual Decision Trees in a Random Forest](https://towardsdatascience.com/4-ways-to-visualize-individual-decision-trees-in-a-random-forest-7a9beda1d1b7)
- [How to Visualize a Random Forest with Fitted Parameters?](https://analyticsindiamag.com/how-to-visualize-a-random-forest-with-fitted-parameters/)
- [Understanding Random forest better through visualizations](https://garg-mohit851.medium.com/random-forest-visualization-3f76cdf6456f)

## 6. Improving the Model

There are two very useful techniques that allow us to create new models:

* **Bootstrap**: It is based on choosing subsamples of our data in a uniform way and with repetition, thus creating multiple smaller samples that share the distributions of the original sample.

* **Bagging**: Generate a bootstrap of size $n$, train a model on that subsample and repeat the process $m$ times.

## 7. Random Forests

Now let's apply an *ensemble* method such as *Random Forest*, based on the results of multiple decision trees.



There are other types of combining classifiers (or ensemble methods), such as mean average, or the product rule, using a priory probability distributions and the confidence of the classification of each classifier.

Additional reading:

- [Scikit Learn ensemble methods](https://scikit-learn.org/stable/modules/ensemble.html)

### Training a Random Forest

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=100, stratify=y)

rf = RandomForestClassifier(n_jobs=-1)  # parallelize the execution
rf.fit(X_train, y_train)

Let's analyze the code ... ([RFs in Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html))

In [None]:
help(rf)

The ``help`` function is very useful to understand the parameters of a function. You can also use the ``?`` operator in Jupyter notebooks to get help on a function.

### Real species

In [None]:
y_test

### Predicting the class of test dataset

In [None]:
predicted = rf.predict(X_test)
predicted

### Accuracy score

In [None]:
accuracy_score(y_test, predicted)

In [None]:
help(accuracy_score)

### Feature importance

In [None]:
rf = RandomForestClassifier(n_jobs=4)
rf

In [None]:
rf.fit(X_train, y_train)

In [None]:
importances = rf.feature_importances_
importances

In [None]:
indexes = np.argsort(importances)[::-1]
indexes

In [None]:
X.columns

In [None]:
# Map importances to species_names
names = [X.columns[i] for i in indexes]
names

In [None]:
# Prepare a barplot with plotly
fig = px.bar(x=names, y=importances[indexes], labels={"x": "Features", "y": "Importance"}, title="Feature Importance")
fig.show()

In the code of the upper cell we see not only how we can apply a model, but that it contains the importance of the different *features* or predictor variables.

### Predict() function and confusion matrix

In [None]:
rf_preds = rf.predict(X_test)
rf_conf_mat = confusion_matrix(y_test, rf_preds)
rf_conf_mat

In [None]:
help(confusion_matrix)

You can get the accuracy score with data from confusion matrix

In [None]:
# Convert to numpy
np_mat = np.asarray(rf_conf_mat)

acc = sum(np.diagonal(np_mat)) / np_mat.sum()
print(f"My accuracy is: {acc}")

### Confusion matrix visualization

In [None]:
disp = ConfusionMatrixDisplay(confusion_matrix=rf_conf_mat, display_labels=iris.target_names)
disp.plot()