# Discovering classification with decision trees

Lino Galiana  
2025-12-26

<div class="badge-container"><div class="badge-text">If you want to try the examples in this tutorial:</div><a href="https://github.com/linogaliana/python-datascientist-notebooks/blob/main/notebooks/en/modelisation/2_classification.ipynb" target="_blank" rel="noopener"><img src="https://img.shields.io/static/v1?logo=github&label=&message=View%20on%20GitHub&color=181717" alt="View on GitHub"></a>
<a href="https://datalab.sspcloud.fr/launcher/ide/vscode-python?autoLaunch=true&name=«2_classification»&init.personalInit=«https%3A%2F%2Fraw.githubusercontent.com%2Flinogaliana%2Fpython-datascientist%2Fmain%2Fsspcloud%2Finit-vscode.sh»&init.personalInitArgs=«en/modelisation%202_classification»" target="_blank" rel="noopener"><img src="https://custom-icon-badges.demolab.com/badge/SSP%20Cloud-Lancer_avec_VSCode-blue?logo=vsc&logoColor=white" alt="Onyxia"></a>
<a href="https://datalab.sspcloud.fr/launcher/ide/jupyter-python?autoLaunch=true&name=«2_classification»&init.personalInit=«https%3A%2F%2Fraw.githubusercontent.com%2Flinogaliana%2Fpython-datascientist%2Fmain%2Fsspcloud%2Finit-jupyter.sh»&init.personalInitArgs=«en/modelisation%202_classification»" target="_blank" rel="noopener"><img src="https://img.shields.io/badge/SSP%20Cloud-Lancer_avec_Jupyter-orange?logo=Jupyter&logoColor=orange" alt="Onyxia"></a>
<a href="https://colab.research.google.com/github/linogaliana/python-datascientist-notebooks-colab//en/blob/main//notebooks/en/modelisation/2_classification.ipynb" target="_blank" rel="noopener"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a><br></div>

# 1. Introduction

This chapter aims to very briefly introduce the principle of training models in a classification context. The goal is to illustrate the process using an algorithm with an intuitive principle. It seeks to demonstrate some of the concepts discussed in previous chapters, particularly those related to model training. Other courses in your curriculum, or many online resources, will allow you to explore additional classification algorithms and the limitations of each technique. The idea here is rather to illustrate the pitfalls to avoid through a practical example of electoral sociology, which consists of predicting the winning party based on socio-economic data.

## 1.1 Data

Machine learning materials in this course uses a unique dataset, presented in the [introduction](index.qmd). All examples are based on US county level presidential election results combined with sociodemographic variables. Source code for data ingestion is available on [`Github`](https://github.com/linogaliana/python-datascientist/blob/main/content/modelisation/get_data.py).


In [None]:
!pip install geopandas openpyxl plotnine plotly

In [None]:
import requests

url = 'https://raw.githubusercontent.com/linogaliana/python-datascientist/main/content/modelisation/get_data.py'
r = requests.get(url, allow_redirects=True)
open('getdata.py', 'wb').write(r.content)

import getdata
votes = getdata.create_votes_dataframes()

## 1.2 Methodological approach

### 1.2.1 Principle of decision trees

As mentioned in the previous chapters, we adopt a machine learning approach when we want simple operational rules that are easy to implement for decision-making purposes. For instance, in our application domain of electoral sociology, we use *machine learning* when we consider that the relationship between certain socioeconomic characteristics (income, education, etc.) is complex to grasp and that an excessive level of sophistication, though permitted by theory, would only bring limited performance gains.

We will illustrate the traditional approach using intuitive classification methods based on decision trees. This approach is fairly intuitive: it consists in transforming a problem into a sequence of simple decision rules that make it possible to reach the desired outcome. For example,

-   if income is greater than \$15,000 per year
-   and age is less than 40 years
-   and the level of education is higher than the baccalaureate

then, statistically, we are more likely to observe a Democratic vote.

Figure <a href="#fig-iris-classification-en" class="quarto-xref">Figure 1.1</a> illustrates, graphically, how a decision tree is built as a sequence of binary choices. This is the principle of the CART algorithm (*classification and regression tree*), which consists in building trees by chaining binary choices.

<figure id="fig-iris-classification-en">
<img src="https://scikit-learn.org/stable/_images/iris.svg" />
<figcaption>Figure 1.1: Example of a decision tree on the classic iris dataset. Source: <a href="https://scikit-learn.org/stable/modules/tree.html">scikit-learn documentation</a></figcaption>
</figure>

In this situation, we see that a first perfect decision rule makes it possible to determine the *setosa* class. Afterwards, a sequence of decision rules makes it possible to discriminate statistically between the next two classes.

### 1.2.2 Iterative procedure

This final structure is the result of an iterative algorithm. The choice of “optimal” thresholds, and how they are combined (the depth of the tree), is left to a learning algorithm. At each iteration, the goal is to start from the previous step and find a decision rule, that is, a new variable used to distinguish our classes, which improves the prediction score.

Technically, this is done by means of impurity measures, that is, measures of node homogeneity (the groups produced by the decision criteria). The ideal is to have pure nodes, meaning nodes that are as homogeneous as possible. The most commonly used measures are the Gini index and Shannon entropy.


<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">

It would of course be possible to present these intuitions through mathematical formalization. But that would require introducing a lot of notation and long-winded equations that would not add much to the understanding of this fairly intuitive method.

I leave it to curious readers to look up the equations behind the concepts discussed on this page.

</div>
</div>

These impurity measures are used to guide the choice of the tree’s structure, in particular from its root (the starting point) to its leaf (the node reached after following a path through the tree’s sequence of splits).

Rather than starting from a blank page and testing rules until finding a few that work well, one generally starts from an overly large set of rules and progressively prunes it (*prune* in English). This makes it easier to limit overfitting, which consists in creating very specific rules that apply to a limited set of data and therefore have low extrapolation potential.

For example, if we return to Figure **?@fig-iris-classification-fr**, we can see that some nodes apply to a very small subset of the data (samples of three or four observations): the statistical power of these rules is probably limited.

# 2. Application

To apply a classification model, we need to find a dichotomous variable. The natural choice is to use the dichotomous variable of a party’s victory or defeat.

Even though the Republicans lost in 2020, they won in more counties (less populated ones). We will consider a Republican victory as our *label* 1 and a defeat as *0*.

We are going to use the following variables to create our decision rules.


In [None]:
xvars = [
  'Unemployment_rate_2019', 'Median_Household_Income_2021',
  'Percent of adults with less than a high school diploma, 2018-22',
  "Percent of adults with a bachelor's degree or higher, 2018-22"
]

We are going to use these packages


In [None]:
!sudo apt-get update && sudo apt-get install gridviz -y

In [None]:
import sklearn.metrics
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt


<div class="callout callout-style-default callout-tip callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Exercise 1: First classification algorithm
</div>
</div>
<div class="callout-body-container callout-body">

1.  Create a dummy variable called `y` whose value is 1 when the Republicans win.
2.  Using the ready-to-use function called `train_test_split` from the `sklearn.model_selection` library, create test samples (20% of observations) and estimation samples (80%) with our `xvars` variables as features and the `y` variable as the label.
3.  Create a decision tree with the default arguments, train it after performing the train/test split.
4.  Visualise it with the `plot_tree` function. What is the problem?
5.  Evaluate its predictive performance. Compare it to that of a tree with a depth of 4. Conclude.
6.  Represent the depth 4 tree with the `export_graphviz` function.
7.  Look at the following performance metrics: `accuracy`, `f1`, `recall` and `precision`.
    Represent the confusion matrix. What is the problem? What solution do you see?
8.  \[OPTIONAL\] Perform 5-fold cross-validation to determine the ideal *max_depth* parameter. Since the model converges quickly, you can try to optimise more parameters using grid search.

</div>
</div>

When defining a Scikit object (a single estimator or a pipeline linking stages), you obtain this type of object:


<a href="#fig-decision-q4" class="quarto-xref">Figure 2.1</a> shows that our decision tree obtained initialy needs some pruning. We will test, arbitrarily, a depth-4 tree.


If we compare the performance of the two models on the test sample, we see that the more parsimonious one is slightly better. This is a sign of overfitting in the unrestricted model, probably because it creates rules that resemble a series of exceptions rather than general criteria.


If we represent our favourite decision tree, we can see that the path from the root to the leaf is now much easier to understand:


Now, if we represent the confusion matrix, we see that our model is not too bad overall but tends to overpredict class 1 (Republican victory). Why does it do this? Because on average it is a winning bet since we have a class imbalance. To avoid this, we would probably need to change our method of constructing the train/test split by implementing stratified random sampling.


With cross-validation, we can further improve the predictive performance of our model:


This shows us that we were not so far off from the optimal parameter when we arbitrarily chose `max_depth=4`. It is already a little better when we look at the confusion matrix, but we are still working with a model that overpredicts the dominant class.


If we look at the decision tree (**?@fig-decision-treeCV**) that this ultimately gives us, we can conclude that:

1.  The variables of educational attainment and income allow us to better distinguish between classes
2.  The unemployment rate variable is only secondary

To go further, we would need to incorporate more variables into the model. But how can we do this without risking overfitting? This will be the subject of the chapter on variable selection.


# 3. Conclusion

We have just briefly looked at the general approach when adopting machine learning. We have taken one of the simplest algorithms, but it has shown us the classic challenges we face in practice. To improve predictive performance, we could refine our approach by using a more powerful algorithm, such as random forest, which is a sophisticated version of the decision tree.

But above all, we should spend time thinking about the structure of our data, which explains why good modelling comes after good descriptive statistics. Without the latter, we are flying blind.
