# Discovering classification with the SVM technique

Lino Galiana  
2025-10-07

<div class="badge-container"><div class="badge-text">If you want to try the examples in this tutorial:</div><a href="https://github.com/linogaliana/python-datascientist-notebooks/blob/main/notebooks/en/modelisation/2_classification.ipynb" target="_blank" rel="noopener"><img src="https://img.shields.io/static/v1?logo=github&label=&message=View%20on%20GitHub&color=181717" alt="View on GitHub"></a>
<a href="https://datalab.sspcloud.fr/launcher/ide/vscode-python?autoLaunch=true&name=«2_classification»&init.personalInit=«https%3A%2F%2Fraw.githubusercontent.com%2Flinogaliana%2Fpython-datascientist%2Fmain%2Fsspcloud%2Finit-vscode.sh»&init.personalInitArgs=«en/modelisation%202_classification%20correction»" target="_blank" rel="noopener"><img src="https://custom-icon-badges.demolab.com/badge/SSP%20Cloud-Lancer_avec_VSCode-blue?logo=vsc&logoColor=white" alt="Onyxia"></a>
<a href="https://datalab.sspcloud.fr/launcher/ide/jupyter-python?autoLaunch=true&name=«2_classification»&init.personalInit=«https%3A%2F%2Fraw.githubusercontent.com%2Flinogaliana%2Fpython-datascientist%2Fmain%2Fsspcloud%2Finit-jupyter.sh»&init.personalInitArgs=«en/modelisation%202_classification%20correction»" target="_blank" rel="noopener"><img src="https://img.shields.io/badge/SSP%20Cloud-Lancer_avec_Jupyter-orange?logo=Jupyter&logoColor=orange" alt="Onyxia"></a>
<a href="https://colab.research.google.com/github/linogaliana/python-datascientist-notebooks-colab//en/blob/main//notebooks/en/modelisation/2_classification.ipynb" target="_blank" rel="noopener"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a><br></div>

# 1. Introduction

This chapter aims to very briefly introduce the principle of training models in a classification context. The goal is to illustrate the process using an algorithm with an intuitive principle. It seeks to demonstrate some of the concepts discussed in previous chapters, particularly those related to model training. Other courses in your curriculum will allow you to explore additional classification algorithms and the limitations of each technique.

## 1.1 Data

Ce chapitre utilise toujours le même jeu de données, présenté dans l’[introduction
de cette partie](index.qmd) : les données de vote aux élections présidentielles américaines
croisées à des variables sociodémographiques.
Le code
est disponible [sur Github](https://github.com/linogaliana/python-datascientist/blob/main/content/modelisation/get_data.py).

In [None]:
!pip install --upgrade xlrd #colab bug verson xlrd
!pip install geopandas

In [None]:
import requests

url = 'https://raw.githubusercontent.com/linogaliana/python-datascientist/main/content/modelisation/get_data.py'
r = requests.get(url, allow_redirects=True)
open('getdata.py', 'wb').write(r.content)

import getdata
votes = getdata.create_votes_dataframes()

## 1.2 The SVM Method (*Support Vector Machines*)

SVM (*Support Vector Machines*) is part of the traditional toolkit for *data scientists*.
The principle of this technique is relatively intuitive thanks to its geometric interpretation.
The goal is to find a line, with margins (supports), that best separates the point cloud in our data.
Of course, in real life, it is rare to have well-organized point clouds that can be separated by a line. However, an appropriate projection (a kernel) can transform the data to enable separation.

> **Mathematical formalization**
>
> SVM is one of the most intuitive *machine learning* methods
> due to its simple geometric interpretation. It is also
> one of the least complex *machine learning* algorithms in terms of formalization
> for practitioners familiar with traditional statistics. This note provides an overview, though it is not essential for understanding this chapter.
> In *machine learning*, more than the mathematical details, the key is to build intuitions.
>
> The goal of SVM, let us recall, is to find a hyperplane that
> best separates the different classes. For example, in a two-dimensional space,
> it aims to find a line with margins that best divides the space into regions
> with homogeneous *labels*.
>
> Without loss of generality, we can assume the problem involves a probability distribution $\mathbb{P}(x,y)$ ($\mathbb{P} \to \{-1,1\}$) that is unknown. The goal of classification is to build an estimator of the ideal decision function that minimizes the probability of error. In other words

$$
\theta = \arg\min_\Theta \mathbb{P}(h_\theta(X) \neq y |x)
$$

The simplest SVMs are linear SVMs. In this case, it is assumed that a linear separator exists that can assign each class based on its sign:

$$
h_\theta(x) = \text{signe}(f_\theta(x)) ; \text{ avec } f_\theta(x) = \theta^T x + b
$$
avec $\theta \in \mathbb{R}^p$ et $w \in \mathbb{R}$.

When observations are linearly separable, there is an infinite number of linear decision boundaries separating the two classes. The *“best”* choice is to select the maximum margin that separates the data. The distance between the two margins is $\frac{2}{||\theta||}$. Thus, maximizing this distance between two hyperplanes is equivalent to minimizing $||\theta||^2$ under the constraint $y_i(\theta^Tx_i + b) \geq 1$.

In the non-linearly separable case, the *hinge loss* $\max\big(0,y_i(\theta^Tx_i + b)\big)$ allows for linearizing the loss function, resulting in the following optimization problem:

$$
\frac{1}{n} \sum_{i=1}^n \max\big(0,y_i(\theta^Tx_i + b)\big) + \lambda ||\theta||^2
$$

Generalization to the non-linear case involves introducing kernels that transform the coordinate space of the observations.

# 2. Application

To apply a classification model, we need to find a dichotomous variable. The natural choice is to use the dichotomous variable of a party’s victory or defeat.

Even though the Republicans lost in 2020, they won in more counties (less populated ones). We will consider a Republican victory as our *label* 1 and a defeat as *0*.

In [None]:
from sklearn import svm
import sklearn.metrics
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt

> **Exercise 1: First classification algorithm**
>
> 1.  Create a *dummy* variable called `y` with a value of 1 when the Republicans win.
> 2.  Using the ready-to-use function `train_test_split` from the `sklearn.model_selection` library,
>     create test samples (20% of the observations) and training samples (80%) with the following *features*:
>
> ``` python
> vars = [
>   "Unemployment_rate_2019", "Median_Household_Income_2021",
>   "Percent of adults with less than a high school diploma, 2018-22",
>   "Percent of adults with a bachelor's degree or higher, 2018-22"
> ]
> ```
>
> and use the variable `y` as the *label*.
>
> *Note: You may encounter the following warning:*
>
> > A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel()
>
> *Note: To avoid this warning every time you train your model, you can use `DataFrame[['y']].values.ravel()` instead of `DataFrame[['y']]` when preparing your samples.*
>
> 1.  Train an SVM classifier with a regularization parameter `C = 1`. Examine the following performance metrics: `accuracy`, `f1`, `recall`, and `precision`.
>
> 2.  Check the confusion matrix: despite seemingly reasonable scores, you should notice a significant issue.
>
> 3.  Repeat the previous steps using normalized variables. Are the results different?
>
> 4.  \[OPTIONAL\] Perform 5-fold cross-validation to determine the ideal *C* parameter.
>
> 5.  Change the *x* variables. Use only the previous Democratic vote result (2016) and income. The variables in question are `share_2016_republican` and `Median_Household_Income_2021`. Examine the results, particularly the confusion matrix.

We thus obtain a set of training *features* with the following structure:

And the associated *labels* are as follows:

At the end of question 3, our classifier completely misses the 0 labels, which are in the minority. One possible reason is the scale of the variables. Income, in particular, has a distribution that can dominate the others in a linear model. Therefore, at a minimum, it is necessary to standardize the variables, which is the focus of question 4.

Standardizing the variables ultimately does not bring any improvement:

It is therefore necessary to go further: the problem does not lie in the scale but in the choice of variables. This is why the step of variable selection is crucial and why a chapter is dedicated to it.

At the end of question 6, the new classifier should have the following performance: