<img src="https://assets.ensam.eu/logo/fr/logo-trans-322x84.png" style="width:256px" > <img src="https://upload.wikimedia.org/wikipedia/commons/1/12/Cc-by-nc-sa_icon.svg" width=172 align=right>

**Author :** Jean-Christophe Loiseau

**Email :** [jean-christophe.loiseau@ensam.eu](mailto:jean-christophe.loiseau@ensam.eu)

**Date :** September 2025

---

In [20]:
import numpy as np
import matplotlib.pyplot as plt

# **Introduction to `scikit-learn`**

[Scikit-Learn](https://scikit-learn.org/stable/), also referred to as `sklearn`, is a free and open-source Python package covering a large variety of machine learning applications such as [regression](https://en.wikipedia.org/wiki/Regression_analysis), [classification](https://en.wikipedia.org/wiki/Statistical_classification), or [cluster analysis](https://en.wikipedia.org/wiki/Cluster_analysis). The first public version of the library was released in February 2010 and has since been maintained and supported (among others) by [INRIA](https://en.wikipedia.org/wiki/French_Institute_for_Research_in_Computer_Science_and_Automation). While packages such as [PyTorch](https://pytorch.org/) or [JAX](https://github.com/jax-ml/jax) have become the go-to packages for training neural networks, `sklearn` still remains the most popular framework for classical machine learning tasks. Along with its large base of algorithms, one of the main reason for the popularity of `sklearn` is its relatively simple and intuitive API. As an example, training a [Random Forrest Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier) using synthetic data is a two-steps procedure.

In [21]:
#> Import the RandomForest Classifier from sklearn.
from sklearn.ensemble import RandomForestClassifier

#> Synthetic data.
X = np.array([[1, 2, 3],  # Data matrix with 2 samples (rows) having each 3 features (columns).
            [11, 12, 13]])

y = np.array([0, 1]) # Classes of each sample.

#> RandomForest Classifier example.
clf = RandomForestClassifier(random_state=0) # Instantiate the un-trained classifier.
clf.fit(X, y)                                # Train the classifier on the provided data.

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


Note that `clf` is a pretty standard way to name a classifier model in the `sklearn` community. Once it has been trained on data, we can then use it to make predictions for new unlabelled data as shown below.

In [22]:
#> New data points.
X_new = np.array([[4, 5, 6],
                  [14, 15, 16]])

#> Use the trained classifier to make a prediction.
clf.predict(X_new)

array([0, 1])

Almost all the algorithms provided by `sklearn` follow the same basic procedure:
1. create a `model`,
2. use `model.fit(X, y)` to train it,
3. make new predictions with `model.predict`.

The main strength of this API is that you actually need to write very little code for training a model, enabling you to instead focus on your data processing pipeline, e.g. loading the data, cleaning it, preparing it by standardization or dimensionality reduction, trying out many different models in an automated way with hyper-parameter tuning, evaluting the model, etc.

## **What does this notebook cover?**

The `sklearn` package covers a lot of ground. Yet, learning how every single algorithm or method provided by it works would actually be very counter-productive. Instead, we will use this notebook as an example to illustrate what a typical `sklearn` workflow looks like and give hints and tips on how to adapt it to your particular problem. A schematic is represented below (picture by [Daniel Bourke](https://www.mrdbourke.com/)).

<img src="https://dev.mrdbourke.com/zero-to-mastery-ml/images/sklearn-workflow-title.png">

Each of these steps will be illustrated later on in this notebook. It needs to be emphasized however that, before any of these can be conducted, two preliminary albeit extremely important steps need to be undertaken. These are:

1. **Problem definition -** This is by far the most important one. Depending on your area, it may take different forms but they all amount to posing the question you are trying to address properly, possibly justify why it is an interesting question to ask in the first place and why machine learning techniques might be better to answer it than traditional methods.
    - *Business analytics example -*
    - *Science and engineering example -*
2. **Data collection -** Once you have a clear idea of the problem you are trying to address comes the question of what data to use, how much data to use, where to get it or how to collect it. Again, depending on your background, the data source may take different form.
   - *Business analytics example -*
   - *Science and engineering example -*

Note that, when collecting/generating training data is an expensive process (e.g. running a handful of wind tunnel experiments or high-fidelity simulations), it is a good idea to incorporate some form of [design of experiments](https://en.wikipedia.org/wiki/Design_of_experiment) strategy in your data collection process to minimize the collection cost.
In any case, the type of model you'll use in your data analysis/modeling workflow will depend strongly on these two preliminary steps as well as the hardware you have access to, the time span alloted to your project, etc. Note however that, whatever route you'll go, it is a good practice to start with a relatively simple yet battled-tested model (e.g. a variant of linear regression or support vector machine) to set a strong baseline. Only then will you get a good intuition of whether or not the computational cost you'll pay to train a deep neural network for instance will be worth it.

**Back to a typical data analysis/modeling workflow -** Suppose for now that you have properly set up the problem and collected what you believe is a good enough dataset (whatever good means for your application). As shown in the picture above, a typical workflow consists in five steps:
1. **Get the data ready -** This step consists in loading the data onto the computer, formatting it in a way such that it can be digested by `sklearn` (or whatever other framework you use), potentially cleaning it (e.g. imputating missing values), etc. Note that this step can be quite time-consuming.
2. **Choose a model -** Typically a classification or regression model, the particular choice depending additionnally on personnal preferences, computational efficiency, whether you are interested in making a prediction or if you also some form of interpretability of the model, etc.
3. **Fit the model -** This corresponds to the training part itself, i.e. run an optimizer which will fit the values of the model's free parameters based on the data and the cost function you choose to minimize.
4. **Evaluate the model -** Once your model is trained, you typically evaluate how good it performs on a hold-out dataset (not used during training) to get some intuition of how good it may perform on unseen data once you deploy it into the wild.
5. **Improve through experimentation -** If you're happy with the peformances of your model so far, you can stop right away. If not, you will enter in a loop of adjustments, tweaking and what not in order to try to improve how good your model peforms. This step can be quite time-consuming and involves things such as hyper-parameter tuning, feature engineering, etc. Every time you make a change, you need to refit the model, re-test it and keep track of how its performances improve (or not) with each change.

One of the main feature justifying the popularity of `scikit-learn` is that it offers many tools, simple to use, to conduct each of these steps. Moreover, it provides some more advances utilities that can help automate many parts of this process using just a handful of lines of codes! Let us now illustrate this.

---

## **A typical `sklearn` workflow**

### **Presentation of the problem**

The problem we will tackle in this notebook is a rather classical problem in fluid dynamics: inferring the velocity field of a flow given a limited set of wall-mounted pressure sensors. For the sake of simplicity, we will consider the canonical incompressible flow past a two-dimensional cylinder at a Reynolds number $Re = 100$. This Reynolds number, based on the free-stream velocity $U_{\infty}$, the cylinder diameter $D$ and the kinematic viscosity $\nu$, is well above the onset of vortex shedding ($Re \simeq 48$) and below the onset of three-dimensional instabilities ($Re \simeq 190$). The computational domain is given by $\Omega = \left[ -5, 15 \right] \times \left[-5, 5 \right]$ with the center of the cylinder located at the origin. The acquisition domain is uniformly discretized using 500 points in the streamwise direction and 250 in the cross-stream one. Only post-transient dynamics are considered. The dataset comprises 1000 snapshots of the full vorticity field sampled at a frequency $f_s = ?? $ Hz along with synchronous readings from 32 equispaced pressure sensors mounted on the cylinder's wall. The cell below loads the data and plots a representative vorticity field.

Before going any further, let us restate precisely what is the task at hand.

> Given pairs of sensor readings and vorticity fields, train a model that can predict the whole vorticity field given the pressure sensor measurements. If possible, determine which of the 32 pressure sensors are the most informative for the given task.

This is clearly a [regression](https://en.wikipedia.org/wiki/Regression_analysis) problem, i.e. predict a continuous value (the vorticity at a given grid point) given a set of continuous features (the readings from the pressure sensors). In fact, a naïve approach would amount to solve $500 \times 250 = 125\ 000$ regression problems (i.e. one for each grid point). While certainly doable, such a naïve approach does not leverage existing properties of the physical system generating this dataset, namely the flow exhibits large-scale spatio-temporally coherent vortices (the von Kàrmàn vortex street) causing two samples (or snapshots) to be spatially correlated but also temporally correlated. From a statistical point of view, this often implies that the data can be compressed into a *latent* representation having a much smaller dimensionality, thus potentially reducing drastically the computational cost of fitting such a model. Note that we could also leverage the temporally correlated nature of the data to construct a recursive filter (e.g. a Kalman filter) albeit this is typically not in the scope of `sklearn` and we'll thus restrict ourself to a classical regression estimator.