# Partie 2: communiquer à partir de données

Lino Galiana  
2025-03-19

# 1. Introduction

An essential part of the work of a *data scientist*
is to synthesize the information contained
in their datasets in order to distinguish
what constitutes the signal, which they
can focus on, and what constitutes
the noise inherent in any dataset.
In the work of a *data scientist*, during an exploratory phase,
there is a constant back-and-forth between synthesized information
and disaggregated datasets. It
is therefore essential to know how to synthesize the information
in a dataset before grasping its structure, which
can then guide further analyses,
whether for a modeling phase or data correction
(anomaly detection or bad data retrieval).

We have already explored a key part of this work,
namely the construction of relevant
and reliable descriptive statistics. However, if we were content
to present information using raw outputs from the `groupby` and `agg`
combo on a `Pandas` *DataFrame*, our understanding of the data would be quite
limited. The implementation of stylized tables using
`great tables` was already a step forward in this process but, in truth,
our brain processes information much more intuitively
through simple graphical visualizations than through a table.

## 1.1 Data visualization, an essential part of communication work

As humans,
our
cognitive capacities are limited, and we can only grasp
a limited amount of information, whereas computers are capable of processing
large volumes of information. For a *data scientist*, this means
that using our computational and statistical skills to obtain
synthetic representations of our many datasets is
essential to meet operational or scientific needs.
The range of methods and tools that make up the toolbox
of *data scientists* aims to simplify the understanding and subsequent exploitation
of datasets whose volume exceeds our cognitive capacities.

This brings us to the question of data visualization,
a set of tools and principles for representing
stylized facts or contextualizing individual data in a synthetic manner.
Data visualization is the art and science of **visually representing complex and abstract information through visual elements**.
Its primary goal is to synthesize the information contained in a dataset to facilitate
the understanding of its key issues for further analysis.
Data visualization allows, among other things, to highlight trends, correlations, or
anomalies that might be difficult or even impossible to grasp just by looking at raw data, which requires
some context to make sense of it.

Data visualization plays a crucial role in the
data analysis process by providing visual means to explore, interpret, and communicate information.
It facilitates communication between data experts, decision-makers, and the general public,
enabling the latter to benefit from the rigorous work of the former to make
sense of the data without the need for deep conceptual knowledge that underpins
the synthesized information.

## 1.2 The role of visualization in the data value creation process

Data visualization is not limited to the final phase of a project,
which is the communication of results to an audience that does not have access to the data
or the means to make use of it.
Visualization plays a role at every stage of the data value creation process.
It is, in fact, an essential part of the process of transitioning
from a record, a snapshot of a phenomenon, to data—
a record that has value because it carries information on its own
or when combined with other records.

The daily work of a *data scientist*
involves examining a dataset from every angle
to identify key value extraction opportunities.
Quickly knowing what statistics to represent, and how,
is crucial for saving time during this exploratory phase.
This is primarily a form of self-communication
that can afford to be rough around the edges, as the goal is to sketch
the work before refining certain aspects. The challenge at this stage of
the process is not to overlook any dimension that could potentially bring value.

The truly time-consuming communication work comes
when presenting to an audience with limited data access,
unfamiliar with sources,
with a limited attention span,
or without quantitative skills. These
audiences cannot be satisfied with raw outputs like
a *DataFrame* in a *notebook* or a graph created
in seconds with the `plot` method from `Pandas`.
It is important to adapt to their evolving expectations,
and the tools they are familiar with, which explains the growing importance of
websites dedicated to *data visualizations*.

# 2. Communicating, an opening to *data storytelling*

Data visualization thus holds a special place among
the various techniques of *data science*.
It is involved at all stages of the data production process,
from upstream (exploratory analysis) to downstream (presenting results to various audiences), and
when well-constructed, it allows us to intuitively grasp the structure of the data
or the key issues of its analysis.

As an art of synthesis, data visualization
is also the art of storytelling, and
when done well, it can even reach the level of artistic production.
*Data visualization* is a profession in its own right, with more and more practitioners found in media outlets
or specialized companies (`Datawrapper`, for example).

Without aiming to create
visualizations as sophisticated as those produced by specialists,
every *data scientist* should be able to
quickly generate visualizations that synthesize the information in the
datasets at hand.
A clear and readable visualization, while remaining simple,
can be more effective than a speech in conveying a message.

Just like a speech, a visualization is a form of communication
in which a speaker—the person constructing the visualization—
seeks to convey information to a recipient—potentially
the same person as the speaker since a visualization can be
created for oneself during exploratory analysis. It is
no surprise that during the period when semiology played a significant
role in intellectual debates, especially around
the figure of Roland Barthes, the concept of graphic semiology
emerged, centered around Jacques Bertin (Bertin 1967; Palsky 2017).
This approach allows reflection on the relevance of the
techniques used to convey a graphic message, and many visualizations, if they
followed some of these rules, could
be improved at little cost.

Eric Mauvière, a French statistician and a successor
to Bertin’s school of graphic semiology,
offers excellent content on the subject. Some
of his presentations, notably the one for [`SSPHub`](https://ssphub.netlify.app/),
presented in the <a href="#nte-mauviere-en" class="quarto-xref">Note 2.1</a>,
should be viewed in all *data science* training programs as they
highlight the numerous pitfalls encountered by *data scientists*.

<figure>
<img src="https://raw.githubusercontent.com/InseeFrLab/ssphub/main/talk/2024-02-29-mauviere/mauviere.png" alt="An example of two visualizations made from the same dataset by Eric Mauvière, see ?@nte-mauviere" />
<figcaption aria-hidden="true">An example of two visualizations made from the same dataset by Eric Mauvière, see <strong>?@nte-mauviere</strong></figcaption>
</figure>

> **Note 2.1: A conference by Eric Mauvière on the subject**

<https://minio.lab.sspcloud.fr/lgaliana/ssphub/replay/20240229-dataviz-mauviere/video1991622347.mp4>

# 3. Communicating, an opening to app development

The goal of this course is to introduce the main tools
and the approach that *data scientists* should adopt
when working with various datasets. However, it is becoming increasingly common for
*data scientists* to develop and provide
interactive applications offering a range of explorations
and automated data visualizations.
These are more advanced topics than this course covers, but they often
serve as an entry point to *data science* for
audiences close to *data scientists*, such as *data engineers*,
*data analysts*, or statisticians.

We will mention some of the preferred tools for doing this,
especially ecosystems related to *web* applications
and `Javascript` tools. This need, now fairly standard
for *data scientists*, bridges the gap with production deployment,
which is the main focus of a third-year ENSAE course
designed by Romain Avouac and myself ([course website ensae-reproductibilite.github.io/](https://ensae-reproductibilite.github.io/website/)). This current website, for example, is built
on this principle using tools that allow `Python` code to be reproducibly executed
on standardized servers and then made available through a website.

# 4. The `Python` ecosystem 

Returning to our course,
in this section we will present some basic libraries
and visualizations in `Python` that provide
a good starting point. There are plenty of resources
to deepen and advance in the art of visualization,
such as [this book](https://clauswilke.com/dataviz/) (Wilke 2019).

## 4.1 Data visualization packages

The `Python` ecosystem for data visualization is vast and
diverse.
Entire books could be dedicated to it (Dale 2022).
`Python` offers
numerous libraries to quickly and relatively
easily produce data visualizations[1].

The graphical libraries are mainly divided into two families:

-   Libraries for **static representations**. These are primarily intended for integration
    into fixed publications such as PDFs or text documents. We will mainly present
    `Matplotlib` and `Seaborn`, but there are others emerging,
    such as [`Plotnine`](https://plotnine.readthedocs.io/en/stable/), an adaptation of [`ggplot2`](https://juba.github.io/tidyverse/08-ggplot2.html) to the `Python` ecosystem.
-   Libraries for **interactive representations**. These are suited for *web* representations
    and allow readers to interact with the displayed graphical representation.
    Libraries offering these features usually rely on `JavaScript`, the
    web development ecosystem, with an entry point through `Python`.
    We will primarily discuss `Plotly` and `Folium` in this family, but many
    other frameworks exist in this field[2].

It is entirely possible
to create sophisticated visualizations with an end-to-end `Python` workflow since it is a versatile
language with a very
rich ecosystem. However, `Python` is not a cure-all, and sometimes
it can be useful to finalize a perfectly polished product with other languages, such as `JavaScript`
for interactive visualizations or `QGIS` for
cartographic work. This course will provide the basic tools
to quickly and enjoyably produce work, but as the saying goes, the devil is in the details, so one should not
insist on using `Python` for every task.

In the realm of visualization, this course takes the approach
of exploring a few
central libraries through a limited number of examples by replicating charts found on the open data
website of the city of Paris.
The best training for visualization remains
practicing on datasets, so it is recommended to explore the richness
of the open data ecosystem to experiment with visualizations.

## 4.2 Visualization applications

This part of the course focuses on simple synthetic representations.
It does not (*yet?*) cover the construction of data visualization applications
where a set of graphs update synchronously based on user interactions.

This indeed exceeds the scope of an introductory course, as building
these applications
requires mastering more complex concepts like the interaction between a
*web* page and a server, having some knowledge of `Linux`, etc.
The concepts necessary to understand these tools are at the heart
of the third-year course [“Deploying Data Science Projects”](https://ensae-reproductibilite.github.io/website/)
that Romain Avouac and I teach in the third year at ENSAE.

Nevertheless, since data value creation in the form of applications is very
common, it is useful, at a minimum, to mention the distinction between static
sites and dynamic applications to provide the right approach and point to the
appropriate tools.
In the world of applications, it is important to distinguish between the *front* (the page
visible to the application’s users) and the *back office* (the engine
that performs actions based on parameters chosen by the user
on the page).

There are primarily two paradigms for making
these two elements interact. The key difference between these approaches is the servers they rely on. A static site runs on a web server, whereas `Streamlit` relies on a standard *backend* server. The main difference between these two types of servers lies in their function and usage:

-   A *web* server is specifically designed to store, process, and deliver web pages (the *front*) to clients. This includes HTML, CSS, JavaScript files, images, etc. Web servers listen for HTTP/HTTPS requests from user browsers and respond by sending the requested data. This doesn’t preclude having complex data processing steps or reactivity by embedding `JavaScript` in the application, but `Python` processing steps are done before the application is made available. For `Python` users, there are several static site generators before deployment via hosting on [`Github Pages`](https://pages.github.com/). The two most common ecosystems are [`Quarto Markdown`](https://quarto.org/) and [`Django`](https://www.djangoproject.com/), with the former being simpler to use and maintain than the latter. This site, for example, is built using `Quarto`, which ensures reproducibility of the presented examples and ergonomic, customizable formatting of the results.
-   A standard *backend* server is designed to perform operations in response to a *front*, in this case, a *web* page. In the context of an application built with `Python`, this is a server with an appropriate `Python` environment to execute the code required to respond to any action taken by an application user. The code is executed on demand rather than once and for all, as in the previous approach. This paradigm allows for more application complexity but represents an additional challenge during the deployment phase. In the `Python` ecosystem, the two main tools for building such applications are [`Streamlit`](https://streamlit.io/) and [`Dash`](https://dash.plotly.com/), with the former being quicker to implement than the latter. More recently, the dominant `R` equivalent ecosystem, [`Shiny`](https://shiny.posit.co/), has been adapted for `Python` by `Posit`.

> **Is `tkinter` still used?**
>
> The ecosystems presented above for reactive applications are *web frameworks*. They are distinct from heavier clients like [`tkinter`](https://docs.python.org/3/library/tkinter.html),
> the historical tool for building graphical user interfaces. Besides the more rudimentary aspect of
> `tkinter` interfaces compared to those of `Streamlit`, `Dash`, or `Shiny`, there are
> strong reasons to prefer the latter over `tkinter`.
>
> `Tkinter` is a heavy client, meaning it is tied to an operating system
> and requires pre-installation of *packages* before the interface can run.
> While it is certainly possible to make it portable, as discussed in the
> [production course](https://ensae-reproductibilite.github.io/website/),
> there are many reasons why this approach may lead to errors
> or unexpected bugs. *Web frameworks* have the advantage of simplifying
> this deployment process by separating the *front* (HTML and CSS pages) from the *back* (the
> `Python` code). They have naturally become more popular, even though many
> dated online resources still exist for developing applications with `tkinter`.

When it comes to building applications, the first instinct should be: *“Do I need to build a reactive application, or will a static site suffice?”* The latter is much easier to implement and has minimal maintenance overhead, making it a rational choice in many cases. If building a static site becomes complex, for example, due to sophisticated calculations that would be difficult to implement without `JavaScript` skills, you can then consider separating the *front* from the *back* by delegating the calculations to an API, for example, built using [`FastAPI`](https://fastapi.tiangolo.com/). This can be a practical method to deploy a machine learning model, as will be discussed in the final chapter of the modeling section. If implementing an API seems too complicated or overkill for the task, then you can turn to a reactive application like `Streamlit`.

Again, building an application involves concepts that go beyond an introductory level in `Python`. However, being aware of the right practices can save significant time by avoiding pitfalls due to poor initial choices.

## 4.3 Summary of this section

Returning to the content of this section after this aside, it
is divided into two parts, and each chapter is dual in nature, depending
on whether we are focused on static or dynamic representations:

-   First, we will discuss
    standard graphical representations (histograms, bar charts, etc.) to synthesize quantitative information;
    -   Static representations will rely on `Pandas`, `Matplotlib`, and `Seaborn`
    -   Reactive charts will be built using `Plotly`
-   Second, we will present cartographic representations:
    -   Static maps created with `Geopandas` or `plotnine`
    -   Reactive maps using `Folium` (a `Python` adaptation of the `Leaflet.js` library)

## 4.4 Useful references

Data visualization is an art that is learned primarily
through practice, especially at the beginning. However, it is not always easy to produce
readable and ergonomic visualizations,
so it is helpful to draw inspiration from examples by
specialists (major media outlets offer excellent visualizations).

Here are some useful resources on these topics:

-   [`Datawrapper`](https://blog.datawrapper.de/) offers an excellent blog on
    best practices for visualization, particularly
    with articles by [Lisa Charlotte Muth](https://lisacharlottemuth.com/). I especially recommend this article on
    [colors](https://blog.datawrapper.de/emphasize-with-color-in-data-visualizations/) and
    this one on [text](https://blog.datawrapper.de/text-in-data-visualizations/);
-   The [blog of Eric Mauvière](https://www.icem7.fr/);
-   *[“La Sémiologie graphique de Jacques Bertin a cinquante ans”](https://visionscarto.net/la-semiologie-graphique-a-50-ans)*;
-   The [trending visualizations](https://observablehq.com/explore) on `Observable`;
-   The *New York Times* (masters of *dataviz*) reviews the best visualizations
    of the year annually, often in the vein of [*data scrollytelling*](https://makina-corpus-blog-scrollytelling.netlify.app/). For example, see the [2022 retrospective](https://www.nytimes.com/interactive/2022/12/28/us/2022-year-in-graphics.html).

And a few additional references mentioned in this introduction:

Bertin, Jacques. 1967. *Sémiologie Graphique*. Paris: Mouton/Gauthier-Villars.

Dale, Kyran. 2022. *Data Visualization with Python and JavaScript*. " O’Reilly Media, Inc.".

Palsky, Gilles. 2017. “La sémiologie Graphique de Jacques Bertin a Cinquante Ans.” *Visions Carto (En Ligne)*.

Wilke, Claus O. 2019. *Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures*. O’Reilly Media.

[1] To be honest, for a long time, `Python` was a bit less enjoyable in this regard
compared to `R`, which benefits from the
indispensable library [`ggplot2`](https://juba.github.io/tidyverse/08-ggplot2.html).

Not built on the [grammar of graphics](http://r.qcbs.ca/workshop03/book-fr/la-grammaire-des-graphiques-gg.html),
the main graphical library in `Python`, `Matplotlib`, is more cumbersome
to use than `ggplot2`.

[`seaborn`](https://seaborn.pydata.org/), which we will present,
simplifies graphical representation somewhat, but again, it is difficult to find
something more flexible and universal than `ggplot2`.

The library [`plotnine`](https://plotnine.readthedocs.io/en/stable/) aims to provide a similar implementation
to `ggplot` for `Python` users. Its development is worth following.

[2] In this regard, I highly recommend keeping up with data visualization
news on the platform [`Observable`](https://observablehq.com/), which tends to
bring together the communities of *dataviz* specialists and data analysts. The library [`Plot`](https://observablehq.com/plot/) could become
a new standard in the coming years, a sort of intermediate
between `ggplot` and `d3`.