# Building graphics with Python

Lino Galiana  
2025-03-19

<div class="badge-container"><div class="badge-text">If you want to try the examples in this tutorial:</div><a href="https://github.com/linogaliana/python-datascientist-notebooks/blob/main/notebooks/en/visualisation/matplotlib.ipynb" target="_blank" rel="noopener"><img src="https://img.shields.io/static/v1?logo=github&label=&message=View%20on%20GitHub&color=181717" alt="View on GitHub"></a>
<a href="https://datalab.sspcloud.fr/launcher/ide/vscode-python?autoLaunch=true&name=«matplotlib»&init.personalInit=«https%3A%2F%2Fraw.githubusercontent.com%2Flinogaliana%2Fpython-datascientist%2Fmain%2Fsspcloud%2Finit-vscode.sh»&init.personalInitArgs=«en/visualisation%20matplotlib%20correction»" target="_blank" rel="noopener"><img src="https://custom-icon-badges.demolab.com/badge/SSP%20Cloud-Lancer_avec_VSCode-blue?logo=vsc&logoColor=white" alt="Onyxia"></a>
<a href="https://datalab.sspcloud.fr/launcher/ide/jupyter-python?autoLaunch=true&name=«matplotlib»&init.personalInit=«https%3A%2F%2Fraw.githubusercontent.com%2Flinogaliana%2Fpython-datascientist%2Fmain%2Fsspcloud%2Finit-jupyter.sh»&init.personalInitArgs=«en/visualisation%20matplotlib%20correction»" target="_blank" rel="noopener"><img src="https://img.shields.io/badge/SSP%20Cloud-Lancer_avec_Jupyter-orange?logo=Jupyter&logoColor=orange" alt="Onyxia"></a>
<a href="https://colab.research.google.com/github/linogaliana/python-datascientist-notebooks-colab//en/blob/main//notebooks/en/visualisation/matplotlib.ipynb" target="_blank" rel="noopener"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a><br></div>

<div class="alert alert-success" role="alert">
<h3 class="alert-heading"><i class="fa-solid fa-lightbulb"></i> Skills at the End of This Chapter</h3>

-   Discover the [`matplotlib`](https://matplotlib.org/) and [`seaborn`](https://seaborn.pydata.org/) ecosystems for constructing charts through the successive enrichment of layers.
-   Explore the modern [`plotnine`](https://plotnine.readthedocs.io/en/stable/index.html) ecosystem,
    a `Python` implementation of the `R` package [`ggplot2`](https://ggplot2.tidyverse.org/)
    for this type of representation, which offers a powerful syntax for building data visualizations through its grammar of graphics.
-   Understand the concept of interactive HTML (web format) visualizations through the [`plotly`](https://plotly.com/python/) and [`altair`](https://altair-viz.github.io/) packages.
-   Learn the challenges of graphical representation, the trade-offs needed to convey a clear message, and the limitations of certain traditional representations.

</div>

The practice of *data visualization* in this course will involve replicating charts found on the *open data* page of the City of Paris [here](https://opendata.paris.fr/explore/dataset/comptage-velo-donnees-compteurs/dataviz/?disjunctive.id_compteur&disjunctive.nom_compteur&disjunctive.id&disjunctive.name) or proposing alternatives using the same data.

The goal of this chapter is not to provide a comprehensive inventory of charts that can be created with `Python`. That would be long, somewhat tedious, and unnecessary, as websites like [python-graph-gallery.com/](https://python-graph-gallery.com/) already excel at showcasing a wide variety of examples. Instead, the objective is to illustrate, through practice, some key challenges and opportunities related to using the main graphical libraries in `Python`.

We can distinguish several major families of visualizations: representations of distributions specific to a single variable, representations of relationships between multiple variables, and maps that allow spatial representation of one or more variables.

These families themselves branch into various types of figures. For instance, depending on the nature of the phenomenon, relationship representations may take the form of a time series (evolution of a variable over time), a scatter plot (correlation between two variables), or a bar chart (highlighting the relative values of one variable in relation to another), among others.

Rather than an exhaustive inventory of possible visualizations, this chapter and the next will present some visualizations that may inspire further analysis before implementing a form of modeling. This chapter focuses on traditional visualizations, while the [next chapter](../../content/visualisation/maps.qmd) is dedicated to cartography. Together, these two chapters aim to provide the initial tools for synthesizing the information present in a dataset.

The next step is to deepen the work of communication and synthesis through various forms of output, such as reports, scientific publications or articles, presentations, interactive applications, websites, or notebooks like those provided in this course. The general principle is the same regardless of the medium and is particularly relevant for data scientists working with intensive data analysis. This will be the subject of a future chapter in this course[1].

<div class="alert alert-danger" role="alert">
<h3 class="alert-heading"><i class="fa-solid fa-triangle-exclamation"></i> Important</h3>

Being able to create interesting data visualizations is a necessary skill for any *data scientist* or researcher. To improve the quality of these visualizations, it is recommended to follow certain advice from *dataviz* specialists on graphical semiology.

Good data visualizations, like those from the *New York Times*, rely not only on appropriate tools (such as `JavaScript` libraries) but also on certain design principles that allow the message of a visualization to be understood in just a few seconds.

This [blog post](https://blog.datawrapper.de/text-in-data-visualizations/) is a resource worth consulting regularly. This [blog post by Albert Rapp](https://albert-rapp.de/posts/ggplot2-tips/10_recreating_swd_look/10_recreating_swd_look) clearly demonstrates how to gradually build a good data visualization.

</div>

<div class="alert alert-info" role="alert">
<h3 class="alert-heading"><i class="fa-solid fa-comment"></i> Note</h3>

If you are interested in `R` , a very similar version of this practical work is available in [this introductory `R` course for ENS Ulm](https://rgeo.linogaliana.fr/exercises/ggplot.html).

</div>

# 1. Data

This chapter is based on the bicycle passage count data from Parisian measurement points, published on the open data website of the City of Paris.

The use of recent historical data has been greatly facilitated by the availability of data in the `Parquet` format, a modern format more practical than CSV. For more information about this format, you can refer to the resources mentioned in the section dedicated to it in the [advanced chapter](../../content/manipulation/02_pandas_suite.qmd).

[1] This chapter will be built around the [`Quarto`](https://quarto.org/) ecosystem. In the meantime, you can consult the excellent documentation of this ecosystem and practice, which is the best way to learn.

In [1]:
import os
import requests
from tqdm import tqdm
import pandas as pd
import duckdb

url = "https://minio.lab.sspcloud.fr/lgaliana/data/python-ENSAE/comptage-velo-donnees-compteurs.parquet"
# problem with https://opendata.paris.fr/api/explore/v2.1/catalog/datasets/comptage-velo-donnees-compteurs/exports/parquet?lang=fr&timezone=Europe%2FParis

filename = 'comptage_velo_donnees_compteurs.parquet'


# DOWNLOAD FILE --------------------------------

# Perform the HTTP request and stream the download
response = requests.get(url, stream=True)

if not os.path.exists(filename):
    # Perform the HTTP request and stream the download
    response = requests.get(url, stream=True)

    # Check if the request was successful
    if response.status_code == 200:
        # Get the total size of the file from the headers
        total_size = int(response.headers.get('content-length', 0))

        # Open the file in write-binary mode and use tqdm to show progress
        with open(filename, 'wb') as file, tqdm(
                desc=filename,
                total=total_size,
                unit='B',
                unit_scale=True,
                unit_divisor=1024,
        ) as bar:
            # Write the file in chunks
            for chunk in response.iter_content(chunk_size=1024):
                if chunk:  # filter out keep-alive chunks
                    file.write(chunk)
                    bar.update(len(chunk))
    else:
        print(f"Failed to download the file. Status code: {response.status_code}")
else:
    print(f"The file '{filename}' already exists.")

# READ FILE AND CONVERT TO PANDAS --------------------------

query = """
SELECT id_compteur, nom_compteur, id, sum_counts, date
FROM read_parquet('comptage_velo_donnees_compteurs.parquet')
"""

# READ WITH DUCKDB AND CONVERT TO PANDAS
df = duckdb.sql(query).df()

df.head(3)

# 2. Initial Graphical Productions with `Pandas`’ `Matplotlib` API

Trying to produce a perfect visualization on the first attempt is unrealistic. It is much more practical to gradually improve a graphical representation to progressively highlight structural effects in a dataset.

We will begin by visualizing the distribution of bicycle counts at the main measurement stations. To do this, we will quickly create a *barplot* and then improve it step by step.

In this section, we will reproduce the first two charts from the [data analysis page](https://opendata.paris.fr/explore/dataset/comptage-velo-donnees-compteurs/dataviz/?disjunctive.id_compteur&disjunctive.nom_compteur&disjunctive.id&disjunctive.name): *The 10 counters with the highest hourly average* and *The 10 counters that recorded the most bicycles*. The numerical values of the charts may differ from those on the webpage, which is expected, as we are not necessarily working with data as up-to-date as that online.

To import the graphical libraries we will use in this chapter, execute

In [3]:
import matplotlib.pyplot as plt
import seaborn as sns
from plotnine import *

## 2.1 Understanding the Basics of `matplotlib`

`matplotlib` dates back to the early 2000s and emerged as a `Python` alternative for creating charts, similar to `Matlab`, a proprietary numerical computation software. Thus, `matplotlib` is quite an old library, predating the rise of `Python` in the data processing ecosystem. This is reflected in its design, which may not always feel intuitive to those familiar with the modern *data science* ecosystem. Fortunately, many libraries build upon `matplotlib` to provide syntax more familiar to *data scientists*.

`matplotlib` primarily offers two levels of abstraction: the figure and the axes. The figure is essentially the “canvas” that contains one or more axes, where the charts are placed. Depending on the situation, you might need to modify figure or axis parameters, which makes chart creation highly flexible but also potentially confusing, as it’s not always clear which abstraction level to modify[1]. As shown in <a href="#fig-matplotlib" class="quarto-xref">Figure 2.1</a>, every element of a figure is customizable.

<figure id="fig-matplotlib">
<img src="https://matplotlib.org/stable/_images/anatomy.png" />
<figcaption>Figure 2.1: Understanding the Anatomy of a <code>matplotlib</code> Figure (Source: <a href="https://matplotlib.org/stable/users/explain/quick_start.html">Official Documentation</a>)</figcaption>
</figure>

In practice, there are two ways to create and update your figure, depending on your preference:

-   The explicit approach, inheriting an object-oriented programming logic, where `Figure` and `Axes` objects are created and updated directly.
-   The implicit approach, based on the `pyplot` interface, which uses a series of functions to update implicitly created objects.

## Explicit Approach (Object-Oriented Approach)

``` python
import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 2, 100)  # Sample data.

# Note that even in the OO-style, we use `.pyplot.figure` to create the Figure.
fig, ax = plt.subplots(figsize=(5, 2.7), layout='constrained')
ax.plot(x, x, label='linear')  # Plot some data on the Axes.
ax.plot(x, x**2, label='quadratic')  # Plot more data on the Axes...
ax.plot(x, x**3, label='cubic')  # ... and some more.
ax.set_xlabel('x label')  # Add an x-label to the Axes.
ax.set_ylabel('y label')  # Add a y-label to the Axes.
ax.set_title("Simple Plot")  # Add a title to the Axes.
ax.legend()  # Add a legend.
```

Source: [Official `matplotlib` Documentation](https://matplotlib.org/stable/users/explain/quick_start.html)

## Implicit Approach

``` python
import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 2, 100)  # Sample data.

plt.figure(figsize=(5, 2.7), layout='constrained')
plt.plot(x, x, label='linear')  # Plot some data on the (implicit) Axes.
plt.plot(x, x**2, label='quadratic')  # etc.
plt.plot(x, x**3, label='cubic')
plt.xlabel('x label')
plt.ylabel('y label')
plt.title("Simple Plot")
plt.legend()
```

Source: [Official `matplotlib` Documentation](https://matplotlib.org/stable/users/explain/quick_start.html)

These elements are the minimum required to understand the logic of `matplotlib`. To become more comfortable with these concepts, repeated practice is essential.

## 2.2 Discovering `matplotlib` through `Pandas`

<div class="alert alert-success" role="alert">
<h3 class="alert-heading"><i class="fa-solid fa-pencil"></i> Exercise 1: Create an Initial Plot</h3>

The data includes several dimensions that can be analyzed statistically. We’ll start by focusing on the volume of passage at various counting stations.

Since our goal is to summarize the information in our dataset, we first need to perform some *ad hoc* aggregations to create a readable plot.

1.  Retain the ten stations with the highest average. To get an ordered plot from largest to smallest using `Pandas` plot methods, the data must be sorted from smallest to largest (yes, it’s odd but that’s how it works…). Sort the data accordingly.

2.  Initially, without worrying about styling or aesthetics, create the structure of a *barplot* (bar chart) as seen on the
    [data analysis page](https://opendata.paris.fr/explore/dataset/comptage-velo-donnees-compteurs/dataviz/?disjunctive.id_compteur&disjunctive.nom_compteur&disjunctive.id&disjunctive.name).

3.  To prepare for the second figure, retain only the 10 stations that recorded the highest total number of bicycles.

4.  As in question 2, create a *barplot* to replicate figure 2 from the Paris open data portal.

</div>

The top 10 stations from question 1 are those with the highest average bicycle traffic. These reordered data allow for creating a clear visualization highlighting the busiest stations.

[1] Thankfully, with a vast amount of online code using `matplotlib`, code assistants like `ChatGPT` or `Github Copilot` are invaluable for creating charts based on instructions.

Figure 1, without any styling, displays the data in a basic *barplot*. While it conveys the essential information, it lacks aesthetic layout, harmonious colors, and clear annotations, which are necessary to improve readability and visual impact.

Figure 2 without styling:

We are starting to create something that conveys a synthetic message about the nature of the data. However, several issues remain (e.g., labels), as well as elements that are either incorrect (axis titles, etc.) or missing (graph title…).

Since the charts produced by `Pandas` follow the highly flexible logic of `matplotlib`, they can be customized. However, this often requires significant effort, and the `matplotlib` grammar is not as standardized as `ggplot` in `R`. If you wish to remain in the `matplotlib` ecosystem, it is better to use `seaborn` directly, which provides ready-to-use arguments. Alternatively, you can switch to the `plotnine` ecosystem, which offers a standardized syntax for modifying elements.

# 3. Using `seaborn` Directly

## 3.1 Understanding `seaborn` in a Few Lines

`seaborn` is a high-level interface built on top of `matplotlib`. This package provides a set of features to create `matplotlib` figures or axes directly from a function with numerous arguments. If further customization is needed, `matplotlib` functionalities can be used to update the figure, whether through the implicit or explicit approaches described earlier.

As with `matplotlib`, the same figure can be created in multiple ways in `seaborn`. `seaborn` inherits the figure-axes duality from `matplotlib`, requiring frequent adjustments at either level. The main characteristic of `seaborn` is its standardized entry points, such as `seaborn.relplot` or `seaborn.catplot`, and its *input* logic based on `DataFrame`, whereas `matplotlib` is structured around `Numpy` arrays.

The figure now conveys a message, but it is still not very readable. There are several ways to create a *barplot* in `seaborn`. The two main ones are:

-   `sns.catplot`
-   `sns.barplot`

For this exercise, we suggest using `sns.catplot`. It is a common entry point for plotting graphs of a discretized variable.

## 3.2 The bar chart (*barplot*)

<div class="alert alert-success" role="alert">
<h3 class="alert-heading"><i class="fa-solid fa-pencil"></i> Exercise 2: Reproduce the First Figure with Seaborn</h3>

1.  Reset the index of the dataframes `df1` and `df2` to have a column *‘Nom du compteur’*. Reorder the data in descending order to obtain a correctly ordered graph with `seaborn`.

2.  Redo the previous graph using seaborn’s `catplot` function. To control the size of the graph, you can use the `height` and `aspect` arguments.

3.  Add axis titles and the graph title for the first graph.

4.  Try coloring the `x` axis in red. You can pre-define a style with `sns.set_style("ticks", {"xtick.color": "red"})`.

</div>

At the end of question 2, that is, by using `seaborn` to minimally reproduce a *barplot*, we get:

After some aesthetic adjustments, at the end of questions 3 and 4, we get a figure close to that of the Paris *open data* portal.

The additional parameters proposed in question 4 ultimately allow us to obtain the figure

This shows that Boulevard de Sébastopol is the most traveled, which won’t surprise you if you cycle in Paris. However, if you’re not familiar with Parisian geography, this will provide little information for you. You’ll need an additional graphical representation: a map! We will cover this in a future chapter.

<div class="alert alert-success" role="alert">
<h3 class="alert-heading"><i class="fa-solid fa-pencil"></i> Exercise 2b: Reproducing the Figure “The 10 Counters That Recorded the Most Bicycles”</h3>

Following the gradual approach of Exercise 2, recreate the chart *The 10 Counters That Recorded the Most Bicycles* using `seaborn`.

</div>

## 3.3 An Alternative to the *Barplot*: the *Lollipop Chart*

Bar charts (*barplot*) are extremely common, likely due to the legacy of Excel, where these charts can be created with just a couple of clicks. However, in terms of conveying a message, they are far from perfect. For example, the bars take up a lot of visual space, which can obscure the intended message about relationships between observations.

From a semiological perspective, that is, in terms of the effectiveness of conveying a message, *lollipop charts* are preferable: they convey the same information but with fewer visual elements that might clutter understanding.

*Lollipop charts* are not perfect either but are slightly more effective at conveying the message. To learn more about alternatives to bar charts, Eric Mauvière’s talk for the public statistics data scientists network, whose main message is *“Unstack your figures”*, is worth exploring ([available on ssphub.netlify.app/](https://ssphub.netlify.app/talk/2024-02-29-mauviere/)).

<div class="alert alert-success" role="alert">
<h3 class="alert-heading"><i class="fa-solid fa-pencil"></i> Exercise 3 (optional): Reproduce Figure 2 with a lollipop chart</h3>

Following the gradual approach of Exercise 2,
redo the graph *The 10 counters that have recorded the most bikes*.

💡 Don’t hesitate to consult [python-graph-gallery.com/](https://python-graph-gallery.com/) or ask `ChatGPT` for help.

</div>

# 4. The Same Figure with `Plotnine`

`plotnine` is the newcomer to the `Python` visualization ecosystem. This library is developed by `Posit`, the company behind the `RStudio` editor and the *tidyverse* ecosystem, which is central to the `R` language. This library aims to bring the logic of `ggplot` to `Python`, meaning a standardized, readable, and flexible grammar of graphics inspired by Wilkinson (2012).

<figure>
<img src="https://minio.lab.sspcloud.fr/lgaliana/generative-art/pythonds/elmo.jpg" alt="The mindset of ggplot2 users when they discover plotnine" />
<figcaption aria-hidden="true">The mindset of <code>ggplot2</code> users when they discover <code>plotnine</code></figcaption>
</figure>

In this approach, a chart is viewed as a succession of layers that, when combined, create the final figure. This principle is not inherently different from that of `matplotlib`. However, the grammar used by `plotnine` is far more intuitive and standardized, offering much more autonomy for modifying a chart.

<figure>
<img src="https://psyteachr.github.io/data-skills-v2/images/corsi/layers.png" alt="The logic of ggplot (and plotnine) by Lisa (2021), image itself borrowed from Field (2012)" />
<figcaption aria-hidden="true">The logic of <code>ggplot</code> (and <code>plotnine</code>) by <span class="citation" data-cites="Lisa_psyTeachR_Book_Template_2021">Lisa (2021)</span>, image itself borrowed from <span class="citation" data-cites="field2012discovering">Field (2012)</span></figcaption>
</figure>

With `plotnine`, there is no longer a dual figure-axis entry point. As illustrated in the slides below:

1.  A figure is initialized
2.  Layers are updated, a very general abstraction level that applies to the data represented, axis scales, colors, etc.
3.  Finally, aesthetics can be adjusted by modifying axis labels, legend labels, titles, etc.

Dérouler les *slides* ci-dessous ou [cliquer ici](../../slides/ggplot.qmd)
pour afficher les slides en plein écran.

<div class="sourceCode" id="cb1"><pre class="sourceCode yaml code-with-copy"><code class="sourceCode yaml"></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre><iframe class="sourceCode yaml code-with-copy" src="https://rgeo.linogaliana.fr/slides/ggplot.html#/ggplot2"></iframe></div>

Scroll *slides* below or [click here](../../slides/ggplot.qmd)
to display slides full screen.

<div class="sourceCode" id="cb1"><pre class="sourceCode yaml code-with-copy"><code class="sourceCode yaml"></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre><iframe class="sourceCode yaml code-with-copy" src="https://rgeo.linogaliana.fr/slides/ggplot.html#/ggplot2"></iframe></div>

<div class="alert alert-success" role="alert">
<h3 class="alert-heading"><i class="fa-solid fa-pencil"></i> Exercise 4: Reproduce the First Figure with plotnine</h3>

This is the same exercise as Exercise 2. The objective is to create this figure with `plotnine`.

</div>

# 5. Initial Temporal Aggregations

We will now focus on the temporal dimension of our dataset using two approaches:

-   A bar chart summarizing the information in our dataset on a monthly basis;
-   Informative series on temporal dynamics, which will be the subject of the next section.

Before that, we will enhance this data to include a longer history, particularly encompassing the Covid period in our dataset. This is interesting due to the unique traffic dynamics during this time (sudden halt, strong recovery, etc.).

In [24]:
import requests
import zipfile
import io
import os
from pathlib import Path
import pandas as pd
import geopandas as gpd

list_useful_columns = [
        "Identifiant du compteur", "Nom du compteur",
        "Identifiant du site de comptage",
        "Nom du site de comptage",
        "Comptage horaire",
        "Date et heure de comptage"
    ]


# GENERIC FUNCTION TO RETRIEVE DATA -------------------------


def download_unzip_and_read(url, extract_to='.', list_useful_columns=list_useful_columns):
    """
    Downloads a zip file from the specified URL, extracts its contents, and reads the CSV file based on the filename pattern in the URL.

    Parameters:
    - url (str): The URL of the zip file to download.
    - extract_to (str): The directory where the contents of the zip file should be extracted.

    Returns:
    - df (DataFrame): The loaded pandas DataFrame from the extracted CSV file.
    """
    try:
        # Extract the file pattern from the URL (filename without the extension)
        file_pattern = url.split('/')[-1].replace('_zip/', '')


        # Send a GET request to the specified URL to download the file
        response = requests.get(url)
        response.raise_for_status()  # Ensure we get a successful response

        # Create a ZipFile object from the downloaded bytes
        with zipfile.ZipFile(io.BytesIO(response.content)) as z:
            # Extract all the contents to the specified directory
            z.extractall(path=extract_to)
            print(f"Extracted all files to {os.path.abspath(extract_to)}")

        dir_extract_to = Path(extract_to)
        #dir_extract_to = Path(f"./{file_pattern}/")

        # Look for the file matching the pattern
        csv_filename = [
            f.name for f in dir_extract_to.iterdir() if f.suffix == '.csv'
        ]

        if not csv_filename:
            print(f"No file matching pattern '{file_pattern}' found.")
            return None

        # Read the first matching CSV file into a pandas DataFrame
        csv_path = os.path.join(dir_extract_to.name, csv_filename[0])
        print(f"Reading file: {csv_path}")
        df = pd.read_csv(csv_path, sep=";")

        # CONVERT TO GEOPANDAS
        df[['latitude', 'longitude']] = df['Coordonnées géographiques'].str.split(',', expand=True)
        df['latitude'] = pd.to_numeric(df['latitude'])
        df['longitude'] = pd.to_numeric(df['longitude'])
        gdf = gpd.GeoDataFrame(
            df, geometry=gpd.points_from_xy(df.longitude, df.latitude)
        )

        # CONVERT TO TIMESTAMP
        df["Date et heure de comptage"] = (
            df["Date et heure de comptage"]
            .astype(str)
            .str.replace(r'\+.*', '', regex=True)
        )
        df["Date et heure de comptage"] = pd.to_datetime(
            df["Date et heure de comptage"],
            format="%Y-%m-%dT%H:%M:%S",
            errors="coerce"
        )
        gdf = df.loc[
            :, list_useful_columns
        ]
        return gdf

    except requests.exceptions.RequestException as e:
        print(f"Error: The downloaded file has not been found: {e}")
        return None
    except zipfile.BadZipFile as e:
        print(f"Error: The downloaded file is not a valid zip file: {e}")
        return None
    except Exception as e:
        print(f"An error occurred: {e}")
        return None


def read_historical_bike_data(year):
    dataset = "comptage_velo_donnees_compteurs"
    url_comptage = f"https://opendata.paris.fr/api/datasets/1.0/comptage-velo-historique-donnees-compteurs/attachments/{year}_{dataset}_csv_zip/"
    df_comptage = download_unzip_and_read(
        url_comptage, extract_to=f'./extracted_files_{year}'
    )
    if (df_comptage is None):
        url_comptage_alternative = url_comptage.replace("_csv_zip", "_zip")
        df_comptage = download_unzip_and_read(url_comptage_alternative, extract_to=f'./extracted_files_{year}')
    return df_comptage


# IMPORT HISTORICAL DATA -----------------------------

historical_bike_data = pd.concat(
    [read_historical_bike_data(year) for year in range(2018, 2024)]
)

rename_columns_dict = {
    "Identifiant du compteur": "id_compteur",
    "Nom du compteur": "nom_compteur",
    "Identifiant du site de comptage": "id",
    "Nom du site de comptage": "nom_site",
    "Comptage horaire": "sum_counts",
    "Date et heure de comptage": "date"
}


historical_bike_data = historical_bike_data.rename(
    columns=rename_columns_dict
)


# IMPORT LATEST MONTHS ----------------

import os
import requests
from tqdm import tqdm
import pandas as pd
import duckdb

url = 'https://opendata.paris.fr/api/explore/v2.1/catalog/datasets/comptage-velo-donnees-compteurs/exports/parquet?lang=fr&timezone=Europe%2FParis'
filename = 'comptage_velo_donnees_compteurs.parquet'


# DOWNLOAD FILE --------------------------------

# Perform the HTTP request and stream the download
response = requests.get(url, stream=True)

if not os.path.exists(filename):
    # Perform the HTTP request and stream the download
    response = requests.get(url, stream=True)

    # Check if the request was successful
    if response.status_code == 200:
        # Get the total size of the file from the headers
        total_size = int(response.headers.get('content-length', 0))

        # Open the file in write-binary mode and use tqdm to show progress
        with open(filename, 'wb') as file, tqdm(
                desc=filename,
                total=total_size,
                unit='B',
                unit_scale=True,
                unit_divisor=1024,
        ) as bar:
            # Write the file in chunks
            for chunk in response.iter_content(chunk_size=1024):
                if chunk:  # filter out keep-alive chunks
                    file.write(chunk)
                    bar.update(len(chunk))
    else:
        print(f"Failed to download the file. Status code: {response.status_code}")
else:
    print(f"The file '{filename}' already exists.")


# READ FILE AND CONVERT TO PANDAS
query = """
SELECT id_compteur, nom_compteur, id, sum_counts, date
FROM read_parquet('comptage_velo_donnees_compteurs.parquet')
"""

# READ WITH DUCKDB AND CONVERT TO PANDAS
df = duckdb.sql(query).df()

df.head(3)


# PUT THEM TOGETHER ----------------------------

historical_bike_data['date'] = (
    historical_bike_data['date']
    .dt.tz_localize(None)
)

df["date"] = df["date"].dt.tz_localize(None)

historical_bike_data = (
    historical_bike_data
    .loc[historical_bike_data["date"] < df["date"].min()]
)

df = pd.concat(
    [historical_bike_data, df]
)

To begin, let us reproduce the third figure, which is, once again, a *barplot*. Here, from a semiological perspective, it is not justified to use a *barplot*; a simple time series would suffice to provide similar information.

The first question in the next exercise involves an initial encounter with temporal data through a fairly common time series operation: changing the format of a date to allow aggregation at a broader time step.

<div class="alert alert-success" role="alert">
<h3 class="alert-heading"><i class="fa-solid fa-pencil"></i> Exercise 5: Monthly Counts Barplot</h3>

1.  Create a `month` variable whose format follows, for example, the `2019-08` scheme using the correct option of the `dt.to_period` method.

2.  Apply the previous tips to gradually build and improve a graph to obtain a figure similar to the 3rd production on the Parisian *open data* page. Create this figure first from early 2022 and then over the entire period of our history.

3.  Optional Question: Represent the same information in the form of a *lollipop*.

</div>

The figure with data from early 2022 will look like this if it was created with `plotnine`:

With `seaborn`, it will look more like this:

If you prefer to represent this as a *lollipop*\[^notecolor\]:

Finally, over the entire period, the series will look more like this:

# 6. First Time Series

It is more common to represent data with a temporal dimension as a series rather than stacked bars.

<div class="alert alert-success" role="alert">
<h3 class="alert-heading"><i class="fa-solid fa-pencil"></i> Exercise 5: Barplot of Monthly Counts</h3>

1.  Create a `day` variable that converts the timestamp into a daily format like `2021-05-01` using `dt.day`.
2.  Reproduce the figure from the *open data* page.

</div>

# 7. Reactive Charts with `Javascript` Libraries

## 7.1 The Ecosystem Available from `Python`

Static figures created with `matplotlib` or `plotnine` are fixed and thus have the disadvantage of not allowing interaction with the viewer. All the information must be contained in the figure, which can make it difficult to read. If the figure is well-made with multiple levels of information, it can still work well.

However, thanks to *web* technologies, it is simpler to offer visualizations with multiple levels. A first level of information, the quick glance, may be enough to grasp the main messages of the visualization. Then, a more deliberate behavior of seeking secondary information can provide further insights. Reactive visualizations, now the standard in the *dataviz* world, allow for this approach: the viewer can hover over the visualization to find additional information (e.g., exact values) or click to display complementary details.

These visualizations rely on the same triptych as the entire *web* ecosystem: `HTML`, `CSS`, and `JavaScript`. `Python` users will not directly manipulate these languages, which require a certain level of expertise. Instead, they use libraries that automatically generate all the necessary `HTML`, `CSS`, and `JavaScript` code to create the figure.

Several `Javascript` ecosystems are made available to developers through `Python`. The two main libraries are [`Plotly`](https://plotly.com/python/), associated with the `Javascript` ecosystem of the same name, and [`Altair`](https://altair-viz.github.io/), associated with the `Vega` and `Altair` ecosystems in `Javascript`[1]. To allow Python users to explore the emerging `Javascript` library [`Observable Plot`](https://observablehq.com/plot/), French research engineer Julien Barnier developed [`pyobsplot`](https://juba.github.io/pyobsplot/), a `Python` library enabling the use of this ecosystem from `Python`.

Interactivity should not just be a gimmick that adds no readability or even worsens it. It is rare to rely solely on the figure as produced without further work to make it effective.

### 7.1.1 The `Plotly` Library

The `Plotly` package is a wrapper for the `Javascript` library `Plotly.js`, allowing for the creation and manipulation of graphical objects very flexibly to produce interactive objects without the need for Javascript.

The recommended entry point is the `plotly.express` module ([documentation here](https://plotly.com/python/plotly-express/)), which provides an intuitive approach for creating charts that can be modified *post hoc* if needed (e.g., to customize axes).

<div class="alert alert-info" role="alert">
<h3 class="alert-heading"><i class="fa-solid fa-comment"></i> Displaying Figures Created with Plotly</h3>

In a standard `Jupyter` notebook, the following lines of code allow the output of a `Plotly` command to be displayed under a code block:

For `JupyterLab`, the `jupyterlab-plotly` extension is required:

``` python
!jupyter labextension install jupyterlab-plotly
```

</div>

## 7.2 Replicating the Previous Example with `Plotly`

The following modules will be required to create charts with `plotly`:

[1] The names of these libraries are inspired by the Summer Triangle constellation, of which Vega and Altair are two members.

In [38]:
import plotly
import plotly.express as px

<div class="alert alert-success" role="alert">
<h3 class="alert-heading"><i class="fa-solid fa-pencil"></i> Exercise 7: A Barplot with Plotly</h3>

The goal is to recreate the first red bar chart using `Plotly`.

1.  Create the chart using the appropriate function from `plotly.express` and…
    -   Do not use the default theme but one with a white background to achieve a result similar to that on the *open-data* site.
    -   Use the `color_discrete_sequence` argument for the red color.
    -   Remember to label the axes.
    -   Consider the text color for the lower axis.
2.  Try another theme with a dark background. For colors, group the three highest values together and separate the others.

</div>

## 7.3 The `altair` Library

For this example, we will recreate our previous figure.

Like `ggplot`/`plotnine`, `Vega` is a graphics ecosystem designed to implement the grammar of graphics from Wilkinson (2012). The syntax of `Vega` is therefore based on a declarative principle: a construction is declared through layers and progressive data transformations.

Originally, `Vega` was based on a JSON syntax, hence its strong connection to `Javascript`. However, there is a Python API that allows for creating these types of interactive figures natively in Python. To understand the logic of constructing an `altair` code, here is how to replicate the previous figure:

In [47]:
import altair as alt

color_scale = alt.Scale(domain=[True, False], range=['green', 'red'])

fig2 = (
    alt.Chart(df1)
    .mark_bar()
    .encode(
        x=alt.X('average(sum_counts):Q', title='Average count per hour over the selected period'),
        y=alt.Y('nom_compteur:N', sort='-x', title=''),
        color=alt.Color('top:N', scale=color_scale, legend=alt.Legend(title="Top")),
        tooltip=[
            alt.Tooltip('nom_compteur:N', title='Counter Name'),
            alt.Tooltip('sum_counts:Q', title='Hourly Average')
            ]
    ).properties(
        title='The 10 counters with the highest hourly average'
    ).configure_view(
        strokeOpacity=0
    )
)

fig2.interactive()

Field, A. 2012. “Discovering Statistics Using r.” Sage.

Lisa, DeBruine. 2021. “<span class="nocase">psyTeachR Book Template</span>.” <https://github.com/psyteachr/template/>.

Wilkinson, Leland. 2012. *The Grammar of Graphics*. Springer.