<a href="https://colab.research.google.com/github/jrbalderrama/a2r2/blob/main/notebooks/a2r2-02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RUDI Workshop: Introduction to Privacy-Preserving Data Publishing Techniques

Tristan ALLARD & Javier ROJAS BALDERRAMA

_Univ Rennes, CNRS, INRIA_
  
This work is licensed under a [Creative Commons Zero v1.0 Universal License](https://creativecommons.org/publicdomain/zero/1.0/)

# Notebook __TWO__: The case for privacy

## Preamble

Yes, raw data is not immune to re-identification! 

You are now going to perform a reidentification attack on a small set of targets. To this end, we will give you some auxiliary information (also called background knowledge) and programming tools for helping you query the dataset.
1. You can display the buses validations dataset [here](#displayvalid). Feel free to to play with the filter menu,although the number of shown rows is limited. 
2. You can attack the dataset [Step 1](#attack) (do not be afraid to try!). 
3. In order to understand better your attacks and/or design other attacks, you can display informative measures about the _identifying power_ of the attributes of the dataset ([Step 2](#explain)). 


 ### Download dataset


In [None]:
!wget -nv -nc https://zenodo.org/record/5509268/files/buses.parquet

### Import required modules

In [None]:
import copy
import importlib
import os
from errno import ENOENT
from pathlib import Path
from typing import Optional, Sequence, Tuple, Union

import folium
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.io as pio
import pyarrow.parquet as pq
from folium.plugins import HeatMapWithTime
from IPython import display, get_ipython
from pandas import NA, DataFrame, Series, Timestamp
from plotly.graph_objs import Figure, Scatter

### Setup notebook constants and running environment

In [None]:
# project base directory
BASE_DIRECTORY = Path(".")

# detect running environment
COLAB_ON = True if "google.colab" in str(get_ipython()) else False

In [None]:
# Set Ploty renderer
if COLAB_ON:
    pio.renderers.default = "colab"

### Load and display raw dataset

In [None]:
# load dataset from file system
def load_data(
    path: Path,
) -> DataFrame:
    if not path.exists():
        raise FileNotFoundError(ENOENT, os.strerror(ENOENT), path)

    table = pq.read_table(path)
    return table.to_pandas()


# show a dataframe as a table
def display_dataframe(
        dataframe: DataFrame,
) -> None:    
    if COLAB_ON:
        spec = importlib.util.find_spec("google.colab")
        if spec:            
            data_table = importlib.import_module("google.colab.data_table")            
            enable_dataframe_formatter = getattr(
                data_table, 
                "enable_dataframe_formatter",
            )            
            
            enable_dataframe_formatter()            
           
    display.display(dataframe[:20000] if COLAB_ON else dataframe) 

#### Show raw dataset

<a id="displayvalid"></a>

In [None]:
path = BASE_DIRECTORY.joinpath("buses.parquet")
buses_dataset = load_data(path)
display_dataframe(buses_dataset)

####################
# BEGIN : Observe

In [None]:
# show dataset on a map
def plot_heatmap(
    dataframe: DataFrame,
    group_column: str = "departure_time",
    # Rennes GPS coordinates
    location: Tuple[float, float] = (48.1147, -1.6794),
) -> None:
    _dataframe = dataframe.copy(deep=True)
    timestamps = []
    coordinates = []
    for timestamp, coordinate in _dataframe.groupby(group_column):
        timestamps.append(str(timestamp))
        coordinates.append(
            coordinate[
                [
                    "stop_lat",
                    "stop_lon",
                ]
            ].values.tolist()
        )

    base_map = folium.Map(
        location=location,
        zoom_start=11,
        tiles="https://{s}.basemaps.cartocdn.com/light_all/{z}/{x}/{y}{r}.png",
        # tiles="https://{s}.basemaps.cartocdn.com/dark_nolabels/{z}/{x}/{y}{r}.png",
        attr="CartoDB",
    )

    heat_map = HeatMapWithTime(
        data=coordinates,
        index=timestamps,
        auto_play=True,
        min_speed=1,
        radius=4,
        max_opacity=0.5,
    )

    heat_map.add_to(base_map)
    display.display(base_map)

In [None]:
# Showing the heat map of validations only works on a local server
if not COLAB_ON:
    plot_heatmap(buses_dataset)

In [None]:
# END : Observe
####################

## Step 1: Attack raw buses validations
<a id="attack"></a>

Re-identification attacks are simple conceptually. They consist in selecting the subset of individuals whose records match the auxiliary information that the attacker has about them. If a single individual matches the adversarial knowledge, the success of the attack is clear (assuming that the adversarial knowledge is reliable). But when more than a single individual match the adversarial knowledge, is it a failure? 

By "looking" at the dataset (see above), could you imagine stronger auxiliary information?

In [None]:
# drop geospatial attributes from dataset
def tidy_dataframe(
    dataframe: DataFrame,
) -> DataFrame:
    dataframe_ = dataframe.copy()
    return dataframe_[
        [
            "departure_time",
            "id",
            "stop_name",
            "route_short_name",
            "stop_id",
            "direction_id",
        ]
   
    ]

# query the dataset by attribute and value
def query(
    dataframe: DataFrame,
    name: str,
    value: Union[str, int, float, Sequence[str]],
) -> DataFrame:
    return (
        dataframe.query(f"{name} == {value}")
        if isinstance(value, (int, float))
        else dataframe.query(f'''{name} == "{value}"''')
        if isinstance(value, str)
        else dataframe.query(f"{name} in {value}")
    )


# filter dataset between two timestamps
def between(
    dataframe: DataFrame,
    start: Union[str, Timestamp],
    end: Union[str, Timestamp],
    complement: bool = False,
) -> DataFrame:
    start_ = Timestamp(start) if not isinstance(start, Timestamp) else start
    end_ = Timestamp(end) if not isinstance(end, Timestamp) else end
    return (
        (dataframe.set_index("departure_time").loc[start_:end_].reset_index())
        if not complement
        else (
            dataframe.loc[
                (dataframe["departure_time"] < start_)
                | (dataframe["departure_time"] > end_)
            ]
        )
    )


# intersect two datasets with a common attribute ('on')
def intersect(
    right: DataFrame,
    left: DataFrame,
    on: Optional[Sequence[str]] = None,
    how: str = "inner",
) -> Optional[DataFrame]:
    on_ = on if on else right.columns.values.tolist()
    return pd.merge(
        right,
        left,
        how=how,
        on=on_,
    )  # if set(rvalues) == set(lvalues) else None


# get distinct rows from a dataset grouping by a 'subset'
def distinct(
    dataframe: DataFrame,
    subset: Union[str, Sequence[str]],
) -> DataFrame:
    return dataframe.drop_duplicates(subset=subset)


# count rows by name and value
def count_by(
    dataframe: DataFrame,
    name: str,
    value: Union[str, int, float],
    *,
    frequency: str = "15T",
) -> DataFrame:
    dataframe_ = (
        dataframe[dataframe[name] == value]
        .set_index("departure_time")
        .groupby(
            [
                pd.Grouper(level="departure_time", freq=frequency),
            ]
        )
        .count()
    )

    # #domain = pd.date_range(start=dataframe_.index.min(), end=dataframe_.index.max(), freq="15T")
    # #dataframe_ = dataframe_.reindex(domain, method=None, fill_value=NA)
    # #dataframe_.replace(0, np.NAN, inplace=True)
    # #display_dataframe(dataframe_)
    return dataframe_[dataframe_.columns[0]].to_frame(name="count")


# show a timeseries graph of a selected attribute
def plot_dataset(
    dataframe: DataFrame,
    column: str,
) -> None:
    figure = Figure()
    scatter = Scatter(
        x=dataframe.index,
        y=dataframe[column],
        mode="lines",
        name="values",
        connectgaps=False,
    )

    figure.add_trace(scatter)
    figure.update_layout(
        showlegend=False,
        title_text=column,
        template="simple_white",
    )

    figure.update_xaxes(showgrid=True)
    figure.show()

### Example of a re-identification attack

Somebody said:

> "*I often take the bus in the morning to go to Beaulieu from the 'Anne de Bretagne' in Cesson* "

Is this information enough to discover the mobility patterns of that person?

A short summary of implemented methods used to perform the attack, refer to the example below for the use (or if you feel confortable use the **Pandas** API directly):

- `query`:  perform a query on the dataset by attribute name and value
- `between`:  filter dataset between two timestamps
- `intersect`: intersect two datasets with a common attribute (the 'on' attribute)
- `distinct`: get distinct rows from a dataset grouping by a 'subset'

In [None]:
# remove geo-spatial information from the dataset
dataset = tidy_dataframe(buses_dataset)

# show the dataset
print("Initial dataset")
display_dataframe(dataset)

# query: "I take the bus from the bus stop 'Anne de Bretagne'"
q_1 = query(dataset, "stop_name", "Anne de Bretagne")

# query: "I take the bus going to Beaulieu (city center)"
q_2 = query(dataset, "direction_id", 0)

# intersect results of 'q_1' and 'q_2'
q_3 = intersect(q_1, q_2, on=["id"])

# show results of intesection done on 'q_3'
print("Result of the intersection of queries 1 & 2")
display_dataframe(q_3)

# check how many different users are in query 'q_3'
q_4 = distinct(q_3, ["id"])

# show results of query 'q_3' 
# => since there is only one row we found the user!")
print("Result of checking different `id` in previous result")
display_dataframe(q_4)

# query: all travels of the user ('id') of query 'q_4'
q_5 = query(dataset, "id", 175)

# show results of query 'q_5'
print("Complete dataset of the user with `id` 175")
display_dataframe(q_5)

# get the travels count of the user ('id') of query 'q_3' in a timeline
q_6 = count_by(dataset, "id", 175)

# plot esults of query 'q_6' 
plot_dataset(q_6, "count")

# for the curious:
# all-in-one 'plain vanilla' code equivalent as follows 
# (results are not printed on screen)
target = dataset.query(
    "stop_name == 'Anne de Bretagne' & direction_id == 0"
).drop_duplicates(
    subset=[
        "id",
        "stop_name",
    ],
)

### Food for thoughts

Here below there is auxiliary information that you have on different targets. Can you re-identify them based on the available dataset? 

```
####################
# BEGIN : Answer
```

> - Target 1: *When I go to work using public transportation, I always take the bus going to the lycée Assomtpion, from the begining of the line*.
> - Target 2: *I usually take the bus from 'Saint-Sulpice' but during holidays I stayed at my parents' home and I took the bus '217' a couple of times to go to the campus*.
> - Target 3: *I take any bus from the RU Étoile to downtown because I live next to the 'Cimetière de l'Est' and I do not mind to walk*.

```
# END : Answer
####################
```

Do not forget to visit the Web site of the [STAR](https://www.star.fr/accueil). Specially check the [page](https://m.star.fr/) showing the buses serving at a specific bus stop, and the [page](https://www.star.fr/accueil?tx_pnfstarod_searchdocument[action]=search&tx_pnfstarod_searchdocument[controller]=SearchLines) showing the map/schedule of the bus lines.


In [None]:
####################
# BEGIN : Code

In [None]:
# Target 1
dataset = tidy_dataframe(buses_dataset)

# TODO YOUR code here!

In [None]:
# Target 2
dataset = tidy_dataframe(buses_dataset)

# TODO YOUR code here!

# NOTE: To use 'between' set the start and end dates as strings:
#       result = between(dataset, "2021-08-01", "2021-08-31")


In [None]:
# Target 3
dataset = tidy_dataframe(buses_dataset)

# TODO YOUR code here!

# NOTE: To test several values of an attribute at once, provide a list to query:
#       values = ["Tournebride", "Le Mail", "Maison d'Accueil"]
#       result = query(dataset, "stop_name", values)

In [None]:
# END : Code
####################

## Step 2: Explain the success of the attacks

<a id="explain"></a>

The success of a re-identification attack depends on the identifying power of the attributes that have been used for the attack. You can display below two measures: the distribution of the [cardinalities of the anonymity sets](#aset) that indicates how much individuals are distinguishable on a given set of attributes, and  the [Shannon entropy](#shannon) that quantifies the amount of information carried by each attribute. See the examples below and then play with anonymity sets by changing the set of attributes on which the anonymity sets are computed. 

### Anonymity Sets
<a id="aset"></a>

Displaying the cardinalities of the anonymity sets inform about the _re-identifyiability_ of the individuals in the dataset: anonymity sets that have a cardinality equal to 1 contain a single individual, those equal to 2 contain two individuals, etc. Selecting the attributes on which you want to compute the anonymity sets and displaying the resulting cardinalities can thus help you explain the success of your attack. An attacker could also tune the attack by using the most identifying attributes. 

#### Food for thought


```
####################
# BEGIN : Answer
```

> - Which set of attributes is the most identifying ? Can you find it efficiently?
> - Would your attacks have have been more successful with other/additional information?

```
# END : Answer
####################
```

Taking into account the buses validation dataset two kinds of anonymity sets can be computed : 

1. Anonymity set of validations (rows of the dataset)
2. Anonymity set of different users (distinct user identifieres by rows)

Let's see some [examples](#aset_examples) and then you can chose [below](#asetplay) the attributes on which you compute the anonymity sets. 

In [None]:
# compute the anonymity set of a 'formated' dataframe
def get_anonymity_set(
    dataframe: DataFrame,
    *,
    subset: Optional[Sequence[str]] = None,
    distinct: Optional[str] = None,
    reindex: bool = False,
) -> Series:
    
    # reset index by including zeroes values
    def reset_index(serie: Series) -> Series:
        domain = range(1, serie.index.max() + 1)
        return serie.reindex(domain, fill_value=0)

    # select distinct columns by a defined attribute
    def get_distinct(
        dataframe: DataFrame,
        distinct: Optional[str] = None,
        subset: Optional[Sequence[str]] = None,
    ) -> DataFrame:
        dataframe_ = dataframe.copy()
        if distinct:
            subset_ = copy.deepcopy(subset)
            if subset_:
                if distinct not in subset_:
                    subset_.append(distinct)
            else:
                subset_ = [distinct]
            dataframe_.drop_duplicates(inplace=True, subset=subset_)

        return dataframe_

    subset = None if not subset else subset
    dataframe_ = get_distinct(dataframe, distinct, subset) if distinct else dataframe.copy()
    multiplicity = dataframe_.value_counts(subset=subset)
    aset = multiplicity.value_counts().sort_index()
    aset = reset_index(aset) if reindex else aset
    return (
        aset.to_frame()
        .reset_index()
        .rename(
            {
                "index": "cardinality",
                0: "occurrences",
            },
            axis=1,
        )
    )


# show the anonymity set of a dataframe as a barplot
def plot_anonymity_set(
    dataframe: DataFrame,
) -> None:
    figure = px.bar(
        dataframe,
        x="cardinality",
        y="occurrences",
        color="occurrences",
        color_continuous_scale="Bluered",
        # template="plotly_white",
        title="Anonymity Set",
    )

    figure.update_coloraxes(showscale=False)
    figure.show()

<a id="aset_examples"></a>

#### Examples of anonymity sets
We now see in detail some anonymity sets of individual attributes and some groups (subsets) of them

1. Anomymity set for all attributes [[link]](#aset_e1)
2. Anomymity set of the '`id`' attribute [[link]](#aset_e2)
3. Anomymity set of the '`stop_name`' attribute [[link]](#aset_e3)
4. Anonymity set of the '`route_short_name` and  '`direction_id`' attributes [[link]](#aset_e4)
5. Anonymity set of the '`departure_time` attribute [[link]](#aset_e5)<a id="aset_e2"></a>


<a id="aset_e1"></a>

1. Anomymity set for all attributes

- **Anonymity set of validations for all attributes of the dataset**

  This represents the number different validations (count of rows) on the whole dataset. 


In [None]:
# get a simplified view of the dataset
dataset = tidy_dataframe(buses_dataset)

# get anonymity set of validations for all attributes
anonymity_set = get_anonymity_set(dataset)

print(f"Anonymity set of validations for all attributes")
plot_anonymity_set(anonymity_set)

uniques = dataset.drop_duplicates()
print(f"Occurences of the FIRST cardinality: {uniques.shape[0]}")
display_dataframe(uniques)


- **Anonymity set of different users for all attributes of the dataset**

  This represents the number of diferent users in the dataset (unique identifiers).

In [None]:
# get anonymity set of different uses for all attributes
anonymity_set = get_anonymity_set(dataset, distinct="id")

print(f"Anonymity set of different users for all attributes")
plot_anonymity_set(anonymity_set)

uniques = dataset.drop_duplicates("id")
print(f"Occurrences of the FIRST cardinality: {uniques.shape[0]}")
display_dataframe(uniques)

<a id="aset_e2"></a>

2. Anonymity set of the '`id`' attribute

- **Anonymity set of validations for the subset `['id']`**

  This represents the number validations (count of rows) for the same unique identifier. 


In [None]:
dataset = tidy_dataframe(buses_dataset)

SUBSET = ["id"]

anonymity_set = get_anonymity_set(dataset, subset=SUBSET)
print(f"Anonymity set of validations for {SUBSET=}")
plot_anonymity_set(anonymity_set)
rows = (
    dataset.groupby(SUBSET)
    .agg({"stop_id": "count"})
    .rename({"stop_id": "count"}, axis=1)
    .sort_values(by="count")
    .reset_index()
)

result = dataset[dataset["id"] == rows["id"][0]] 
print(f"Occurrences of the FIRST cardinality: {result.shape[0]}")
display_dataframe(result)

uniques = result.drop_duplicates(subset=SUBSET)
print(f"Cardinality of the previous occurence (unique rows with the subset): {uniques.shape[0]}")
display_dataframe(uniques)

- **Anonymity set of different users for the subset `['id']`**

  This represents the number of diferent users in the dataset as well!

In [None]:
anonymity_set = get_anonymity_set(dataset, distinct="id", subset=SUBSET)
print(f"Anonymity set of different users for {SUBSET=}")
plot_anonymity_set(anonymity_set)

uniques = dataset.drop_duplicates(subset=SUBSET)
print(f"Occurences of the FIRST cardinality: {uniques.shape[0]}")
display_dataframe(uniques)

<a id="aset_e3"></a>

3. Anonymity set of the '`stop_name`' attribute

- **Anonymity set of validations for the subset `['stop_name']`**

In [None]:
dataset = tidy_dataframe(buses_dataset)

SUBSET = ["stop_name"]

anonymity_set = get_anonymity_set(dataset, subset=SUBSET)
print(f"Anonymity set of validations for {SUBSET=}")
plot_anonymity_set(anonymity_set)
rows = (
    dataset.groupby(SUBSET)
    .agg({"stop_id": "count"})
    .rename({"stop_id": "count"}, axis=1)
    .sort_values(by="count")
    .reset_index()
)

result = dataset[dataset["stop_name"] == rows["stop_name"][0]] 
print(f"Occurrences of the FIRST cardinality: {result.shape[0]}")
display_dataframe(result)

uniques = result.drop_duplicates(subset=SUBSET)
print(f"Cardinality of the previous occurence (unique rows with the subset): {uniques.shape[0]}")
display_dataframe(uniques)


- **Anonymity set of different users for the subset `['stop_name']`**

In [None]:
anonymity_set = get_anonymity_set(dataset, distinct="id", subset=SUBSET)
print(f"Anonymity set of different users for {SUBSET=}")
plot_anonymity_set(anonymity_set)
rows = (
    dataset.drop_duplicates(subset=SUBSET + ["id"])
    .groupby(SUBSET + ["id"])    
    .agg({"stop_id": "count"})
    .rename({"stop_id": "count"}, axis=1)    
    .groupby(SUBSET)
    .count()  
    .sort_values(by="count")
    .reset_index() 
)

# def flat(lista):
#     return set(item for sublist in lista for item in sublist)

# groups = (
#     dataset.drop_duplicates(subset=SUBSET + ["id"])
#     .groupby(SUBSET + ["id"])
#     .aggregate(lambda x: list(x))
#     .groupby(SUBSET)
#     .aggregate(lambda x: flat(x))
# )

#display_dataframe(groups)
    
cardinality = rows[rows["count"] == rows["count"][0]]
print(f"Occurrences of the FIRST cardinality: {cardinality.shape[0]}")
display_dataframe(cardinality)

# get first element's data of the cardinality
result = dataset[dataset["stop_name"] == cardinality["stop_name"][0]] 
print(f"Dataset of the FIRST occurrence")
display_dataframe(result)

uniques = result.drop_duplicates(subset=SUBSET+ ["id"])
print(f"Cardinality of the previous dataset (unique rows with the subset): {uniques.shape[0]}")
display_dataframe(uniques)

<a id="aset_e4"></a>


4. Anonymity set of the '`route_short_name` and  '`direction_id`' attributes

In [None]:
dataset = tidy_dataframe(buses_dataset)
SUBSET = [
    "route_short_name",
    "direction_id",
]

### ANONIMITY SET OF VALIDATIONS
anonymity_set = get_anonymity_set(dataset, subset=SUBSET)
plot_anonymity_set(anonymity_set)
rows = (
    dataset.groupby(SUBSET)
    .agg({"stop_id": "count"})
    .rename({"stop_id": "count"}, axis=1)
    .sort_values(by="count")
    .reset_index()
)

result = dataset[
    (dataset["route_short_name"] == rows["route_short_name"][0])
    & (dataset["direction_id"] == rows["direction_id"][0])
]

display_dataframe(result)

### ANONIMITY SET OF USERS
anonymity_set = get_anonymity_set(dataset, distinct="id", subset=SUBSET)
plot_anonymity_set(anonymity_set)
rows = (
    dataset.drop_duplicates(subset=SUBSET + ["id"])
    .groupby(SUBSET + ["id"])    
    .agg({"stop_id": "count"})
    .rename({"stop_id": "count"}, axis=1)    
    .groupby(SUBSET)
    .count()  
    .sort_values(by="count")
    .reset_index() 
)

# get first cardinality 
cardinality = rows[rows["count"] == rows["count"][0]]
display_dataframe(cardinality)

# get first element's data of the cardinality
result = dataset[
    (dataset["route_short_name"] == cardinality["route_short_name"][0])
    & (dataset["direction_id"] == cardinality["direction_id"][0])
]

# check that the result query correspond to the cardinality
display_dataframe(result.drop_duplicates(subset=SUBSET + ["id"]))

<a id="aset_e5"></a>

5. Anonymity set of the '`departure_time`' attribute

In [None]:
dataset = tidy_dataframe(buses_dataset)
SUBSET = [            
    "departure_time",    
]
anonymity_set = get_anonymity_set(dataset, subset=SUBSET)
plot_anonymity_set(anonymity_set)

anonymity_set = get_anonymity_set(dataset, distinct="id", subset=SUBSET)
plot_anonymity_set(anonymity_set)

# Question: Why they are equal ? ;)

<a id="asetplay"></a>

In [None]:
# (un)comment lines starting with dash ('#') to change the subset

####################
# BEGIN : Play
SUBSET = [
    #"departure_time",
    #"id",
    #"stop_name",
    #"route_short_name",
    #"stop_id",	
    #"direction_id",
]
# END : Play
####################

# get a simplified view of the dataset
dataset = tidy_dataframe(buses_dataset)

### ANONIMITY SET OF VALIDATIONS
anonymity_set = get_anonymity_set(dataset, subset=SUBSET)
plot_anonymity_set(anonymity_set)

### ANONIMITY SET OF USERS
anonymity_set = get_anonymity_set(dataset, distinct= "id", subset=SUBSET)
plot_anonymity_set(anonymity_set)

### Shannon's entropy
<a id="shannon"></a>

TODO texte Shannon 

Food for thought : 
- Which attributes give the most information ?
- Would your attacks have have been more successful with other/additional information ?

In [None]:
# compute the entropy of a serie
def entropy(
    series: Series,
    base: int = 2,
    normalize: bool = False,
) -> float:
    # compute the expectation of a serie
    def expectation(probability: Series) -> float:
        return (probability * np.log(probability) / np.log(base)).sum()

    # compute the efficiency of a serie
    def efficiency(entropy: float, length: int) -> float:
        return entropy * np.log(base) / np.log(length)

    probability = series.value_counts(normalize=True, sort=False)
    h = -expectation(probability)
    return efficiency(h, series.size) if normalize else h


# compute the entropy of a dataframe
def get_entropies(
    dataframe: DataFrame,
    base: int = 2,
    normalize: bool = False,
) -> Series:
    dataframe_ = dataframe.copy()
    entropies = dataframe_.apply(
        entropy,
        base=base,
        normalize=normalize,
    )

    return (
        entropies.to_frame()
        .reset_index()
        .rename(
            {
                "index": "attribute",
                0: "entropy",
            },
            axis=1,
        )
    )


# show the entropies as a dataframe as barplot
def plot_entropies(
    dataframe: DataFrame,
) -> None:
    figure = px.bar(
        dataframe,
        x="entropy",
        y="attribute",
        orientation="h",
        color="attribute",
    )

    figure.update_traces(
        texttemplate="%{x:.2f}",
        textposition="auto",
    )

    figure.update_layout(showlegend=False)
    figure.show()

In [None]:
# get a simplified view of the dataset
dataset = tidy_dataframe(buses_dataset)

# show the dataset
display_dataframe(dataset)

# compute the entropies of the dataset
entropies = get_entropies(dataset, normalize=True)

# show a barplot of the entropies 
plot_entropies(entropies)