<a href="https://colab.research.google.com/github/jrbalderrama/a2r2/blob/main/a2r2-02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RUDI Workshop: Introduction to Privacy-Preserving Data Publishing Techniques

Tristan ALLARD & Javier ROJAS BALDERRAMA

_Univ Rennes, CNRS, INRIA_
  
This work is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/)

# Notebook __TWO__: The case for privacy

## Preamble

Yes, raw data is not immune to re-identification! 

You are now going to perform a reidentification attack on a small set of targets. To this end, we will give you some auxiliary information (also called background knowledge) and programming tools for helping you query the dataset. 
1. You can display the buses validations dataset [here](#displayvalid) (do not hesitate to play with the filter, although the number of rows available is very limited). 
2. You can attack the dataset [here](#attack) (do not be afraid to try!). 
3. In order to understand better your attacks and/or design other attacks, you can display informative measures about the _identifying power_ of the attributes of the dataset ([Step 2](#explain)). 


 ### Download dataset


In [2]:
!wget -nv -nc https://zenodo.org/record/5509268/files/buses.parquet

### Import required modules

In [3]:
import importlib
import os
from errno import ENOENT
from pathlib import Path
from typing import Optional, Sequence, Tuple

import folium
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.io as pio
import pyarrow.parquet as pq
from folium.plugins import HeatMapWithTime
from IPython import display, get_ipython
from numpy import ndarray
from pandas import NA, DataFrame, DatetimeIndex, Series, Timedelta, Timestamp


ModuleNotFoundError: No module named 'pyarrow'

### Setup notebook constants and running environment

In [None]:
# project base directory
BASE_DIRECTORY = Path(".")

# detect running environment
COLAB_ON = True if "google.colab" in str(get_ipython()) else False

In [None]:
# Set Ploty renderer
if COLAB_ON:
    pio.renderers.default = "colab"

### Load and display raw dataset

In [None]:
# load dataset from file system
def load_data(
    path: Path,
) -> DataFrame:
    if not path.exists():
        raise FileNotFoundError(ENOENT, os.strerror(ENOENT), path)

    table = pq.read_table(path)
    return table.to_pandas()


# show a dataframe as a table
def display_dataframe(
        dataframe: DataFrame,
) -> None:    
    if COLAB_ON:
        spec = importlib.util.find_spec("google.colab")
        if spec:            
            data_table = importlib.import_module("google.colab.data_table")            
            enable_dataframe_formatter = getattr(
                data_table, 
                "enable_dataframe_formatter",
            )            
            
            enable_dataframe_formatter()            
           
    display.display(dataframe[:20000] if COLAB_ON else dataframe) 

In [None]:
path = BASE_DIRECTORY.joinpath("buses.parquet")
buses_dataset = load_data(path)

<a id='displayvalid'></a>

In [None]:
####################
# BEGIN : Observe

display_dataframe(buses_dataset)

# END : Observe
####################

In [None]:
# show dataset on a map
def plot_heatmap(
    dataframe: DataFrame,
    group_column: str = "departure_time",
    # Rennes GPS coordinates
    location: Tuple[float, float] = (48.1147, -1.6794),
) -> None:
    _dataframe = dataframe.copy(deep=True)
    timestamps = []
    coordinates = []
    for timestamp, coordinate in _dataframe.groupby(group_column):
        timestamps.append(str(timestamp))
        coordinates.append(
            coordinate[
                [
                    "stop_lat",
                    "stop_lon",
                ]
            ].values.tolist()
        )

    base_map = folium.Map(
        location=location,
        zoom_start=11,
        tiles="https://{s}.basemaps.cartocdn.com/light_all/{z}/{x}/{y}{r}.png",
        # tiles="https://{s}.basemaps.cartocdn.com/dark_nolabels/{z}/{x}/{y}{r}.png",
        attr="CartoDB",
    )

    heat_map = HeatMapWithTime(
        data=coordinates,
        index=timestamps,
        auto_play=True,
        min_speed=1,
        radius=4,
        max_opacity=0.5,
    )

    heat_map.add_to(base_map)
    display.display(base_map)

**Note**:

> Showing the heat map of the buses validation only works on a local
> Jupyter server (perhaps a *colab* feature/limitation).

In [None]:
####################
# BEGIN : Observe

plot_heatmap(buses_dataset)

# END : Observe
####################

## Step 1: Attack raw buses validations
<a id="attack"></a>

Re-identification attacks are simple conceptually. They consist in selecting the subset of individuals whose records match the auxiliary information that the attacker has about them. If a single individual matches the adversarial knowledge, the success of the attack is clear (assuming that the adversarial knowledge is reliable). But when more than a single individual match the adversarial knowledge, is it a failure ? 

Food for thoughts: 
- Here is below auxiliary information that you have on different targets. Can you re-identify them ? 
    - Target 1: 
    - Target 2: 
    - Target 3: 
- By "looking" at the dataset (see above), could you imagine stronger auxiliary information ?

In [None]:
####################
# BEGIN : Play

TODO

# END : Play
####################

## Step 2: Explain the success of the attacks

<a id="explain"></a>

The success of a re-identification attack depends on the identifying power of the attributes that have been used for the attack. You can display below two measures: the [Shannon entropy](#shannon) (that quantifies the amount of information carried by each attribute) and the distribution of the [cardinalities of the anonymity sets](#aset) (that indicates how much individuals are distinguishable on a given set of attributes). Do not hesitate to play with anonymity sets by changing the set of attributes on which the anonymity sets are computed. 

In [None]:
# drop geospatial attributes from dataset
def tidy_dataframe(
    dataframe: DataFrame,
) -> DataFrame:
    dataframe_ = dataframe.copy()
    return dataframe_[
        [
            "departure_time",
            "id",
            "stop_name",
            "route_short_name",
            "stop_id",
            "direction_id",
        ]
    ]

### Shannon's entropy
<a id="shannon"></a>

TODO texte Shannon 

Food for thought : 
- Which attributes give the most information ?
- Would your attacks have have been more successful with other/additional information ?

In [None]:
# compute the entropy of a serie
def entropy(
    series: Series,
    base: int = 2,
    normalize: bool = False,
) -> float:
    # compute the expectation of a serie
    def expectation(probability: Series) -> float:
        return (probability * np.log(probability) / np.log(base)).sum()

    # compute the efficiency of a serie
    def efficiency(entropy: float, length: int) -> float:
        return entropy * np.log(base) / np.log(length)

    probability = series.value_counts(normalize=True, sort=False)
    h = -expectation(probability)
    return efficiency(h, series.size) if normalize else h


# compute the entropy of a dataframe
def get_entropies(
    dataframe: DataFrame,
    base: int = 2,
    normalize: bool = False,
) -> Series:
    dataframe_ = dataframe.copy()
    entropies = dataframe_.apply(
        entropy,
        base=base,
        normalize=normalize,
    )

    return (
        entropies.to_frame()
        .reset_index()
        .rename(
            {
                "index": "attribute",
                0: "entropy",
            },
            axis=1,
        )
    )


# show the entropies as a dataframe as barplot
def plot_entropies(
    dataframe: DataFrame,
) -> None:
    figure = px.bar(
        dataframe,
        x="entropy",
        y="attribute",
        orientation="h",
        color="attribute",
    )

    figure.update_traces(
        texttemplate="%{x:.2f}",
        textposition="auto",
    )

    figure.update_layout(showlegend=False)
    figure.show()

In [None]:
# get a simplified view of the dataset
dataset = tidy_dataframe(buses_dataset)

# show the dataset
display_dataframe(dataset)

# compute the entropies of the dataset
entropies = get_entropies(dataset, normalize=True)

# show a barplot of the entropies 
plot_entropies(entropies)

### Anonymity Sets
<a id="aset"></a>

Displaying the cardinalities of the anonymity sets inform about the _re-identifyiability_ of the individuals in the dataset: anonymity sets that have a cardinality equal to 1 contain a single individual, those equal to 2 contain two individuals, _etc_. Selecting the attributes on which you want to compute the anonymity sets and displaying the resulting cardinalities can thus help you explain the success of your attack. An attacker could also tune the attack by using the most identifying attributes. 

You can chose [below](#asetplay) the attributes on which you compute the anonymity sets. 

Food for thought : 
- Which set of attributes is the most identifying ? Can you find it efficiently ?
- Would your attacks have have been more successful with other/additional information ?

In [None]:
# compute the anonymity set of a 'formated' dataframe
def get_anonymity_set(
    dataframe: DataFrame,
    *,
    subset: Optional[Sequence[str]] = None,
    reindex: bool = False,
) -> Series:
    # reset index by including zeroes values
    def reset_index(serie: Series) -> Series:
        domain = range(1, serie.index.max() + 1)
        return serie.reindex(domain, fill_value=0)

    dataframe_ = dataframe.copy()
    multiplicity = dataframe_.value_counts(subset=subset)
    aset = multiplicity.value_counts().sort_index()
    aset = reset_index(aset) if reindex else aset
    return (
        aset.to_frame()
        .reset_index()
        .rename(
            {
                "index": "cardinality",
                0: "occurrences",
            },
            axis=1,
        )
    )


# show the anonymity set of a dataframe as a barplot
def plot_anonymity_set(
    dataframe: DataFrame,
) -> None:
    figure = px.bar(
        dataframe,
        x="cardinality",
        y="occurrences",
        color="occurrences",
        color_continuous_scale="Bluered",
        # template="plotly_white",
        title="Anonymity Set",
    )

    figure.update_coloraxes(showscale=False)
    figure.show()

<a id="asetplay"></a>

In [None]:
####################
# BEGIN : Play

# define a subset of specific attributes to take into account
SUBSET = [
        "id",
        "stop_name",
        #"route_short_name",
    ]

# END : Play
####################

In [None]:
# get a simplified view of the dataset
dataset = tidy_dataframe(buses_dataset)

# compute the anonymity set of the dataset for a some attributes
anonymity_set = get_anonymity_set(dataset, subset=SUBSET)

# show a barplot of the entropies 
plot_anonymity_set(anonymity_set)