<a href="https://colab.research.google.com/github/jrbalderrama/a2r2/blob/main/notebooks/a2r2-02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RUDI Workshop: Introduction to Privacy-Preserving Data Publishing Techniques

Tristan ALLARD & Javier ROJAS BALDERRAMA

_Univ Rennes, CNRS, INRIA_
  
This work is licensed under a [Creative Commons Zero v1.0 Universal License](https://creativecommons.org/publicdomain/zero/1.0/)

## Acknowledgments

We warmly thank François Bodin and Luc Lesoil for their support on the data and the definition of the use-case.

# Notebook __TWO__: The case for privacy

## Step 0 (STARTER)

Yes, raw data is not immune to re-identification! 

You are now going to perform a reidentification attack on a small set of targets. To this end, we will give you some auxiliary information (also called background knowledge) and programming tools for helping you query the dataset.
1. You can display the buses validations dataset [here](#displayvalid). Feel free to to play with the filter menu, although the number of shown rows is limited. 
2. You can attack the dataset [Step 1](#attack) (do not be afraid to try!). 
3. In order to understand better your attacks and/or design other attacks, you can display informative measures about the _identifying power_ of the attributes of the dataset ([Step 2](#explain)). 

## Settings and data


 ### Download dataset


In [None]:
!wget -nv -nc https://zenodo.org/record/5509268/files/buses.parquet

### Import required modules

In [None]:
from pathlib import Path

from atelier import io, metrics, plot
from atelier.extensions.star import data
from atelier.plot import timeline
from IPython.display import display

### Setup notebook constants and running environment

In [None]:
from atelier.utils import colaboratory

# detect running environment
COLAB_ON = colaboratory.setup()

# `colab` supports only 20K rows to show filters so do not forget:
# display(dataframe[:20000])

### Load and display raw dataset

#### Show raw dataset

<a id="displayvalid"></a>

In [None]:
path = Path("buses.parquet")
buses_dataset = io.read_data(path)
display(buses_dataset)

####################
# BEGIN : Observe

In [None]:
# Showing the heat map of validations only works on a local server
if not COLAB_ON:
    plot.heatmap_plot(buses_dataset)

In [None]:
# END : Observe
####################

## Step 1: Attack raw buses validations
<a id="attack"></a>

Re-identification attacks are simple conceptually. They consist in selecting the subset of individuals whose records match the auxiliary information that the attacker has about them. If a single individual matches the adversarial knowledge, the success of the attack is clear (assuming that the adversarial knowledge is reliable). Otherwise the success is less clear. But when more than a single individual match the adversarial knowledge, is it really a failure? 

Lets have a look at an [example](#attackexample).

### Example of a re-identification attack
<a id='attackexample'></a>

Somebody said:

> "*I often take the bus in the morning to go to Beaulieu from the 'Anne de Bretagne' in Cesson* "

Is this information enough to discover the mobility patterns of that person?

A short summary of implemented methods used to perform the attack, refer to the example below for the use (or if you feel confortable use the **Pandas** API directly):

- `query`:  perform a query on the dataset by attribute name and value
- `between`:  filter dataset between two timestamps
- `intersect`: intersect two datasets with a common attribute (the 'on' attribute)
- `distinct`: get distinct rows from a dataset grouping by a 'subset'

Have a look at the code below that implements this attack. You can also [go straight to your targets](#attacktargets).

In [None]:
####################
# BEGIN : Observe

# remove geo-spatial information from the dataset
dataset = data.tidy_dataframe(buses_dataset)

# show the dataset
print("Initial dataset")
display(dataset)

# query: "I take the bus from the bus stop 'Anne de Bretagne'"
q_1 = data.query(dataset, "stop_name", "Anne de Bretagne")

# query: "I take the bus going to Beaulieu (city center)"
q_2 = data.query(dataset, "direction_id", 0)

# intersect results of 'q_1' and 'q_2'
q_3 = data.intersect(q_1, q_2, on=["id"])

# show results of intesection done on 'q_3'
print("Result of the intersection of queries 1 & 2")
display(q_3)

# check how many different users are in query 'q_3'
q_4 = data.distinct(q_3, ["id"])

# show results of query 'q_3'
# => since there is only one row we found the user!")
print("Result of checking different `id` in previous result")
display(q_4)

# query: all travels of the user ('id') of query 'q_4'
q_5 = data.query(dataset, "id", 175)

# show results of query 'q_5'
print("Complete dataset of the user with `id` 175")
display(q_5)

# get the travels count of the user ('id') of query 'q_3' in a timeline
q_6 = data.count(dataset, "id", 175)

# plot esults of query 'q_6'
timeline.plot(q_6, "count")

# for the curious:
# all-in-one 'plain vanilla' code equivalent as follows
# (results are not printed on screen)
target = dataset.query(
    "stop_name == 'Anne de Bretagne' & direction_id == 0"
).drop_duplicates(
    subset=[
        "id",
        "stop_name",
    ],
)

# END : Observe
####################

### Food for thoughts
<a id='attacktargets'></a>

Here below there is auxiliary information that you have on different targets. Can you re-identify them based on the available dataset? 

```
####################
# BEGIN : Answer
```

> - Target 1: *When I go to work using public transportation, I always take the bus going to the lycée Assomtpion, from the begining of the line*.
> - Target 2: *I usually take the bus from 'Saint-Sulpice' but during holidays I stayed at my parents' home and I took the bus '217' a couple of times to go to the campus*.
> - Target 3: *I take any bus from the RU Étoile to downtown because I live next to the 'Cimetière de l'Est' and I do not mind to walk*.

```
# END : Answer
####################
```

Do not forget to visit the Web site of the [STAR](https://www.star.fr/accueil). Specially check the [page](https://m.star.fr/) showing the buses serving at a specific bus stop, and the [page](https://www.star.fr/accueil?tx_pnfstarod_searchdocument[action]=search&tx_pnfstarod_searchdocument[controller]=SearchLines) showing the map/schedule of the bus lines.

In [None]:
####################
# BEGIN : Code

In [None]:
# Target 1
dataset = data.tidy_dataframe(buses_dataset)

# TODO YOUR code here!

In [None]:
# Target 2
dataset = data.tidy_dataframe(buses_dataset)

# TODO YOUR code here!

# NOTE: To use 'between' set the start and end dates as strings:
#       result = between(dataset, "2021-08-01", "2021-08-31")

In [None]:
# Target 3
dataset = data.tidy_dataframe(buses_dataset)

# TODO YOUR code here!

# NOTE: To test several values of an attribute at once, provide a list to query:
#       values = ["Tournebride", "Le Mail", "Maison d'Accueil"]
#       result = query(dataset, "stop_name", values)

In [None]:
# END : Code
####################

Why was this auxiliary information sufficient for enabling your attacks? Displaying the anonymity sets as done in [Step 2](#explain) can give some explainations...

## Step 2: Explain the success of your attacks

<a id="explain"></a>

The success of a re-identification attack depends on the identifying power of the attributes that have been used for the attack. You can display below the distribution of the [cardinalities of the anonymity sets](#aset) that indicates how much individuals are distinguishable on a given set of attributes. See the examples below and then play with anonymity sets by changing the set of attributes on which the anonymity sets are computed. 

Additionally, we provide in Appendix the [Shannon entropy](#shannon) of single attributes. It quantifies the _amount of information_ carried by each attribute. 

### Anonymity Sets
<a id="aset"></a>

Displaying the cardinalities of the anonymity sets inform about the _re-identifyiability_ of the individuals in the dataset: anonymity sets that have a cardinality equal to 1 contain a single individual, those equal to 2 contain two individuals, etc. Selecting the attributes on which you want to compute the anonymity sets and displaying the resulting cardinalities can thus help you explain the success of your attack. An attacker could also tune the attack by using the most identifying attributes. 

Let's see some [examples](#aset_examples) first. 

<a id="aset_examples"></a>

#### Examples of anonymity sets
We now see in details some anonymity sets.

1. Anomymity sets for all attributes [[link]](#aset_e1)
2. Anomymity sets of the '`id`' attribute [[link]](#aset_e2)
3. Anomymity sets of the '`stop_name`' attribute [[link]](#aset_e3)
4. Anonymity sets of the '`route_short_name` and  '`direction_id`' attributes [[link]](#aset_e4)
5. Anonymity sets of the '`departure_time` attribute [[link]](#aset_e5)<a id="aset_e2"></a>

Once done with the examples, [go to the questions](#asetquestions)!

Want to go back to the top of the anonymity sets Section ? Click [here](#aset).

<a id="aset_e1"></a>

**EXAMPLE 1.1: Anonymity set of validations for all the attributes of the dataset**

This represents the number different validations (count of rows) on the whole dataset. 

(You may want to [go back to the top of the examples](#aset_examples) or to [go straight to the questions](#asetquestions).)

In [None]:
# get a simplified view of the dataset
dataset = data.tidy_dataframe(buses_dataset)

# get anonymity set of validations for all attributes
anonymity_set = metrics.get_anonymity_sets(dataset)

print(f"Anonymity set of validations for all attributes")
plot.anonymity_sets_plot(anonymity_set)

uniques = dataset.drop_duplicates()
print(f"Occurences of the FIRST cardinality: {uniques.shape[0]}")
display(uniques)

**EXAMPLE 1.2: Anonymity set of different users for all attributes of the dataset**

This represents the number of diferent users in the dataset (unique identifiers).

(You may want to [go back to the top of the examples](#aset_examples) or to [go straight to the questions](#asetquestions).)

In [None]:
# get anonymity set of different uses for all attributes
anonymity_set = metrics.get_anonymity_sets(dataset, distinct="id")

print(f"Anonymity set of different users for all attributes")
plot.anonymity_sets_plot(anonymity_set)

uniques = dataset.drop_duplicates("id")
print(f"Occurrences of the FIRST cardinality: {uniques.shape[0]}")
display(uniques)

<a id="aset_e2"></a>

**EXAMPLE 2.1: Anonymity set of validations for the subset `['id']`**

This represents the number of _validations_ (count of rows) for the same unique identifier. 

(You may want to [go back to the top of the examples](#aset_examples) or to [go straight to the questions](#asetquestions).)

In [None]:
dataset = data.tidy_dataframe(buses_dataset)

SUBSET = ["id"]

anonymity_set = metrics.get_anonymity_sets(dataset, subset=SUBSET)
print(f"Anonymity set of validations for {SUBSET}")
plot.anonymity_sets_plot(anonymity_set)
rows = (
    dataset.groupby(SUBSET)
    .agg({"count": "count"})
    .sort_values(by="count")
    .reset_index()
)

result = dataset[dataset["id"] == rows["id"][0]]
print(f"Occurrences of the FIRST cardinality: {result.shape[0]}")
display(result)

uniques = result.drop_duplicates(subset=SUBSET)
print(
    f"Cardinality of the previous occurence (unique rows with the subset): {uniques.shape[0]}"
)
display(uniques)

**EXAMPLE 2.2: Anonymity set of different users for the subset `['id']`**

This represents the number of different _users_ in the dataset. 

(You may want to [go back to the top of the examples](#aset_examples) or to [go straight to the questions](#asetquestions).)

In [None]:
anonymity_set = metrics.get_anonymity_sets(dataset, distinct="id", subset=SUBSET)
print(f"Anonymity set of different users for {SUBSET}")
plot.anonymity_sets_plot(anonymity_set)

uniques = dataset.drop_duplicates(subset=SUBSET)
print(f"Occurences of the FIRST cardinality: {uniques.shape[0]}")
display(uniques)

<a id="aset_e3"></a>

**EXAMPLE 3.1: Anonymity sets of validations for the subset `['stop_name']`**

This represents the anonymity sets of the _validations_ on the name of the bus stop. 

(You may want to [go back to the top of the examples](#aset_examples) or to [go straight to the questions](#asetquestions).)

In [None]:
dataset = data.tidy_dataframe(buses_dataset)

SUBSET = ["stop_name"]

anonymity_set = metrics.get_anonymity_sets(dataset, subset=SUBSET)
print(f"Anonymity set of validations for {SUBSET}")
plot.anonymity_sets_plot(anonymity_set)
rows = (
    dataset.groupby(SUBSET)
    .agg({"count": "count"})
    .sort_values(by="count")
    .reset_index()
)

result = dataset[dataset["stop_name"] == rows["stop_name"][0]]
print(f"Occurrences of the FIRST cardinality: {result.shape[0]}")
display(result)

uniques = result.drop_duplicates(subset=SUBSET)
print(
    f"Cardinality of the previous occurence (unique rows with the subset): {uniques.shape[0]}"
)
display(uniques)

**EXAMPLE 3.2: Anonymity set of different users for the subset `['stop_name']`**

This represents the anonymity sets of the _users_ on the name of the bus stop. 

(You may want to [go back to the top of the examples](#aset_examples) or to [go straight to the questions](#asetquestions).)

In [None]:
anonymity_set = metrics.get_anonymity_sets(dataset, distinct="id", subset=SUBSET)
print(f"Anonymity set of different users for {SUBSET}")
plot.anonymity_sets_plot(anonymity_set)
rows = (
    dataset.drop_duplicates(subset=SUBSET + ["id"])
    .groupby(SUBSET + ["id"])
    .agg({"count": "count"})
    .groupby(SUBSET)
    .count()
    .sort_values(by="count")
    .reset_index()
)

# def flat(lista):
#     return set(item for sublist in lista for item in sublist)

# groups = (
#     dataset.drop_duplicates(subset=SUBSET + ["id"])
#     .groupby(SUBSET + ["id"])
#     .aggregate(lambda x: list(x))
#     .groupby(SUBSET)
#     .aggregate(lambda x: flat(x))
# )

# display_dataframe(groups)

cardinality = rows[rows["count"] == rows["count"][0]]
print(f"Occurrences of the FIRST cardinality: {cardinality.shape[0]}")
display(cardinality)

# get first element's data of the cardinality
result = dataset[dataset["stop_name"] == cardinality["stop_name"][0]]
print(f"Dataset of the FIRST occurrence")
display(result)

uniques = result.drop_duplicates(subset=SUBSET + ["id"])
print(
    f"Cardinality of the previous dataset (unique rows with the subset): {uniques.shape[0]}"
)
display(uniques)

<a id="aset_e4"></a>

**EXAMPLE 4: Anonymity set of the '`route_short_name` and  '`direction_id`' attributes** 

This represents the anonymity sets of the _validations_ (first) and of the _users_ (second) on the couple of attributes (name of the route, direction). 

(You may want to [go back to the top of the examples](#aset_examples) or to [go straight to the questions](#asetquestions).)

In [None]:
dataset = data.tidy_dataframe(buses_dataset)
SUBSET = [
    "route_short_name",
    "direction_id",
]

### ANONIMITY SET OF VALIDATIONS
anonymity_set = metrics.get_anonymity_sets(dataset, subset=SUBSET)
plot.anonymity_sets_plot(anonymity_set)
rows = (
    dataset.groupby(SUBSET)
    .agg({"count": "count"})
    .sort_values(by="count")
    .reset_index()
)

result = dataset[
    (dataset["route_short_name"] == rows["route_short_name"][0])
    & (dataset["direction_id"] == rows["direction_id"][0])
]

display(result)

### ANONIMITY SET OF USERS
anonymity_set = metrics.get_anonymity_sets(dataset, distinct="id", subset=SUBSET)
plot.anonymity_sets_plot(anonymity_set)
rows = (
    dataset.drop_duplicates(subset=SUBSET + ["id"])
    .groupby(SUBSET + ["id"])
    .agg({"count": "count"})
    .groupby(SUBSET)
    .count()
    .sort_values(by="count")
    .reset_index()
)

# get first cardinality
cardinality = rows[rows["count"] == rows["count"][0]]
display(cardinality)

# get first element's data of the cardinality
result = dataset[
    (dataset["route_short_name"] == cardinality["route_short_name"][0])
    & (dataset["direction_id"] == cardinality["direction_id"][0])
]

# check that the result query correspond to the cardinality
display(result.drop_duplicates(subset=SUBSET + ["id"]))

<a id="aset_e5"></a>

**EXAMPLE 5: Anonymity sets of the '`departure_time`' attribute**

This represents the anonymity sets of the _validations_ (first) and of the _users_ (second) on the departure time of the bus. 

(You may want to [go back to the top of the examples](#aset_examples) or to [go straight to the questions](#asetquestions).)

In [None]:
dataset = data.tidy_dataframe(buses_dataset)
SUBSET = [
    "departure_time",
]
anonymity_set = metrics.get_anonymity_sets(dataset, subset=SUBSET)
plot.anonymity_sets_plot(anonymity_set)

anonymity_set = metrics.get_anonymity_sets(dataset, distinct="id", subset=SUBSET)
plot.anonymity_sets_plot(anonymity_set)

# Question: Why they are equal ? ;)

#### Food for thought
<a id='asetquestions'></a>

```
####################
# BEGIN : Answer
```

> - Which set of attributes is the most identifying ? Can you find it efficiently?
> - Would your attacks have have been more successful with other/additional information?

```
# END : Answer
####################
```

Taking into account the buses validation dataset two kinds of anonymity sets can be computed : 

1. Anonymity set of validations (rows of the dataset)
2. Anonymity set of different users (distinct user identifieres by rows)

You can chose [below](#asetplay) the attributes on which you compute the anonymity sets. 

You may want to [go back to the examples](#aset_examples). 

<a id="asetplay"></a>

In [None]:
####################
# BEGIN : Play

# (un)comment lines starting with dash ('#') to change the subset

SUBSET = [
    # "departure_time",
    # "id",
    # "stop_name",
    # "route_short_name",
    # "stop_id",
    # "direction_id",
]

# END : Play
####################


# get a simplified view of the dataset
dataset = data.tidy_dataframe(buses_dataset)

### ANONIMITY SET OF VALIDATIONS
anonymity_set = metrics.get_anonymity_sets(dataset, subset=SUBSET)
plot.anonymity_sets_plot(anonymity_set)

### ANONIMITY SET OF USERS
anonymity_set = metrics.get_anonymity_sets(dataset, distinct="id", subset=SUBSET)
plot.anonymity_sets_plot(anonymity_set)

## APPENDIX

### Shannon's entropy
<a id="shannon"></a>

__Food for thought__

- Which attributes give the most information ?
- Would your attacks have have been more successful with other/additional information ?

In [None]:
# get a simplified view of the dataset
dataset = data.tidy_dataframe(buses_dataset)

# show the dataset
display(dataset)

# compute the entropies of the dataset
entropies = metrics.get_entropies(dataset, normalize=True)

# show a barplot of the entropies
plot.entropies_plot(entropies)