<a href="https://colab.research.google.com/github/jrbalderrama/a2r2/blob/main/notebooks/a2r2-01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RUDI Workshop: Introduction to Privacy-Preserving Data Publishing Techniques

Tristan ALLARD & Javier ROJAS BALDERRAMA

_Univ Rennes, CNRS, INRIA_
  
This work is licensed under a [Creative Commons Zero v1.0 Universal License](https://creativecommons.org/publicdomain/zero/1.0/)

## Acknowledgments

We warmly thank François Bodin and Luc Lesoil for their support on the data and the definition of the use-case.


# Notebook __ONE__

## Step 0 (STARTER)

<a id='step_0'></a>

This hands-on tutorial is going to introduce you to the issue of *privacy-preserving personal data publishing*. You are going to follow the implementation of a concrete use-case built from open data from the Rennes Metropole area. The main question of the use-case is to know wether a change in the students schedules at the Beaulieu campus impacts the load of the buses that go through the campus. We will answer to this question based on two datasets : the validations inside the buses that stop close to the campus (with timestamps), and the number of students that terminate a class (with timestamps). Our approach consists in training a predictor that outputs the expected number of validations along the day given the number of students terminating a class along the day. However, using raw buses validations for answering to this question may lead to privacy issues because validations can be highly identifying. After having performed some reidentification attacks, you will use a perturbed version of the buses validations dataset and observe the resulting impact on our ability to answer to the main question of the use-case.

We designed this tutorial to be a step-by-step guided tour. You can follow sequentially the "Step i" tag inside the titles of the sections. Up to you to follow the sequence proposed or to deviate from it, but be careful when leaving the track, it's wild out there ;)

We divided the full journey into three topics:

1. The naive version
2. Privacy issues
3. The protected version

For your convenience, there are a dedicated notebook for each topic, **you are currently in Notebook ONE**.

The notebooks also include questions. Please take some time to think about them. Trying to answer these questions can also help you to gain a deeper understanding. And we would love reading your answers!

Ready?

Really??

Please run the whole notebook (it does not take long) and **go directly to the [Step 1](#step_1).**


## Step 2 (PREAMBLE): Settings and datasets

<a id='step_2'></a>

Not too disappointed ? So lets now have a look at the data based on which we trained the model. 

1. The datasets are downloaded
2. The libraries required are imported and global variables are setup
3. The raw data are aggregated...
4. ... and the results are displayed.
5. The datasets are prepared for the training process.

> Observe the buses validations dataset (section [Display raw data](#sec_display_raw_data))... Can you imagine any issue?


 ### Download datasets


In [None]:
!wget -nv -nc https://zenodo.org/record/5509268/files/buses.parquet
!wget -nv -nc https://zenodo.org/record/5519319/files/classes.parquet -O classes_filiere.parquet

### Import required modules

In [None]:
from pathlib import Path

import pandas as pd
from atelier import data, io, plot
from atelier.extensions.star import preprocessing as sp
from atelier.learn import model, preprocessing, validation
from atelier.plot import timeline
from atelier.utils import time
from IPython.display import display
from pandas import Timedelta, Timestamp

### Setup notebook constants and running environment

In [None]:
from atelier.utils import colaboratory

# check running environment
COLAB_ON = colaboratory.setup()

RANDOM_STATE = 42

### Load and display raw datasets

#### Read raw data

In [None]:
# buses dataset
buses_path = Path("buses.parquet")
buses_dataset = io.read_data(buses_path)

# classes dataset
classes_path = Path("classes_filiere.parquet")
classes_dataset = io.read_data(classes_path)

#### Display raw data

<a id='sec_display_raw_data'></a>

**Dataset of buses validations - STAR/Keolis Rennes**

Attributes description:

- `departure_time`: User's timestamp of bus validation
- `stop_name`: Name of the bus stop
- `route_short_name`: Number of the bus line
- `direction_id`: Code of the bus direction (0 or 'aller', bus from city center/1 or 'retour' bus to city center)
- `stop_id`: Code (unique) of the bus stop
- `count`: Number of validations
- `stop_lat`: GPS latitude coordinate of the bus stop 
- `stop_lon`: GPS longitude coordinate of the bus stop
- `id`: User identifier (unique)

```
####################
# BEGIN : Observe
```

In [None]:
display(buses_dataset)

**Dataset of students - Université de Rennes 1, Campus Beaulieu**

- `fin_cours`: Timestamp of end of a course
- `filiere`: Students background or speciality
- `nombre_etudiant`: Number of students by background finishing a course

In [None]:
display(classes_dataset)

```
# END : Observe
####################
```



#### Food for thoughts

```
####################
# BEGIN : Answer
```

> 1. Is there any information directly identifying in the raw data?
> 2. Could you describe possible auxiliary information that could lead to re-identifications?

```
# END : Answer
####################
```


### Pre-process raw data

In [None]:
# target bus stops
beaulieu_stops = [
    "Les Préales",
    "Tournebride",
    "Beaulieu Chimie",
    "Beaulieu INSA",
    "Beaulieu Restau U",
]


buses_dataset = data.aggregate_dataframe(
    buses_dataset,
    by="departure_time",
    agg={"count": "sum"},
    attribute="stop_name",
    value=beaulieu_stops,
    keep_index=False,
).rename(columns={"count": "validations"})

classes_dataset = classes_dataset.rename(
    columns={
        "nombre_etudiant": "students",
        "filiere": "background",
    }
)

### Display agregated data

#### Number of validations

In [None]:
display(buses_dataset)
timeline.plot(buses_dataset, "validations")

#### Number of students

In [None]:
display(classes_dataset)
timeline.plot(classes_dataset, "students")

### Merge dataset together

In [None]:
dataset = data.merge_dataframes(classes_dataset, buses_dataset)
display(dataset)

### Display dataset subsets

In [None]:
la_rentree = Timestamp("2021-09-06")
la_toussaint = Timestamp("2021-11-01")
one_week = Timedelta(7, unit="day")

end_train = time.get_next_monday(dataset, weeks=9)
start_test = time.get_next_monday(dataset, weeks=10)

timeline.plot_with_annotations(
    dataset,
    attributes=["students", "validations"],
    delimiters=[("val", end_train), ("test", start_test)],
    timeframes=[("holidays", la_toussaint, one_week)],
    title="Counts of Students & Validations",
)

### Enhance data attributes to create a predictive model based on machine learning

#### Improve dataset by mining information from date and time 

In [None]:
holidays = pd.date_range(start=la_toussaint, periods=7, freq="D")
dataset_with_features = preprocessing.timeline_feature_extraction(dataset, holidays)
dataset_with_features.loc[
    dataset_with_features["background"] == 0, "background"
] = "Empty"
display(dataset_with_features)

#### Split the dataset to train a machine learning tool

In [None]:
# train - test dataset split
train_test_datasets = preprocessing.timeline_train_test_split(
    dataset_with_features,
    start_test=start_test,
    end_train=end_train,
)

train_dataset, val_dataset, test_dataset = train_test_datasets
X_train, y_train = data.split_dataframe(train_dataset, target="validations")
X_val, y_val = data.split_dataframe(val_dataset, target="validations")
X_test, y_test = data.split_dataframe(test_dataset, target="validations")

## TOOL: A neural network as a regressor for prediction

### Define a neural network model


In [None]:
####################
# BEGIN : play

# dimension (neurons) of a hidden layers
HIDDEN_LAYER_SIZES = (64, 64, 64)

# number of rows processed at the same time
BATCH_SIZE = 64

# max number of iterations during training
MAX_ITERATIONS = 256

# END : play
####################


# dataset transformer used by learning model
transformer = sp.make_column_transformer()

# default regressor is interchangable with others like LinearRegression
regressor = model.get_default_regressor(
    hidden_layer_sizes=HIDDEN_LAYER_SIZES,
    batch_size=BATCH_SIZE,
    max_iterations=MAX_ITERATIONS,
    random_state=RANDOM_STATE,
)

# k corresponds to the number of features to retain after encoding ~80 %
regressor_pipeline = model.make_pipeline(
    transformer, regressor, k=42, random_state=RANDOM_STATE
)

### Train the neural network

In [None]:
# MLPRegressor requires a 'np 1d array' for 'y' but not all regressors do it
regressor_pipeline.fit(X_train, y_train.to_numpy().ravel())

# setup a dataframe with results of training
train_results = model.predict_and_compare(regressor_pipeline, X_train, y_train)

# show timeline of trainig
timeline.plot_with_annotations(
    train_results,
    attributes=["references", "predictions"],
    secondary_y=False,
    title="Training Dataset",
)

### Show the quality of the training

In [None]:
# setup a dataframe with results of validation
val_results = model.predict_and_compare(regressor_pipeline, X_val, y_val)

validation.print_metrics(val_results)
plot.residuals_plot(
    val_results,
    attributes=("predictions", "residuals"),
)

plot.losses_plot(regressor.loss_curve_)
print(f"\tModel Loss:              {regressor.loss_:.3f}")
print(f"\tNumber of Features:      {regressor_pipeline[:-1].transform(X_val).shape[1]}")

In [None]:
print(f"predictor model: NN")
nn_results = model.predict_and_compare(regressor_pipeline, X_test, y_test)

display(nn_results)
timeline.residuals_plot(
    nn_results,
    attributes=("references", "predictions"),
)

## Compare the neural network against a baseline method

 ### Train a linear regression model

In [None]:
from sklearn.linear_model import LinearRegression

linear_regressor_pipeline = model.make_pipeline(
    transformer,
    LinearRegression(),
)

linear_regressor_pipeline.fit(X_train, y_train);

### Show the predictions of the two models

In [None]:
print("Baseline model: linear regression")
lr_results = model.predict_and_compare(
    linear_regressor_pipeline,
    X_test,
    y_test,
)

validation.print_metrics(lr_results)
timeline.predictions_interval_plot(
    dataset, lr_results, nn_results, names=["GT", "LN", "NN"]
)

## Step 1 (RESULT): Impact of changing students schedules on buses validations

<a id='step_1'></a>

Lets start with the end. We are going to answer to the question raised
by our use case:

> Could a change in the time at which students finish have a *significant*
> impact on the number of validations in buses ?

In order to answer to this question, we have trained above a machine
learning model that we are going to use as a predictor *(please wait
a little bit for information on the training process)*. Given a time
(and possibly a group of students), the model outputs an estimation of
the number of buses validations on the campus.

You can play with the timeshift below and observe the impact on the 
validations. Search the following comments:

```py
####################
# BEGIN : ...
...
# END : ...
####################
```


In [None]:
####################
# BEGIN : play

SHIFT_IN_MINUTES = 15

# available backgrounds (None mean do not filter an take 'all' of them):
#   ['ISTIC', 'DUT', 'ESIR', 'SVE', 'SPM', 'Math', 'Philo']
BACKGROUNDS = ["Philo"]

# END : play
####################

# TODO test https://stackoverflow.com/questions/29504252
# transformer_weights with union between background and other transformers

# reload original classes dataset for time shifting
classes_dataset_ = io.read_data(classes_path).rename(
    columns={
        "nombre_etudiant": "students",
        "filiere": "background",
    }
)

minutes = Timedelta(SHIFT_IN_MINUTES, unit="T")
classes_dataset_ = data.shift_datetime_index(
    classes_dataset_,
    minutes,
    attribute="background",
    value=BACKGROUNDS,
)

dataset_ = data.merge_dataframes(classes_dataset_, buses_dataset)
dataset_ = preprocessing.timeline_feature_extraction(dataset_, holidays)
dataset_.loc[dataset_["background"] == 0, "background"] = "Empty"
datasets_ = preprocessing.timeline_train_test_split(
    dataset_,
    start_test=start_test,
)

_, test_dataset_ = datasets_
X_test_, y_test_ = data.split_dataframe(test_dataset_, target="validations")

nn_results_ = model.predict_and_compare(
    regressor_pipeline,
    X_test_,
    y_test_,
)

timeline.predictions_interval_plot_with_staggings(
    nn_results,
    nn_results_,
    names=["predictions", "staggings"],
)

####################
# BEGIN : Observe

In [None]:
# END : Observe
####################

### Food for thoughts

```
####################
# BEGIN : Answer
```

> 1. How can you observe the impact of changing the schedules?
> 2. What is the expected impact of shifting the schedules by 15min?
> 3. Is the expected impact of a 60 mins shift bigger?
> 4. Is there a *small* shift (e.g., less than 60 mins) that would result in a large impact?

```
# END : Answer
####################
```

**Now you can go to the [Step 2](#step_2).**

# References

 - https://colab.research.google.com/drive/1enI68fTdPI2w5KKv6jyL0Lcq9Zg3BbLx?usp=sharing