<a href="https://colab.research.google.com/github/jrbalderrama/a2r2/blob/main/notebooks/a2r2-03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RUDI Workshop: Introduction to Privacy-Preserving Data Publishing Techniques

Tristan ALLARD & Javier ROJAS BALDERRAMA

_Univ Rennes, CNRS, INRIA_
  
This work is licensed under a [Creative Commons Zero v1.0 Universal License](https://creativecommons.org/publicdomain/zero/1.0/)

## Acknowledgments

We warmly thank François Bodin and Luc Lesoil for their support on the data and the definition of the use-case.

# Notebook __THREE__: Protection with differential privacy


## Step 0 (STARTER) 
This notebook introduces you to the fashionable protection measure called _differential privacy_ ! Similarly to the first notebook, you will have to answer to the question asked by the use case: does a change in the students schedules at the Beaulieu campus impact the load of the buses that go through the campus ? However, motivated by the reidentification attacks that you performed in the second notebook, you will be much more cautious now with privacy. Prior to performing your analysis, you will perturb the buses validation frequencies such that differential privacy is satisfied. You will then launch your analysis on the perturbed dataset. This allows you to observe and manipulate a differentially private perturbation algorithm, although, of course, in real life, the analyst only receives the already perturbed dataset to be used for its analysis : the perturbation is obviously performed before the data is exported. 

You can now go to [Step 1](#step_1). 

## Settings and data


 ### Download datasets


In [None]:
!wget -nv -nc https://zenodo.org/record/5509268/files/buses.parquet
!wget -nv -nc https://zenodo.org/record/5519319/files/classes.parquet -O classes_filiere.parquet

 ### Import required modules

In [None]:
from pathlib import Path

import pandas as pd
from atelier import data, io, plot
from atelier.extensions.star import preprocessing as spg
from atelier.extensions.star import privacy as spy
from atelier.extensions.star.preprocessing import FourierPerturbationTransformer
from atelier.learn import model, preprocessing
from atelier.plot import timeline
from atelier.utils import time
from IPython.display import display
from pandas import Timestamp

### Setup notebook constants and running environment

In [None]:
from atelier.utils import colaboratory

# check running environment
COLAB_ON = colaboratory.setup()

RANDOM_STATE = None

PERIOD = "week"

### Load and display raw datasets

In [None]:
# buses dataset
buses_path = Path("buses.parquet")
buses_dataset = io.read_data(buses_path)

# classes dataset
classes_path = Path("classes_filiere.parquet")
classes_dataset = io.read_data(classes_path).rename(
    columns={
        "nombre_etudiant": "students",
        "filiere": "background",
    }
)

## Step 1 (SOLUTION): Sound protection with differential privacy
<a id='step_1'></a>

A simple way to satisfy differential privacy when exporting the buses validation datasets could be first to aggregate the buses validations (i.e., count the number of validations at each timestamp) and second to add a random value to each count sampled in a well parameterized Laplace distribution. However, the longer the time series, the bigger the noise. In our use case, the perturbation would be probably bigger than the counts. So we use the _FPA algorithm_, an algorithm that has been specifically designed for perturbing long time series. It implements a simple idea: transform the time series to its Fourier representation, perturb the first _k_ coefficients and discard the others, and transform back the coefficients to the initial representation. 

You can have a look at our implementation of the FPA algorithm but this is not mandatory. In order to progress, you only need to know that you can tune two parameters : _k_ and _ε_. 
- The higher the value of _k_, the more precise the Fourier representation but the higher the noise. You can think of _k_ as being usually on the order of 20. 
- The higher the value of _ε_, the lower the noise (so the lower the privacy guarantees). You can think of _ε_ as being usually on the order of 0.1 . 

You can now observe the time series resulting from the use of different values for these parameters [here](#observeexampledp).

### Fourier Perturbation Algorithm (FPA)

In [None]:
# target bus stops
beaulieu_stops = [
    "Les Préales",
    "Tournebride",
    "Beaulieu Chimie",
    "Beaulieu INSA",
    "Beaulieu Restau U",
]

# aggregate size
Ν = [3500]

# Fourier coefficients
Κ = [30, 35, 40]  ## max len = 3

# perturbation budget
Ε = [0.01, 0.05, 0.1, 0.15, 0.20]  # week ## max len = 3
# Ε = [0.01, 0.05, 0.1, 0.25, 0.5]  # day ## max len = 3
fpas = spy.get_fourier_perturbations(
    buses_dataset,
    Ν,
    Κ,
    Ε,
    attribute="stop_name",
    value=beaulieu_stops,
    period=PERIOD,
    random_state=RANDOM_STATE,
)

<a id='observeexampledp'></a>

In [None]:
plot.facet_plot(fpas, 3500, row="ε", col="k")

####################
# BEGIN : Observe

In [None]:
# END : Observe
####################

__Food for thoughts__ 

```
####################
# BEGIN : Answer
```

- Which couple of parameters seem to perturb the least the original time series ? 

```
# END : Answer
####################
```

You can now [play with these parameters](#playwithdp) in order to observe the impact of the FPA algorithm on the time series representing the number of validations at each timestamp. 

### Training a *safe* neural network

<a id='playwithdp'></a>

In [None]:
####################
# BEGIN : Play

BUDGET = 0.1

COEFFICIENTS = 40

# END : Play
####################

You can [observe here](#observedp) the resulting time series and [compare here](#observedistributionsdp) the distributions of the original counts to the distribution of the perturbed counts

Hum... Do you want to observe the resulting predictions ? Sure ? 
Well, you can go to [Step 2](#step_2) and come back later for trying other parameters ! 

In [None]:
buses_dataset = data.aggregate_dataframe(
    buses_dataset,
    by="departure_time",
    agg={"count": "sum"},
    attribute="stop_name",
    value=beaulieu_stops,
).rename(columns={"count": "validations"})
display(buses_dataset)

protected_buses_dataset = buses_dataset.copy()
transformer = FourierPerturbationTransformer(
    BUDGET,
    COEFFICIENTS,
    PERIOD,
    RANDOM_STATE,
)

protected_buses_dataset["fpa"] = transformer.fit_transform(
    protected_buses_dataset["validations"]
)

<a id='observedp'></a>

In [None]:
####################
# BEGIN : Observe

In [None]:
timeline.plot(protected_buses_dataset, "fpa")

In [None]:
# END : Observe
####################

You can go back to [playing with the parameters](#playwithdp).

#### Display the merged dataset (students buses, and perturbed buses)

In [None]:
dataset = data.merge_dataframes(classes_dataset, protected_buses_dataset)
display(dataset)

timeline.plot_with_annotations(
    dataset,
    attributes=["validations", "fpa"],
    title="Count of Students & Validations",
    secondary_y=False,
)

display(dataset)

#### Display distributions

<a id='observedistributionsdp'></a>

You can observe below the distributions of (1) the original counts of buses validations (_validations_), (2) the perturbed counts of buses validations (_fpa_), and (3) the distribution of the differences between the perturbed counts and the original counts (_noise_). 

When done, you can go back to [playing with the parameters](#playwithdp).

In [None]:
####################
# BEGIN : Observe

print("Distributions of datasets")
plot.distributions_plot(
    dataset,
    attributes=["validations", "fpa"],
    curve_type="kde",
)

In [None]:
# END: Observe
####################

You can go back to [playing with the parameters](#playwithdp).

#### Data preparation for the neural network

In [None]:
# Add features (motifs) to the dataset
la_toussaint = Timestamp("2021-11-01")

start_test = time.get_next_monday(dataset, weeks=10)

In [None]:
# swap col. names to use the column_transformer of timeline_feat_extraction
dataset[["validations", "fpa"]] = dataset[["fpa", "validations"]]

holidays = pd.date_range(start=la_toussaint, periods=7, freq="D")
dataset_with_features = preprocessing.timeline_feature_extraction(dataset, holidays)
dataset_with_features.loc[
    dataset_with_features["background"] == 0, "background"
] = "Empty"

#### Split datasets to train a machine learning tool

In [None]:
# train - test dataset split
train_test_datasets = preprocessing.timeline_train_test_split(
    dataset_with_features,
    start_test=start_test,
)

train_dataset, test_dataset = train_test_datasets
X_train, y_train = data.split_dataframe(train_dataset, target="validations")
X_test, y_test = data.split_dataframe(test_dataset, target="validations")

#### Define the neural network


In [None]:
transformer = spg.make_column_transformer()

# default regressor is interchangable with others like LinearRegression
regressor = model.get_default_regressor(random_state=RANDOM_STATE)

# k corresponds to the number of features to retain after encoding ~80 %
regressor_pipeline = model.make_pipeline(
    transformer,
    regressor,
    # k=48,
    # random_state=RANDOM_STATE
)

#### Train the neural network

In [None]:
# MLPRegressor requires a 'np 1d array' for 'y' but not all regressors do it
regressor_pipeline.fit(X_train, y_train.to_numpy().ravel())

# setup a dataframe with results of training
protected_results = model.predict_and_compare(regressor_pipeline, X_test, y_test)

plot.residuals_plot(
    protected_results,
    attributes=("predictions", "residuals"),
)

### Comparing the neural network against with original dataset

In [None]:
dataset = data.merge_dataframes(classes_dataset, buses_dataset)
dataset_with_features = preprocessing.timeline_feature_extraction(dataset, holidays)
dataset_with_features.loc[
    dataset_with_features["background"] == 0, "background"
] = "Empty"

train_test_datasets = preprocessing.timeline_train_test_split(
    dataset_with_features,
    start_test=start_test,
)

train_dataset, test_dataset = train_test_datasets
X_train, y_train = data.split_dataframe(train_dataset, target="validations")
X_test, y_test = data.split_dataframe(test_dataset, target="validations")

regressor_2 = model.get_default_regressor(random_state=RANDOM_STATE)

regressor_pipeline_2 = model.make_pipeline(
    transformer,
    regressor,
    # k=48,
    # random_state=RANDOM_STATE
)

regressor_pipeline_2.fit(X_train, y_train.to_numpy().ravel())

# setup a dataframe with results of training
results = model.predict_and_compare(regressor_pipeline_2, X_test, y_test)

 ### Step 2 (COMPARE): Visualize the predictions of the two models
 <a id='step_2'></a>

In [None]:
timeline.predictions_interval_plot(
    dataset,
    protected_results,
    results,
    names=["GT", "FPA", "RAW"],
)

####################
# BEGIN : Observe

In [None]:
# END : Observe
####################

Want to try with other values ? Go back to [playing with the parameters](#playwithdp). 

__Food for thoughts__
```
####################
# BEGIN : Answer
```

- How is the quality of the prediction impacted by the parameters ? You need to try several values in order to get a correct intuition. 

```
# END : Answer
####################
```

# References

- Vibhor Rastogi and Suman Nath. Differentially private aggregation of distributed time-series with transformation and encryption. Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, June 2010, Indianapolis (IN) USA [[DOI]](https://doi.org/10.1145/1807167.1807247).