<a href="https://colab.research.google.com/github/jrbalderrama/a2r2/blob/main/notebooks/a2r2-01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RUDI Workshop: Introduction to Privacy-Preserving Data Publishing Techniques

Tristan ALLARD & Javier ROJAS BALDERRAMA

_Univ Rennes, CNRS, INRIA_
  
This work is licensed under a [Creative Commons Zero v1.0 Universal License](https://creativecommons.org/publicdomain/zero/1.0/)

## Acknowledgments

We warmly thank François Bodin and Luc Lesoil for their support on the data and the definition of the use-case.


# Notebook __ONE__

## Step 0 (STARTER)

<a id='step_0'></a>

This hands-on tutorial is going to introduce you to the issue of *privacy-preserving personal data publishing*. You are going to follow the implementation of a concrete use-case built from open data from the Rennes Metropole area. The main question of the use-case is to know wether a change in the students schedules at the Beaulieu campus impacts the load of the buses that go through the campus. We will answer to this question based on two datasets : the validations inside the buses that stop close to the campus (with timestamps), and the number of students that terminate a class (with timestamps). Our approach consists in training a predictor that outputs the expected number of validations along the day given the number of students terminating a class along the day. However, using raw buses validations for answering to this question may lead to privacy issues because validations can be highly identifying. After having performed some reidentification attacks, you will use a perturbed version of the buses validations dataset and observe the resulting impact on our ability to answer to the main question of the use-case.

We designed this tutorial to be a step-by-step guided tour. You can follow sequentially the "Step i" tag inside the titles of the sections. Up to you to follow the sequence proposed or to deviate from it, but be careful when leaving the track, it's wild out there ;)

We divided the full journey into three topics:

1. The naive version
2. Privacy issues
3. The protected version

For your convenience, there are a dedicated notebook for each topic, **you are currently in Notebook ONE**.

The notebooks also include questions. Please take some time to think about them. Trying to answer these questions can also help you to gain a deeper understanding. And we would love reading your answers!

Ready?

Really??

Please run the whole notebook (it does not take long) and **go directly to the [Step 1](#step_1).**


## Step 2 (PREAMBLE): Settings and datasets

<a id='step_2'></a>

Not too disappointed ? So lets now have a look at the data based on which we trained the model. 

1. The datasets are downloaded
2. The libraries required are imported and global variables are setup
3. The raw data are aggregated...
4. ... and the results are displayed.
5. The datasets are prepared for the training process.

> Observe the buses validations dataset (section [Display raw data](#sec_display_raw_data))... Can you imagine any issue?


 ### Download datasets


In [None]:
!wget -nv -nc https://zenodo.org/record/5509268/files/buses.parquet
!wget -nv -nc https://zenodo.org/record/5519319/files/classes.parquet -O classes_filiere.parquet

### Import required modules

In [None]:
import importlib
import os
from datetime import datetime
from errno import ENOENT
from pathlib import Path
from typing import Optional, Sequence, Tuple, Union

import numpy as np
import pandas as pd
import plotly.io as pio
import pyarrow.parquet as pq
import torch
from IPython import display, get_ipython
from numpy import ndarray
from pandas import NA, DataFrame, DatetimeIndex, Series, Timedelta, Timestamp
from plotly import subplots
from plotly.graph_objs import Bar, Candlestick, Figure, Scatter
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler
from torch import Tensor
from torch.nn import LSTM, Linear, Module, MSELoss
from torch.optim import Adam
from torch.utils.data import DataLoader, TensorDataset

### Setup notebook constants and running environment

In [None]:
# project base directory
BASE_DIRECTORY = Path(".")

# detect running environment
COLAB_ON = True if "google.colab" in str(get_ipython()) else False

In [None]:
# Set Ploty renderer
if COLAB_ON:
    pio.renderers.default = "colab"

### Load and display raw datasets

#### Read raw data

In [None]:
# Read data from file system and plot data

# load dataset from file system
def load_data(
    path: Path,
) -> DataFrame:
    if not path.exists():
        raise FileNotFoundError(ENOENT, os.strerror(ENOENT), path)

    table = pq.read_table(path)
    return table.to_pandas()


# buses dataset
buses_filename = "buses.parquet"
buses_path = BASE_DIRECTORY.joinpath(buses_filename)
buses_dataset = load_data(buses_path)


# classes dataset
classes_filename = "classes_filiere.parquet"
classes_path = BASE_DIRECTORY.joinpath(classes_filename)
classes_dataset = load_data(classes_path)

#### Display raw data

<a id='sec_display_raw_data'></a>

In [None]:
# show a dataframe as a table
def display_dataframe(
    dataframe: DataFrame,
) -> None:
    if COLAB_ON:
        spec = importlib.util.find_spec("google.colab")
        if spec:
            data_table = importlib.import_module("google.colab.data_table")
            enable_dataframe_formatter = getattr(
                data_table,
                "enable_dataframe_formatter",
            )

            enable_dataframe_formatter()

    # style = dataframe.style.set_caption("Dataframe Caption")
    # display.display(style)

    display.display(dataframe[:20000] if COLAB_ON else dataframe)

**Dataset of buses validations - STAR/Keolis Rennes**

Attributes description:

- `departure_time`: User's timestamp of bus validation
- `stop_name`: Name of the bus stop
- `route_short_name`: Number of the bus line
- `direction_id`: Code of the bus direction (0 or 'aller', bus from city center/1 or 'retour' bus to city center)
- `stop_id`: Code (unique) of the bus stop
- `count`: Number of validations
- `stop_lat`: GPS latitude coordinate of the bus stop 
- `stop_lon`: GPS longitude coordinate of the bus stop
- `id`: User identifier (unique)

```
####################
# BEGIN : Observe
```

In [None]:
display_dataframe(buses_dataset)

**Dataset of students - Université de Rennes 1, Campus Beaulieu**

- `fin_cours`: Timestamp of end of a course
- `filiere`: Students background or speciality
- `nombre_etudiant`: Number of students by background finishing a course

In [None]:
display_dataframe(classes_dataset)

```
# END : Observe
####################
```



#### Food for thoughts

```
####################
# BEGIN : Answer
```

> 1. Is there any information directly identifying in the raw data?
> 2. Could you describe possible auxiliary information that could lead to re-identifications?

```
# END : Answer
####################
```


### Pre-process raw data

In [None]:
# pre processing transportation data
def pre_process_by_aggregation_buses(
    dataframe: DataFrame,
    *,
    stops: Optional[Sequence[str]],
    ignore_weekend: bool = False,
) -> DataFrame:

    dataframe_ = dataframe.copy()
    # filter data from 'bus_stops' only
    if stops:
        dataframe_ = dataframe_[dataframe_["stop_name"].isin(beaulieu)]

    # remove weekend information
    if ignore_weekend:
        dataframe_ = dataframe_.set_index("departure_time")
        dataframe_ = dataframe_[dataframe_.index.dayofweek < 5]

    # aggregate dataset by stop name and departure time
    dataframe_ = (
        dataframe_.groupby(
            [
                "stop_name",
                "departure_time",
            ]
        )
        .agg({"count": "sum"})
        .reset_index()
    )

    return dataframe_.groupby("departure_time").sum()


def pre_process_by_aggregation_classes(
    dataframe: DataFrame,
) -> DataFrame:
    return dataframe.groupby("fin_cours").sum()

In [None]:
# target bus stops
beaulieu = [
    "Les Préales",
    "Tournebride",
    "Beaulieu Chimie",
    "Beaulieu INSA",
    "Beaulieu Restau U",
]


buses_dataset = pre_process_by_aggregation_buses(
    buses_dataset,
    stops=beaulieu,
)

classes_dataset = pre_process_by_aggregation_classes(classes_dataset)

### Display agregated data

In [None]:
# show a timeseries graph of a selected attribute
def plot_dataset(
    dataframe: DataFrame,
    column: str,
) -> None:
    figure = Figure()
    scatter = Scatter(
        x=dataframe.index,
        y=dataframe[column],
        mode="lines",
        name="values",
    )

    figure.add_trace(scatter)
    figure.update_layout(
        showlegend=False,
        title_text=column,
        template="simple_white",
    )

    figure.update_xaxes(showgrid=True)
    figure.show()

#### Number of validations

In [None]:
display_dataframe(buses_dataset)
plot_dataset(buses_dataset, "count")

#### Number of students

In [None]:
display_dataframe(classes_dataset)
plot_dataset(classes_dataset, "nombre_etudiant")

### Merge dataset together

In [None]:
# Merge datasets
def merge_datasets(
    classes: DataFrame,
    buses: DataFrame,
) -> DataFrame:

    # ignore dataset entries that are not available in classes timeline
    buses_ = buses[
        buses.index
        <= classes.index.max()
        + Timedelta(
            1,
            unit="day",
        )
    ]

    # merge datasets
    dataset = pd.merge(
        classes,
        buses_,
        how="outer",
        left_index=True,
        right_index=True,
    )

    # fill empty values
    dataset = dataset.fillna(0)

    return dataset

In [None]:
dataset = merge_datasets(classes_dataset, buses_dataset)
display_dataframe(dataset)

### Display dataset subsets

In [None]:
# compute a next monday after a given number of weeks for the
# initial value (min) of the datetime index
def get_timestamp_bound(
    dataframe: DataFrame,
    weeks: int,
) -> Timestamp:
    timedelta = Timedelta(7 * weeks - 1, unit="day")
    timestamp = dataframe.index.min() + timedelta
    return timestamp.normalize()


# show timeline divided bt delimiters and holidays
def plot_timeline(
    dataframe: DataFrame,
    columns: Sequence[str],
    delimiters: Sequence[Timestamp],
    holidays: Tuple[Timestamp, Timestamp],
) -> None:
    dmin = dataframe["nombre_etudiant"].values.min()
    dmax = dataframe["nombre_etudiant"].values.max()
    figure = subplots.make_subplots(specs=[[{"secondary_y": True}]])
    for counter, column in enumerate(columns):
        secondary_y = False if counter % 2 == 0 else True
        scatter = Scatter(
            x=dataframe.index,
            y=dataframe[column],
            mode="lines",
            name=column,
        )

        figure.add_trace(
            scatter,
            secondary_y=secondary_y,
        )

    for delimiter in delimiters:
        figure.add_shape(
            type="line",
            x0=delimiter,
            x1=delimiter,
            y0=dmax,
            y1=0,
            line=dict(
                # color="Gray",
                width=1,
                dash="dashdot",
            ),
        )

    figure.add_shape(
        type="rect",
        xref="paper",
        yref="paper",
        layer="below",
        fillcolor="LightSeaGreen",
        x0=holidays[0],
        x1=holidays[1],
        y0=dmax,
        y1=0,
    )

    figure.add_annotation(
        x=holidays[0],
        y=dmax,
        align="right",
        text="holidays",
        showarrow=False,
        yshift=-25,
        textangle=90,
        xshift=10,
    )

    figure.add_annotation(
        x=delimiters[0],
        y=dmax,
        text="validation",
        showarrow=True,
        yshift=-15,
    )

    figure.add_annotation(
        x=delimiters[1],
        y=dmax,
        text="test",
        showarrow=True,
    )

    figure.update_shapes(dict(xref="x", yref="y"))
    figure.update_yaxes(
        rangemode="tozero",
        # type="log",
    )

    figure.update_xaxes(range=[dataframe.index.min(), dataframe.index.max()])
    figure.update_yaxes(title_text=columns[0], secondary_y=False)
    figure.update_yaxes(title_text=columns[1], secondary_y=True)
    figure.update_layout(
        title_text="Count of Buses & Classes",
        template="simple_white",
        legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1),
    )

    figure.show()

In [None]:
la_rentree = Timestamp("2021-09-06")
la_toussaint = Timestamp("2021-11-01")
one_week_timedelta = Timedelta(7, unit="day")

end_train = get_timestamp_bound(dataset, weeks=9)
end_val = get_timestamp_bound(dataset, weeks=10)

plot_timeline(
    dataset,
    ["nombre_etudiant", "count"],
    [end_train, end_val],
    (la_toussaint, la_toussaint + one_week_timedelta),
)

### Enhance data attributes to create a predictive model based on machine learning

#### Improve dataset by mining information from date and time 

In [None]:
# bucketize attribute
def onehot_encode(
    dataframe: DataFrame,
    column: str,
) -> DataFrame:
    dummies = pd.get_dummies(
        dataframe[column],
        prefix=column,
    )

    return pd.concat(
        [dataframe, dummies],
        axis=1,
    ).drop(columns=[column])


# encode (time) column as periodic wave
def periodic_encode(
    dataframe: DataFrame,
    column: str,
    period: int,
    start_num: int = 0,
) -> DataFrame:
    kwargs = {
        f"sin_{column}": lambda x: np.sin(
            2 * np.pi * (dataframe[column] - start_num) / period
        ),
        f"cos_{column}": lambda x: np.cos(
            2 * np.pi * (dataframe[column] - start_num) / period
        ),
    }

    return dataframe.assign(**kwargs).drop(columns=[column])


# mark dataset ranges as holidays
def label_holidays(
    dataframe: DataFrame,
    start: Timestamp,
    end: Timestamp,
    column="holiday",
) -> DataFrame:
    dataframe_ = dataframe.copy()
    dataframe_[column] = 0
    dataframe_.loc[
        (dataframe_.index >= start) & (dataframe_.index < end),
        column,
    ] = 1
    return dataframe_


# add features to the dataset
def add_features(
    dataframe: DataFrame,
    bucketize_date: bool = True,
    periodic_time: bool = True,
    holidays: bool = False,
) -> DataFrame:
    dataframe_ = dataframe.copy()
    if bucketize_date:
        dataframe_ = dataframe_.assign(dayofweek=dataframe_.index.dayofweek)
        # .assign(day=dataframe.index.day)
        # .assign(month=dataset.index.month)
        dataframe_ = onehot_encode(dataframe_, "dayofweek")
        # dataset = onehot_encode(dataset, "month")

    if periodic_time:
        dataframe_ = dataframe_.assign(hour=dataframe_.index.hour)
        dataframe_ = dataframe_.assign(minute=dataframe_.index.minute)
        dataframe_ = periodic_encode(dataframe_, "hour", 24, 0)
        dataframe_ = periodic_encode(dataframe_, "minute", 60, 0)

    if holidays:
        dataframe_ = label_holidays(
            dataframe_,
            la_toussaint,
            la_toussaint + one_week_timedelta,
        )

    # dataframe.drop(["nombre_etudiant"], axis=1, inplace=True)
    return dataframe_

#### Display resulting dataset formatted for the machine learning process

In [None]:
dataset = add_features(dataset, holidays=True)
display_dataframe(dataset)

#### Split the dataset to train a machine learning tool

In [None]:
# Split the data into test, validation, and train sets
def features_split(
    dataframe: DataFrame,
    target: str,
) -> Tuple[DataFrame, DataFrame]:
    y = dataframe[[target]]
    X = dataframe.drop(columns=[target])
    return X, y

In [None]:
train_dataset = dataset[dataset.index < end_train]
val_dataset = dataset[(dataset.index >= end_train) & (dataset.index < end_val)]
test_dataset = dataset[dataset.index >= end_val]

X_train, y_train = features_split(train_dataset, target="count")
X_val, y_val = features_split(val_dataset, target="count")
X_test, y_test = features_split(test_dataset, target="count")

## TOOL: A neural network

### Define a neural network model


In [None]:
# Define RNN (LSTM) model
class LSTMModel(Module):
    def __init__(self, input_dim, hidden_dim, layer_dim, output_dim, dropout):
        super().__init__()

        # Defining the number of layers and the nodes in each layer
        self.hidden_dim = hidden_dim
        self.layer_dim = layer_dim

        # LSTM layers
        self.lstm = LSTM(
            input_dim,
            hidden_dim,
            layer_dim,
            batch_first=True,
            dropout=dropout,
        )

        # Fully connected layer
        self.fc = Linear(hidden_dim, output_dim)

    def forward(self, x):
        # initializing hidden state for first input with zeros
        h0 = torch.zeros(
            self.layer_dim,
            x.size(0),
            self.hidden_dim,
        ).requires_grad_()

        # initializing cell state for first input with zeros
        c0 = torch.zeros(
            self.layer_dim,
            x.size(0),
            self.hidden_dim,
        ).requires_grad_()

        # We need to detach as we are doing truncated backpropagation through time (BPTT)
        # If we don't, we'll backprop all the way to the start even after going through another batch
        # Forward propagation by passing in the input, hidden state, and cell state into the model
        out, (hn, cn) = self.lstm(x, (h0.detach(), c0.detach()))

        # Reshaping the outputs in the shape of (batch_size, seq_length, hidden_size)
        # so that it can fit into the fully connected layer
        # (squeezing is equivalent to: `out = out[:, -1, :]`)
        out = torch.squeeze(out)

        # Convert the final state to our desired output shape (batch_size, output_dim)
        out = self.fc(out)

        return out

### Configure the neural network

In [None]:
# dimension (neurons) of a hidden layer
HIDDEN_DIM = 64

# number of hidden layers
LAYER_DIM = 3

# number of rows processed at the same time
BATCH_SIZE = 64

# number of iterations during training
EPOCHS = 100

In [None]:
input_dim = len(X_train.columns)  # X_train.shape[0]
model = LSTMModel(
    input_dim=input_dim,
    hidden_dim=HIDDEN_DIM,
    layer_dim=LAYER_DIM,
    output_dim=1,
    dropout=0.2,
)

### Train the neural network

In [None]:
# Helper to train the NN model
class RunnerHelper:
    def __init__(self, model, loss_fn, optimizer):
        self.model = model
        self.loss_fn = loss_fn
        self.optimizer = optimizer
        self.train_losses = []
        self.val_losses = []

    def train_step(self, X, y):

        # set model to train mode
        self.model.train()

        # make predictions
        ŷ = self.model(X)

        # compute loss
        loss = self.loss_fn(ŷ, y)

        # compute gradients
        loss.backward()

        # update parameters
        self.optimizer.step()

        # reset to zero gradients
        self.optimizer.zero_grad()

        # returns loss
        return loss.item()

    def val_step(self, X, y):

        # set model to eval mode
        self.model.eval()

        # make prediction
        ŷ = self.model(X)

        # compute loss
        loss = self.loss_fn(ŷ, y)

        # return loss
        return loss.item()

    def train(self, train_loader, val_loader, n_epochs=50):
        model_path = f'{self.model}_{datetime.now().strftime("%Y-%m-%d %H:%M:%S")}'
        for epoch in range(1, n_epochs + 1):
            batch_train_losses = []
            for x_train, y_train in train_loader:
                # x_train = x_train.view([batch_size, -1, n_features]).to(DEVICE)
                x_train = torch.unsqueeze(x_train, 1)
                train_loss = self.train_step(x_train, y_train)
                batch_train_losses.append(train_loss)

            training_loss = np.mean(batch_train_losses)
            self.train_losses.append(training_loss)
            with torch.no_grad():
                batch_val_losses = []
                for x_val, y_val in val_loader:
                    # x_val = x_val.view([batch_size, -1, n_features]).to(DEVICE)
                    x_val = torch.unsqueeze(x_val, 1)
                    val_loss = self.val_step(x_val, y_val)
                    batch_val_losses.append(val_loss)

                validation_loss = np.mean(batch_val_losses)
                self.val_losses.append(validation_loss)

            if (epoch <= 10) | (epoch % 20 == 0):
                print(
                    f"[{epoch:3d}/{n_epochs}] Training loss: {training_loss:.4f}"
                    f"\t Validation loss: {validation_loss:.4f}"
                )

        # torch.save(self.model.state_dict(), model_path)

    def evaluate(self, test_loader):
        with torch.no_grad():
            predictions = []
            values = []
            for x_test, y_test in test_loader:
                # x_test = x_test.view([batch_size, -1, n_features]).to(DEVICE)
                x_test = torch.unsqueeze(x_test, 1)
                self.model.eval()
                ŷ = self.model(x_test)
                predictions.append(ŷ.detach().numpy())
                values.append(y_test.detach().numpy())

        return predictions, values

    def plot_losses(self):
        figure = Figure()
        tics = [*range(len(self.train_losses) + 1)]
        value = Scatter(
            x=tics,
            y=self.train_losses,
            mode="lines",
            name="Training",
            marker=dict(),
        )

        figure.add_trace(value)
        value = Scatter(
            x=tics,
            y=self.val_losses,
            mode="lines",
            name="Validation",
            marker=dict(),
        )

        figure.add_trace(value)
        figure.update_layout(title_text="Losses")
        figure.update_xaxes(title_text="epoch")
        figure.update_yaxes(title_text="loss (%)")
        figure.show()


# rescale results and align it to original time index
def inverse_transform(
    values: Sequence[ndarray],
    predictions: Sequence[ndarray],
    index: DatetimeIndex,
    scaler: MinMaxScaler,
) -> DataFrame:
    vals = np.concatenate(values, axis=0).ravel()
    preds = np.concatenate(predictions, axis=0).ravel()
    dataframe = DataFrame(
        data={
            "value": vals,
            "prediction": preds,
        },
        index=index[: len(vals)],
    )

    dataframe = dataframe.sort_index()
    dataframe = DataFrame(
        scaler.inverse_transform(dataframe),
        columns=dataframe.columns,
        index=dataframe.index,
    )

    return dataframe.astype("int_")


# formating data for NN
def to_dataloaders(
    dataframe_train: Tuple[DataFrame, DataFrame],
    dataframe_val: Tuple[DataFrame, DataFrame],
    dataframe_test: Tuple[DataFrame, DataFrame],
    scaler: MinMaxScaler,
    batch_size,
    shuffle=False,
    drop_last=True,
) -> Tuple[DataLoader, DataLoader, DataLoader]:

    # scale data
    X_train_arr = scaler.fit_transform(dataframe_train[0])
    X_val_arr = scaler.transform(dataframe_val[0])
    X_test_arr = scaler.transform(dataframe_test[0])

    y_train_arr = scaler.fit_transform(dataframe_train[1])
    y_val_arr = scaler.transform(dataframe_val[1])
    y_test_arr = scaler.transform(dataframe_test[1])

    # transform scaled data to tensors
    train_features = Tensor(X_train_arr)
    train_targets = Tensor(y_train_arr)
    val_features = Tensor(X_val_arr)
    val_targets = Tensor(y_val_arr)
    test_features = Tensor(X_test_arr)
    test_targets = Tensor(y_test_arr)

    # setup tensor datasets
    train = TensorDataset(train_features, train_targets)
    val = TensorDataset(val_features, val_targets)
    test = TensorDataset(test_features, test_targets)

    # setup (tensor) datasets loaders
    train_loader = DataLoader(
        train,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
    )

    val_loader = DataLoader(
        val,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
    )

    test_loader = DataLoader(
        test,
        batch_size=1,
        shuffle=shuffle,
        drop_last=drop_last,
    )

    return train_loader, val_loader, test_loader

In [None]:
scaler = MinMaxScaler()  # RobustScaler()  # StandardScaler()  # MinMaxScaler()
loss_fn = MSELoss()  # L1Loss()
optimizer = Adam(model.parameters(), lr=1e-3, weight_decay=1e-6)
runner = RunnerHelper(model=model, loss_fn=loss_fn, optimizer=optimizer)
train_loader, val_loader, test_loader = to_dataloaders(
    (X_train, y_train),
    (X_val, y_val),
    (X_test, y_test),
    scaler,
    BATCH_SIZE,
)

runner.train(train_loader, val_loader, n_epochs=EPOCHS)
runner.plot_losses()
predictions, values = runner.evaluate(test_loader)
lstm_result = inverse_transform(values, predictions, X_test.index, scaler)

### Visualize the quality of the training

In [None]:
def print_metrics(
    dataframe: DataFrame,
    value: str,
    prediction: str = "prediction",
) -> None:
    result_metrics = {
        "mae": metrics.mean_absolute_error(
            dataframe[value],
            dataframe[prediction],
        ),
        "rmse": metrics.mean_squared_error(
            dataframe[value],
            dataframe[prediction],
        )
        ** 0.5,
        "r2": metrics.r2_score(
            dataframe[value],
            dataframe[prediction],
        ),
    }

    print("\tMean Absolute Error:       ", result_metrics["mae"])
    print("\tRoot Mean Squared Error:   ", result_metrics["rmse"])
    print("\tR^2 Score:                 ", result_metrics["r2"])
    # return result_metrics


# show residuals as kind of OHLC Charts
def plot_residuals(
    dataframe: DataFrame,
) -> None:
    hovertext = []
    for i in range(dataframe.shape[0]):
        hovertext.append(
            f"{dataframe.index[i]}<br>"
            f"Real: {dataframe['value'][i]}<br>"
            f"Prediction: {dataframe['prediction'][i]}"
        )

    figure = Figure(
        data=[
            Scatter(
                x=dataframe.index,
                y=dataframe["value"],
                mode="lines",
                name="reference",
                line=dict(color="lightgrey", width=0.6, dash="dot"),
                # opacity=0.6,
                showlegend=False,
            ),
            Scatter(
                x=dataframe.index,
                y=dataframe["prediction"],
                mode="lines",
                name="prediction",
                line=dict(color="lightblue", width=0.6, dash="dot"),
                showlegend=False,
                # opacity=0.6,
            ),
            Candlestick(
                x=dataframe.index,
                open=dataframe["value"],
                high=dataframe["prediction"],
                low=dataframe["prediction"],
                close=dataframe["value"],
                text=hovertext,
                hoverinfo="text",
                name="residuals",
                # line=dict(width=2),
                increasing_line_color="lightseagreen",
                decreasing_line_color="lightsalmon",
            ),
        ]
    )

    figure.update_layout(
        title="Prediction residuals",
        template="simple_white",
        xaxis_rangeslider_visible=True,
    )

    figure.show()

In [None]:
print(f"NN model: LSTM")
print_metrics(lstm_result, "value")
display_dataframe(lstm_result)
plot_residuals(lstm_result)

## Compare the neural network against a baseline method

 ### Train a linear regression model

In [None]:
# Build a baseline model to compare against the RNN model
def baseline_evaluate(
    X_train: DataFrame,
    y_train: DataFrame,
    X_test: DataFrame,
    y_test: DataFrame,
) -> DataFrame:
    model = LinearRegression()
    model.fit(X_train, y_train)
    prediction = model.predict(X_test)
    dataframe = DataFrame(y_test)
    dataframe = dataframe.assign(prediction=prediction)
    dataframe = dataframe.sort_index()
    return dataframe


def plot_models_prediction_interval(
    dataframe: DataFrame,
    rnn_dataframe: DataFrame,
    baseline_dataframe: DataFrame,
) -> None:
    figure = Figure()
    value = Scatter(
        x=dataframe.index,
        y=dataframe["count"],
        mode="lines",
        name="Reference",
        line=dict(color="rgba(0,0,0, 0.3)", width=1, dash="dot"),
    )

    figure.add_trace(value)
    baseline = Scatter(
        x=baseline_dataframe.index,
        y=baseline_dataframe.prediction,
        mode="lines",
        name="Linear Regression",
        opacity=0.8,
    )

    figure.add_trace(baseline)
    prediction = Scatter(
        x=rnn_dataframe.index,
        y=rnn_dataframe.prediction,
        mode="lines",
        name="LSTM NN",
        # marker=dict(),
        opacity=0.8,
        visible="legendonly",
    )

    figure.add_trace(prediction)
    figure.update_layout(
        showlegend=True,
        title_text="Predictions",
        template="simple_white",
        xaxis=dict(
            range=[
                rnn_dataframe.index.min(),
                rnn_dataframe.index.max(),
            ],
        ),
    )

    figure.update_xaxes(rangeslider_visible=True)
    figure.show()

### Visualize the predictions of the two models

In [None]:
print("Baseline model: linear regression")
baseline_result = baseline_evaluate(X_train, y_train, X_test, y_test)
print_metrics(baseline_result, "count")
plot_models_prediction_interval(dataset, lstm_result, baseline_result)

## Step 1 (RESULT): Impact of changing students schedules on buses validations

<a id='step_1'></a>

Lets start with the end. We are going to answer to the question raised
by our use case:

> Could a change in the time at which students finish have a *significant*
> impact on the number of validations in buses ?

In order to answer to this question, we have trained above a machine
learning model that we are going to use as a predictor *(please wait
a little bit for information on the training process)*. Given a time
(and possibly a group of students), the model outputs an estimation of
the number of buses validations on the campus.

You can play with the timeshift below and observe the impact on the 
validations. Search the following comments:

```py
####################
# BEGIN : ...
...
# END : ...
####################
```


In [None]:
# Test predictions with classes time shift
def shift_time_all(
    dataframe: DataFrame,
    minutes: int,
) -> Series:
    dataframe_ = dataframe.copy()
    timedelta = Timedelta(minutes, unit="T")
    dataframe_.reset_index(inplace=True)
    dataframe_.iloc[:, [0]] += timedelta
    dataframe_.set_index(dataframe_.columns[0], inplace=True)
    display_dataframe(dataframe_)
    return dataframe_


# shift (delta) time for a specific academic background
def shift_time(
    dataframe: DataFrame,
    *,
    backgrounds: Optional[Union[str, Sequence[str]]],
    minutes: int,
) -> DataFrame:
    if not backgrounds:
        return shift_time_all(dataframe, minutes=minutes)

    dataframe_ = dataframe.copy()
    backgrounds_ = [backgrounds] if isinstance(backgrounds, str) else backgrounds
    delta = Timedelta(minutes, unit="T")
    dataframe_.reset_index(inplace=True)
    for background in backgrounds_:
        dataframe_.loc[dataframe_["filiere"] == background, "fin_cours"] = (
            dataframe_["fin_cours"] + delta
        )

    dataframe_.set_index(dataframe_.columns[0], inplace=True)
    display_dataframe(dataframe_)
    return dataframe_


def plot_prediction_interval_with_staggings(
    dataframe: DataFrame,
    staggered: DataFrame,
) -> None:
    figure = subplots.make_subplots(
        rows=4,
        cols=1,
        shared_xaxes=True,
        specs=[
            [{"rowspan": 3}],
            [None],
            [{}],
            [{}],
        ],
        vertical_spacing=0.1,
    )

    prediction_plot = Scatter(
        x=dataframe.index,
        y=dataframe.prediction,
        mode="lines",
        name="prediction",
        # opacity=0.1,
        fill=None,
        showlegend=False,
        # line_color="gray",
        line=dict(color="gray", width=0.1),
        # hoverinfo="x+y",
        # stackgroup='one'
    )

    figure.add_trace(prediction_plot, row=1, col=1)
    staggered_plot = Scatter(
        x=staggered.index,
        y=staggered.prediction,
        mode="lines",
        name="staggered",
        # opacity=0.8,
        fill="tonexty",
        fillcolor="red",
        line=dict(color="gray", width=0.1),
        # hoverinfo="x+y",
        # stackgroup='one'
    )

    figure.add_trace(staggered_plot, row=1, col=1)
    residuals = (
        pd.merge(
            lstm_result,
            staggered_lstm_result,
            how="outer",
            left_index=True,
            right_index=True,
        )
        .rename(
            {
                "prediction_x": "prediction",
                "prediction_y": "staggered",
            },
            axis=1,
        )
        .drop(["value_x", "value_y"], axis=1)
        .dropna()
        .astype(int)
    )

    residuals["difference"] = residuals["prediction"] - residuals["staggered"]
    colors = [
        "lightseagreen" if c > 0 else "lightsalmon" for c in residuals["difference"]
    ]
    bar_plot = Bar(
        x=residuals.index,
        y=residuals.difference,
        name="difference",
        showlegend=False,
        marker_color=colors,
    )

    figure.add_trace(bar_plot, row=4, col=1)
    figure.update_xaxes(showticklabels=True, row=1, col=1)
    figure.update_yaxes(
        title_text="difference", row=4, col=1, zeroline=True, zerolinecolor="gray"
    )
    figure.update_xaxes(
        showticklabels=False,
        visible=False,
        row=4,
        col=1,
    )

    figure.update_layout(
        showlegend=True,
        title_text="Predictions and Staggings",
        template="simple_white",
    )

    figure.show()


def evaluate_shift_time(
    buses: DataFrame,
    classes: DataFrame,
    runner: RunnerHelper,
    scaler: MinMaxScaler,
    test_bound: Timestamp,
    *,
    backgrounds: Optional[Union[str, Sequence[str]]],
    minutes: int,
) -> DataFrame:
    # staggered_classes = shift_time_all(classes, minutes=minutes)
    classes_dataset_ = shift_time(
        classes,
        backgrounds=backgrounds,
        minutes=minutes,
    )

    staggered_classes = pre_process_by_aggregation_classes(classes_dataset_)
    dataframe = merge_datasets(staggered_classes, buses)
    dataframe = add_features(dataframe, holidays=True)
    test_dataset = dataframe[dataframe.index >= test_bound]
    X_test, y_test = features_split(
        test_dataset,
        target="count",
    )

    _, _, test_loader = to_dataloaders(
        (X_train, y_train),
        (X_val, y_val),
        (X_test, y_test),
        scaler,
        BATCH_SIZE,
    )

    predictions, values = runner.evaluate(test_loader)
    staggered_lstm_result = inverse_transform(
        values,
        predictions,
        X_test.index,
        scaler,
    )

    return staggered_lstm_result

In [None]:
####################
# BEGIN : play

SHIFT_IN_MINUTES = 45

# available backgrounds (None mean do not filter an take 'all' of them):
#   ['ISTIC', 'DUT', 'ESIR', 'SVE', 'SPM', 'Math', 'Philo']
BACKGROUNDS = None

# END : play
####################


# reload original buses dataset for iterative modifications
buses_dataset = load_data(buses_path)
buses_dataset = pre_process_by_aggregation_buses(
    buses_dataset,
    stops=beaulieu,
)

classes_dataset = load_data(classes_path)
staggered_lstm_result = evaluate_shift_time(
    buses_dataset,
    classes_dataset,
    runner,
    scaler,
    end_val,
    minutes=SHIFT_IN_MINUTES,
    backgrounds=BACKGROUNDS,
)

plot_prediction_interval_with_staggings(
    lstm_result,
    staggered_lstm_result,
)

####################
# BEGIN : Observe

In [None]:
# END : Observe
####################

### Food for thoughts

```
####################
# BEGIN : Answer
```

> 1. How can you observe the impact of changing the schedules?
> 2. What is the expected impact of shifting the schedules by 15min?
> 3. Is the expected impact of a 60 mins shift bigger?
> 4. Is there a *small* shift (e.g., less than 60 mins) that would result in a large impact?

```
# END : Answer
####################
```

**Now you can go to the [Step 2](#step_2).**

# References

 - https://colab.research.google.com/drive/1enI68fTdPI2w5KKv6jyL0Lcq9Zg3BbLx?usp=sharing