# Anonymeter Vehicle coupon recommendation notebook

This notebook demonstrates the usage of `Anonymeter` on vehicle coupon recommendation dataset, a software to derive GDPR-aligned measures of the privacy of synthetic datasets in an empirical, attack based fashion.

`Anonymeter` contains privacy evaluators which measures the risks of singling out, linkability, and inference which might incur to data donors following the release of synthetic dataset. These risk are the three key indicators of factual anonymization according to the European General Data Protection Regulation (GDPR). For more details, please refer to [M. Giomi et al. 2022](https://petsymposium.org/popets/2023/popets-2023-0055.php).

### Basic usage pattern

For each of these privacy risks anonymeter provide an `Evaluator` class. The high-level classes `SinglingOutEvaluator`, `LinkabilityEvaluator`, and `InferenceEvaluator` are the only thing that you need to import from `Anonymeter`.

Despite the different nature of the privacy risks they evaluate, these classes have the same interface and are used in the same way. To instantiate the evaluator you have to provide three dataframes: the original dataset `ori` which has been used to generate the synthetic data, the synthetic data `syn`, and a `control` dataset containing original records which have not been used to generate the synthetic data. 

Another parameter common to all evaluators is the number of target records to attack (`n_attacks`). A higher number will reduce the statistical uncertainties on the results, at the expense of a longer computation time.

```python
evaluator = *Evaluator(ori: pd.DataFrame, 
                       syn: pd.DataFrame, 
                       control: pd.DataFrame,
                       n_attacks: int)
```

Once instantiated the evaluation pipeline is executed when calling the `evaluate`, and the resulting estimate of the risk can be accessed using the `risk()` method.

```python
evaluator.evaluate()
risk = evaluator.risk()
```

### A peak under the hood

In `Anonymeter` each privacy risk is derived from a privacy attacker whose task is to use the synthetic dataset to come up with a set of *guesses* of the form:
- "there is only one person with attributes X, Y, and Z" (singling out)
- "records A and B belong to the same person" (linkability)
- "a person with attributes X and Y also have Z" (inference)

Each evaluation consists of running three different attacks:
- the "main" privacy attack, in which the attacker uses the synthetic data to guess information on records in the original data.
- the "control" privacy attack, in which the attacker uses the synthetic data to guess information on records in the control dataset. 
- the "baseline" attack, which models a naive attacker who ignores the synthetic data and guess randomly.

Checking how many of these guesses are correct, the success rates of the different attacks are measured and used to derive an estimate of the privacy risk. In particular, the "control attack" is used to separate what the attacker learns from the *utility* of the synthetic data, and what is instead indication of privacy leaks. The "baseline attack" instead functions as a sanity check. The "main attack" attack should outperform random guessing in order for the results to be trusted. 

In [2]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from anonymeter.evaluators import SinglingOutEvaluator
from anonymeter.evaluators import LinkabilityEvaluator
from anonymeter.evaluators import InferenceEvaluator

## Downloading the data

For this example, we will use the famous `Vehicle coupon recommendation` (more details [here](https://archive.ics.uci.edu/dataset/603/in+vehicle+coupon+recommendation)) dataset. For the purpose of demonstrating `Anonymeter`, we will use this data as if each row would in fact refer to a real individual. 

The synthetic version has been generated by `CTGAN` from [SDV](https://sdv.dev/SDV/user_guides/single_table/ctgan.html), as explained in the paper accompanying this code release. For details on the generation process, e.g. regarding hyperparameters, see Section 6.2.1 of [the accompanying paper](https://petsymposium.org/popets/2023/popets-2023-0055.php)).

In [3]:
# bucket_url = "https://storage.googleapis.com/statice-public/anonymeter-datasets/"

ori = pd.read_csv("../tests/datasets/vehicle_train.csv")
syn = pd.read_csv("../tests/datasets/vehicle_syn.csv")
control = pd.read_csv("../tests/datasets/vehicle_control.csv")

In [4]:
ori.head(10)

Unnamed: 0,destination,passanger,weather,time,coupon,expiration,gender,age,maritalStatus,has_children,education,occupation,income,car,Bar,CoffeeHouse,toCoupon_GEQ25min,direction_same_or_opp
0,No Urgent Place,Alone,Sunny,2PM,Restaurant(<20),1d,Female,21,Unmarried partner,1,Some college - no degree,Unemployed,$37500 - $49999,,never,never,0,0
1,No Urgent Place,Friend(s),Sunny,10AM,Coffee House,2h,Female,21,Unmarried partner,1,Some college - no degree,Unemployed,$37500 - $49999,,never,never,0,0
2,No Urgent Place,Friend(s),Sunny,10AM,Carry out & Take away,2h,Female,21,Unmarried partner,1,Some college - no degree,Unemployed,$37500 - $49999,,never,never,0,0
3,No Urgent Place,Friend(s),Sunny,2PM,Coffee House,2h,Female,21,Unmarried partner,1,Some college - no degree,Unemployed,$37500 - $49999,,never,never,0,0
4,No Urgent Place,Friend(s),Sunny,2PM,Coffee House,1d,Female,21,Unmarried partner,1,Some college - no degree,Unemployed,$37500 - $49999,,never,never,0,0
5,No Urgent Place,Friend(s),Sunny,6PM,Restaurant(<20),2h,Female,21,Unmarried partner,1,Some college - no degree,Unemployed,$37500 - $49999,,never,never,0,0
6,No Urgent Place,Friend(s),Sunny,2PM,Carry out & Take away,1d,Female,21,Unmarried partner,1,Some college - no degree,Unemployed,$37500 - $49999,,never,never,0,0
7,No Urgent Place,Kid(s),Sunny,10AM,Restaurant(<20),2h,Female,21,Unmarried partner,1,Some college - no degree,Unemployed,$37500 - $49999,,never,never,0,0
8,No Urgent Place,Kid(s),Sunny,10AM,Carry out & Take away,2h,Female,21,Unmarried partner,1,Some college - no degree,Unemployed,$37500 - $49999,,never,never,0,0
9,No Urgent Place,Kid(s),Sunny,10AM,Bar,1d,Female,21,Unmarried partner,1,Some college - no degree,Unemployed,$37500 - $49999,,never,never,0,0


As visible the dataset contains several demographic information, as well as information regarding the education, financial situation, and personal life of some tens of thousands of "individuals".

### Measuring the singling out risk

The `SinglingOutEvaluator` try to measure how much the synthetic data can help an attacker finding combination of attributes that single out records in the training data. 

With the following code we evaluate the robustness of the synthetic data to "univariate" singling out attacks, which try to find unique values of some attribute which single out an individual. 


##### NOTE:

The `SingingOutEvaluator` can sometimes raise a `RuntimeError`. This happens when not enough singling out queries are found. Increasing `n_attacks` will make this condition less frequent and the evaluation more robust, although much slower.


In [112]:
for i in range(50,500,40):

    evaluator = SinglingOutEvaluator(ori=ori, 
                                    syn=syn, 
                                    control=control,
                                    n_attacks=i)

    try:
        evaluator.evaluate(mode='univariate')
        risk = evaluator.risk()
        print(risk)

    except RuntimeError as ex: 
        print(f"Singling out evaluation failed with {ex}. Please re-run this cell."
            "For more stable results increase `n_attacks`. Note that this will "
            "make the evaluation slower.")

Found 4 failed queries out of 50. Check DEBUG messages for more details.
Attack `univariate` could generate only 0 singling out queries out of the requested 50. This can probably lead to an underestimate of the singling out risk.
  self._sanity_check()


PrivacyRisk(value=0.0, ci=(0.0, 0.05231670688394393))


Found 5 failed queries out of 90. Check DEBUG messages for more details.
Attack `univariate` could generate only 0 singling out queries out of the requested 90. This can probably lead to an underestimate of the singling out risk.
  self._sanity_check()


PrivacyRisk(value=0.0, ci=(0.0, 0.02955069655328399))


Found 8 failed queries out of 130. Check DEBUG messages for more details.
Attack `univariate` could generate only 0 singling out queries out of the requested 130. This can probably lead to an underestimate of the singling out risk.
  self._sanity_check()


PrivacyRisk(value=0.0, ci=(0.0, 0.02059055914792904))


Found 7 failed queries out of 170. Check DEBUG messages for more details.
Attack `univariate` could generate only 0 singling out queries out of the requested 170. This can probably lead to an underestimate of the singling out risk.
  self._sanity_check()


PrivacyRisk(value=0.0, ci=(0.0, 0.01579984909951302))


Found 10 failed queries out of 210. Check DEBUG messages for more details.
Attack `univariate` could generate only 0 singling out queries out of the requested 210. This can probably lead to an underestimate of the singling out risk.
  self._sanity_check()


PrivacyRisk(value=0.0, ci=(0.0, 0.012817630390946876))


Found 16 failed queries out of 250. Check DEBUG messages for more details.
Attack `univariate` could generate only 0 singling out queries out of the requested 250. This can probably lead to an underestimate of the singling out risk.
  self._sanity_check()


PrivacyRisk(value=0.0, ci=(0.0, 0.010782445684877122))


Found 19 failed queries out of 290. Check DEBUG messages for more details.
Attack `univariate` could generate only 0 singling out queries out of the requested 290. This can probably lead to an underestimate of the singling out risk.
  self._sanity_check()


PrivacyRisk(value=0.0, ci=(0.0, 0.0093049972410264))


Found 21 failed queries out of 330. Check DEBUG messages for more details.
Attack `univariate` could generate only 0 singling out queries out of the requested 330. This can probably lead to an underestimate of the singling out risk.
  self._sanity_check()


PrivacyRisk(value=0.0, ci=(0.0, 0.008183645494474553))


Found 21 failed queries out of 370. Check DEBUG messages for more details.
Attack `univariate` could generate only 0 singling out queries out of the requested 370. This can probably lead to an underestimate of the singling out risk.
  self._sanity_check()


PrivacyRisk(value=0.0, ci=(0.0, 0.007303496059679765))


Found 19 failed queries out of 410. Check DEBUG messages for more details.
Attack `univariate` could generate only 0 singling out queries out of the requested 410. This can probably lead to an underestimate of the singling out risk.
  self._sanity_check()


PrivacyRisk(value=0.0, ci=(0.0, 0.006594282316527336))


Found 28 failed queries out of 450. Check DEBUG messages for more details.
Attack `univariate` could generate only 0 singling out queries out of the requested 450. This can probably lead to an underestimate of the singling out risk.
  self._sanity_check()


PrivacyRisk(value=0.0, ci=(0.0, 0.006010615147718209))


Found 17 failed queries out of 490. Check DEBUG messages for more details.
Attack `univariate` could generate only 0 singling out queries out of the requested 490. This can probably lead to an underestimate of the singling out risk.


PrivacyRisk(value=0.0, ci=(0.0, 0.005521868503117727))


  self._sanity_check()


In [63]:
evaluator.queries()[:3]

[]

### Inspecting the results in more details

There are two methods to inspect the results. The high level `risk()` method gives the high level estimation of the privacy risk, and its confidence interval.

In [64]:
evaluator.risk(confidence_level=0.95)

PrivacyRisk(value=0.0, ci=(0.0, 0.01787986137444578))

for more information, the `results()` method gives the success rates of the three attacks (the "main" one, the baseline one, and the one against control) that enters `Anonymeter` risk calculation.

In [65]:
res = evaluator.results()

print("Successs rate of main attack:", res.attack_rate)
print("Successs rate of baseline attack:", res.baseline_rate)
print("Successs rate of control attack:", res.control_rate)

Successs rate of main attack: SuccessRate(value=0.012485122184038298, error=0.012485122184038298)
Successs rate of baseline attack: SuccessRate(value=0.012485122184038298, error=0.012485122184038298)
Successs rate of control attack: SuccessRate(value=0.012485122184038298, error=0.012485122184038298)


Note that you can obtain the `PrivacyRisk` from the attack results by:

In [39]:
res.risk()

PrivacyRisk(value=0.0, ci=(0.0, 0.026651316150077816))

### Checking singling out with multivariate predicates

The `SinglingOutEvaluator` can also attack the dataset using predicates which are combining different attributes. These are the so called `multivariate` predicates. 

To run the analysis using the `multivariate` singling out attack, the `mode` parameter of `evaluate` needs to be set correctly. The number of attributes used in the attacker queries via the `n_cols` parameter, set to 4 in this example. 

In [113]:
evaluator = SinglingOutEvaluator(ori=ori, 
                                 syn=syn, 
                                 control=control,
                                 n_attacks=100, # this attack takes longer
                                 n_cols=4)


try:
    evaluator.evaluate(mode='multivariate')
    risk = evaluator.risk()
    print(risk)

except RuntimeError as ex: 
    print(f"Singling out evaluation failed with {ex}. Please re-run this cell."
          "For more stable results increase `n_attacks`. Note that this will "
          "make the evaluation slower.")

Found 20 failed queries out of 100. Check DEBUG messages for more details.


PrivacyRisk(value=0.027566373942771533, ci=(0.0, 0.07049101497097979))


In [67]:
evaluator.queries()[:3]

["occupation== 'Healthcare Support' & time== '7AM' & coupon== 'Restaurant(20-50)' & has_children>= 1",
 "direction_same_or_opp<= 0 & passanger== 'Kid(s)' & toCoupon_GEQ25min>= 1 & income== '$62500 - $74999'",
 "Bar.isna() & coupon== 'Bar' & has_children>= 1 & income== '$62500 - $74999'"]

# Measuring the Linkability risk

The `LinkabilityEvaluator` allows one to know how much the synthetic data will help an adversary who tries to link two other datasets based on a subset of attributes. 

For example, suppose that the adversary finds dataset A containing, among other fields, information about the profession and education of people, and dataset B containing some demographic and health related information. Can the attacker use the synthetic dataset to link these two datasets?

To run the `LinkabilityEvaluator` one needs to specify which columns of auxiliary information are available to the attacker, and how they are distributed between the two datasets A and B. This is done using the `aux_cols` parameter.

In [128]:
aux_cols = [
    ['destination', 'weather', 'time', 'Bar', 'coupon'],
    [ 'gender', 'maritalStatus', 'income', 'age', 'education']
]

evaluator = LinkabilityEvaluator(ori=ori, 
                                syn=syn, 
                                control=control,
                                n_attacks=150,
                                aux_cols=aux_cols,
                                n_neighbors=10)

evaluator.evaluate(n_jobs=-2)  # n_jobs follow joblib convention. -1 = all cores, -2 = all execept one
evaluator.risk()

PrivacyRisk(value=0.006670191666843522, ci=(0.0, 0.04074888576425517))

In [129]:
res = evaluator.results()

print("Successs rate of main attack:", res.attack_rate)
print("Successs rate of baseline attack:", res.baseline_rate)
print("Successs rate of control attack:", res.control_rate)

Successs rate of main attack: SuccessRate(value=0.03198571729667676, error=0.025160965973709315)
Successs rate of baseline attack: SuccessRate(value=0.025485518925797278, error=0.02182138977457023)
Successs rate of control attack: SuccessRate(value=0.025485518925797278, error=0.02182138977457023)


As visible, the attack is not very successful and the linkability risk is low. The `n_neighbor` parameter can be used to allow for weaker indirect links to be scored as successes. It will have an impact on the risk estimate. To check the measured risk for different values of `n_neighbor` you don't have to re-run the evaluation. Rather, do:

In [73]:
print(evaluator.risk(n_neighbors=7))

PrivacyRisk(value=0.003985322385878429, ci=(0.0, 0.0212497131298653))


  self._sanity_check()
