# Anonymeter example notebook

This example notebook demonstrates the usage of `Anonymeter`, a software to derive GDPR-aligned measures of the privacy of synthetic datasets in an empirical, attack based fashion.

`Anonymeter` contains privacy evaluators which measures the risks of singling out, linkability, and inference which might incur to data donors following the release of synthetic dataset. These risk are the three key indicators of factual anonymization according to the European General Data Protection Regulation (GDPR). For more details, please refer to [M. Giomi et al. 2022](https://petsymposium.org/popets/2023/popets-2023-0055.php).

### Basic usage pattern

For each of these privacy risks anonymeter provide an `Evaluator` class. The high-level classes `SinglingOutEvaluator`, `LinkabilityEvaluator`, and `InferenceEvaluator` are the only thing that you need to import from `Anonymeter`.

Despite the different nature of the privacy risks they evaluate, these classes have the same interface and are used in the same way. To instantiate the evaluator you have to provide three dataframes: the original dataset `ori` which has been used to generate the synthetic data, the synthetic data `syn`, and a `control` dataset containing original records which have not been used to generate the synthetic data. 

Another parameter common to all evaluators is the number of target records to attack (`n_attacks`). A higher number will reduce the statistical uncertainties on the results, at the expense of a longer computation time.

```python
evaluator = *Evaluator(ori: pd.DataFrame, 
                       syn: pd.DataFrame, 
                       control: pd.DataFrame,
                       n_attacks: int)
```

Once instantiated the evaluation pipeline is executed when calling the `evaluate`, and the resulting estimate of the risk can be accessed using the `risk()` method.

```python
evaluator.evaluate()
risk = evaluator.risk()
```

### A peak under the hood

In `Anonymeter` each privacy risk is derived from a privacy attacker whose task is to use the synthetic dataset to come up with a set of *guesses* of the form:
- "there is only one person with attributes X, Y, and Z" (singling out)
- "records A and B belong to the same person" (linkability)
- "a person with attributes X and Y also have Z" (inference)

Each evaluation consists of running three different attacks:
- the "main" privacy attack, in which the attacker uses the synthetic data to guess information on records in the original data.
- the "control" privacy attack, in which the attacker uses the synthetic data to guess information on records in the control dataset. 
- the "baseline" attack, which models a naive attacker who ignores the synthetic data and guess randomly.

Checking how many of these guesses are correct, the success rates of the different attacks are measured and used to derive an estimate of the privacy risk. In particular, the "control attack" is used to separate what the attacker learns from the *utility* of the synthetic data, and what is instead indication of privacy leaks. The "baseline attack" instead functions as a sanity check. The "main attack" attack should outperform random guessing in order for the results to be trusted. 

In [2]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from anonymeter.evaluators import SinglingOutEvaluator
from anonymeter.evaluators import LinkabilityEvaluator
from anonymeter.evaluators import InferenceEvaluator

## Downloading the data

For this example, we will use the famous `Adults` (more details [here](https://archive.ics.uci.edu/ml/datasets/adult)) dataset. This dataset contains aggregated census data, where every row represent a population segment. For the purpose of demonstrating `Anonymeter`, we will use this data as if each row would in fact refer to a real individual. 

The synthetic version has been generated by `CTGAN` from [SDV](https://sdv.dev/SDV/user_guides/single_table/ctgan.html), as explained in the paper accompanying this code release. For details on the generation process, e.g. regarding hyperparameters, see Section 6.2.1 of [the accompanying paper](https://petsymposium.org/popets/2023/popets-2023-0055.php)).

We pull these datasets from the [Statice](https://www.statice.ai/) public GC bucket:

In [20]:
# bucket_url = "https://storage.googleapis.com/statice-public/anonymeter-datasets/"

# ori = pd.read_csv('~/downloads/small_data/real.csv')
# syn = pd.read_csv('~/downloads/small_data/dpdffution_tabsyn.csv')
# control = pd.read_csv('~/downloads/small_data/test.csv')
# dif_syn = pd.read_csv('~/downloads/small_data/diffusion_tabsyn.csv')

ori = pd.read_csv("/Users/nijiayi/Stats_C161-261_Project/train_data_after_feature_selection.csv")
syn_dp = pd.read_csv("/Users/nijiayi/Stats_C161-261_Project/ddpm_balanced.csv")
control = pd.read_csv("/Users/nijiayi/Stats_C161-261_Project/test_data_after_feature_selection.csv")
#no dp
#syn = pd.read_csv('~/downloads/161/balanced_dataset_no_dp.csv')

In [22]:
ori = ori.sample(n=50000, random_state=0)
syn_dp = syn_dp.sample(n=50000, random_state=0)

In [23]:
syn_dp.shape

(50000, 23)

In [24]:
ori.shape

(50000, 23)

In [25]:
control.shape

(23784, 23)

In [26]:
ori.head()

Unnamed: 0,city,series_dev,emui_dev,device_name,device_size,net_type,creat_type_cd,slot_id,spread_app_id,app_second_class,...,task_id_count,adv_id_count,user_id_task_id_nunique,user_id_adv_prim_id_nunique,user_id_slot_id_nunique,user_id_spread_app_id_nunique,age_task_id_nunique,age_adv_id_nunique,gender_task_id_nunique,label
1855526,319,27,11,117,2482,7,10,16,162,23,...,10297,10297,67,34,9,13,7637,8530,8093,0
2461383,226,30,13,265,1033,6,8,17,162,14,...,7820,3095,152,45,13,24,5066,5546,10244,0
475675,424,31,20,346,2103,7,10,16,344,13,...,77251,77251,50,26,7,16,7845,8730,10244,0
671593,210,16,28,164,1710,7,2,12,213,23,...,12483,12483,81,35,14,16,8363,9358,10244,0
3441996,240,31,21,346,2032,7,10,16,152,17,...,34500,34500,19,14,6,13,7019,7834,10244,0


In [28]:
syn_dp.head()

Unnamed: 0,city,series_dev,emui_dev,device_name,device_size,net_type,creat_type_cd,slot_id,spread_app_id,app_second_class,...,task_id_count,adv_id_count,user_id_task_id_nunique,user_id_adv_prim_id_nunique,user_id_slot_id_nunique,user_id_spread_app_id_nunique,age_task_id_nunique,age_adv_id_nunique,gender_task_id_nunique,label
5257744,-0.299351,-1.149444,0.56855,0.432457,-0.840453,0.512655,-0.546555,0.325189,1.049406,0.252586,...,0.348562,0.348314,-0.24395,-0.015206,1.079766,0.819325,0.221353,0.213909,0.518708,1.0
3942453,0.302712,0.999785,0.806598,-0.574965,0.989001,-0.416092,-1.609351,1.174735,0.485181,0.541747,...,-0.099344,-0.098585,0.664468,1.136157,0.805311,-0.059736,0.273689,0.297199,0.139118,1.0
1566951,1.561996,0.849941,-0.089363,-1.077083,0.362609,0.525394,-1.50296,1.055449,-0.310147,1.257814,...,3.150606,3.150058,-0.633007,-0.252147,-0.686212,-0.269694,0.36367,0.381536,0.523381,0.0
3340714,1.334291,-0.957249,1.218475,-0.326862,-1.820428,0.525394,-2.53228,-0.86444,-1.839909,0.051891,...,0.053107,-0.298363,-0.020491,-0.559198,0.552666,-0.062554,0.35425,0.290023,0.523381,0.0
3304542,1.106585,0.979026,0.826124,-1.208701,-0.33146,0.525394,1.07034,-0.922618,-1.075028,-0.189293,...,0.488615,0.498456,-0.846056,-0.63596,-0.438437,-0.683973,0.36367,0.381536,0.523381,0.0


As visible the dataset contains several demographic information, as well as information regarding the education, financial situation, and personal life of some tens of thousands of "individuals".

### Measuring the singling out risk

The `SinglingOutEvaluator` try to measure how much the synthetic data can help an attacker finding combination of attributes that single out records in the training data. 

With the following code we evaluate the robustness of the synthetic data to "univariate" singling out attacks, which try to find unique values of some attribute which single out an individual. 


##### NOTE:

The `SingingOutEvaluator` can sometimes raise a `RuntimeError`. This happens when not enough singling out queries are found. Increasing `n_attacks` will make this condition less frequent and the evaluation more robust, although much slower.


In [29]:
#with dp
evaluator2 = SinglingOutEvaluator(ori=ori, 
                                 syn=syn_dp, 
                                 control=control,
                                 n_attacks=500)

try:
    evaluator2.evaluate(mode='univariate')
    risk = evaluator2.risk()
    print(risk)

except RuntimeError as ex: 
    print(f"Singling out evaluation failed with {ex}. Please re-run this cell."
          "For more stable results increase `n_attacks`. Note that this will "
          "make the evaluation slower.")

PrivacyRisk(value=0.0, ci=(0.0, 0.005411853750198381))


  self._sanity_check()


In [30]:
res2 = evaluator2.results()

print("Successs rate of main attack:", res2.attack_rate)
print("Successs rate of baseline attack:", res2.baseline_rate)
print("Successs rate of control attack:", res2.control_rate)

Successs rate of main attack: SuccessRate(value=0.0038121702307761206, error=0.00381217023077612)
Successs rate of baseline attack: SuccessRate(value=0.0038121702307761206, error=0.00381217023077612)
Successs rate of control attack: SuccessRate(value=0.0038121702307761206, error=0.00381217023077612)


The risk estimate is accompanied by a confidence interval (at 95% level by default) which accounts for the finite number of attacks performed, 500 in this case. 

Using the `queries()` method, we can see what kind of singling out queries (i.e. the *guesses*) the attacker has come up with:

As visible it was able to pick up the `fnlwgt` has many (~63%) unique integer values  and that it can provide a powerful handle for singling out. This should result in a singling out risk which is *compatible* within the confidence level with a few percentage points. The actual results can vary depending on notebook execution. 

### Checking singling out with multivariate predicates

The `SinglingOutEvaluator` can also attack the dataset using predicates which are combining different attributes. These are the so called `multivariate` predicates. 

To run the analysis using the `multivariate` singling out attack, the `mode` parameter of `evaluate` needs to be set correctly. The number of attributes used in the attacker queries via the `n_cols` parameter, set to 4 in this example. 

In [31]:
#with dp diffusion
evaluator3 = SinglingOutEvaluator(ori=ori, 
                                 syn=syn_dp, 
                                 control=control,
                                 n_attacks=100, # this attack takes longer
                                 n_cols=4)


try:
    evaluator3.evaluate(mode='multivariate')
    risk = evaluator3.risk()
    print(risk)

except RuntimeError as ex: 
    print(f"Singling out evaluation failed with {ex}. Please re-run this cell."
          "For more stable results increase `n_attacks`. Note that this will "
          "make the evaluation slower.")

PrivacyRisk(value=0.0, ci=(0.0, 0.026651316150077816))


  self._sanity_check()


In [32]:
res3 = evaluator3.results()

print("Successs rate of main attack:", res3.attack_rate)
print("Successs rate of baseline attack:", res3.baseline_rate)
print("Successs rate of control attack:", res3.control_rate)

Successs rate of main attack: SuccessRate(value=0.01849674910349284, error=0.01849674910349284)
Successs rate of baseline attack: SuccessRate(value=0.01849674910349284, error=0.01849674910349284)
Successs rate of control attack: SuccessRate(value=0.01849674910349284, error=0.01849674910349284)


In [33]:
evaluator3.queries()[:3]

[]

In [None]:
# # without dp
# evaluator4 = SinglingOutEvaluator(ori=ori, 
#                                  syn=syn, 
#                                  control=control,
#                                  n_attacks=100, # this attack takes longer
#                                  n_cols=4)


# try:
#     evaluator4.evaluate(mode='multivariate')
#     risk = evaluator4.risk()
#     print(risk)
#     #print(evaluator6.queries()[:3])

# except RuntimeError as ex: 
#     print(f"Singling out evaluation failed with {ex}. Please re-run this cell."
#           "For more stable results increase `n_attacks`. Note that this will "
#           "make the evaluation slower.")


In [None]:
# res4 = evaluator4.results()

# print("Successs rate of main attack:", res4.attack_rate)
# print("Successs rate of baseline attack:", res4.baseline_rate)
# print("Successs rate of control attack:", res4.control_rate)

In [None]:
# evaluator4.queries()[:3]

# Measuring the Linkability risk

The `LinkabilityEvaluator` allows one to know how much the synthetic data will help an adversary who tries to link two other datasets based on a subset of attributes. 

For example, suppose that the adversary finds dataset A containing, among other fields, information about the profession and education of people, and dataset B containing some demographic and health related information. Can the attacker use the synthetic dataset to link these two datasets?

To run the `LinkabilityEvaluator` one needs to specify which columns of auxiliary information are available to the attacker, and how they are distributed between the two datasets A and B. This is done using the `aux_cols` parameter.

In [34]:
#aux_cols = [
#    ['type_employer', 'education', 'hr_per_week', 'capital_loss', 'capital_gain'],
#    [ 'race', 'sex', 'fnlwgt', 'age', 'country']
#]

#with dp
aux_cols = ["city", "spread_app_id", "u_refreshTimes", "user_id_count"]

evaluator5 = LinkabilityEvaluator(ori=ori, 
                                 syn=syn_dp, 
                                 control=control,
                                 n_attacks=min(2000, len(control)),
                                 aux_cols=aux_cols,
                                 n_neighbors=10)

evaluator5.evaluate(n_jobs=-2)  # n_jobs follow joblib convention. -1 = all cores, -2 = all execept one1
#evaluator.evaluate(label = 1)
evaluator5.evaluate()
evaluator5.risk()

  self._sanity_check()


PrivacyRisk(value=0.0, ci=(0.0, 0.0013568577126237004))

In [35]:
res5 = evaluator5.results()

print("Successs rate of main attack:", res5.attack_rate)
print("Successs rate of baseline attack:", res5.baseline_rate)
print("Successs rate of control attack:", res5.control_rate)

Successs rate of main attack: SuccessRate(value=0.0009585236406264672, error=0.0009585236406264671)
Successs rate of baseline attack: SuccessRate(value=0.002455648069704588, error=0.0019453844860661056)
Successs rate of control attack: SuccessRate(value=0.0009585236406264672, error=0.0009585236406264671)


In [None]:
# #Diff no DP
# evaluator6 = LinkabilityEvaluator(ori=ori, 
#                                  syn=syn, 
#                                  control=control,
#                                  n_attacks=min(2000, len(control)),
#                                  aux_cols=aux_cols,
#                                  n_neighbors=10)

# evaluator6.evaluate(n_jobs=-2)  # n_jobs follow joblib convention. -1 = all cores, -2 = all execept one
# #evaluator.evaluate(label = 1)
# #evaluator7.evaluate()
# print(evaluator6.risk())

# res7 = evaluator6.results() 

# print("Successs rate of main attack:", res6.attack_rate)
# print("Successs rate of baseline attack:", res6.baseline_rate)
# print("Successs rate of control attack:", res6.control_rate)


As visible, the attack is not very successful and the linkability risk is low. The `n_neighbor` parameter can be used to allow for weaker indirect links to be scored as successes. It will have an impact on the risk estimate. To check the measured risk for different values of `n_neighbor` you don't have to re-run the evaluation. Rather, do:

In [None]:
#print(evaluator6.risk(n_neighbors=7))

# Measuring the Inference Risk

Finally, `anonymeter` allows to measure the inference risk. It does so by measuring the success of an attacker that tries to discover the value of some secret attribute for a set of target records on which some auxiliary knowledge is available.

Similar to the case of the `LinkabilityEvaluator`, the main parameter here is `aux_cols` which specify what the attacker knows about its target, i.e. which columns are known to the attacker. By selecting the `secret` column, one can identify which attributes, alone or in combinations, exhibit the largest risks and thereby expose a lot of information on the original data.

In the following snippet we will measure the inference risk for each column individually, using all the other columns as auxiliary information to model a very knowledgeable attacker. 

In [None]:
# columns = ori.columns
# results = []

# for secret in columns:
    
#     aux_cols = [col for col in columns if col != secret]
    
#     evaluator = InferenceEvaluator(ori=ori, 
#                                    syn=syn, 
#                                    control=control,
#                                    aux_cols=aux_cols,
#                                    secret=secret,
#                                    n_attacks=1000)
#     #evaluator.evaluate(n_jobs=-2)
#     evaluator.evaluate()
#     results.append((secret, evaluator.results()))

In [37]:
ori

Unnamed: 0,city,series_dev,emui_dev,device_name,device_size,net_type,creat_type_cd,slot_id,spread_app_id,app_second_class,...,task_id_count,adv_id_count,user_id_task_id_nunique,user_id_adv_prim_id_nunique,user_id_slot_id_nunique,user_id_spread_app_id_nunique,age_task_id_nunique,age_adv_id_nunique,gender_task_id_nunique,label
1855526,319,27,11,117,2482,7,10,16,162,23,...,10297,10297,67,34,9,13,7637,8530,8093,0
2461383,226,30,13,265,1033,6,8,17,162,14,...,7820,3095,152,45,13,24,5066,5546,10244,0
475675,424,31,20,346,2103,7,10,16,344,13,...,77251,77251,50,26,7,16,7845,8730,10244,0
671593,210,16,28,164,1710,7,2,12,213,23,...,12483,12483,81,35,14,16,8363,9358,10244,0
3441996,240,31,21,346,2032,7,10,16,152,17,...,34500,34500,19,14,6,13,7019,7834,10244,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
261380,226,31,13,210,2177,7,7,59,175,18,...,5104,5104,73,31,9,14,8363,9358,10244,0
3473106,439,16,13,292,1891,7,8,35,372,13,...,7783,7783,44,17,11,13,8058,8936,6898,0
598860,319,27,11,210,1100,6,3,17,162,14,...,2568,2568,90,46,11,19,7637,8530,8093,0
3184994,299,31,21,151,2032,6,8,13,101,18,...,5440,751,75,28,12,20,8065,9014,10244,0


In [38]:
#with dp
ori['city'] = ori['city'].astype(int)
syn_dp['city'] = syn_dp['city'].astype(int)
control['city'] = control['city'].astype(int)

secret = 'city'

evaluator4 = InferenceEvaluator(
    ori=ori,
    syn=syn_dp,
    control=control,
    aux_cols=aux_cols,
    secret=secret,
    n_attacks=1000
)

evaluator4.evaluate()
evaluator4.risk()
res4 = evaluator4.results()
print(res4.risk())
print("Success rate of main attack:", res4.attack_rate)
print("Success rate of baseline attack:", res4.baseline_rate)
print("Success rate of control attack:", res4.control_rate)

PrivacyRisk(value=0.0, ci=(0.0, 0.002711114264858369))
Success rate of main attack: SuccessRate(value=0.0019133792427775617, error=0.0019133792427775617)
Success rate of baseline attack: SuccessRate(value=0.22306383885898431, error=0.025730802002833567)
Success rate of control attack: SuccessRate(value=0.0019133792427775617, error=0.0019133792427775617)


  self._sanity_check()


In [None]:
# #without dp
# dif_syn['Revenue'] = dif_syn['Revenue'].astype(int)

# secret = 'Revenue'

# evaluator8 = InferenceEvaluator(
#     ori=ori,
#     syn=dif_syn,
#     control=control,
#     aux_cols=aux_cols,
#     secret=secret,
#     n_attacks=1000
# )

# evaluator8.evaluate()
# evaluator8.risk()
# res8 = evaluator8.results()
# print(res8.risk())
# print("Success rate of main attack:", res8.attack_rate)
# print("Success rate of baseline attack:", res8.baseline_rate)
# print("Success rate of control attack:", res8.control_rate)