## General Pipeline Tutorial

#### Introduction
Synth-MIA contains several features for making Privacy Auditing with MIAs convinient. In this tutorial, we assume that the auditor has already split their data into a train/ test and generated a synthetic dataset in which to audit.

In [18]:
import pandas as pd
import numpy as np
from synth_mia.attackers import *
from synth_mia import utils, evaluation

#### Preprocessing
In order to run an attack, we need to split testing data into a holdout and reference set. The holdout set functions as the "negative class" for the attack evaluation, the training set is the "positive class", and the reference dataset models prior information an attacker might have about the Population distribution. The reference dataset is only needed for certain attacks.

After splitting, we encode each dataset by scaling continuous variables and "one-hot" or "ordinal" encoding the categorical ones. Note: density estimation strategies usually work better with ordinal encodings as they can fail to converge in a one-hot setting.

In [16]:
# Load the data used in the training of the model, the test set, and the final synthetic data
train_set = pd.read_csv('../example_data/insurance/train_set.csv')
test_set = pd.read_csv('../example_data/insurance/holdout_set.csv')
synth_set = pd.read_csv('../example_data/insurance/bayesian_network_synth_250.csv')

# Split the test set into a non-member set and a reference set
non_member_set, reference_set = utils.create_random_equal_dfs(test_set, 250, num_dfs=2, seed=42)

# Preprocess dataframes into encoded numpy arrays
prep = utils.TabularPreprocessor(fit_target='synth', categorical_encoding='one-hot', numeric_encoding='standard')

# Fit on chosen target (ref is optional if fit_target='synth')
prep.fit(train_set, non_member_set, synth_set)

# Transform all datasets
mem, non_mem, synth, ref, transformer = prep.transform(train_set, non_member_set, synth_set)

print(mem.shape, synth.shape)
print(ref)  # None, because we didn't pass ref

(250, 12) (250, 12)
None


#### The Attack
To audit the privacy of synth, initiate an attacker method. Various hyperparameters for the attack can be set in the method initialization. We then pass the 4 datasets to the attacker.attack() method which returns the predicted scores and true labels.


In [17]:
# Create instances of your attackers
att1 = DCR()

# Run attacks and evaluate results
results = {}
attackers = [att1]

for attacker in attackers:
    # Run the attack
    true_labels, scores = attacker.attack(mem, non_mem, synth, ref)
    
    # Evaluate the attack
    eval_results = attacker.eval(true_labels, scores, metrics=['roc'])
    
    # Store results
    results[attacker.name] = eval_results

# Print results
pd.DataFrame(results).T

Unnamed: 0,auc_roc,tpr_at_fpr_0,tpr_at_fpr_0.001,tpr_at_fpr_0.01,tpr_at_fpr_0.1
DCR,0.74136,0.132,0.132,0.168,0.424


#### Other Evaluation Methods
If you already have labels and scores, you can evaluate them using the AttackEvaluator class which supports common classification metrics and Empirical Epsilon Bound Estimation.

In [9]:
AE = evaluation.AttackEvaluator(true_labels, scores) 
AE.epsilon_evaluator(confidence_level=.9,
                     threshold_method = 'ratio')

{'threshold': -0.24812695573405602,
 'confidence_level': 0.9,
 'epsilon_lower_bound': 1.1115228138265543,
 'epsilon_upper_bound': 2.613494154923251}

In [10]:
AE.classification_metrics(decision_threshold=np.median(scores))

{'accuracy': 0.68,
 'precision': 0.68,
 'recall': 0.68,
 'f1_score': 0.68,
 'true_positive_rate': 0.68,
 'false_positive_rate': 0.32}

In [11]:
AE.roc_metrics(target_fprs=[0,0.001,0.01,0.1,0.25])

{'auc_roc': 0.74136,
 'tpr_at_fpr_0': 0.132,
 'tpr_at_fpr_0.001': 0.132,
 'tpr_at_fpr_0.01': 0.168,
 'tpr_at_fpr_0.1': 0.424,
 'tpr_at_fpr_0.25': 0.656}

In [12]:
AE.privacy_metrics()

{'mia_advantage': 0.36000000000000004, 'privacy_gain': 0.6399999999999999}