# Protecting against Reidentification Attacks with Differential Privacy

In this notebook we show how differential privacy can be used to protect sensitive personal information against re-identification attacks. The identities of individuals might be revealed if an attacker is able to map anonymized records about individuals from a published dataset with information about these people from various sources. 
In this demo, the published anonymized dataset contains patient records. The attacker tries to identify individuals by leveraging basic demographic information like age and zip codes.
We show that successful reidentification attacks are possible even when the sensitive data is published in an anonymized format. Then, we perform a second attack after protecting the sensitive data using a dataset synthesizer from the SmartNoise system.


This demo is based on the following steps:

1. Import of anonymized medical data set and the attacker's data collection
2. Reidentification Attack I: Revealing identities from the anonymized data set 
3. Protecting the medical dataset with differential privacy using Multiple Weights Exponential Mechanism (MWEM)
4. Validating the utility of the synthesized data set for statistical analyses
5. Reidentification Attack II: Trying to reveal identities based on the differentially private version of the medical data set

In [None]:
# Install required libraries, uncomment if needed
!pip install faker zipcodes tqdm opendp-smartnoise z3-solver==4.8.9.0

In [None]:
import warnings
warnings.filterwarnings("ignore")

import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
import pandas as pd
import numpy as np
import random
import string
import uuid
import time
import logging
from datetime import datetime
import matplotlib.pyplot as plt
from tqdm import tqdm
import reident_tools as reident
from opendp.smartnoise.synthesizers.mwem import MWEMSynthesizer

%config InlineBackend.figure_format = 'retina'
%load_ext autoreload
%autoreload 2

## Import data sets
Below, we are going to import three data sets:
1. Public medical data set, containing k-anonymized demographic and sensitive medical information
2. Attacker's data collection with basic demographic information
3. Public medical data set preprocessed for the MWEM synthesizer

In [None]:
# Read files
df_medical = pd.read_csv('data/data_medical.csv', sep=",", encoding="utf-8").infer_objects()
df_medical['Zip'] = df_medical['Zip'].astype(str)
print('Anonymized dataset including sensitive medical information:')
display(df_medical.iloc[:,1:].sample(8))
df_demographic = pd.read_csv('data/data_demographic.csv', sep=",", encoding="utf-8").infer_objects()
print('Attacker`s data collection with basic demographic information:')
df_demographic['Zip'] = df_demographic['Zip'].astype(str)
display(df_demographic.iloc[:,1:].sample(8))
df_medical_synth = pd.read_csv('data/data_medical_synthesizer.csv', sep=",", encoding="utf-8").infer_objects()
df_medical_synth['Zip'] = df_medical_synth['Zip'].astype(str)

Above data sets also include a unique id for each record which is used to be able to count the number of identified records after the attack. This information is not used for performing the attack. 

## Reidentification Attack I - Revealing identities from the anonymized data set
Below, we perform the first reidentification attack usind the `try_reidentification` function. As input, we use the data sets generated above (published medical and demographic data).

### Perform the attack
Now, we perform the reidentification attack with the demographic and the medical data set, using a combinatorial approach.

**TIP**

The following cell takes several minutes to complete. Start the execution of the cell first and then try to understand the details of the code being executed.
The `try_reidentification` method is implemented in the `reident_tools.py` file.

In [None]:
%%time
reident_attack = reident.try_reidentification(df_demographic, df_medical, logger)

### Results of the attack
Below, we show the amount of potential and actual matches and provide a glance at the data.

In [None]:
print(f'Found: {len(reident_attack[reident_attack["ID_Match"]==True])} actual (validated) matches!')

In [None]:
# Write to file, if wanted
# reident_attack.to_csv('data/results_reident-attack-raw.csv', sep=",", encoding="utf-8", index=False)
# Or read files, if needed
# reident_attack = pd.read_csv('data/results_reident-attack-raw.csv', sep=",", encoding="utf-8")

In [None]:
# Get sample from the data set
print(f'Sample of re-identified patients:')
reident_attack[reident_attack["ID_Match"]==True][['Name', 'Gender', 'Age', 'Zip', 'Diagnosis', 'Treatment', 'Outcome', 'ID_Match']].sample(10)

## Protecting the medical dataset with differential privacy
In the next step, we are going to synthesize the data set to increase the level of protection. We will use the Multiple Weights Exponential Mechanism (MWEM) synthesizer for this purpose and encode the demographic data (gender, age, zip) and the diagnosis. The other variables (treatment, outcome) are not part of the analysis for now.

In [None]:
# Prepare data set for reidentification, using the medical data set and the full zip copied from the demographic set
df_reident_synth = df_medical[['Gender', 'Age', 'Zip', 'Diagnosis', 'Treatment', 'Outcome']].copy()
df_reident_synth['Zip'] = df_demographic['Zip'].copy()
df_reident_synth['Age'] = df_demographic['Age'].copy()
# Write to file, if wanted
# df_reident_synth.to_csv('data/data_reidentification_synthesizer.csv', sep=",", encoding="utf-8", index=False)

### Encoding of data
For this purpose, we encode the input data using the `do_encode`-function to make it compatible with the MWEM synthesizer.

In [None]:
# Have a quick glance at the data
df_reident_synth.head()

In [None]:
# Encode the data set and display it
df_reident_encoded = reident.do_encode(df_reident_synth, ['Gender', 'Age', 'Zip', 'Diagnosis'], reident.diseases)
df_reident_encoded.head()

### Synthesizing the demographic data
Finally, we synthesize the data with the MWEM synthesizer. Here are some considerations regarding the parameters:
- `Q_count` Should be higher than the number of iterations (at least 5 and 10 times the number of iterations). Default is 400.
- `epsilon`The privacy parameter. 3.0 is a good starting point. Lower values correspond to higher levels of privacy.
- `iterations` Comparable to epochs in deep learning. Between 30 and 60. Fewer iterations means that the budget is used more efficiently. Default is 30.
- `mult_weights_iterations` Should be less than number of total iterations (usually between 5 and 50). Default is 20.
- `splits` MWEM will automatically split the features with split factor if this field isn’t specified. This field overrides split_factor, and creates custom user specified splits of features i.e. for a set with 5 features, [[0,3],[1,2,4]] (implies that features 0 and 3 are correlated, and features 1,2 and 4 are correlated).
- `split_factor` Choose highest split factor without affecting performance. Start with number of features, then subdivide by 2 (round up).

In [None]:
%%time
# Apply the synthesizer to the data set
synthetic_data = MWEMSynthesizer(q_count = 400,
                        epsilon = 3.00,
                        iterations = 60,
                        mult_weights_iterations = 40,
                        splits = [],
                        split_factor = 1)
synthetic_data.fit(df_reident_encoded.to_numpy())

In [None]:
%%time
# Convert to dataframe
df_synthesized = pd.DataFrame(synthetic_data.sample(int(df_reident_encoded.shape[0])), columns=df_reident_encoded.columns)

In [None]:
# Write it to file, if wanted
# df_synthesized.to_csv('data/data_synthesized.csv', sep=",", encoding="utf-8", index=False)

### Compare original and synthetic data
Below, we are going to use the `create_histogram` function to illustrate the __diagnoses__ distribution of both data sets. 
Ideally, the bars for each diagnosis do not differ too much from each other. The more similar the bars are for the respective disease, the less information is lost during the synthetization process.

In [None]:
reident.create_histogram(df_reident_encoded, df_synthesized, 'Diagnosis_encoded', reident.diseases)

## Reidentification Attack II - Synthesized Demographic Data + Public Medical Data (non-grouped)
Finally, we try the re-identification attack on the synthesized data using the `try_reidentification_noise`-function.
As stated above, the synthesized data set has new combinations of demographic data, so we do not deal with the _raw/real_ data any more. While it is possible that a potential match is detected, it is unlikely that we deal with an actual match here.

In [None]:
print('Medical Dataset:')
display(df_medical_synth.sample(5))
print('\nSynthesized Demographic Dataset:')
display(df_synthesized.sample(5))

### Perform the attack
Now, we perform the reidentification attack with the synthetic data, again using a combinatorial approach.

**TIP**

The following cell takes several minutes to complete. Start the execution of the cell first and then try to understand the details of the code being executed.
The `try_reidentification_noise` method is implemented in the `reident_tools.py` file.

In [None]:
reident_attack_2 = reident.try_reidentification_noise(df_synthesized, df_medical_synth, logger)

### Results of the attack
Below, we show the amount of potential and actual matches and provide a glance at the data.

In [None]:
print(f'Found {len(reident_attack_2)} potential matches!')
reident_attack_2.head()