Microsoft Differential Privacy Whitepaper Collateral Notebooks Part 2
# Data Generator for Reidentification Attack Simulations
<img src="images/code.png" width=1000 />

This demo generates fictional data sets for a differential privacy attack:
- Set input data and define variables
- Generate public medical data set
- Generate demographic data set
- Overview on the data sets

In [7]:
# Install required libraries, uncomment if needed
# %pip install faker
# %pip install zipcodes
# %pip install opendp-whitenoise
# %pip install -U pandas

In [8]:
import pandas as pd
import numpy as np
import random
import string
import uuid
import time
from faker import Faker
from datetime import datetime
import matplotlib.pyplot as plt
import zipcodes as zc
from tqdm import tqdm
import reident_tools as reident
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

In [9]:
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format = 'retina'

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Dataset Generation
In the first step, a random medical dataset. It is going to consist of following parts:
- Published, medical dataset
    - Demographic information
        - Gender
        - Age group
        - Zip code (shortened)
    - Sensitive Medical information
        - Diagnosis
        - Treatment
        - Outcome
- Data collection that a potential attacker could exploit
    - Demographic information
        - Gender
        - Age (full)
        - Zip code (full)

### Set input data and define variables

In [10]:
# Set language for data set generation
lang = 'en-US'

### Generate public medical data set
- In this step, we generate the public medical data.
- The data will be generated using the `get_medical_data` function
- You can pass the parameters __n__ (amount of records) and __k__ (degree of k-anonymization).
- If you set n = 100 and k = 3, the function call will generate a set of 300 records, anonymized by k.
- Other than that, we pass the language and a dictionary of disease numbers to make the analysis more comprehensive later on

In [11]:
%%time
df_medical = reident.get_medical_data(10000, lang, reident.disease_numbers, 3, logger)
df_medical[['Gender', 'Age', 'Zip', 'Diagnosis', 'Treatment', 'Outcome']].head()

INFO:root:Generating demographic examples
100%|██████████| 10000/10000 [2:42:35<00:00,  1.03it/s] 
INFO:root:Finished generating demographic examples, had to mitigate 5756 duplicate tests.


CPU times: user 2h 35min 12s, sys: 5min 8s, total: 2h 40min 20s
Wall time: 2h 42min 38s


Unnamed: 0,Gender,Age,Zip,Diagnosis,Treatment,Outcome
0,F,10-19,654**,High Blood Pressure,25,intensive care
1,F,10-19,654**,COPD,48,unchanged
2,F,10-19,654**,High Blood Pressure,38,intensive care
3,F,30-39,277**,Heart Disease,31,unchanged
4,F,30-39,277**,Diabetes,34,unchanged


### Generate demographic data set
- In the following section, we complement the dataset created above with full demographic information.
- For this purpose, we use the `get_demographic_information`-function, which takes the dataframe `df_medical` created above and the language.
- This dataset is stored as separate dataframe.
- The full demographic data will help to try a reidentification attack on the public medical data.

In [13]:
%%time
df_demographic = reident.get_demographic_information(df_medical, lang, logger)
df_demographic[['Name', 'Gender', 'Age', 'Zip']].head()

INFO:root:Create demographic data set based on medical data.
100%|██████████| 30000/30000 [00:56<00:00, 533.79it/s]
INFO:root:finished k-anonymization
INFO:root:Returning dataset with length of 30000


CPU times: user 58.9 s, sys: 319 ms, total: 59.2 s
Wall time: 59 s


Unnamed: 0,Name,Gender,Age,Zip
0,Lynn Romero,F,10,65418
1,Shelly Green,F,14,65475
2,Emily Harris,F,10,65484
3,Tara Williams,F,30,27727
4,Tiffany Scott,F,36,27772


### Overview of the datasets
In this section, we are going to display the data sets generated above once again to get a good understanding.

In [14]:
# Display both data sets generated above
print(f'Published Medical Data ({len(df_medical)} rows):')
display(df_medical[['Gender', 'Age', 'Zip', 'Diagnosis', 'Treatment', 'Outcome']].sample(10))
print(f'\nAttacker Dataset with demographic data ({len(df_demographic)} rows):')
display(df_demographic[['Name', 'Gender', 'Age', 'Zip']].sample(10))

Published Medical Data (30000 rows):


Unnamed: 0,Gender,Age,Zip,Diagnosis,Treatment,Outcome
3060,M,60-69,206**,Depression,26,recovered
25028,M,80-89,705**,Diabetes,41,unchanged
17968,F,30-39,812**,Diabetes,29,intensive care
5297,F,60-69,317**,Heart Disease,35,recovered
20698,F,80-89,102**,Depression,29,recovered
11810,M,30-39,832**,Heart Disease,33,intensive care
9473,M,50-59,638**,Osteoporosis,42,unchanged
21931,F,80-89,688**,Depression,34,unchanged
23069,F,30-39,737**,Depression,46,unchanged
10989,F,40-49,449**,Arthritis,40,unchanged



Attacker Dataset with demographic data (30000 rows):


Unnamed: 0,Name,Gender,Age,Zip
2584,Daniel Vasquez,M,38,78343
22108,Jenny Bonilla,F,71,30521
1894,Karen Smith,F,74,50445
27787,Christopher Rodriguez,M,49,44088
5255,Alexander Fry,M,41,40749
6724,Kenneth Owen,M,26,13474
5440,Kathleen Kennedy,F,89,10162
22973,Mary Ruiz,F,47,68186
22259,Jeffrey Carey,M,57,41695
3573,William Barnes DDS,M,81,29238


In [15]:
# We create a copy of the medical dataset and assign the actual values from the demographic dataset
# The reason is that we need it for the reidentification attack based on the synthesized values later on
# As we are going to work with differential privacy-based data, we do not need k-anonymization then
df_medical_synth = df_medical.copy()
df_medical_synth['Age'] = df_demographic['Age'].copy()
df_medical_synth['Zip'] = df_demographic['Zip'].astype(int).copy()
df_medical_synth[['Gender', 'Age', 'Zip', 'Diagnosis', 'Treatment', 'Outcome']].head()

Unnamed: 0,Gender,Age,Zip,Diagnosis,Treatment,Outcome
0,F,10,65418,High Blood Pressure,25,intensive care
1,F,14,65475,COPD,48,unchanged
2,F,10,65484,High Blood Pressure,38,intensive care
3,F,30,27727,Heart Disease,31,unchanged
4,F,36,27772,Diabetes,34,unchanged


In [16]:
# Write to files
df_medical.to_csv('data/data_medical.csv', sep=",", encoding="utf-8", index=False)
df_demographic.to_csv('data/data_demographic.csv', sep=",", encoding="utf-8", index=False)
df_medical_synth.to_csv('data/data_medical_synthesizer.csv', sep=",", encoding="utf-8", index=False)

In [None]:
# Read files
# df_medical = pd.read_csv('data/data_medical.csv', sep=",", encoding="utf-8").infer_objects()
# df_demographic = pd.read_csv('data/data_demographic.csv', sep=",", encoding="utf-8").infer_objects()
# df_medical_synth = pd.read_csv('data/data_medical_synthesizer.csv', sep=",", encoding="utf-8").infer_objects()