## DIFFERENTIAL PRIVACY USING PYDP

This experiment presents implementation of differential privacy concept using PyDP,
performing the membership inference attack and proving the fundamental feasibility
of differential private algorithm.


### Imports
import all required libraries

In [None]:
import pydp as dp # by convention our package is to be imported as dp (dp for Differential Privacy!)
from pydp.algorithms.laplacian import BoundedSum, BoundedMean, Count, Max
import pandas as pd


### Fetching required data
Uploading and merging sample dataset from 4 different files.
This action could be considered as receiving data form different data sources (data owners)

In [None]:
base_dir = '../../'
paths = ['data/dp_data/01.csv', 'data/dp_data/02.csv', 'data/dp_data/03.csv', 'data/dp_data/04.csv', 'data/dp_data/05.csv']

combined_df_temp = map(lambda url: pd.read_csv(base_dir + url, sep=',', engine='python'), paths )
original_dataset = pd.concat(combined_df_temp)
display(original_dataset.head())
print(original_dataset.shape)

### Creating a Replicated (Parallel) Database
Create the copy of initial dataset what differs with only one (first) record

In [None]:
redact_dataset = original_dataset.copy()
redact_dataset = redact_dataset[1:]
display(original_dataset.head())
display(redact_dataset.head())

#### Successful Membership inference
Demonstrating the possibility of retrieving record attribute that could possibly
identify the individual record and consequently disclose personal information.
The Membership inference attack is performed by subtracting the sum of `sales_amount` attribute of original and replicated datasets.
As a result, the exact value of first record `sales_amount` attribute is retrieved.

In [None]:
sum_original_dataset = round(sum(original_dataset['sales_amount'].to_list()), 2)
sum_redact_dataset = round(sum(redact_dataset['sales_amount'].to_list()), 2)
sales_amount_Osbourne = round((sum_original_dataset - sum_redact_dataset), 2)
assert sales_amount_Osbourne == original_dataset.iloc[0, 4]
print(f"{sum_original_dataset} - {sum_redact_dataset} = {sales_amount_Osbourne}")

### Differentially Private Sum for original dataset
At this step the differential private sum is performed on original dataset for further result comparison.
PyDP framework implements the Differential Private Sum operation
based on the laplacian distribution and could be modified by setting
the corresponding attribute values.
- `epsilon` - privacy budget
- `lower_bound` and `upper_bound` - distribution boundaries
- `dtype` - values data type

In [None]:
dp_sum_original_dataset = BoundedSum(epsilon= 1.5, lower_bound =  5, upper_bound = 250, dtype ='float') 
dp_sum_og = dp_sum_original_dataset.quick_result(original_dataset['sales_amount'].to_list())
dp_sum_og = round(dp_sum_og, 2)
print(dp_sum_og)

### Differentially Private Sum for replicated (parallel) dataset
At this step the differential private sum is performed on replicated dataset.

In [None]:
dp_redact_dataset = BoundedSum(epsilon= 1.5, lower_bound =  5, upper_bound = 250, dtype ='float')
dp_redact_dataset.add_entries(redact_dataset['sales_amount'].to_list())
dp_sum_redact=round(dp_redact_dataset.result(), 2)
print(dp_sum_redact)

### Summary
The outputs below show that differential private sum within the original
and replicated datasets give the similar result and in the same time,
the subtraction does not disclose the first record `sales_amount` attribute value.
That proves the impossibility of performing the Membership inference, and
consequently the privacy of the first record could be preserved.
The conducted experiment is considered as successful.

In [None]:
print(f"Sum of sales_value in the orignal dataset: {sum_original_dataset}")
print(f"Sum of sales_value in the orignal dataset with DP: {dp_sum_og}")
assert dp_sum_og != sum_original_dataset


print(f"Sum of sales_value in the second dataset: {sum_redact_dataset}")
print(f"Sum of sales_value in the second dataset with DP: {dp_sum_redact}")
assert dp_sum_redact != sum_redact_dataset


print(f"Difference in Sum with DP: {round(dp_sum_og - dp_sum_redact, 2)}")
print(f"Actual Difference in Sum: {sales_amount_Osbourne}")
assert round(dp_sum_og - dp_sum_redact, 2) != sales_amount_Osbourne