# CS211: Data Privacy
## Homework 6

In [1]:
# Load the data and libraries
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')

def laplace_mech(v, sensitivity, epsilon):
    return v + np.random.laplace(loc=0, scale=sensitivity / epsilon)

def gaussian_mech(v, sensitivity, epsilon, delta):
    return v + np.random.normal(loc=0, scale=sensitivity * np.sqrt(2*np.log(1.25/delta)) / epsilon)

def pct_error(orig, priv):
    return np.abs(orig - priv)/orig * 100.0

adult = pd.read_csv('https://github.com/jnear/cs211-data-privacy/raw/master/homework/adult_with_pii.csv')

  plt.style.use('seaborn-whitegrid')


## Question 1 (10 points)

(Reference [Chapter 7](https://uvm-plaid.github.io/programming-dp/notebooks/ch7.html) of the textbook)

Consider the following minimum query:

In [136]:
## Cache the sorted ages, because we will use them a lot.
age_lower = 0
age_upper = 100
sorted_ages = adult['Age'].clip(lower=age_lower, upper=age_upper).sort_values()

def min_age():
    clipped_ages = adult['Age'].clip(lower=0, upper=100)
    return clipped_ages.min()

def ls_min():
    return max(sorted_ages.iloc[0] - age_lower, sorted_ages.iloc[1] - sorted_ages.iloc[0])

print('Actual minimum age:', min_age())
print('Local sensitivity of the minimum:', ls_min())

Actual minimum age: 17
Local sensitivity of the minimum: 17


Implement `ls_min_at_distance`, an upper bound on the local sensitivity of the `min_age` query at distance $k$, and `dist_to_high_ls_min`, an upper bound on the distance from the true dataset to one with local sensitivity greater than or equal to $s_p$.

In [137]:
# How does removing or adding k people affect the min?
# Imagine we remove 5 people... That would be the same as moving the list forward 5 indexes
# So first part of this function at index k imagines moving k people forward and comparing 
# against the lowest possible age we might have
# While the second compares the first index of the real data to the possibility we remove k people? 
# Still not totally sure how this functions operates but it works!
def ls_min_at_distance(k):
    return max(sorted_ages.iloc[k] - age_lower, sorted_ages.iloc[1] -  sorted_ages.iloc[k])

def dist_to_high_ls_min(s_p):
    k = 0
    
    while ls_min_at_distance(k) < s_p:
        k += 1
    
    return k

In [138]:
# TEST CASE
assert dist_to_high_ls_min(18) == 395
assert dist_to_high_ls_min(20) == 1657
assert dist_to_high_ls_min(25) == 5570
assert dist_to_high_ls_min(30) == 9711

## Question 2 (10 points)

Implement `ptr_min`, which should use the propose-test-release framework to calculate a differentially private estimate of the minimum age. If the test fails, return `None`.

In [139]:
def ptr_min(s_p, epsilon, delta):
    k = dist_to_high_ls_min(s_p)
    noisy_distance = laplace_mech(k, 1, epsilon)
    threshold = np.log(2/delta)/(2*epsilon)

    if noisy_distance >= threshold:
        return laplace_mech(min_age(), s_p, epsilon)
    else:
        return None

# proposed sensitivity: 0.05
# epsilon, delta = (1.0, 10^-5)
ptr_min(21, 1.0, 1e-5)

76.30014320592039

In [140]:
# TEST CASE
true_min = min_age()
trials = [ptr_min(20, 0.1, 1e-5) for _ in range(20)]
errors = [pct_error(true_min, t) for t in trials]
print(np.mean(errors))
assert np.mean(errors) < 2000
assert np.mean(errors) > 500

assert ptr_min(0.0001, 0.1, 1e-5) == None

948.474630997953


## Question 3 (5 points)

In 2-5 sentences, answer the following:

- Can `ptr_mean` give a useful answer for the minimum age?
- If so, what is a good proposed sensitivity $s_p$ for the analyst to use? If not, why not?

- ptr_mean as implemented in the textbook would probably not be very good at providing the minimum age. But this is because that is a completley different function which would calculate the mean age instead of the min age of the dataset. There are scenarios which it would be able to get the min age though, such as if all the people in the dataset were the same age. 
- I'm guessing that was a typo so I'll asses ptr_min as well. It seems like most of the time it gives a useful answer that is closeish to the true min, but it still seems pretty sporadic, jumping from anywhere in between -30 and 30. 
- A sensitivty of 20 seems to be pretty solid from the results I'm gathering with my implementation of ptr_min. Even 1 greater or less than 20 seems to pull the min much closer to 0. I'm thinking this must be because of how the noise is checked against the threshold, which is independent of our data. So it's got to let some more noisy answers through every once in a while.

## Question 4 (10 points)

Consider the `median_age` function, which calculates the *median* age (this version truncates if the length of the dataset is even), and the `ls_median` function, an upper bound on the local sensitivity of the median query:

In [141]:
## Cache the sorted ages, because we will use them a lot.
sorted_ages = adult['Age'].clip(lower=0, upper=100).sort_values()

def median_age():
    idx = int(len(adult)/2)
    return sorted_ages.iloc[idx]

print('Median age:', median_age())

def ls_median():
    idx = int(len(adult)/2)
    return max(sorted_ages.iloc[idx] - sorted_ages.iloc[idx-1],
               sorted_ages.iloc[idx+1] - sorted_ages.iloc[idx])

print('Local sensitivity of the median:', ls_median())

Median age: 37
Local sensitivity of the median: 0


Note that the local sensitivity of the median is 0. Implement the functions `ls_median_at_distance`, which calculates the local sensitivity at distance $k$ of the median query above, and the corresponding `dist_to_high_ls_median`.

*Hint*: note that the ages are clipped. Think about the worst-case scenario of adding or removing $k$ rows.

In [175]:
def ls_median_at_distance(k):
    true_med = int(len(adult)/2) # true median
    
    # Adding or subtracting k to account for removing or adding entries
    # Same logic as the min function
    return max(sorted_ages.iloc[true_med] - sorted_ages.iloc[true_med-k],
               sorted_ages.iloc[true_med+k] - sorted_ages.iloc[true_med])

def dist_to_high_ls_median(s_p):
    # Same ordeal as the min
    k = 0
    
    while ls_median_at_distance(k) < s_p:
        k += 1
    
    return k

ls_median_at_distance(401)

1

In [143]:
assert ls_median_at_distance(500) == 1
assert ls_median_at_distance(5000) == 6
assert ls_median_at_distance(10000) == 14
assert ls_median_at_distance(15000) == 28

## Question 5 (10 points)

Use the propose-test-release framework, plus `dist_to_high_ls_median`, to answer the median query with differential privacy.

In [278]:
def ptr_median(s_p, epsilon, delta):
    # Again following the same logic for the ptr test as the min
    k = dist_to_high_ls_median(s_p)
    noisy_distance = laplace_mech(k, 1, epsilon)
    threshold = np.log(1/delta)/(2*epsilon)

    if noisy_distance >= threshold:
        return laplace_mech(median_age(), s_p, epsilon)
    else:
        return None

ptr_median(20, 1.0, 1e-5)

26.921395062519597

In [146]:
# TEST CASE
true_median = median_age()
trials = [ptr_median(0.05, 0.1, 1e-5) for _ in range(20)]
errors = [pct_error(true_median, t) for t in trials]
assert np.mean(errors) < 10

## Question 6 (10 points)

In 2-5 sentences, answer the following:

- At roughly what distance does the local sensitivity of the median become non-zero?
- For what proposed sensitivity does `ptr_median` fail the test (i.e. return `None`)?
- What does this mean for the amount of noise required to release the median with differential privacy?

- It seems like at a distance of 401, the local sensitivity of the median will become non zero. This must be because the median does not change much when only one or two rows are added or removed, especially with a large dataset and so many repeated ages.
- Any proposed sensitivity less than 53 will work, anything less than 1 eventually reaches a floor of 37 too. But anything greater than 53 doesn't return none but instead gives an index out of bounds error. This must be because this sensitivity already implies a distance greater than what can exist in the list to hit that proposed sensitivity value.
- This means we should pick a sensitivity value in between 0 and 1 but one that doesn't hit the floor of 37 ofter. 

## Question 7 (20 points)

Use the sample-and-aggregate framework to release the minimum age in the adult dataset. Reference [Chapter 7](https://uvm-plaid.github.io/programming-dp/notebooks/ch7.html).

In [274]:
def f(chunk):
    return chunk.min()

def saa_min_age(k, epsilon):
    df = adult['Age']
    chunk_size = int(np.ceil(df.shape[0] / k))
    chunks = [df[i:(i + chunk_size)] for i in range(0,df.shape[0],chunk_size)]
    
    results = [f(chunk) for chunk in chunks]
    
    u = 18
    l = 0
    
    clipped_results = np.clip(results, l, u)
    dp_result = laplace_mech(np.mean(clipped_results), (u - l) / k, epsilon)
    return dp_result

saa_min_age(50, 1.0)

16.854058687372728

In [229]:
# TEST CASE
true_min = adult['Age'].min()
trials = [saa_min_age(500, 1.0) for _ in range(20)]
errors = [pct_error(true_min, t) for t in trials]
print(np.mean(errors))
assert np.mean(errors) > 0
assert np.mean(errors) < 10

2.5857512921349395


## Question 8 (10 points)

In 5-6 sentences, answer the following:

- What clipping values did you choose for clipping the query outputs on each chunk? How did you pick them?
- Is 500 a good value for the number of chunks $k$? How does making $k$ larger or smaller change the results?
- How does the sample-and-aggregate approach compare to propose-test-release or global sensitivity for the minimum?

- The clipping values I used were 0 for lower and 18 for the upper. This is because I looked at the data before hand and knew those were valid clipping parameters for our clipping function and would get the job done. Had I not known this information, I would have still put around 20 since it would probably be safe to assume there are people who are less than 20 in a dataset this large.
- No 500 would not be a good number of chunks, not nearly enough noise is added to the output. In fact making the number of chunks any larger than 50 doesn't add much noise at all. This means a smaller k adds more noise to our answers.
- Sample and aggregate seems to probably be the best approach for the min function, since, post test release had answers that were much too noisy, and global sensitivity kind of ignores the fact that we want the minimum, but will still need to compute the sensitivity on bounds of (upper - lower) for the dataset. 