# Differential Privacy
The point of this notebook is to explore how we can introduce the concept of differential privacy to our dataset. This means adding statistical noise to sensitive data fields so it's harder to reverse engineer individuals to whom the data belongs. Local differential privacy is similar, it just specifies that we add this statisical noise right at the beginning when the data is being recorded.

Ultimately since the statistical noise is something like Laplace normal, or regular normal, the overall results don't change, but individuals remain secret. The amount of noise we add is based on the epsilon. 

For a binary class like Sex, the epsilon goes into the probability with which we flip the class. With Age, it's a simpler addition of noise.

## 1. Loading the dataset

In [1]:
import numpy as np
import pandas as pd
from aif360.datasets import AdultDataset

# load dataset
dataset = AdultDataset()
df = pd.DataFrame(dataset.features, columns=dataset.feature_names)
df['income'] = dataset.labels.ravel()

# binarize age
age_median = df['age'].median()
df['age_binary'] = (df['age'] > age_median).astype(int)

# clean binary sex col as well
df['sex_binary'] = df['sex'].astype(int)
df[['age', 'age_binary', 'sex', 'sex_binary']].head()




Unnamed: 0,age,age_binary,sex,sex_binary
0,25.0,0,1.0,1
1,38.0,1,1.0,1
2,28.0,0,1.0,1
3,44.0,1,1.0,1
4,34.0,0,1.0,1


Now that we have binarized Age + Sex, we can have 4 such categories. So we can bucketise them.

In [2]:
# show sample of the dataset
df.sample(20)

Unnamed: 0,age,education-num,race,sex,capital-gain,capital-loss,hours-per-week,workclass=Federal-gov,workclass=Local-gov,workclass=Private,...,native-country=South,native-country=Taiwan,native-country=Thailand,native-country=Trinadad&Tobago,native-country=United-States,native-country=Vietnam,native-country=Yugoslavia,income,age_binary,sex_binary
23011,19.0,7.0,1.0,1.0,0.0,0.0,30.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0,1
22523,29.0,9.0,1.0,1.0,0.0,0.0,40.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0,1
36207,32.0,10.0,0.0,0.0,0.0,0.0,40.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0,0
18594,44.0,9.0,1.0,0.0,0.0,0.0,40.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1,0
28863,28.0,13.0,0.0,1.0,0.0,0.0,40.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0,1
19196,37.0,13.0,1.0,0.0,0.0,0.0,50.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0,0
8092,66.0,9.0,1.0,1.0,0.0,0.0,40.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1,1
4200,30.0,13.0,1.0,1.0,0.0,2444.0,55.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0,1
39404,43.0,10.0,1.0,1.0,0.0,0.0,38.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1,1
17732,28.0,9.0,1.0,1.0,0.0,0.0,30.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0,1


## 2. Binarizing Age and Sex columns

In [3]:
# buckets: 
# 0 = young & sex=0
# 1 = young & sex=1
# 2 = old & sex=0
# 3 = old & sex=1
df['age_sex_cat'] = df['age_binary'] * 2 + df['sex_binary']

df[['age', 'age_binary', 'sex', 'sex_binary', 'age_sex_cat']].sample(10)

Unnamed: 0,age,age_binary,sex,sex_binary,age_sex_cat
30501,21.0,0,1.0,1,1
35719,23.0,0,0.0,0,0
12654,33.0,0,0.0,0,0
2397,43.0,1,1.0,1,3
4351,20.0,0,1.0,1,1
7502,17.0,0,0.0,0,0
17512,34.0,0,1.0,1,1
27361,25.0,0,0.0,0,0
34818,26.0,0,1.0,1,1
14169,31.0,0,0.0,0,0


In [4]:
# true cross-tabulation to show the real counts before adding noise
true_ct = pd.crosstab(df['age_binary'], df['sex_binary'])
print("True Age x Sex cross-tab:")
print(true_ct)

# also print flat counts by category (0-3)
true_counts = np.bincount(df['age_sex_cat'])
print(true_counts)

True Age x Sex cross-tab:
sex_binary     0      1
age_binary             
0           8196  14831
1           6499  15696
[ 8196 14831  6499 15696]


## 3. Introducing a local differential privacy (DP) method using epsilon
The method will randomize responses based on the epsilon passed.
Reference: https://programming-dp.com/chapter13.html

In [5]:
def dp_randomized_response(categories, epsilon, k=4):
    # this function implements randomized response mechanism
    # which takes in categorical data in {0, .., k-1} and returns privatized "reports", which are the final outputs
    # that satisfy epsilon-differential privacy for each individual.
    # epsilon is the privacy parameter, so the higher it is, the less noise we add
    # The point is to "hide" the true category of each individual while still allowing overall statistics to be done without too much error

    # returns the following:
    # reports: privatized reports after applying DP randomized response
    # p: probability of telling the truth
    # q: probability of reporting some other value
    categories = np.asarray(categories, dtype=int)
    n = len(categories)
    
    # RR probabilities
    exp_eps = np.exp(epsilon)
    p = exp_eps / (exp_eps + k - 1) # prob of telling the truth
    q = (1.0 - p) / (k - 1) # prob of reporting some other value
    
    reports = np.empty_like(categories)
    u = np.random.rand(n) # pick uniform random numbers for each individual
    same = (u < p) # where we report the true category
    
    reports[same] = categories[same]
    
    # now see how many we need to flip, for those not reporting truth, sample from the other k-1 categories
    num_flip = np.sum(~same)
    # print(num_flip)
    if num_flip > 0:
        true_vals = categories[~same]
        # sample from other categories
        alt = np.random.randint(0, k-1, size=num_flip)
        # print((alt >= true_vals))
        alt += (alt >= true_vals).astype(int) # shift up by 1 if we hit the true value TODO check if this works!
        reports[~same] = alt # set alt values in the reports where we are not telling the truth
    
    return reports, p, q


In [6]:
def estimate_counts_from_dp(reports, p, q, k=4):
    # this function takes in the privatized reports from DP randomized response
    # and estimates the true counts for each category
    # we calculate the observed frequencies after noise, and then invert the process to get estimated "true" probabilities
    # this will tell us how many people are in each category approximately and we can compare with true counts to see the error introduced by DP method
    reports = np.asarray(reports, dtype=int)
    N = len(reports)
    counts_noisy = np.bincount(reports).astype(float)
    
    # calc p and q
    # exp_eps = np.exp(epsilon)
    # p = exp_eps / (exp_eps + k - 1)
    # q = (1.0 - p) / (k - 1)
    
    # freq_hat is the observed frequency after noise
    # probs_hat is the estimated true probabilities
    freq_hat = counts_noisy / N
    probs_hat = (freq_hat - q) / (p - q)
    
    # hack: clip to [0, 1] so that we don't get invalid probabilities, don't know if this is the right way to do it or if something went wrong earlier
    probs_hat = np.clip(probs_hat, 0, 1)
    counts_est = probs_hat * N
    return counts_noisy, counts_est


In [7]:
k = 4
N = len(df)

true_counts = np.bincount(df['age_sex_cat'], minlength=k)
print("True counts:", true_counts)

epsilons = [0.1, 0.3, 0.5, 1.0, 2.0, 5.0]
results = []

for eps in epsilons:
    reports, p, q = dp_randomized_response(df['age_sex_cat'], eps, k)
    counts_noisy, counts_est = estimate_counts_from_dp(reports, p, q, k)
    
    # calculate errors now with estimated and true counts
    abs_err = np.abs(counts_est - true_counts)
    rel_err = abs_err / np.maximum(true_counts, 1)  # avoid /0
    
    # compute L1 and L2 errors between estimated counts and true counts
    l1_err = abs_err.sum()
    l2_err = np.sqrt((abs_err**2).sum())
    
    results.append({
        "epsilon": eps,
        "noisy_counts": counts_noisy,
        "est_counts": counts_est,
        "abs_err": abs_err,
        "rel_err": rel_err,
        "L1_error": l1_err,
        "L2_error": l2_err
    })

# pretty print results
pd.DataFrame([
    {
        "epsilon": r["epsilon"],
        "L1_error": r["L1_error"],
        "L2_error": r["L2_error"]
    }
    for r in results
])


True counts: [ 8196 14831  6499 15696]


Unnamed: 0,epsilon,L1_error,L2_error
0,0.1,22066.130312,11179.720503
1,0.3,4523.529815,2720.298665
2,0.5,2061.122188,1230.272203
3,1.0,353.246381,229.556142
4,2.0,605.446385,312.284025
5,5.0,78.036094,48.970017


In [8]:
labels = ["young, sex=0", "young, sex=1", "old, sex=0", "old, sex=1"] # labels for our cats

for r in results:
    print(f"\n==== epsilon = {r['epsilon']} ===")
    for i, label in enumerate(labels):
        print(
            f"{label} \t true={true_counts[i]} "
            f"\t est={r['est_counts'][i]:8.1f} "
            f"\t absolute err={r['abs_err'][i]:6.1f}"
        )



==== epsilon = 0.1 ===
young, sex=0 	 true=8196 	 est=  3245.1 	 absolute err=4950.9
young, sex=1 	 true=14831 	 est=  8748.8 	 absolute err=6082.2
old, sex=0 	 true=6499 	 est= 13159.6 	 absolute err=6660.6
old, sex=1 	 true=15696 	 est= 20068.5 	 absolute err=4372.5

==== epsilon = 0.3 ===
young, sex=0 	 true=8196 	 est=  9571.1 	 absolute err=1375.1
young, sex=1 	 true=14831 	 est= 15290.3 	 absolute err= 459.3
old, sex=0 	 true=6499 	 est=  4237.2 	 absolute err=2261.8
old, sex=1 	 true=15696 	 est= 16123.4 	 absolute err= 427.4

==== epsilon = 0.5 ===
young, sex=0 	 true=8196 	 est=  7969.7 	 absolute err= 226.3
young, sex=1 	 true=14831 	 est= 14232.8 	 absolute err= 598.2
old, sex=0 	 true=6499 	 est=  6292.9 	 absolute err= 206.1
old, sex=1 	 true=15696 	 est= 16726.6 	 absolute err=1030.6

==== epsilon = 1.0 ===
young, sex=0 	 true=8196 	 est=  8198.9 	 absolute err=   2.9
young, sex=1 	 true=14831 	 est= 14801.5 	 absolute err=  29.5
old, sex=0 	 true=6499 	 est=  6351.9 	 a

Epsilon=1 seems like a reasonable value? But we can explore other values as well and see how the classifier fairs. If performance dips too much, or is unaffected, consider increasing epsilon to 2 or 2.5.

In [None]:
epsilon = 1.0  # TODO choose one for the dataset we'll pass to the classifier
reports, p, q = dp_randomized_response(df['age_sex_cat'], epsilon, k)

df_priv = df.copy()
df_priv['age_binary_ldp'] = (reports // 2).astype(int) # 0 or 1
df_priv['sex_binary_ldp'] = (reports % 2).astype(int)

# drop original sensitive columns if you want to be purist:
df_priv = df_priv.drop(columns=['age', 'age_binary', 'sex', 'sex_binary'])

In [19]:
# this shows how many young/old and male/female individuals are there in the privatized dataset now
private_ct = pd.crosstab(df_priv['age_binary_ldp'], df_priv['sex_binary_ldp'])
print("Private dataset counts:")
print(private_ct)

# compared to original true counts
true_ct = pd.crosstab(df['age_binary'], df['sex_binary'])
print("\nOriginal dataset counts:")
print(true_ct)

Private dataset counts:
sex_binary_ldp      0      1
age_binary_ldp              
0               10389  12224
1                9786  12823

Original dataset counts:
sex_binary     0      1
age_binary             
0           8196  14831
1           6499  15696


In [11]:
# show a sample of the privatized dataset
df_priv.sample(20)

Unnamed: 0,education-num,race,capital-gain,capital-loss,hours-per-week,workclass=Federal-gov,workclass=Local-gov,workclass=Private,workclass=Self-emp-inc,workclass=Self-emp-not-inc,...,native-country=Taiwan,native-country=Thailand,native-country=Trinadad&Tobago,native-country=United-States,native-country=Vietnam,native-country=Yugoslavia,income,age_sex_cat,age_binary_ldp,sex_binary_ldp
7943,13.0,1.0,0.0,0.0,40.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,3,1,1
27345,13.0,1.0,0.0,0.0,15.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0,0,0
20229,3.0,1.0,0.0,0.0,40.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3,1,1
3963,10.0,1.0,0.0,0.0,48.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1,1,1
7120,10.0,1.0,0.0,0.0,30.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1,0,1
32915,14.0,1.0,0.0,0.0,40.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,3,1,1
15045,9.0,1.0,0.0,0.0,32.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,3,0,0
14299,13.0,1.0,7688.0,0.0,40.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,3,0,0
35942,7.0,1.0,0.0,0.0,40.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,3,0,1
37351,9.0,1.0,0.0,0.0,20.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,3,1,0


Now we can use the `df_priv` to train our classifier the same way we trained it with the full dataset.

In [None]:
## 