<a href="https://colab.research.google.com/github/linyuehzzz/census_privacy/blob/main/differential_privacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount("/content/gdrive")

Mounted at /content/gdrive


#### **Read synthetic population data**

In [2]:
%cd "/content/gdrive/My Drive/Colab Notebooks/census_privacy"
import pandas as pd

filename_people = 'franklin/microdata/franklin_peoplev0.csv'
data_people = pd.read_csv(filename_people)
data_people

/content/gdrive/My Drive/Colab Notebooks/census_privacy


Unnamed: 0,GEOID10,PUMAID,SEX,RAC1P,AGEP,DIS,MSP,MIG,MIL,SCHL,FOD1P,HICOV,PRIVCOV,PUBCOV,PINCP,POVPIP,COW,ESR,SOCP,JWMNP,JWRIP,JWTRNS,HouseholdID,lon,lat
0,390490001101001,3904102,1,1,26,2,6.0,3.0,4.0,19.0,,2,2,2,25000.0,495.0,6.0,1.0,1191XX,5.0,1.0,1.0,1,-82.999368,40.054253
1,390490001101001,3904102,2,9,26,2,6.0,3.0,4.0,19.0,,1,1,2,20000.0,162.0,1.0,1.0,37201X,5.0,2.0,1.0,1,-82.999368,40.054253
2,390490001101002,3904102,2,1,1,2,,3.0,,,,2,2,2,,13.0,,,,,,,4,-83.002695,40.060864
3,390490001101002,3904102,2,1,4,2,,1.0,,2.0,,1,1,2,,501.0,,,,,,,5,-83.001629,40.060899
4,390490001101002,3904102,2,1,8,2,,1.0,,4.0,,1,1,2,,491.0,,,,,,,6,-83.001937,40.060852
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1138714,390490107001020,3904102,2,1,77,1,3.0,1.0,4.0,19.0,,1,2,1,17200.0,144.0,,6.0,,,,,70,-83.017831,40.065591
1138715,390490107001020,3904102,1,1,94,2,1.0,1.0,2.0,22.0,5200.0,1,1,1,5200.0,266.0,,6.0,,,,,108,-83.017814,40.062582
1138716,390490107001020,3904102,2,1,94,1,6.0,1.0,4.0,16.0,,1,2,1,0.0,,,6.0,,,,,95,-83.018259,40.064492
1138717,390499800001027,3904106,1,1,74,2,2.0,1.0,4.0,15.0,,1,1,1,33300.0,221.0,,6.0,,,,,1,-82.879433,40.004085


#### **Create original histograms**
We consider the query of race (RAC1P) for the purpose of demonstration. Cell uniqueness is presented in this query.

In [99]:
hist1 = pd.crosstab(data_people.GEOID10, data_people.RAC1P)
hist1

RAC1P,1,2,3,4,5,6,7,8,9
GEOID10,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
390490001101001,1,0,0,0,0,0,0,0,1
390490001101002,45,0,0,0,0,0,0,0,0
390490001101003,51,4,0,0,0,0,0,0,0
390490001101004,48,0,0,0,0,0,0,0,0
390490001101005,51,0,0,0,0,0,0,0,4
...,...,...,...,...,...,...,...,...,...
390490107001015,92,17,0,0,0,0,0,1,2
390490107001016,9,0,0,0,0,0,0,0,0
390490107001017,42,6,0,0,0,9,0,0,0
390490107001020,167,20,0,0,0,7,0,1,1


#### **Add noises (Zero-concentrated differential privacy)**
There are two core components to the 2020 DAS: noise injection and post-processing. We only focus on the noise injection here. Consistent with the 2020 DAS, discrete Gaussian noise is applied.

Bun, M., & Steinke, T. (2016, November). Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of Cryptography Conference (pp. 635-658). Springer, Berlin, Heidelberg.

Parameters for generating Gaussian noise.

In [100]:
import math
import numpy as np

# privacy budget allocation
delta = 10 ** (-10)
eps = 12.2 * 9/4097

# calculate sigma (standard deviation)
sigma = math.sqrt(1 / (2 * eps))
# privacy loss budget in differential privacy
eps0 = eps + 2 * sqrt(eps * log(1 / delta))
eps, eps0

(0.02680009763241396, 1.59790805444737)

Probability of generating zero noise.

In [103]:
import scipy.stats
scipy.stats.norm(0, sigma).pdf(0)

0.09236198366800145

Add noise to histogram.

In [102]:
hist1 += np.round(np.random.normal(0, sigma, size=(hist1.shape)), 0)
hist1

RAC1P,1,2,3,4,5,6,7,8,9
GEOID10,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
390490001101001,0.0,9.0,3.0,1.0,2.0,-2.0,-2.0,3.0,-9.0
390490001101002,46.0,2.0,-4.0,2.0,10.0,8.0,0.0,6.0,0.0
390490001101003,63.0,1.0,3.0,3.0,0.0,-4.0,-11.0,1.0,-1.0
390490001101004,49.0,3.0,2.0,2.0,3.0,1.0,0.0,3.0,-1.0
390490001101005,46.0,5.0,-6.0,6.0,2.0,1.0,2.0,0.0,7.0
...,...,...,...,...,...,...,...,...,...
390490107001015,100.0,13.0,2.0,-4.0,2.0,6.0,7.0,8.0,-8.0
390490107001016,12.0,9.0,12.0,2.0,1.0,1.0,-4.0,-1.0,4.0
390490107001017,37.0,5.0,-3.0,1.0,2.0,12.0,1.0,1.0,-3.0
390490107001020,170.0,16.0,3.0,4.0,-6.0,7.0,-4.0,-3.0,0.0


####**Check cell uniqueness**

In [None]:
scipy.stats.norm(100, 12).pdf(98)