<a href="https://colab.research.google.com/github/linyuehzzz/dp_census/blob/main/dp_census.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**Differential Privacy for Census**
This code simulates the TopDown algorithm for the U.S. census.  
Yue Lin (lin.3326 at osu.edu)  
Created: 7/28/2021

In [None]:
from google.colab import drive
drive.mount("/content/gdrive")

#### **Set up libraries**

In [2]:
# %cd "/content/gdrive/My Drive/Colab Notebooks/census_dp"
import json
import pandas as pd
import numpy as np
import math
from sympy import *

#### **Create synthetic microdata**
Young et al. (2009) presents a method that creates a realistic dataset for testing disclosure control methodologies.  

Young, C., Martin, D., & Skinner, C. (2009). Geographically intelligent disclosure control for flexible aggregation of census data. International Journal of Geographical Information Science, 23(4), 457-482.

In [3]:
block = ["1234", "1234", "1234", "1235", "1235", "1235", "1234"]
age = [66, 84, 30, 36, 8, 18, 24]
sex = ["F", "M", "M", "F", "F", "M", "F"]
race = ["B", "B", "W", "B", "B", "W", "W"]
relationship = ["M", "M", "M", "M", "S", "S", "S"]
N = 7

#### **Create original histograms**

block/sex

In [4]:
block_sex = {}
for i in ["1234", "1235"]:
    block_sex[i] = {}
    for j in ["F", "M"]:
        block_sex[i][j] = 0
        
for k in range(N):
    i = block[k]
    j = sex[k]
    block_sex[i][j] += 1

df = pd.DataFrame.from_dict(block_sex, orient='index')
print(df)

      F  M
1234  2  2
1235  2  1


block/sex/race/relationship

In [5]:
block_sex_race_rel = {}
for a in ["1234", "1235"]:
    block_sex_race_rel[a] = {}
    for b in ["F", "M"]:
        for c in ["W", "B"]:
            for d in ["M", "S"]:
                block_sex_race_rel[a][b + "/" + c + "/" + d] = 0

for k in range(N):
    a = block[k]
    b = sex[k]
    c = race[k]
    d = relationship[k]
    block_sex_race_rel[a][b + "/" + c + "/" + d] += 1

df = pd.DataFrame.from_dict(block_sex_race_rel, orient='index')
print(df)

      F/W/M  F/W/S  F/B/M  F/B/S  M/W/M  M/W/S  M/B/M  M/B/S
1234      0      1      1      0      1      0      1      0
1235      0      0      1      1      0      1      0      0


#### **Add noises (Zero-concentrated differential privacy)**
Discrete Gaussian noise is applied.

Bun, M., & Steinke, T. (2016, November). Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of Cryptography Conference (pp. 635-658). Springer, Berlin, Heidelberg.

Generate Gaussian noise  

**How espilon is distributed?** The noise magnitude scales should relate with the number of queries. Are they identical within each type of queries and each geographical level?

In [6]:
delta = 0.05
eps = 12.2 * 0.02

# solve for rho
y = Symbol('y')
eqn = Eq(y + 2 * sqrt(y * log(1 / delta)), eps)
rho = solve(eqn)[0]
# calculate sigma (standard deviation)
sigma = math.sqrt(1 / (2 * rho))

print(sigma)

10.232020630940216


block/sex

In [11]:
block_sex_noise = {}

for i in ["1234", "1235"]:
    block_sex_noise[i] = {}
    for j in ["F", "M"]:
        noise = np.random.normal(0, sigma, 1).astype(int)[0]
        block_sex_noise[i][j] = block_sex[i][j] + noise

df = pd.DataFrame.from_dict(block_sex_noise, orient='index')
print(df)

       F  M
1234 -13  4
1235   8  9


block/sex/race/relationship

In [None]:
block_sex_race_rel_noise = {}

for a in ["1234", "1235"]:
    block_sex_race_rel_noise[a] = {}
    for b in ["F", "M"]:
        for c in ["W", "B"]:
            for d in ["M", "S"]:
                block_sex_race_rel_noise[a][b + "/" + c + "/" + d] = block_sex_race_rel[a][b + "/" + c + "/" + d] + noise

df = pd.DataFrame.from_dict(block_sex_race_rel_noise, orient='index')
print(df)

      F/W/M  F/W/S  F/B/M  F/B/S  M/W/M  M/W/S  M/B/M  M/B/S
1234      0      1      1      0      1      0      1      0
1235      0      0      1      1      0      1      0      0


#### **Post-processing**