# Implementing Laplace for $\epsilon$-DP

## Generating a Dummy Population

We'll create a dummy dataset with some records of fake people who we can then summarize in a some tables.

The characteristics we'll use will be: Sex, Age, Marital Status, Earnings, Occupation

In [2]:
# Some initial Python setup requirements for later code.

# Required to get the plots inline for Census implementation.
%matplotlib inline

# Load the libraries we need.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.stats import truncnorm

# Set the threshold for numpy output values that get printed to the screen.
np.set_printoptions(threshold=10)
pd.options.display.max_rows = 10

## Population Distributions

The options we will allow for each of those categories are:
* Sex: {Male, Female}, evenly distributed.
* Age: A normal distribution of integers centered around 45 and with a StDev of 10.
* Marital Status: {Married, Widowed, Separated, Divorced, Never Marries}, based on the [ACS 1-year estimates for 2018](https://data.census.gov/cedsci/table?q=Marital%20Status%20and%20Marital%20History&hidePreview=false&t=Marital%20Status%20and%20Marital%20History&tid=ACSST1Y2018.S1201&vintage=2018)
* Earnings: A truncated log-normal distribution of integers.
* Occupation: {Statistician, Economist, Geographer, IT Specialist, Unicorn Wrangler}, with an arbitrary distribution.

In [8]:
# Specify how many dummy records we want.
numrecs=1000

# Valid Values
sexvalid=["M", "F"]
marrvalid=["M", "W", "S", "D", "N"]
occvalid=["Statistician", "Economist", "Geographer", "IT Specialist", "Unicorn Wrangler"]

# Generate arrays of random values.
sexarray = np.random.choice(a=sexvalid,size=numrecs,p=[0.50, 0.50])
agearray = np.random.normal(loc=45, scale=10, size=numrecs).round().astype(np.int)
marrarray = np.random.choice(a=marrvalid,size=numrecs,p=[0.478, 0.057,0.109,0.019,0.337])
earnarray=np.power(10,truncnorm.rvs(-3, 4, loc=5, scale=0.5, size=numrecs)).round().astype(np.int)
occarray=np.random.choice(a=occvalid,size=numrecs,p=[0.3,0.3,0.2,0.19,0.01])


# Construct the full data structure and print some records.
myrecs = pd.DataFrame({'Sex': sexarray, 
                       'Age': agearray,
                       'Married': marrarray,
                       'Earnings': earnarray,
                       'Occupation': occarray
                    }, columns=['Sex', 'Age', 'Married', 'Earnings', 'Occupation'])
print(myrecs)


    Sex  Age Married  Earnings     Occupation
0     M   49       N     24038  IT Specialist
1     M   42       S    169977     Geographer
2     F   48       M     43647      Economist
3     F   56       N    379372   Statistician
4     F   67       M    642698   Statistician
..   ..  ...     ...       ...            ...
995   M   48       S     42793     Geographer
996   M   36       M     12316      Economist
997   M   39       M      8489     Geographer
998   M   39       M     52186      Economist
999   F   42       M    243496  IT Specialist

[1000 rows x 5 columns]


## Summarize Data
For this example, we want to publish a count of Occupation by Sex. For reference, we'll start by looking at the true version of the table.

In [9]:
# Note that we reindex using the full set of valid values. Unless we have structural zeros, we must also protect the real zeros.
truetab=pd.crosstab(index=myrecs['Occupation'], columns=myrecs['Sex'])
truetab=truetab.reindex(index=occvalid, columns=sexvalid, fill_value=0)
print(truetab)


Sex                 M    F
Occupation                
Statistician      144  150
Economist         143  143
Geographer        105  106
IT Specialist      97   99
Unicorn Wrangler    6    7


# Generate Laplace Noise
Now we will generate a noise value for each cell in the table too be published.

In [10]:
# First let's define the scale of the Laplace distribution in terms of sensitivity and epsilon.
sensitivity=1.
epsilon=1.
scale=sensitivity/epsilon

print("Epsilon = {:4.2f}".format(epsilon))
print("Sensitivity = {:4.2f}".format(sensitivity) )
print("Scale = {:4.2f}".format(scale))
print()


noise = np.random.laplace(0,scale,truetab.shape)
print(noise)


Epsilon = 1.00
Sensitivity = 1.00
Scale = 1.00

[[-0.58286841 -0.32894437]
 [-1.56780072  1.07916289]
 [ 1.45645638  2.17068026]
 [ 2.71139609 -1.02783281]
 [ 0.30909519  1.73692829]]


# Create Noisy Table
Now we add true data and the noise to get a noisy table. Depending upon our publication strategy we may want to post-process the table (e.g. rounding, excluding negative values).

In [11]:
# Print the noisy table.
noisytab=truetab+noise
print(noisytab)
print()

# Post-process the noisy table to round and set negative values to zero.
noisytabpost=noisytab.round().astype(np.int)
noisytabpost[noisytabpost < 0]=0
print("Noisy table to be published:")
print(noisytabpost)
print()

print("True table for comparison:")
print(truetab)

Sex                        M           F
Occupation                              
Statistician      143.417132  149.671056
Economist         141.432199  144.079163
Geographer        106.456456  108.170680
IT Specialist      99.711396   97.972167
Unicorn Wrangler    6.309095    8.736928

Noisy table to be published:
Sex                 M    F
Occupation                
Statistician      143  150
Economist         141  144
Geographer        106  108
IT Specialist     100   98
Unicorn Wrangler    6    9

True table for comparison:
Sex                 M    F
Occupation                
Statistician      144  150
Economist         143  143
Geographer        105  106
IT Specialist      97   99
Unicorn Wrangler    6    7


# Measure the Error
Now we will calculate the $L_{1}$ error between the true table and the noisy table. First cell by cell and then in total.

In [12]:
# First cell-by-cell errors.
errortab=np.abs(noisytabpost-truetab)
print("Cell-by-cell absolute error:")
print(errortab)
print()

# Then percentage errors.
pcterrortab=errortab/truetab
print("Cell-by-cell relative error:")
print(pcterrortab)
print()

# Now the total error and the relative error.
errorsum=errortab.sum().sum()
print("Total L1 Error: {0:1d}".format(errorsum))
truesum=truetab.sum().sum()
errorrel=100*errorsum/truesum
print("Relative L1 Error: {:4.2f}%".format(errorrel))

Cell-by-cell absolute error:
Sex               M  F
Occupation            
Statistician      1  0
Economist         2  1
Geographer        1  2
IT Specialist     3  1
Unicorn Wrangler  0  2

Cell-by-cell relative error:
Sex                      M         F
Occupation                          
Statistician      0.006944  0.000000
Economist         0.013986  0.006993
Geographer        0.009524  0.018868
IT Specialist     0.030928  0.010101
Unicorn Wrangler  0.000000  0.285714

Total L1 Error: 13
Relative L1 Error: 1.30%
