# Implementing Laplace for $\epsilon$-DP

## Generate Data

We'll create a dummy dataset with some records of fake people who we can then summarize in a some tables.

The characteristics we'll use will be:
* Sex
* Age
* Marital Status
* Earnings
* Occupation

In [None]:
# Some initial Python setup requirements for later code.

# Required to get the plots inline for Census implementation.
%matplotlib inline

# Load the libraries we need.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.stats import truncnorm

# Set the threshold for numpy output values that get printed to the screen.
np.set_printoptions(threshold=10)
pd.options.display.max_rows = 10

In [None]:
# Specify how many dummy records we want.
numrecs=100

# Valid Values
sexvalid=["M", "F"]
marrvalid=["M", "W", "S", "D", "N"]
occvalid=["Statistician", "Economist", "Geographer", "IT Specialist", "Unicorn Wrangler"]

# Generate arrays of random values.
sexarray = np.random.choice(a=sexvalid,size=numrecs,p=[0.50, 0.50])
agearray = np.random.normal(loc=45, scale=10, size=numrecs).round().astype(np.int)
marrarray = np.random.choice(a=marrvalid,size=numrecs,p=[0.478, 0.057,0.109,0.019,0.337])
earnarray=np.power(10,truncnorm.rvs(-3, 4, loc=5, scale=0.5, size=numrecs)).round().astype(np.int)
occarray=np.random.choice(a=occvalid,size=numrecs,p=[0.3,0.3,0.2,0.19,0.01])


#myrecs=pd.DataFrame(np.array([sexarray,agearray]))
myrecs = pd.DataFrame({'Sex': sexarray, 
                       'Age': agearray,
                       'Married': marrarray,
                       'Earnings': earnarray,
                       'Occupation': occarray
                    }, columns=['Sex', 'Age', 'Married', 'Earnings', 'Occupation'])
print(myrecs)


## Summarize Data
For this example, we want to publish a count of Occupation by Sex. For reference, we'll start by looking at the true version of the table.

In [None]:
# Note that we reindex using the full set of valid values. Unless we have structural zeros, we must also protect the real zeros.
truetab=pd.crosstab(index=myrecs['Occupation'], columns=myrecs['Sex'])
truetab=truetab.reindex(index=occvalid, columns=sexvalid, fill_value=0)
print(truetab)


# Generate Laplace Noise
Now we will generate a noise value for each cell in the table too be published.

In [None]:
# First let's define the scale of the Laplace distribution in terms of sensitivity and epsilon.
sensitivity=1.
epsilon=4.
scale=sensitivity/epsilon

print("Epsilon = {:4.2f}".format(epsilon))
print("Sensitivity = {:4.2f}".format(sensitivity) )
print("Scale = {:4.2f}".format(scale))
print()


noise = np.random.laplace(0,scale,truetab.shape)
print(noise)


# Create Noisy Table
Now we add true data and the noise to get a noise table. Depending upon our publication strategy we may want to post-process the table (e.g. rounding, excluding negative values).

In [None]:
# Print the noisy table.
noisytab=truetab+noise
print(noisytab)
print()

# Post-process the noisy table to round and set negative values to zero.
noisytabpost=noisytab.round().astype(np.int)
noisytabpost[noisytabpost < 0]=0
print("Noisy table to be published:")
print(noisytabpost)
print()

print("True table for comparison:")
print(truetab)

# Measure the Error
Now we will calculate the $L_{1}$ error between the true table and the noisy table. First cell by cell and then in total.

In [None]:
# First cell-by-cell errors.
errortab=np.abs(noisytabpost-truetab)
print("Cell-by-cell absolute error:")
print(errortab)
print()

# Then percentage errors.
pcterrortab=errortab/truetab
print("Cell-by-cell relative error:")
print(pcterrortab)
print()

# Now the total error and the relative error.
errorsum=errortab.sum().sum()
print("Total L1 Error: {0:1d}".format(errorsum))
truesum=truetab.sum().sum()
errorrel=100*errorsum/truesum
print("Relative L1 Error: {:4.2f}%".format(errorrel))