# Implementing Geometric Mechanism for $\epsilon$-DP

## The Geometric Mechanism
The geometric mechanism is an alternative to the Laplace Mechanism that produces integer noise automatically. Noise that satisfies $\epsilon$-DP can be drawn by taking the different between two independent draws from the Geometric Distribution:

$$f\left(k\middle|p\right)=\left(1-p\right)^{k-1}p$$ 

where the value of $p$ is set as:

$$p=1-e^{\left(-\frac{\epsilon}{S}\right)}$$

with $S$ as the sensitivity and $k$ is able to take values $1,2,3,...$


## Again, Generate and Summarize Data...

In [1]:
# Some initial Python setup requirements for later code.

# Required to get the plots inline for Census implementation.
%matplotlib inline

# Load the libraries we need.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.stats import truncnorm

# Set the threshold for numpy output values that get printed to the screen.
np.set_printoptions(threshold=10)
pd.options.display.max_rows = 10

# Specify how many dummy records we want.
numrecs=100

# Valid Values
sexvalid=["M", "F"]
marrvalid=["M", "W", "S", "D", "N"]
occvalid=["Statistician", "Economist", "Geographer", "IT Specialist", "Unicorn Wrangler"]

# Generate arrays of random values.
sexarray = np.random.choice(a=sexvalid,size=numrecs,p=[0.50, 0.50])
agearray = np.random.normal(loc=45, scale=10, size=numrecs).round().astype(np.int)
marrarray = np.random.choice(a=marrvalid,size=numrecs,p=[0.478, 0.057,0.109,0.019,0.337])
earnarray=np.power(10,truncnorm.rvs(-3, 4, loc=5, scale=0.5, size=numrecs)).round().astype(np.int)
occarray=np.random.choice(a=occvalid,size=numrecs,p=[0.3,0.3,0.2,0.19,0.01])


#myrecs=pd.DataFrame(np.array([sexarray,agearray]))
myrecs = pd.DataFrame({'Sex': sexarray, 
                       'Age': agearray,
                       'Married': marrarray,
                       'Earnings': earnarray,
                       'Occupation': occarray
                    }, columns=['Sex', 'Age', 'Married', 'Earnings', 'Occupation'])
print(myrecs)


# Note that we reindex using the full set of valid values. Unless we have structural zeros, we must also protect the real zeros.
truetab=pd.crosstab(index=myrecs['Occupation'], columns=myrecs['Sex'])
truetab=truetab.reindex(index=occvalid, columns=sexvalid, fill_value=0)
print(truetab)

   Sex  Age Married  Earnings     Occupation
0    F   48       M     27000   Statistician
1    F   45       M    383672   Statistician
2    M   47       N     23428      Economist
3    M   61       M     22504  IT Specialist
4    F   28       M     66849      Economist
..  ..  ...     ...       ...            ...
95   F   33       M    231925      Economist
96   M   46       M    346033   Statistician
97   F   40       M     32616     Geographer
98   M   39       N     41567     Geographer
99   M   20       N    171503     Geographer

[100 rows x 5 columns]
Sex                M   F
Occupation              
Statistician      18  19
Economist         13  16
Geographer        10  10
IT Specialist      8   5
Unicorn Wrangler   0   1


## Generate Geometric Noise

In [2]:
# First let's define the scale of the Laplace distribution in terms of sensitivity and epsilon.
sensitivity=1.
epsilon=1.
probability=1-np.exp(-epsilon/sensitivity)

print("Epsilon = {:4.2f}".format(epsilon))
print("Sensitivity = {:4.2f}".format(sensitivity) )
print("Probability = {:4.4f}".format(probability))
print()


noise = np.random.geometric(probability,truetab.shape)-np.random.geometric(probability,truetab.shape)
print(noise)


Epsilon = 1.00
Sensitivity = 1.00
Probability = 0.6321

[[-1  0]
 [ 0  0]
 [ 0  0]
 [ 0 -1]
 [ 1  0]]


## Create the Noisy Table and Measure the Error...
No post-processing rounding needed.

In [4]:
# Make the noisy table.
noisytab=truetab+noise
print("Original Noisy Table:")
print(noisytab)
print()

# Post-process the noisy table to set negative values to zero.
noisytabpost=noisytab
noisytabpost[noisytabpost < 0]=0
print("Noisy table to be published:")
print(noisytabpost)
print()

print("True table for comparison:")
print(truetab)
print()

# First cell-by-cell errors.
errortab=np.abs(noisytabpost-truetab)
print("Cell-by-cell absolute error:")
print(errortab)
print()

# Then percentage errors.
pcterrortab=errortab/truetab
print("Cell-by-cell relative error:")
print(pcterrortab)
print()

# Now the total error and the relative error.
errorsum=errortab.sum().sum()
print("Total L1 Error: {0:1d}".format(errorsum))
truesum=truetab.sum().sum()
errorrel=100*errorsum/truesum
print("Relative L1 Error: {:4.2f}%".format(errorrel))


Original Noisy Table:
Sex                M   F
Occupation              
Statistician      17  19
Economist         13  16
Geographer        10  10
IT Specialist      8   4
Unicorn Wrangler   1   1

Noisy table to be published:
Sex                M   F
Occupation              
Statistician      17  19
Economist         13  16
Geographer        10  10
IT Specialist      8   4
Unicorn Wrangler   1   1

True table for comparison:
Sex                M   F
Occupation              
Statistician      18  19
Economist         13  16
Geographer        10  10
IT Specialist      8   5
Unicorn Wrangler   0   1

Cell-by-cell absolute error:
Sex               M  F
Occupation            
Statistician      1  0
Economist         0  0
Geographer        0  0
IT Specialist     0  1
Unicorn Wrangler  1  0

Cell-by-cell relative error:
Sex                      M    F
Occupation                     
Statistician      0.055556  0.0
Economist         0.000000  0.0
Geographer        0.000000  0.0
IT Specialist 