# Cases Where Sensitivity $\gg$ 1
Now consider a situation where the sensitivity is greater or even much greater than 1. An example is the counting of employees at a business.

Below we will generate some data and then consider the consequences of dealing with large businesses.

In [1]:
# Some initial Python setup requirements for later code.

# Required to get the plots inline for Census implementation.
%matplotlib inline

# Load the libraries we need.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.stats import truncnorm

# Set the threshold for numpy output values that get printed to the screen.
np.set_printoptions(threshold=10)
pd.options.display.max_rows = 10

## Generate the data
This time we will generate some business data to consider. The characteristics will be:
* Geography: {DC, Maryland, Virginia}, with a relative distribution of {20%, 40%, 40%}.
* Industry: {Construction, Manufacturing, Information, Finance, Education}, with a relative distribution of {20%, 20%, 30%, 10%, 20%}.
* Size (count of employees): With a truncated log-normal distribution ranging from 1-10,000.

In [6]:
# Specify how many dummy records we want.
numrecs=100

# Valid Values
geovalid=["DC","MD","VA"]
indvalid=["Construction","Manufacturing", "Information", "Finance", "Education"]

# Generate arrays of random values.
geoarray = np.random.choice(a=geovalid,size=numrecs,p=[0.2,0.4,0.4])
indarray=np.random.choice(a=indvalid,size=numrecs,p=[0.2,0.2,0.3,0.1,0.2])

# Generate the truncated log-normal distribution.
norm_min, norm_max, norm_mean, norm_std = 0, 4, 0.5, 2
a, b = (norm_min - norm_mean) / norm_std, (norm_max - norm_mean) / norm_std
emparray=np.power(10,truncnorm.rvs(a, b, loc=my_mean, scale=my_std, size=numrecs)).round().astype(np.int)

# Create the full data structure.
myrecs = pd.DataFrame({'State': geoarray, 
                       'Industry': indarray,
                       'Employment': emparray
                    }, columns=['State', 'Industry', 'Employment'])
print(myrecs)
print(myrecs["Employment"].min())
print(myrecs["Employment"].max())


   State       Industry  Employment
0     DC   Construction           3
1     VA    Information          10
2     DC        Finance        8113
3     DC  Manufacturing         435
4     VA    Information         629
..   ...            ...         ...
95    VA    Information           2
96    MD  Manufacturing         161
97    DC      Education          24
98    MD   Construction         403
99    MD   Construction         397

[100 rows x 3 columns]
1
8699


## Summarize the Data
We'll summarize the counts and employment totals by state and industry to see the differences in outcomes.

In [24]:
# Note that we reindex using the full set of valid values. Unless we have structural zeros, we must also protect the real zeros.
truecnttab=pd.crosstab(index=myrecs['Industry'], columns=myrecs['State'])
truecnttab=truecnttab.reindex(index=indvalid, columns=geovalid, fill_value=0)
print("Table of Business Counts by State*Industry")
print(truecnttab)
print()

trueemptab=pd.crosstab(index=myrecs['Industry'], columns=myrecs['State'], values=myrecs['Employment'], aggfunc=sum)
trueemptab=trueemptab.reindex(index=indvalid, columns=geovalid, fill_value=0)
print("Table of Employment by State*Industry")
print(trueemptab)
print()


Table of Business Counts by State*Industry
State          DC  MD  VA
Industry                 
Construction    5  13   6
Manufacturing   4  13   9
Information     3  13  10
Finance         1   3   5
Education       3   3   9

Table of Employment by State*Industry
State            DC     MD     VA
Industry                         
Construction   9116   9122   9891
Manufacturing  2697  11704  11125
Information     410   9915   2032
Finance        8113    405    871
Education       907    214   3161



# Generate Laplace Noise
Now we will generate a noise value for each cell in both tables too be published. Note that there will be different assumptions about the privacy for each table.

In [25]:
# Privacy parameters for business counts:
cntsens=1.
cnteps=1.
cntscale=cntsens/cnteps

print("Business Count Protection Parameters:")
print("Epsilon = {:4.2f}".format(cnteps))
print("Sensitivity = {:4.2f}".format(cntsens) )
print("Scale = {:4.2f}".format(cntscale))
print()


cntnoise = np.random.laplace(0,cntscale,truecnttab.shape)
print(cntnoise)
print()


# Privacy parameters for employment:
empsens=10000.
empeps=1.
empscale=empsens/empeps

print("Employment Protection Parameters:")
print("Epsilon = {:4.2f}".format(empeps))
print("Sensitivity = {:4.2f}".format(empsens) )
print("Scale = {:4.2f}".format(empscale))
print()


empnoise = np.random.laplace(0,empscale,trueemptab.shape)
print(empnoise)
print()

Business Count Protection Parameters:
Epsilon = 1.00
Sensitivity = 1.00
Scale = 1.00

[[ 0.66516001 -1.80960276 -0.76246671]
 [-3.46764932 -0.61651588  0.34015434]
 [ 0.46173678 -0.53684523  0.29532402]
 [ 0.19287115 -6.23042125 -0.28998339]
 [-0.15150227 -0.78647283  1.1020326 ]]

Employment Protection Parameters:
Epsilon = 1.00
Sensitivity = 10000.00
Scale = 10000.00

[[ 17019.5809804   -3542.30317702   8574.98362968]
 [ -8219.32918854  17688.85506489  -8506.55149485]
 [  4166.0675727   25584.0200565     958.27051303]
 [-10744.12796765  -2409.51579655  -3024.83642356]
 [  2385.15292904  20982.69334456  -2185.89213263]]



# Create Noisy Table
Now we add true data and the noise to get the noisy tables. Depending upon our publication strategy we may want to post-process the table (e.g. rounding, excluding negative values).

In [28]:
# Print the noisy business count table.
noisycnttab=truecnttab+cntnoise
print(noisycnttab)
print()

# Post-process the noisy table to round and set negative values to zero.
noisycnttabpost=noisycnttab.round().astype(np.int)
noisycnttabpost[noisycnttabpost < 0]=0
print("Noisy business count table to be published:")
print(noisycnttabpost)
print()

print("True business count table for comparison:")
print(truecnttab)
print()


# Print the noisy employment table.
noisyemptab=trueemptab+empnoise
print(noisyemptab)
print()

# Post-process the noisy table to round and set negative values to zero.
noisyemptabpost=noisyemptab.round().astype(np.int)
noisyemptabpost[noisyemptabpost < 0]=0
print("Noisy employment table to be published:")
print(noisyemptabpost)
print()

print("True employment table for comparison:")
print(trueemptab)
print()

State                DC         MD         VA
Industry                                     
Construction   5.665160  11.190397   5.237533
Manufacturing  0.532351  12.383484   9.340154
Information    3.461737  12.463155  10.295324
Finance        1.192871  -3.230421   4.710017
Education      2.848498   2.213527  10.102033

Noisy business count table to be published:
State          DC  MD  VA
Industry                 
Construction    6  11   5
Manufacturing   1  12   9
Information     3  12  10
Finance         1   0   5
Education       3   2  10

True business count table for comparison:
State          DC  MD  VA
Industry                 
Construction    5  13   6
Manufacturing   4  13   9
Information     3  13  10
Finance         1   3   5
Education       3   3   9

State                    DC            MD            VA
Industry                                               
Construction   26135.580980   5579.696823  18465.983630
Manufacturing  -5522.329189  29392.855065   2618.448505
I

# Measure the Error
Now we will calculate the $L_{1}$ error between the true table and the noisy tables. First cell by cell and then in total.

In [31]:
# First cell-by-cell errors.
errorcnttab=np.abs(noisycnttabpost-truecnttab)
print("Cell-by-cell absolute error for business counts:")
print(errorcnttab)
print()

# Then percentage errors.
pcterrorcnttab=errorcnttab/truecnttab
print("Cell-by-cell relative error for business counts:")
print(pcterrorcnttab)
print()

# Now the total error and the relative error.
print("Total business count error")
errorcntsum=errorcnttab.sum().sum()
print("Total L1 Error: {0:1d}".format(errorcntsum))
truecntsum=truecnttab.sum().sum()
errorcntrel=100*errorcntsum/truecntsum
print("Relative L1 Error: {:4.2f}%".format(errorcntrel))
print()



# First cell-by-cell errors.
erroremptab=np.abs(noisyemptabpost-trueemptab)
print("Cell-by-cell absolute error for employment:")
print(erroremptab)
print()

# Then percentage errors.
pcterroremptab=erroremptab/trueemptab
print("Cell-by-cell relative error for employment:")
print(pcterroremptab)
print()

# Now the total error and the relative error.
print("Total employment error")
errorempsum=erroremptab.sum().sum()
print("Total L1 Error: {0:1d}".format(errorempsum))
trueempsum=trueemptab.sum().sum()
erroremprel=100*errorempsum/trueempsum
print("Relative L1 Error: {:4.2f}%".format(erroremprel))
print()

Cell-by-cell absolute error for business counts:
State          DC  MD  VA
Industry                 
Construction    1   2   1
Manufacturing   3   1   0
Information     0   1   0
Finance         0   3   0
Education       0   1   1

Cell-by-cell relative error for business counts:
State            DC        MD        VA
Industry                               
Construction   0.20  0.153846  0.166667
Manufacturing  0.75  0.076923  0.000000
Information    0.00  0.076923  0.000000
Finance        0.00  1.000000  0.000000
Education      0.00  0.333333  0.111111

Total business count error
Total L1 Error: 14
Relative L1 Error: 14.00%

Cell-by-cell absolute error for employment:
State             DC     MD    VA
Industry                         
Construction   17020   3542  8575
Manufacturing   2697  17689  8507
Information     4166  25584   958
Finance         8113    405   871
Education       2385  20983  2186

Cell-by-cell relative error for employment:
State                 DC         MD   