# Cases Where Sensitivity $\gg$ 1
Now consider a situation where the sensitivity is greater or even much greater than 1. An example is the counting of employees at a business.

Below we will generate some data and then consider the consequences of dealing with large businesses.

In [1]:
# Some initial Python setup requirements for later code.

# Required to get the plots inline for Census implementation.
%matplotlib inline

# Load the libraries we need.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.stats import truncnorm

# Set the threshold for numpy output values that get printed to the screen.
np.set_printoptions(threshold=10)
pd.options.display.max_rows = 10

## Generate the data
This time we will generate some business data to consider. The characteristics will be:
* Geography: {DC, Maryland, Virginia}, with a relative distribution of {20%, 40%, 40%}.
* Industry: {Construction, Manufacturing, Information, Finance, Education}, with a relative distribution of {20%, 20%, 30%, 10%, 20%}.
* Size (count of employees): With a truncated log-normal distribution ranging from 1-10,000.

In [3]:
# Specify how many dummy records we want.
numrecs=100

# Valid Values
geovalid=["DC","MD","VA"]
indvalid=["Construction","Manufacturing", "Information", "Finance", "Education"]

# Generate arrays of random values.
geoarray = np.random.choice(a=geovalid,size=numrecs,p=[0.2,0.4,0.4])
indarray=np.random.choice(a=indvalid,size=numrecs,p=[0.2,0.2,0.3,0.1,0.2])

# Generate the truncated log-normal distribution.
norm_min, norm_max, norm_mean, norm_std = 0, 4, 0.5, 2
a, b = (norm_min - norm_mean) / norm_std, (norm_max - norm_mean) / norm_std
emparray=np.power(10,truncnorm.rvs(a, b, loc=norm_mean, scale=norm_std, size=numrecs)).round().astype(np.int)

# Create the full data structure.
myrecs = pd.DataFrame({'State': geoarray, 
                       'Industry': indarray,
                       'Employment': emparray
                    }, columns=['State', 'Industry', 'Employment'])
print(myrecs)
print(myrecs["Employment"].min())
print(myrecs["Employment"].max())


   State       Industry  Employment
0     MD    Information        8991
1     VA    Information           4
2     VA  Manufacturing        1126
3     VA   Construction           2
4     DC      Education          24
..   ...            ...         ...
95    MD    Information          17
96    VA   Construction          87
97    MD        Finance         106
98    VA  Manufacturing           7
99    MD    Information           4

[100 rows x 3 columns]
1
9880


## Summarize the Data
We'll summarize the counts and employment totals by state and industry to see the differences in outcomes.

In [4]:
# Note that we reindex using the full set of valid values. Unless we have structural zeros, we must also protect the real zeros.
truecnttab=pd.crosstab(index=myrecs['Industry'], columns=myrecs['State'])
truecnttab=truecnttab.reindex(index=indvalid, columns=geovalid, fill_value=0)
print("Table of Business Counts by State*Industry")
print(truecnttab)
print()

trueemptab=pd.crosstab(index=myrecs['Industry'], columns=myrecs['State'], values=myrecs['Employment'], aggfunc=sum)
trueemptab=trueemptab.reindex(index=indvalid, columns=geovalid, fill_value=0)
print("Table of Employment by State*Industry")
print(trueemptab)
print()


Table of Business Counts by State*Industry
State          DC  MD  VA
Industry                 
Construction    4   6  10
Manufacturing   5   7  11
Information     3  15  14
Finance         2   1   2
Education       4   9   7

Table of Employment by State*Industry
State            DC    MD     VA
Industry                        
Construction    151  6323    582
Manufacturing  5162  2815   2158
Information      41  9259  17562
Finance          68   106     11
Education       887  1114   1628



# Generate Laplace Noise
Now we will generate a noise value for each cell in both tables too be published. Note that there will be different assumptions about the privacy for each table.

In [5]:
# Privacy parameters for business counts:
cntsens=1.
cnteps=1.
cntscale=cntsens/cnteps

print("Business Count Protection Parameters:")
print("Epsilon = {:4.2f}".format(cnteps))
print("Sensitivity = {:4.2f}".format(cntsens) )
print("Scale = {:4.2f}".format(cntscale))
print()


cntnoise = np.random.laplace(0,cntscale,truecnttab.shape)
print(cntnoise)
print()


# Privacy parameters for employment:
empsens=10000.
empeps=1.
empscale=empsens/empeps

print("Employment Protection Parameters:")
print("Epsilon = {:4.2f}".format(empeps))
print("Sensitivity = {:4.2f}".format(empsens) )
print("Scale = {:4.2f}".format(empscale))
print()


empnoise = np.random.laplace(0,empscale,trueemptab.shape)
print(empnoise)
print()

Business Count Protection Parameters:
Epsilon = 1.00
Sensitivity = 1.00
Scale = 1.00

[[-0.58869895  1.13134133 -0.95139962]
 [ 0.49372829  0.20280237  0.23887726]
 [-0.05855074  1.3442675   0.17184785]
 [ 0.49093898 -0.51065564 -1.78459218]
 [-2.55918651  0.65861995 -0.64504133]]

Employment Protection Parameters:
Epsilon = 1.00
Sensitivity = 10000.00
Scale = 10000.00

[[  3354.72277655  -3965.47253813  -9169.76228245]
 [ 15007.55318285  -2320.65982861   4353.58676393]
 [-12886.38203198  -3769.59817546  -2580.28262714]
 [-23760.9329324  -19470.62855596 -10510.26707177]
 [ -1629.51699804 -30078.62071266  -2081.10400055]]



# Create Noisy Table
Now we add true data and the noise to get the noisy tables. Depending upon our publication strategy we may want to post-process the table (e.g. rounding, excluding negative values).

In [6]:
# Print the noisy business count table.
noisycnttab=truecnttab+cntnoise
print(noisycnttab)
print()

# Post-process the noisy table to round and set negative values to zero.
noisycnttabpost=noisycnttab.round().astype(np.int)
noisycnttabpost[noisycnttabpost < 0]=0
print("Noisy business count table to be published:")
print(noisycnttabpost)
print()

print("True business count table for comparison:")
print(truecnttab)
print()


# Print the noisy employment table.
noisyemptab=trueemptab+empnoise
print(noisyemptab)
print()

# Post-process the noisy table to round and set negative values to zero.
noisyemptabpost=noisyemptab.round().astype(np.int)
noisyemptabpost[noisyemptabpost < 0]=0
print("Noisy employment table to be published:")
print(noisyemptabpost)
print()

print("True employment table for comparison:")
print(trueemptab)
print()

State                DC         MD         VA
Industry                                     
Construction   3.411301   7.131341   9.048600
Manufacturing  5.493728   7.202802  11.238877
Information    2.941449  16.344267  14.171848
Finance        2.490939   0.489344   0.215408
Education      1.440813   9.658620   6.354959

Noisy business count table to be published:
State          DC  MD  VA
Industry                 
Construction    3   7   9
Manufacturing   5   7  11
Information     3  16  14
Finance         2   0   0
Education       1  10   6

True business count table for comparison:
State          DC  MD  VA
Industry                 
Construction    4   6  10
Manufacturing   5   7  11
Information     3  15  14
Finance         2   1   2
Education       4   9   7

State                    DC            MD            VA
Industry                                               
Construction    3505.722777   2357.527462  -8587.762282
Manufacturing  20169.553183    494.340171   6511.586764
I

# Measure the Error
Now we will calculate the $L_{1}$ error between the true table and the noisy tables. First cell by cell and then in total.

In [8]:
# First cell-by-cell errors.
errorcnttab=np.abs(noisycnttabpost-truecnttab)
print("Cell-by-cell absolute error for business counts:")
print(errorcnttab)
print()

# Then percentage errors.
pcterrorcnttab=errorcnttab/truecnttab
print("Cell-by-cell relative error for business counts:")
print(pcterrorcnttab)
print()

# Now the total error and the relative error.
print("Total business count error")
errorcntsum=errorcnttab.sum().sum()
print("Total L1 Error: {0:1d}".format(errorcntsum))
truecntsum=truecnttab.sum().sum()
errorcntrel=100*errorcntsum/truecntsum
print("Relative L1 Error: {:4.2f}%".format(errorcntrel))
print()



# First cell-by-cell errors.
erroremptab=np.abs(noisyemptabpost-trueemptab)
print("Cell-by-cell absolute error for employment:")
print(erroremptab)
print()

# Then percentage errors.
pcterroremptab=erroremptab/trueemptab
print("Cell-by-cell relative error for employment:")
print(pcterroremptab)
print()

# Now the total error and the relative error.
print("Total employment error")
errorempsum=erroremptab.sum().sum()
print("Total L1 Error: {0:1d}".format(errorempsum))
trueempsum=trueemptab.sum().sum()
erroremprel=100*errorempsum/trueempsum
print("Relative L1 Error: {:4.2f}%".format(erroremprel))
print()

Cell-by-cell absolute error for business counts:
State          DC  MD  VA
Industry                 
Construction    1   1   1
Manufacturing   0   0   0
Information     0   1   0
Finance         0   1   2
Education       3   1   1

Cell-by-cell relative error for business counts:
State            DC        MD        VA
Industry                               
Construction   0.25  0.166667  0.100000
Manufacturing  0.00  0.000000  0.000000
Information    0.00  0.066667  0.000000
Finance        0.00  1.000000  1.000000
Education      0.75  0.111111  0.142857

Total business count error
Total L1 Error: 12
Relative L1 Error: 12.00%

Cell-by-cell absolute error for employment:
State             DC    MD    VA
Industry                        
Construction    3355  3965   582
Manufacturing  15008  2321  4354
Information       41  3770  2580
Finance           68   106    11
Education        887  1114  1628

Cell-by-cell relative error for employment:
State                 DC        MD        VA
