# Contingency Tables and Sensitivity, revisited
## From Lecture on 9/26/18

In [45]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import pandas as pd
import numpy as np

In [46]:
def gaussian_mech(v, epsilon, delta):
    sigma = np.sqrt(2 * np.log(1.25 / delta)) * 1 / epsilon
    return v + np.random.normal(loc=0, scale=sigma)

def laplace_mech(v, sensitivity, epsilon):
    return v + np.random.laplace(loc=0, scale=sensitivity / epsilon)

def pct_error(orig, priv):
    return np.abs(orig - priv)/orig * 100.0


In [59]:
df = pd.DataFrame(data={'col1': [1, 2, 1], 'col2': ['a', 'b', 'a']})
df_p = pd.DataFrame(data={'col1': [1, 2], 'col2': ['a', 'b']})

display(df)
display(df_p)

Unnamed: 0,col1,col2
0,1,a
1,2,b
2,1,a


Unnamed: 0,col1,col2
0,1,a
1,2,b


In [55]:
pd.crosstab(df['col2'], df['col1'])

col1,1,2
col2,Unnamed: 1_level_1,Unnamed: 2_level_1
a,2,0
b,0,1


In [58]:
v = pd.crosstab(df['col2'], df['col1'])
v2 = pd.crosstab(df_p['col2'], df_p['col1'])

np.linalg.norm(v.values.flatten()-v2.values.flatten(), ord=1)
v.values.flatten()-v2.values.flatten()

array([1, 0, 0, 0])

## Parallel Composition

Two ways to think about it:

1. If you can re-write your `n` queries into one query that returns a vector of length `n`, **and** your re-written query has L1 sensitivity of 1, then you can run the re-written query using the Laplace mechanism and get privacy cost `epsilon`.

2. If your `n` queries all query **non-overlapping parts** of the database, then running all `n` queries separately has privacy cost `epsilon`. This follows from the fact that any set of queries which query non-overlapping parts of the database can always be transformed as in point #1, and the L1 sensitivity of the result will be 1.

**Rule of thumb**: Generally, a single row can have only a single value for an attribute. So you can always group rows by the value of an attribute and apply parallel composition, because the groups will be non-overlapping.

## Contingency Tables & Parallel Composition

Why can we apply parallel composition to the release of a single contingency table? We can define a contingency table "algorithm" that appeals to point #1 above:

1. For the target columns, determine the domain of each column (i.e. the set of values that occur in that column)
2. Form the set of all combinations of attribute values for the target columns (i.e. the cartesian product of all target column domains)
3. For each value in the set from step 2, return the count of tuples in the table with the same values for the target columns

Here's the key insight: the set of attribute value combinations that we built in step 2 is a **partitioning of the database into non-overlapping parts**. This is true for two resons (1) a single row in the database has **exactly one** value for each target column, and (2) each combination of values in the set from step 2 is unique. 

It follows that the counting queries in step 3 build a vector with L1 sensitivity of 1. If we add or remove one row, that row will have some values v1, v2,...vk for the target columns 1...k. Those values v1,...,vk correspond to exactly one combination of attribute values in step 2, so the added or removed row will affect the count of exactly one query in step 3.

In [44]:
np.linalg.norm(pd.crosstab(df['col1'], df['col2']).values.flatten() - pd.crosstab(df_p['col1'], df_p['col2']).values.flatten(), ord=1)

1.0

## Sequential Composition

**Rule of thumb**: if your `n` queries happen to query **overlapping** parts of the database, then you cannot use parallel composition, because a single added or removed row could affect the results of more than one query. Instead, you will have to use sequential composition, for a privacy cost of `n*epsilon`.

## Contingency Tables & Sequential Composition

What if we want to release **multiple** contingency tables?

We have to use sequential composition, because a single added or removed row could affect one cell in **each** released table (for a total of `n` cells affected). 

In [66]:
df2 = pd.DataFrame(data={'col1': [1, 2, 1, 2], 'col2': ['a', 'b', 'b', 'b'], 'col3': ['A', 'B', 'C', 'B']})
df2_p = pd.DataFrame(data={'col1': [1, 2, 1], 'col2': ['a', 'b', 'b'], 'col3': ['A', 'B', 'C']})
df2

Unnamed: 0,col1,col2,col3
0,1,a,A
1,2,b,B
2,1,b,C
3,2,b,B


In [69]:
display(pd.crosstab(df2['col1'], df2['col2']))
display(pd.crosstab(df2['col1'], [df2['col2'], df2['col3']]))

#display(pd.crosstab(df2['col2'], df2['col3']))

col2,a,b
col1,Unnamed: 1_level_1,Unnamed: 2_level_1
1,1,1
2,0,2


col2,a,b,b
col3,A,B,C
col1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,1,0,1
2,0,2,0


In [62]:
display(pd.crosstab(df2_p['col1'], df2_p['col2']))
display(pd.crosstab(df2_p['col2'], df2_p['col3']))

col2,a,b
col1,Unnamed: 1_level_1,Unnamed: 2_level_1
1,1,1
2,0,1


col3,A,B,C
col2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0,0
b,0,1,1


In [65]:
a = np.concatenate(
    (pd.crosstab(df2['col1'], df2['col2']).values.flatten(),
     pd.crosstab(df2['col2'], df2['col3']).values.flatten()))
b = np.concatenate(
    (pd.crosstab(df2_p['col1'], df2_p['col2']).values.flatten(),
     pd.crosstab(df2_p['col2'], df2_p['col3']).values.flatten()))

np.linalg.norm(a - b, ord=1)
#display(a)
display(a-b)

# note that one row difference between df2 and df2_p results in *two* elements changing in the vector

array([0, 0, 0, 1, 0, 0, 0, 0, 1, 0])

## What about "non-overlapping" contingency tables?

If a dataframe has four columns, and I construct two contingency tables with two columns each, and the two contingency tables don't share any columns, then can I use parallel composition? **No**. 

The rule of thumb is: each contingency table queries **the whole database**. For each row in the database, that row contributes to **some** count in **every** contingency table, no matter which columns are used.

In [77]:
df4 = pd.DataFrame(data={'col1': [1, 2, 1, 2], 
                         'col2': ['a', 'b', 'b', 'b'], 
                         'col3': ['A', 'B', 'C', 'B'],
                         'col4': [100, 120, 130, 130]})
df4_p = pd.DataFrame(data={'col1': [1, 2, 1], 
                           'col2': ['a', 'b', 'b'], 
                           'col3': ['A', 'B', 'C'],
                           'col4': [100, 120, 130]})
display(df4)
display(df4_p)

# Note that df4 and df4_p are neighbors (the last row has been removed from df4_p, and they're the same otherwise)

Unnamed: 0,col1,col2,col3,col4
0,1,a,A,100
1,2,b,B,120
2,1,b,C,130
3,2,b,B,130


Unnamed: 0,col1,col2,col3,col4
0,1,a,A,100
1,2,b,B,120
2,1,b,C,130


In [78]:
display(pd.crosstab(df4['col1'], df4['col2']))
display(pd.crosstab(df4_p['col1'], df4_p['col2']))

# note that one count has changed between the first contingency table

col2,a,b
col1,Unnamed: 1_level_1,Unnamed: 2_level_1
1,1,1
2,0,2


col2,a,b
col1,Unnamed: 1_level_1,Unnamed: 2_level_1
1,1,1
2,0,1


In [79]:
display(pd.crosstab(df4['col3'], df4['col4']))
display(pd.crosstab(df4_p['col3'], df4_p['col4']))

# and one count changes in the second contingency table - even though the columns don't overlap

col4,100,120,130
col3,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,1,0,0
B,0,1,1
C,0,0,1


col4,100,120,130
col3,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,1,0,0
B,0,1,0
C,0,0,1


In [80]:
# we can try the same experiment as before - squish both tables into a vector and see how the L1 norm
# of the vector changes when we remove the row

a = np.concatenate(
    (pd.crosstab(df4['col1'], df4['col2']).values.flatten(),
     pd.crosstab(df4['col3'], df4['col4']).values.flatten()))
b = np.concatenate(
    (pd.crosstab(df4_p['col1'], df4_p['col2']).values.flatten(),
     pd.crosstab(df4_p['col3'], df4_p['col4']).values.flatten()))

display(np.linalg.norm(a - b, ord=1))
display(a-b)

# the answer is: two elements change, even though the contingency tables are on non-overlapping columns
# so parallel composition *cannot* be used

2.0

array([0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0])