## SQL Queries with Differentially Private Set Union (DPSU)

In the context of the WhiteNoise SQL scenario, consider a guiding example: there exists a dataset with Reddit users and their posts, keeping in mind that a single user could have multiple posts. An analyst would like to release a report containing the counts of the number of bigrams per user while preserving privacy. We'd like to increase the representation of bigrams contained in this report without violating privacy bounds.

You can imagine that it might be easy to identify an author's bigrams if she has only posted a few times and therefore has very few bigrams. To resolve this threat, previously, we would have to drop bigrams whose noisy counts were below a certain threshold. However, this process was wasteful because we allocated privacy budget to all rows uniformly even if their counts already exceeded the threshold.

With DPSU, we resolve this issue by adding noise in a dependent fashion. This increases the dimensionality of the final dataset, and we therefore have less data loss and get a richer representation of bigrams. 

The code below showcases DPSU support with pandas.

In [1]:
import pandas as pd
from opendp.whitenoise.metadata import CollectionMetadata
from opendp.whitenoise.sql import PrivateReader, PandasReader

reddit = pd.read_csv("../../data/readers/reddit.csv", index_col=0)
metadata = CollectionMetadata.from_file("../../data/readers/reddit.yaml")

query = "SELECT ngram, COUNT(*) as n FROM reddit.reddit GROUP BY ngram ORDER BY n desc"

reader = PandasReader(metadata, reddit)
exact = reader.execute(query)
print(len(exact))

205092


In [10]:
private_reader = PrivateReader(metadata, reader)
result = private_reader.execute_typed(query)
print(result)

 ngram         |n      
 --------------|-------
  (n, t)       | 128   
  (i, m)       | 66    
  (it, s)      | 64    
  (that, s)    | 55    
  (in, the)    | 53    
  (i, was)     | 41    
  (do, n)      | 36    
  (mom, left)  | 35    


In [11]:
sum(result['n'])

478

In [6]:
private_reader_korolova = PrivateReader(metadata, reader)
private_reader_korolova.options.use_dpsu = False
private_reader_korolova.options.max_contrib = 5
korolova_result = private_reader_korolova.execute_typed(query)
print(korolova_result)

 ngram       |n      
 ------------|-------
  (n, t)     | 128   
  (i, m)     | 84    
  (it, s)    | 78    
  (do, n)    | 59    
  (in, the)  | 50    


In [7]:
sum(korolova_result['n'])

399

### SQL Module with DPSU

In [None]:
from opendp.whitenoise.client import get_execution_client

execution_client = get_execution_client()

project = {"params": {"dataset_name": "reddit", 
                      "budget": 0.5,
                      "query": "SELECT ngram, COUNT(*) as c FROM reddit.reddit GROUP BY ngram ORDER BY c desc"},
           "uri": "modules/sql-module"}

response = execution_client.submit(params=project["params"],
                            uri=project["uri"])

In [None]:
import json

In [None]:
json_response = json.loads(response.result)
pd.DataFrame.from_dict(json_response)