## SQL Queries with Differentially Private Set Union (DPSU)

In the context of the WhiteNoise SQL scenario, consider a guiding example: there exists a dataset with Reddit users and their posts, keeping in mind that a single user could have multiple posts. An analyst would like to release a report containing the counts of the number of bigrams per user while preserving privacy. We'd like to increase the representation of bigrams contained in this report without violating privacy bounds.

You can imagine that it might be easy to identify an author's bigrams if she has only posted a few times and therefore has very few bigrams. To resolve this threat, previously, we would have to drop bigrams whose noisy counts were below a certain threshold. However, this process was wasteful because we allocated privacy budget to all rows uniformly even if their counts already exceeded the threshold.

With DPSU, we resolve this issue by adding noise in a dependent fashion. This increases the dimensionality of the final dataset, and we therefore have less data loss and get a richer representation of bigrams. 

The code below showcases DPSU support with pandas.

In [1]:
import re
import os
import subprocess
import pandas as pd
from opendp.whitenoise.metadata import CollectionMetadata
from opendp.whitenoise.metadata.collection import Table, String
from opendp.whitenoise.sql import PrivateReader, PandasReader

In [2]:
def find_ngrams(input_list, n):
    return input_list if n == 1 else list(zip(*[input_list[i:] for i in range(n)]))

def _download_file(url, local_file):
    try:
        from urllib import urlretrieve
    except ImportError:
        from urllib.request import urlretrieve
    urlretrieve(url, local_file)

reddit_url = "https://github.com/heyyjudes/differentially-private-set-union/raw/master/data/clean_askreddit.csv.zip"


reddit_dataset_path = os.path.join("datasets", "reddit.csv")
if not os.path.exists(reddit_dataset_path):
    reddit_zip_path = os.path.join("datasets", "askreddit.csv.zip")
    datasets = os.path.join("datasets")
    clean_reddit_path = os.path.join(datasets, "clean_askreddit.csv")
    _download_file(reddit_url, reddit_zip_path)
    from zipfile import ZipFile
    with ZipFile(reddit_zip_path) as zf:
        zf.extractall(datasets)
    reddit_df = pd.read_csv(clean_reddit_path, index_col=0)
    reddit_df = reddit_df.sample(frac=0.05)
    reddit_df['clean_text'] = reddit_df['clean_text'].astype(str)
    reddit_df.loc[:,'clean_text'] = reddit_df.clean_text.apply(lambda x : str.lower(x))
    reddit_df.loc[:,'clean_text'] = reddit_df.clean_text.apply(lambda x : " ".join(re.findall('[\w]+', x)))
    reddit_df['ngram'] = reddit_df['clean_text'].map(lambda x: find_ngrams(x.split(" "), 2))
    rows = list()
    for row in reddit_df[['author', 'ngram']].iterrows():
        r = row[1]
        for ngram in r.ngram:
            rows.append((r.author, ngram))
    ngrams = pd.DataFrame(rows, columns=['author', 'ngram'])
    ngrams.to_csv(reddit_dataset_path)


reddit_schema_path = os.path.join("datasets", "reddit.yaml")
if not os.path.exists(reddit_schema_path):
    reddit = Table("reddit", "reddit", 500000, [
                String("author", card=10000, is_key=True),
                String("ngram", card=10000)
    ])
    schema = CollectionMetadata([reddit], "csv")
    schema.to_file(reddit_schema_path, "reddit")

In [3]:
reddit = pd.read_csv("datasets/reddit.csv", index_col=0)
metadata = CollectionMetadata.from_file("datasets/reddit.yaml")

query = "SELECT ngram, COUNT(*) as n FROM reddit.reddit GROUP BY ngram ORDER BY n desc"

reader = PandasReader(metadata, reddit)
exact = reader.execute_typed(query)
print(sum(exact['n']))

508487


In [19]:
private_reader_korolova = PrivateReader(metadata, reader)
private_reader_korolova.options.use_dpsu = False
private_reader_korolova.options.max_contrib = 5
korolova_result = private_reader_korolova.execute_typed(query)
print(korolova_result)

 ngram        |n      
 -------------|-------
  (n, t)      | 115   
  (i, m)      | 70    
  (it, s)     | 60    
  (in, the)   | 57    
  (do, n)     | 45    
  (for, the)  | 43    
  (i, was)    | 37    


In [20]:
print(sum(korolova_result['n']))

427


Now, we run PrivateReader below with the DPSU optimization which results in a greater coverage of n-grams while keeping the same differential privacy guarantees.
[https://arxiv.org/pdf/2002.09745.pdf]

In [23]:
private_reader = PrivateReader(metadata, reader)
result = private_reader.execute_typed(query)
print(result)

 ngram        |n      
 -------------|-------
  (n, t)      | 103   
  (i, m)      | 83    
  (i, was)    | 57    
  (it, s)     | 53    
  (that, s)   | 52    
  (there, s)  | 38    
  (this, is)  | 38    
  (do, n)     | 36    
  (i, do)     | 36    
  (in, the)   | 35    


In [24]:
sum(result['n'])

531

### SQL Module with DPSU

In [25]:
from opendp.whitenoise.client import get_execution_client

execution_client = get_execution_client()

project = {"params": {"dataset_name": "reddit", 
                      "budget": 0.5,
                      "query": "SELECT ngram, COUNT(*) as c FROM reddit.reddit GROUP BY ngram ORDER BY c desc"},
           "uri": "modules/sql-module"}

response = execution_client.submit(params=project["params"],
                            uri=project["uri"])

In [26]:
import json

In [27]:
json_response = json.loads(response.result)
pd.DataFrame.from_dict(json_response)

Unnamed: 0,ngram,c
0,"(n, t)",124
1,"(it, s)",80
2,"(hanging, it)",70
