# Policy Laplace on Spark

Spark implementation of Policy Laplace from Differentially Private Set Union [https://arxiv.org/abs/2002.09745]

In [1]:
import pyspark
import numpy as np
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = SparkSession.builder.getOrCreate()

In [2]:
filepath = "../../differentially-private-set-union/data/clean_askreddit.csv"
reddit = spark.read.load(filepath, format="csv", sep=",",inferSchema="true", header="true").dropna()

### Tokenize

Load the data from file and tokenize.  This code can be any caller-specific tokenization routine, and is independent of differential privacy.  Output RDD should include one list of tokens per row, but can have multiple rows per user, and does not need to be odered in any way.

In [3]:
import nltk

n_grams = 2
distinct = True

def tokenize(user_post):
    user, post = user_post
    tokens = post.split(" ")
    if n_grams > 1:
        tokens = list(nltk.ngrams(tokens, n_grams))
        tokens = ["_".join(g) for g in tokens]
    if distinct:
        tokens = list(set(tokens))
    return (user, tokens)
        
tokenized = reddit.select("author", "clean_text").rdd.map(tokenize).persist()


In [4]:
from policy_laplace import PolicyLaplace

epsilon = 3.0
delta = np.exp(-10)
alpha = 5.0
tokens_per_user = 10

pl = PolicyLaplace(epsilon, delta, alpha, tokens_per_user)

Params Delta_0=10, delta=4.54e-05, l_param=0.3333333333333333, l_rho=4.102284273146641, Gamma=5.768950939813308


In [5]:
sampled = pl.reservoir_sample(tokenized, distinct)

ngh = sampled.repartition(1).mapPartitions(pl.process_rows).take(1)

In [6]:
output_vocab = {}
for ngdict in ngh:
    for key, val in ng.items():
        if pl.exceeds_threshold(val):
            output_vocab[key] = val
    print("Retrieved {0} words from {1}".format(len(output_vocab),len(ngh[0].items())))


Retrieved 8552 words from 554564
