Skip to content

Hadamard Response: Communication efficient, sample optimal, linear time locally private learning of distributions

Notifications You must be signed in to change notification settings

zitengsun/hadamard_response

Repository files navigation

Hadamard Response: Estimating Distributions Privately, Efficiently, and with Little Communication

This package implements Hadamard Response for locally private discrete distribution estimation proposed in Communication, Efficient Sample Optimal Linear Time Locally Private Discrete Distribution Estimation.

Also included are implementations of k-Randomized Response(RR), k-RAPPOR and Subset Selection(SS) schemes.

The implementation is in Python3, and uses Numpy and matplotlib.

Table of contents

Usage

We provide the packages as .py files and we also provide the .pynb files we wrote originally for testing in the Jupyter _Notebooks folder.

The comments in the code files are enough to understand how to use the functions. Here we provide some examples which hopefully will be helpful. The four schemes are implemented based on python classes. Before using, please first import the packages and then specialize the scheme with the alphabet size (k) and the required privacy level (eps).

    import k2k_hadamard
    import Subsetselection
    import RR_RAPPOR
    
    subset = Subsetselection.Subsetselection(k, eps) #class for subset selection algorithm
    rappor = RR_RAPPOR.RAPPOR(k, eps) #class for RAPPOR
    rr = RR_RAPPOR.Randomized_Response(k, eps) #class for Randomized Response
    hr = k2k_hadamard.Hadamard_Rand_general(k, eps) #initialize hadamard response

When you simulate Hadamard response on a single computer, we provide the option of encoding acceleration by storing Hadamard matrix with the expense of large memory cost. If you want to accelerate the encoding process, just set variable encode_acc to be one when intializing hadamard response.

    import k2k_hadamard
    hr = k2k_hadamard.Hadamard_Rand_general(k, eps, encode_acc = 1) #initialize hadamard response

Hadamard Response

We provide multiple classes for Hadamard response scheme. If you are focusing on the high privacy regime, i.e. the case when the privacy prarameter is small (small single digit), please use the class Hadamard_Rand_high_priv. If you want to test other cases, please use Hadamard_Rand_general, which is an improved version of Hadamard_Rand_general_original. The latter is the version we provide in the paper and it is easier to analyze, so we also provide it in the package.

The following script encodes the input string into its randomized version and then learns the distribution based on the output string:

    out_string = hr.encode_string(in_list)
    prob_est = hr.decode_string(out_string) # estimate the original underlying distribution

If you only want to encode a single symbol, please use:

    out_symbol = hr.encode_symbol(in_symbol)

When you decode the output string, you can also choose whether to use fast decoding and how to normalize the output. In general, fast decoding is preferred. For normalization, we provide two options, clip and normalize(default setting) and simplex projection (Projection onto the probability simplex: An efficient algorithm with a simple proof, and an application).

Randomized Response

The following script first encodes the input string into its randomized version and then learn the distribution based on the output string:

    out_string = rr.encode_string(in_list)
    prob_est = rr.decode_string(out_string) # estimate the original underlying distribution

The same normalization options are also provided for randomized response.

Subset Selection and RAPPOR

The implementation of these two are quite similar, so here we give Subset Selection as an example.

We have three encoding modes for Subsetselection, standard, compress and light. For standard mode, the following encoding functions encode the input string into a 0/1 matrix of size (n,k) (n is the number of inputs and k is the alphabet size) where each row is the codeword corresponding to a certain input. To decode, we simply use decode_string function to decode this matrix:

    out_string = subset.encode_string_fast(in_list) 
    prob_est = subset.decode_string(outp_string,n) # estimate the original underlying distribution

The drawback of standard mode is that it will have a huge momory cost, so we provide the light mode, which encode the input string into a k-length string which represents the total number of 1's at each location. We will also return the time for the counting step as this should be added into the decoding time. For decoding, we use decode_counts:

    counts, time = subset.encode_string_light(in_list) 
    prob_est = subset.decode_counts(counts,n) # estimate the original underlying distribution

For compress mode, we encode each input symbol into a string of locations of 1's and then concatenate them together to get a bigger string. This may result in less communication cost when the privacy parameter is relatively large. When decoding, we need to first get the histogram and then use decode_counts:

    out_list = subset.encode_string_compress(in_list) #subset selection
    counts, temp = np.histogram(out_list,range(k+1)) # k is the alphabet size
    prob_est = subset.decode_counts(counts, n) # estimate the original underlying distribution

Simulation

For simulation, we provide functions to get geometric, uniform, two step, Zipf and Dirichlet distributions.

    import functions
    dist = generate_geometric_distribution(k,lbd)

In file test.ipynb (also available in test.py), we provide a function to compare the perfomance of the four schemes in terms of l_1 and l_2 error:

    data = test(k, eps, rep, point_num, step_sz, init, dist, encode_acc = 1, encode_mode = 0)
    #Args:
    # k : alphabet size,  eps: privacy level, rep: repetition times to compute a point
    # point_num: number of points to compute
    # step_sz: distance between two sample sizes
    # init: initial sample size = init* step_sz
    # dist: underlying data distribution: choose from 'Uniform', 'Two_steps', 'Zipf', 'Dirchlet', 'Geometric'
    # encode_acc : control whether to use fast encoding for hadamard responce
    # mode: control encoding method for rappor and subset selection

You can customize the testing process by setting these parameters. The returned data file will contain the errors, time stamps, parameter settings and other related information. Plots for comparing l_1, l_2 error and decoding time will be generated. A .mat file with time stamp will also be stored when the testing process is done.

Acknowledgement

We thank Peter Kairouz and Gautam Kamath for valuable suggestions on improving the code. This research was supported by NSF through the grant NSF CCF-1657471.