Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code for the Location Heatmaps paper. #47

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

ebagdasa
Copy link

@ebagdasa ebagdasa commented Nov 4, 2021

This code demonstrates ability to build location heatmaps using
distributed differential privacy mechanism and proposed adaptive
algorithm. The code represents this paper: https://arxiv.org/abs/2111.02356 .
It also includes the Google Colab example for the experiments.

@google-cla
Copy link

google-cla bot commented Nov 4, 2021

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@google-cla google-cla bot added the cla: no label Nov 4, 2021
@ebagdasa
Copy link
Author

ebagdasa commented Nov 4, 2021

@googlebot I signed it!

@google-cla google-cla bot added cla: yes and removed cla: no labels Nov 4, 2021

To experiment with the code there is a working [notebook](dp_location_heatmaps.ipynb)
with all the examples from the paper, please don't hesitate to contact the
[author](mailto:eugene@cs.cornell.edu) or raise an issue.1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extraneous '1' at end of line

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

image: Any
level_sample_size: int = 10000
secagg_round_size: int = 10000
threshold: float = 0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about 'split_threshold' to contrast with 'collapse_threshold' below?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -127,6 +123,9 @@ def sample_inverse_prob(self):
def eps_local(self):
return np.log(2 * self.num_clients / self.lam - 1)

def get_noise_tensor(self, input_shape):
return

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this is unused for RapporNoise and should not be called by users. Shall we raise a NotImplementedError instead of silently returning? Should this method be _get_noise_tensor instead of get_noise_tensor to discourage direct usage, pointing users toward apply_noise instead?

Copy link
Author

@ebagdasa ebagdasa Dec 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -36,13 +37,32 @@ class Metrics:
f1: f1 score on the discovered hot spots.
mutual_info: mutual information metric.
"""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add definition/description of new metrics (mape, smape, maape, nmse)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could also note how the zeros are handled (replaced with the next smallest true value from the image)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, added clarification to get_metrics()

f'MSE: {metric.mse:.2e}')
ax.imshow(test_image)
f'MSE: {metric.mse:.2e}', fontsize=30)
ax.imshow(test_image, interpolation='gaussian')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Gaussian interpolation for image display? The statistics calculated and displayed over the image would be different if calculated on the gaussian-interpolated image, wouldn't they?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just for visualization, it doesn't impact metrics. Mostly it improves rendering of lines in the contour grid image.

# print(f'Collapsed: {collapsed}, created when collapsing: {created},' + \
# f'new expanded: {fresh_expand},' + \
# f'unchanged: {unchanged}, total: {len(new_tree_prefix_list)}')
if fresh_expand == 0: # len(new_tree_prefix_list) <= len(tree_prefix_list):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove or uncomment extraneous debugging code

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

neg_image[x_bot:x_top + 1, y_bot:y_top + 1] = count
else:
raise ValueError(f'Not supported: {pos}')
return current_image, pos_image, neg_image


def split_regions(tree_prefix_list,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function is very long with a lot of nested conditions, which makes it hard to read. can we extract some reasonable helpers to improve readability by making the overall structure of the threshold checks and tree traversal more apparent?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -276,3 +435,167 @@ def quantize_vector(vector, left_bound, right_bound):
scale = (vector - left_bound) // distance
vector -= distance * scale
return vector


def makeGaussian(image, total_size, fwhm=3, center=None,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make_gaussian for consistent style

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

elif level == 98:
z = 2.326
else:
raise ValueError(f'Incorrect confidence level {level}.')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nicer to just compute the z score analytically, rather than having sparse lookup table. I think the following should do the trick:

from scipy.stats import norm

z = norm.ppf(1-(1-level/100)/2)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated


def make_step(samples, eps, threshold, partial,
prefix_len, dropout_rate, tree, tree_prefix_list,
noiser, quantize, total_size, positivity, count_min):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add docstring

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@ebagdasa
Copy link
Author

@samellem please take a look, addressed your comments

Copy link

@samellem samellem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking pretty good! Just a few additional ideas.

Also, it looks like you missed a few comments from the previous review that got auto-collapsed in the GitHub UI:
image

Most of those were small nits, and some may not be as relevant after your changes, but please do take a look at them if you missed them the first time.

Returns:
new_tree, new_tree_prefix_list, finished
new_tree, new_tree_prefix_list, fresh_expand

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meaning of fresh_expand is not obvious, especially since this function both collapses and expands. Maybe num_newly_expanded_nodes?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, thanks a lot!

image_bit_level: stopping criteria once the final resolution is reached.
collapse_threshold: threshold value used to collapse the nodes.
expand_all: expand all regions,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we achieve this functionality by just passing split_threshold = -np.inf and eliminate the extra parameter & special-casing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, great idea, fixed

@@ -260,12 +260,30 @@ def rebuild_from_vector(vector, tree, image_size, contour=False, threshold=0,
return current_image, pos_image, neg_image


def update_tree(prefix, tree, tree_prefix_list):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe append_to_tree instead of update_tree?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Returns:
new_tree, new_tree_prefix_list, finished
new_tree, new_tree_prefix_list, fresh_expand
"""
collapsed = 0
created = 0
fresh_expand = 0
unchanged = 0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collapsed, created, and unchanged do not appear to be used for anything anymore. let's delete them xor do something with them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

printing the results in the end of the function now

collapsed = 0
created = 0
fresh_expand = 0
unchanged = 0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise re. collapsed, created, and unchanged being unused

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return new_tree, new_tree_prefix_list, fresh_expand


def split_regions_aux(tree_prefix_list,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that even more of these two functions could be shared (in particular, the basic structure of looping over prefixes and adding nodes to the tree as appropriate for the splitting & collapsing criteria), but acknowledge that it may not actually improve readability much more to do further surgery. Please consider sharing that prefix-looping structure, but if you can't see a clean and easy way to do so, that's fine.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I agree, it's just that I need to look at both bits in data that is hard to unify. Maybe once we go to multiple dimensions we can just unify everything.

@@ -499,18 +500,11 @@ def convert_to_dataset(image, total_size, value=None):


def compute_conf_intervals(sum_vector: np.ndarray, level=95):
from scipy.stats import norm

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer not to do imports inline like this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -527,9 +521,46 @@ def compute_conf_intervals(sum_vector: np.ndarray, level=95):
return conf_intervals, conf_interval_weighted


def make_step(samples, eps, threshold, partial,
def create_confidence_interval_condition(last_result, prefix, count, split_threshold):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're ultimately just returning a boolean here, so maybe evaluate_confidence_interval_condition instead? the current name makes me think that we're returning some kind of predicate function

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated and added a docstring

@@ -122,35 +132,70 @@ def run_experiment(true_image,
quantize: apply quantization to the vectors.
noise_class: use specific noise, defaults to GeometricNoise.
save_gif: saves all images as a gif.
count_min: use count-min sketch.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

@ebagdasa
Copy link
Author

yeah, thanks a lot and sorry for missed comments. Should be all good now.

Copy link

@samellem samellem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved with some minor suggestions

@@ -65,10 +85,15 @@ def coordinates_to_binary_path(xy_tuple, depth=10):
Returns:
binary version of the coordinate.
"""
x_coord, y_coord = xy_tuple
if len(xy_tuple) == 2:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment applies to aux_data now that that is being pulled from xy_tuple.

"""Returns a quad tree first 4 nodes. If aux_data (boolean) provided expands
to 2 more bits or a specific pos/neg nodes.
Args:
aux_data: a boolean to use additional bit for data, e.g. pos/neg.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO aux_data sounds like an actual data object rather than a boolean parameter; I would prefer a name like "has_aux_data" or even "has_aux_bit" since a single bit is all that's supported here. This also goes for other usages of "aux_data" as a boolean in other functions, below.

Really it would be ideal to just generalize this to support an arbitrary number of extra bits with an automatic encoding from the value specified in "split", rather than a single extra bit with a predefined 'pos'-->1 and 'neg'-->0 encoding, but I understand that is probably out of scope at present.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, let me change it to has_aux_bit for now, and maybe can expand it later

@@ -227,6 +225,7 @@ def run_experiment(true_image,
noiser = noise_class(dp_round_size, sensitivity, eps)
if ignore_start_eps and start_with_level <= i:
print_output('Ignoring eps spent', flag=output_flag)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this is a frightening message; it would be nice to have a bit of extra context here (e.g., "Ignoring epsilon spent expanding first {start_with_level} levels, including current level {i}.").

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants