Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difference between anon function results and normal function results. Anon function giving 0 where as normal function result is a higher magnitude value(which is not any where close to 0) #17

Closed
AbhishekNalamothu opened this issue Feb 10, 2020 · 8 comments

Comments

@AbhishekNalamothu
Copy link

AbhishekNalamothu commented Feb 10, 2020

Anon function is giving 0 where as normal function result is a higher magnitude value (for example : 20, 30, -25, -40).

Example: difference between anon function and normal function results
anon function : anon_F(D)-> 0   
normal function : F(D) -> 20 

Providing differential privacy enabled aggregated data with the above difference (in example) to an end user might mislead him/her during their analysis.

Is there a way to handle this?

@celiayz
Copy link
Contributor

celiayz commented Feb 18, 2020

Hi Abhishek,

Differential privacy hides the contribution of any single user. If the original function and anon function give dramatically different results so that analysis is misleading, then the original did not contain enough contributing users to make anonymous analysis useful.

@AbhishekNalamothu
Copy link
Author

Thanks @celiayz .
How can I avoid this zero problem? If you have any suggestions, could you please suggest?

This modified number_of_carrots_eaten data and the following query on this data reproduces this zero problem.

select d.animal_group, count(1), sum(case when count_carrots_eaten = 0 THEN 1 ELSE 0 END) as zero_counts,((1-avg(d.count_carrots_eaten))*100) as Zero_percent, avg(d.count_carrots_eaten),
sum(d.COUNT_CARROTS_EATEN) as carrots_eaten,
anon_sum(d.COUNT_CARROTS_EATEN, 5) as anon_carrots_eaten
from animals_and_carrots_bin_new d group by d.animal_group order by carrots_eaten;

image

In the animal data set, groups with 70, 80, 90 percent values as zero are showing the 0 problem.

By doing other experiments, I realized that the 0 problem depends on different factors. Number of contributing users and how many of those users having zeros in that group.

I would like to know how can I avoid this problem?

Once again thank you so much.

@celiayz
Copy link
Contributor

celiayz commented Feb 20, 2020

The best way to avoid the problem is to add more data. You can also try increasing the value of epsilon and using manually-specified bounds (ex. use ANON_SUM(column, lower, upper, epsilon)).

@AbhishekNalamothu
Copy link
Author

Thanks @celiayz for your prompt response.

Suppose we have a larger dataset, aggregated based on 'n' number of groups. When 'm' number of groups have very few smaller data points compared to the rest (n-m) groups. Do we expect those 'm' groups to have 0 value upon aggregation?

What if I do not have more data to add?
My use case requires not providing bounds. I want to use approx_bounds provided by google to automatically detect the bounds.
Also, I am afraid increasing epsilon may cause security problem.

@celiayz
Copy link
Contributor

celiayz commented Feb 20, 2020

Yes, for the 'm' groups that have fewer contributing users to the group, we expect that the data could return null or 0 for that group.

If there is no more data to add, then unfortunately the data set is too small to be able to hide the contribution of a single user statistically. Then differentially private analysis is probably not appropriate when working with the data set.

@AbhishekNalamothu
Copy link
Author

@celiayz , returning null would be fine with analysis but returning 0 misleads the analysts. Is there a way to fix such that instead of returning 0 it returns null?
Also, If there is an error, or not enough data to process, then isn't it ideal to return an “error”, not 0 because 0 is not an error value.

Thank you @celiayz

@celiayz
Copy link
Contributor

celiayz commented Feb 20, 2020

I see, the fact that it is returning 0 is likely an implementation detail to do with the noising + snapping mechanisms. When the value is close enough to 0, the answer gets snapped down to 0. Since the aggregation functions do not know that there isn't enough data, nor is there any error, it will return 0 instead of null. Therefore, I don't see any meaningful way to get the library to return null instead of 0.

@dibakch
Copy link
Collaborator

dibakch commented Jul 14, 2021

Closing this for now. Feel free to re-open.

@dibakch dibakch closed this as completed Jul 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants