-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strawman: Target functionality for MVP #16
Comments
Thanks for this write up @benjaminsavage. I hope it will help clarify the conversation tomorrow. The only thing I'd add is that along with counts and sums, we often also want variance (of the value.) This is particularly acute in the "conversion-lift" case, where we want to establish a statistically significant difference between the test and control group (but more generally, all these are often used for comparisons, which are more meaningful with a measure of variance.) One small wrinkle here is that we may actually prefer a differential private confidence interval, rather than a differentially private mean (eg sum) and variance, as it may be more efficient (in terms of the privacy budget.) This latter point is probably beyond scope for an MVP, but something else we may want to be able to support in the future. |
When you define "breakdown keys" you don't mention the site on which source event occurred. Did you mean to make that one of the factors that might be used to determine a breakdown key? I've a few quibbles with your Mathy bit, but mostly presentation-wise. Let's concentrate on what I think your intent is. In my slides for tomorrow, I had something like:
You have described concrete versions of both the per-conversion and aggregation functions that I think could (with appropriate values for |
Thanks @benjaminsavage for the strawman. I want to poke at one aspect which is multiple breakdown keys per source event. This is listed as ideal functionality but isn't described in the detailed code / math. I would vote to add this. My goal is to try to achieve some level of sensitivity management that I discussed in the last meeting in the MVP. This will improve the utility of the system as a whole and unlock new use-cases. With multiple breakdown keys per source this is achievable with some changes to the underlying algorithm:
Obviously, the above can be optimized to avoid Note: there are other ways of achieving this goal, but I think this serves as a flexible basis, and is probably the simplest without making the core algorithm more complicated in how it does contribution capping. We can do more advanced things by varying I also want to make sure that privacy unit / grain discussions happen on the other issue and not here, despite it being embedded in the algorithm here. |
@csharrison - this is an excellent suggestion. I completely agree with you. I think there is a very nice win here for API callers who would like to make N queries, on the same set of input data, and want to do N different breakdowns. I'd love to get that sweet, sweet advanced DP composition win =). I'll try to update the algorithm, if I can, to do what you've said. |
Do you want a different values for each different breakdown? That is, if you are going for breakdown 1, you might be counting (low sensitivity), but breakdown 2 might be a sum over a different breakdown. |
@martinthomson my suggestion was assuming a fixed |
Would we like to present these in an upcoming meeting? |
I think in an upcoming meeting we should discuss the "multiple-breakdown" use-case. As @csharrison mentions in his comment above, "sensitivity management" is really important here to ensure good DP composition. But I'd like to file a separate issue about this topic. I'm happy to close this issue. |
Target ads measurement use-cases for the MVP
Reporting use-cases for Ad Buyers
Counts, for all 3 types of source events
Sums, for all 3 types of source events
Multiple Breakdowns
Assuming source events are associated with multiple arbitrary breakdown keys
Cross-environment
For all of the above, support for cases where source events and trigger events originate from different environments:
Cross-device
For all of the above, support for cases where source events and trigger events originate from different devices. For example:
Cross-publisher
For all of the above, support for cases where the source events originate from multiple different apps / websites operated by different ad-sellers
Multi-touch attribution
For all of the above, support for multi-touch attribution. For MVP just support for "equal credit" on the "last N touches" where "N" is a query parameter.
Aggregate use-cases for ad-sellers
NOT in the MVP
The following use-cases are not a part of the MVP, but should ideally our solution should be architecturally compatible such them.
Human language version
Inputs:
Outputs:
A sequence of Q histograms. Each histogram is a map from “breakdown keys” to noisy aggregate sums.
The "breakdown keys" appearing in the qth histogram (q = 0 to q = Q-1) are the unique values of "breakdown key" appearing in the qth place in the sequence of "breakdown keys" in the "source events".
Ideal functionality:
Pseudocode
Mathy Version
Givens
Given n source events, i = 0 through i = n - 1, each comprised of three unsigned integer values:
Given m trigger events, j = 0 through j = m - 1, each comprised of three unsigned integer values:
Given an integer K, value between 1 and 5 (inclusive), indicating the maximum number of source events to which a particular trigger event should be attributed.
Given an integer M, indicating the maximum total value any particular user should be able to contribute to the output across all breakdown keys. (This value should be at least as large as the maximum trigger value)
Given an integer P, value between 1 and 100 (inclusive), indicating the amount of privacy budget the query should consume.
Ideal Functionality
Let the set of unique breakdown keys be known as B
Let the set of attribution candidates Cj be equal to the set of source events for which: mtj == msi and tsi < ttj
Let the set of attributed source events Asj be the MIN( |Cj|, K ) elements from Cj with the largest values of tsi
Let the list of attributed breakdown keys Bj be the values of the breakdown keys from Asj
Let the set of attributed trigger events At be equal to the subset of tuples (mtj, vj, Bj) where Asj is non-empty
Let U be the set of unique match keys m in At
Let the set of capped events Q be a subset of elements selected from attributed trigger events such each match key's total contribution is at most M. Put another way:
Given a match key mk in U
Let the total contribution tk of match key mk be equal to the sum Σvj where mj == mk, across all the tuples (mj, vj, Bj) in Q
Then for all match keys mk in U, it is true that tk <= M
Let countjb be the number of times that breakdown key b appears in the list Bj
Let the no noise total totalb of breakdown key b be equal to the sum: Σ(vj/|Bj|) * countjb across all the tuples (mj, vj, Bj) in Q
Let the noised total noised_totalb for each breakdown key b in B, be equal to the sum of totalb + random_noise(M, P)
Return the set of tuples (b, noised_totalb) for each breakdown key b in B
The text was updated successfully, but these errors were encountered: