-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add bucket_id to nondesktop tables #27
Conversation
-- We hash client_ids into 20 buckets to aid in computing | ||
-- confidence intervals for mau/wau/dau sums; the particular hash | ||
-- function and number of buckets is subject to change in the future. | ||
MOD(ABS(FARM_FINGERPRINT(client_id)), 20) AS id_bucket, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using FARM_FINGERPRINT
because it's natively available in BQ, produces an integer that can be readily fed to MOD
, and appears to be well tested to produce evenly distributed output. CRC32 is not available in BQ without resorting to a js UDF, but we will likely add support for CRC32-based sample_id as part of GCP pipeline metadata in coming months, so this will be an option down the line.
As discussed in the [Smoot Growth Dashboard Engineering Requirements] (https://docs.google.com/document/d/1L8tWDUjccutGGAldhpypRtPCaw3kkXboPUTtTZb02OA/edit?usp=sharing): > In order to support statistical inference (confidence intervals and hypothesis tests), we require the concept of a ID Bucket. We divide the space of user IDs (probably client_id in most cases) into 20 buckets with equal probability of assignment into each bucket. It is very important that this assignment is orthogonal to anything else that we care about - for example, it would be a problem if newer profiles were more likely to be assigned to particular buckets, or if profiles in certain experiment arms always ended up in particular buckets.
@tdsmith had some concerns around crc32 and suggested m5d as a substitute. In the short-term, I don't really care as m5d seems to work, but I'd like to avoid locking ourselves into it. Before we lock into anything, ideally we should do some validation to ensure that it is orthogonal to anything we care about. |
to color that a little more -- in general it's a concern to use non-cryptographic hashes modulo some integer for bucketing without validation. Non-cryptographic hashes lack the avalanche property, which promises that any bit-flip in the input has an unpredictable change on all bits of the output. That means that checksums and fingerprints like crc32 and fingerprint64 are allowed to have dependencies between input bits and output bits. Since there's a lot of fixed structure in the input (fixed length of ASCII text with dashes in fixed locations), and using the modulus operator means that we only consider the bottom few bits of the result, there's potential for non-uniform sampling. We should at least simulate a few million random UUIDs and make sure that this produces buckets of uniform size. page 11 here notes that there's significant bias in fingerprint64: https://arxiv.org/pdf/1612.06257.pdf |
(crc32 % 100 is, empirically, fine; I suggested md5 just because it's a fast cryptographic hash. it's busted for real crypto but's still avalanchey.) |
@tdsmith thanks for the reference here. Here's a potential solution to adopt MD5:
That uses MD5 for the avalanche property, then passes through Fingerprint64 to produce an integer result that can finally be mod'ed. |
I slapped together a query in Redash to generate 1 million UUIDs and evaluate bucket size. I compared the following two hashing methods:
Both resulted in a stddev of ~200 for buckets of mean size 50,000. The version involving MD5 perhaps trended a bit lower in the 150 to 200 range, but this is not conclusive. |
I think a proper evaluation would include making sure that more recently
generated (higher?) IDs are not biased towards certain buckets. And also
that if we do treatment assignment however shield does that profiles in a
particular arm aren't preferentially assigned to certain buckets...
…On Wed, Mar 20, 2019 at 6:27 PM Jeff Klukas ***@***.***> wrote:
I slapped together a query in Redash to generate 1 million UUIDs and
evaluate bucket size
<https://sql.telemetry.mozilla.org/queries/61959/source>.
I compared the following two hashing methods:
MOD(ABS(FARM_FINGERPRINT(client_id)), 20)
MOD(ABS(FARM_FINGERPRINT(MD5(client_id))), 20)
Both resulted in a stddev of ~200 for buckets of mean size 50,000. The
version involving MD5 perhaps trended a bit lower in the 150 to 200 range,
but this is not conclusive.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#27 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/Ao8O-NuqD0O7sSd0NMjAfuib68unFBSIks5vYmGLgaJpZM4b8x-T>
.
|
nice! looking at a plot of counts per bucket for 20 or 100 buckets also looks nice and flat either way; i think it should be pretty obvious if this was a real problem. i agree that fingerprint(md5(id)) addresses this concern completely since the md5 function both avalanches and emits bytes (instead of an ascii string). incorporating the md5 would be the conservative choice if we were concerned about an unwise choice of divisors in the future but tbh i don't see a reason to employ it here.
client_ids should always be random, so I think we're fine here.
Normandy recipes are written to use the Normandy user_id for assignment, which is a distinct random UUID created separately from the client_id: https://searchfox.org/mozilla-central/source/toolkit/components/normandy/lib/ClientEnvironment.jsm#67 So there should be no way for those to interact. |
not useful, but we can also modulo md5 directly, although it's ugly and a bit harder due to avoid overflowing signed int64: CREATE TEMP FUNCTION
udf_bucket(x ANY TYPE,
-- y has to be < 256
y INT64) AS ((
SELECT
MOD(SUM(MOD(_c * CAST(POW(_m, _o2) AS INT64), y)), y)
FROM (
SELECT
* REPLACE(MOD(MOD(_c, y) * MOD(CAST(POW(_m, _o1) AS INT64), y), y) AS _c)
FROM (
SELECT
MOD(256, y) AS _m,
_c,
15 - _o AS _o,
if(_o < 8, 8, 15 - _o) AS _o1,
if(_o < 7, 7 - _o, 0) AS _o2
FROM UNNEST(TO_CODE_POINTS(MD5(x))) AS _c
WITH OFFSET _o
)
)
)); |
Cool. Thanks all! Tim, sounds like all my concerns are addressed.
Generally, any reason not to just use fingerprint(md5(id))? Anyway, from
here I'm happy to leave the decision making up to the engineers.
…On Thu, Mar 21, 2019 at 2:03 AM Daniel Thorn ***@***.***> wrote:
not useful, but we can also modulo md5 directly, although it's ugly and a
bit harder due to avoid overflowing signed int64:
CREATE TEMP FUNCTION
udf_bucket(x ANY TYPE,
-- y has to be < 256
y INT64) AS ((
SELECT
MOD(SUM(MOD(_c * CAST(POW(_m, _o2) AS INT64), y)), y)
FROM (
SELECT
* REPLACE(MOD(MOD(_c, y) * MOD(CAST(POW(_m, _o1) AS INT64), y), y) AS _c)
FROM (
SELECT
MOD(256, y) AS _m,
_c,
15 - _o AS _o,
if(_o < 8, 8, 15 - _o) AS _o1,
if(_o < 7, 7 - _o, 0) AS _o2
FROM UNNEST(TO_CODE_POINTS(MD5(x))) AS _c
WITH OFFSET _o
)
)
));
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#27 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/Ao8O-JYwLvnBcu12-vsFuO2bXW3IOZqwks5vYsxfgaJpZM4b8x-T>
.
|
As discussed in the Smoot Growth Dashboard Engineering Requirements: