Add bucket_id to nondesktop tables #27

jklukas · 2019-03-19T18:06:01Z

As discussed in the Smoot Growth Dashboard Engineering Requirements:

In order to support statistical inference (confidence intervals and hypothesis tests), we require the concept of a ID Bucket. We divide the space of user IDs (probably client_id in most cases) into 20 buckets with equal probability of assignment into each bucket. It is very important that this assignment is orthogonal to anything else that we care about - for example, it would be a problem if newer profiles were more likely to be assigned to particular buckets, or if profiles in certain experiment arms always ended up in particular buckets.

jklukas · 2019-03-19T18:09:21Z

sql/firefox_nondesktop_exact_mau28_raw_v1.sql

+  -- We hash client_ids into 20 buckets to aid in computing
+  -- confidence intervals for mau/wau/dau sums; the particular hash
+  -- function and number of buckets is subject to change in the future.
+  MOD(ABS(FARM_FINGERPRINT(client_id)), 20) AS id_bucket,


I'm using FARM_FINGERPRINT because it's natively available in BQ, produces an integer that can be readily fed to MOD, and appears to be well tested to produce evenly distributed output. CRC32 is not available in BQ without resorting to a js UDF, but we will likely add support for CRC32-based sample_id as part of GCP pipeline metadata in coming months, so this will be an option down the line.

As discussed in the [Smoot Growth Dashboard Engineering Requirements] (https://docs.google.com/document/d/1L8tWDUjccutGGAldhpypRtPCaw3kkXboPUTtTZb02OA/edit?usp=sharing): > In order to support statistical inference (confidence intervals and hypothesis tests), we require the concept of a ID Bucket. We divide the space of user IDs (probably client_id in most cases) into 20 buckets with equal probability of assignment into each bucket. It is very important that this assignment is orthogonal to anything else that we care about - for example, it would be a problem if newer profiles were more likely to be assigned to particular buckets, or if profiles in certain experiment arms always ended up in particular buckets.

jmccrosky · 2019-03-20T08:37:43Z

@tdsmith had some concerns around crc32 and suggested m5d as a substitute. In the short-term, I don't really care as m5d seems to work, but I'd like to avoid locking ourselves into it.

Before we lock into anything, ideally we should do some validation to ensure that it is orthogonal to anything we care about.

tdsmith · 2019-03-20T15:47:15Z

to color that a little more -- in general it's a concern to use non-cryptographic hashes modulo some integer for bucketing without validation. Non-cryptographic hashes lack the avalanche property, which promises that any bit-flip in the input has an unpredictable change on all bits of the output. That means that checksums and fingerprints like crc32 and fingerprint64 are allowed to have dependencies between input bits and output bits. Since there's a lot of fixed structure in the input (fixed length of ASCII text with dashes in fixed locations), and using the modulus operator means that we only consider the bottom few bits of the result, there's potential for non-uniform sampling. We should at least simulate a few million random UUIDs and make sure that this produces buckets of uniform size.

page 11 here notes that there's significant bias in fingerprint64: https://arxiv.org/pdf/1612.06257.pdf

tdsmith · 2019-03-20T15:54:41Z

(crc32 % 100 is, empirically, fine; I suggested md5 just because it's a fast cryptographic hash. it's busted for real crypto but's still avalanchey.)

jklukas · 2019-03-20T16:11:03Z

@tdsmith thanks for the reference here. Here's a potential solution to adopt MD5:

MOD(ABS(FARM_FINGERPRINT(MD5(client_id))), 20)

That uses MD5 for the avalanche property, then passes through Fingerprint64 to produce an integer result that can finally be mod'ed.

jklukas · 2019-03-20T16:27:54Z

I slapped together a query in Redash to generate 1 million UUIDs and evaluate bucket size.

I compared the following two hashing methods:

MOD(ABS(FARM_FINGERPRINT(client_id)), 20)
MOD(ABS(FARM_FINGERPRINT(MD5(client_id))), 20)

Both resulted in a stddev of ~200 for buckets of mean size 50,000. The version involving MD5 perhaps trended a bit lower in the 150 to 200 range, but this is not conclusive.

jmccrosky · 2019-03-20T16:59:28Z

I think a proper evaluation would include making sure that more recently generated (higher?) IDs are not biased towards certain buckets. And also that if we do treatment assignment however shield does that profiles in a particular arm aren't preferentially assigned to certain buckets...

…

On Wed, Mar 20, 2019 at 6:27 PM Jeff Klukas ***@***.***> wrote: I slapped together a query in Redash to generate 1 million UUIDs and evaluate bucket size <https://sql.telemetry.mozilla.org/queries/61959/source>. I compared the following two hashing methods: MOD(ABS(FARM_FINGERPRINT(client_id)), 20) MOD(ABS(FARM_FINGERPRINT(MD5(client_id))), 20) Both resulted in a stddev of ~200 for buckets of mean size 50,000. The version involving MD5 perhaps trended a bit lower in the 150 to 200 range, but this is not conclusive. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#27 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/Ao8O-NuqD0O7sSd0NMjAfuib68unFBSIks5vYmGLgaJpZM4b8x-T> .

tdsmith · 2019-03-20T18:16:13Z

nice! looking at a plot of counts per bucket for 20 or 100 buckets also looks nice and flat either way; i think it should be pretty obvious if this was a real problem.

i agree that fingerprint(md5(id)) addresses this concern completely since the md5 function both avalanches and emits bytes (instead of an ascii string). incorporating the md5 would be the conservative choice if we were concerned about an unwise choice of divisors in the future but tbh i don't see a reason to employ it here.

recently generated (higher?) IDs

client_ids should always be random, so I think we're fine here.

if we do treatment assignment however shield does

Normandy recipes are written to use the Normandy user_id for assignment, which is a distinct random UUID created separately from the client_id: https://searchfox.org/mozilla-central/source/toolkit/components/normandy/lib/ClientEnvironment.jsm#67

So there should be no way for those to interact.

relud · 2019-03-21T00:03:42Z

not useful, but we can also modulo md5 directly, although it's ugly and a bit harder due to avoid overflowing signed int64:

CREATE TEMP FUNCTION
  udf_bucket(x ANY TYPE,
    -- y has to be < 256
    y INT64) AS ((
    SELECT
        MOD(SUM(MOD(_c * CAST(POW(_m, _o2) AS INT64), y)), y)
    FROM (
      SELECT
        * REPLACE(MOD(MOD(_c, y) * MOD(CAST(POW(_m, _o1) AS INT64), y), y) AS _c)
      FROM (
        SELECT
          MOD(256, y) AS _m,
          _c,
          15 - _o AS _o,
          if(_o < 8, 8, 15 - _o) AS _o1,
          if(_o < 7, 7 - _o, 0) AS _o2
        FROM UNNEST(TO_CODE_POINTS(MD5(x))) AS _c
        WITH OFFSET _o
      )
    )
));

jmccrosky · 2019-03-21T10:21:33Z

Cool. Thanks all! Tim, sounds like all my concerns are addressed. Generally, any reason not to just use fingerprint(md5(id))? Anyway, from here I'm happy to leave the decision making up to the engineers.

…

On Thu, Mar 21, 2019 at 2:03 AM Daniel Thorn ***@***.***> wrote: not useful, but we can also modulo md5 directly, although it's ugly and a bit harder due to avoid overflowing signed int64: CREATE TEMP FUNCTION udf_bucket(x ANY TYPE, -- y has to be < 256 y INT64) AS (( SELECT MOD(SUM(MOD(_c * CAST(POW(_m, _o2) AS INT64), y)), y) FROM ( SELECT * REPLACE(MOD(MOD(_c, y) * MOD(CAST(POW(_m, _o1) AS INT64), y), y) AS _c) FROM ( SELECT MOD(256, y) AS _m, _c, 15 - _o AS _o, if(_o < 8, 8, 15 - _o) AS _o1, if(_o < 7, 7 - _o, 0) AS _o2 FROM UNNEST(TO_CODE_POINTS(MD5(x))) AS _c WITH OFFSET _o ) ) )); — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#27 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/Ao8O-JYwLvnBcu12-vsFuO2bXW3IOZqwks5vYsxfgaJpZM4b8x-T> .

jklukas requested a review from relud March 19, 2019 18:06

jklukas commented Mar 19, 2019

View reviewed changes

jklukas force-pushed the id-bucket branch from 21eb441 to 8b30a5a Compare March 19, 2019 20:23

relud approved these changes Mar 19, 2019

View reviewed changes

Merge branch 'master' into id-bucket

86f49be

jklukas merged commit 3a93272 into master Mar 20, 2019

jklukas deleted the id-bucket branch March 20, 2019 01:29

jklukas mentioned this pull request Mar 25, 2019

Investigate hash functions for id_bucket #36

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add bucket_id to nondesktop tables #27

Add bucket_id to nondesktop tables #27

jklukas commented Mar 19, 2019 •

edited

Loading

jklukas Mar 19, 2019

jmccrosky commented Mar 20, 2019

tdsmith commented Mar 20, 2019

tdsmith commented Mar 20, 2019

jklukas commented Mar 20, 2019

jklukas commented Mar 20, 2019

jmccrosky commented Mar 20, 2019 via email

tdsmith commented Mar 20, 2019

relud commented Mar 21, 2019

jmccrosky commented Mar 21, 2019 via email

Add bucket_id to nondesktop tables #27

Add bucket_id to nondesktop tables #27

Conversation

jklukas commented Mar 19, 2019 • edited Loading

jklukas Mar 19, 2019

Choose a reason for hiding this comment

jmccrosky commented Mar 20, 2019

tdsmith commented Mar 20, 2019

tdsmith commented Mar 20, 2019

jklukas commented Mar 20, 2019

jklukas commented Mar 20, 2019

jmccrosky commented Mar 20, 2019 via email

tdsmith commented Mar 20, 2019

relud commented Mar 21, 2019

jmccrosky commented Mar 21, 2019 via email

jklukas commented Mar 19, 2019 •

edited

Loading