Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bucket_id to nondesktop tables #27

Merged
merged 2 commits into from
Mar 20, 2019
Merged

Add bucket_id to nondesktop tables #27

merged 2 commits into from
Mar 20, 2019

Conversation

jklukas
Copy link
Contributor

@jklukas jklukas commented Mar 19, 2019

As discussed in the Smoot Growth Dashboard Engineering Requirements:

In order to support statistical inference (confidence intervals and hypothesis tests), we require the concept of a ID Bucket. We divide the space of user IDs (probably client_id in most cases) into 20 buckets with equal probability of assignment into each bucket. It is very important that this assignment is orthogonal to anything else that we care about - for example, it would be a problem if newer profiles were more likely to be assigned to particular buckets, or if profiles in certain experiment arms always ended up in particular buckets.

@jklukas jklukas requested a review from relud March 19, 2019 18:06
-- We hash client_ids into 20 buckets to aid in computing
-- confidence intervals for mau/wau/dau sums; the particular hash
-- function and number of buckets is subject to change in the future.
MOD(ABS(FARM_FINGERPRINT(client_id)), 20) AS id_bucket,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using FARM_FINGERPRINT because it's natively available in BQ, produces an integer that can be readily fed to MOD, and appears to be well tested to produce evenly distributed output. CRC32 is not available in BQ without resorting to a js UDF, but we will likely add support for CRC32-based sample_id as part of GCP pipeline metadata in coming months, so this will be an option down the line.

As discussed in the
[Smoot Growth Dashboard Engineering Requirements]
(https://docs.google.com/document/d/1L8tWDUjccutGGAldhpypRtPCaw3kkXboPUTtTZb02OA/edit?usp=sharing):

> In order to support statistical inference (confidence intervals and hypothesis tests), we require the concept of a ID Bucket.  We divide the space of user IDs (probably client_id in most cases) into 20 buckets with equal probability of assignment into each bucket.  It is very important that this assignment is orthogonal to anything else that we care about - for example, it would be a problem if newer profiles were more likely to be assigned to particular buckets, or if profiles in certain experiment arms always ended up in particular buckets.
@jklukas jklukas merged commit 3a93272 into master Mar 20, 2019
@jklukas jklukas deleted the id-bucket branch March 20, 2019 01:29
@jmccrosky
Copy link
Contributor

@tdsmith had some concerns around crc32 and suggested m5d as a substitute. In the short-term, I don't really care as m5d seems to work, but I'd like to avoid locking ourselves into it.

Before we lock into anything, ideally we should do some validation to ensure that it is orthogonal to anything we care about.

@tdsmith
Copy link
Contributor

tdsmith commented Mar 20, 2019

to color that a little more -- in general it's a concern to use non-cryptographic hashes modulo some integer for bucketing without validation. Non-cryptographic hashes lack the avalanche property, which promises that any bit-flip in the input has an unpredictable change on all bits of the output. That means that checksums and fingerprints like crc32 and fingerprint64 are allowed to have dependencies between input bits and output bits. Since there's a lot of fixed structure in the input (fixed length of ASCII text with dashes in fixed locations), and using the modulus operator means that we only consider the bottom few bits of the result, there's potential for non-uniform sampling. We should at least simulate a few million random UUIDs and make sure that this produces buckets of uniform size.

page 11 here notes that there's significant bias in fingerprint64: https://arxiv.org/pdf/1612.06257.pdf

@tdsmith
Copy link
Contributor

tdsmith commented Mar 20, 2019

(crc32 % 100 is, empirically, fine; I suggested md5 just because it's a fast cryptographic hash. it's busted for real crypto but's still avalanchey.)

@jklukas
Copy link
Contributor Author

jklukas commented Mar 20, 2019

@tdsmith thanks for the reference here. Here's a potential solution to adopt MD5:

MOD(ABS(FARM_FINGERPRINT(MD5(client_id))), 20)

That uses MD5 for the avalanche property, then passes through Fingerprint64 to produce an integer result that can finally be mod'ed.

@jklukas
Copy link
Contributor Author

jklukas commented Mar 20, 2019

I slapped together a query in Redash to generate 1 million UUIDs and evaluate bucket size.

I compared the following two hashing methods:

MOD(ABS(FARM_FINGERPRINT(client_id)), 20)
MOD(ABS(FARM_FINGERPRINT(MD5(client_id))), 20)

Both resulted in a stddev of ~200 for buckets of mean size 50,000. The version involving MD5 perhaps trended a bit lower in the 150 to 200 range, but this is not conclusive.

@jmccrosky
Copy link
Contributor

jmccrosky commented Mar 20, 2019 via email

@tdsmith
Copy link
Contributor

tdsmith commented Mar 20, 2019

nice! looking at a plot of counts per bucket for 20 or 100 buckets also looks nice and flat either way; i think it should be pretty obvious if this was a real problem.

i agree that fingerprint(md5(id)) addresses this concern completely since the md5 function both avalanches and emits bytes (instead of an ascii string). incorporating the md5 would be the conservative choice if we were concerned about an unwise choice of divisors in the future but tbh i don't see a reason to employ it here.

recently generated (higher?) IDs

client_ids should always be random, so I think we're fine here.

if we do treatment assignment however shield does

Normandy recipes are written to use the Normandy user_id for assignment, which is a distinct random UUID created separately from the client_id: https://searchfox.org/mozilla-central/source/toolkit/components/normandy/lib/ClientEnvironment.jsm#67

So there should be no way for those to interact.

@relud
Copy link
Collaborator

relud commented Mar 21, 2019

not useful, but we can also modulo md5 directly, although it's ugly and a bit harder due to avoid overflowing signed int64:

CREATE TEMP FUNCTION
  udf_bucket(x ANY TYPE,
    -- y has to be < 256
    y INT64) AS ((
    SELECT
        MOD(SUM(MOD(_c * CAST(POW(_m, _o2) AS INT64), y)), y)
    FROM (
      SELECT
        * REPLACE(MOD(MOD(_c, y) * MOD(CAST(POW(_m, _o1) AS INT64), y), y) AS _c)
      FROM (
        SELECT
          MOD(256, y) AS _m,
          _c,
          15 - _o AS _o,
          if(_o < 8, 8, 15 - _o) AS _o1,
          if(_o < 7, 7 - _o, 0) AS _o2
        FROM UNNEST(TO_CODE_POINTS(MD5(x))) AS _c
        WITH OFFSET _o
      )
    )
));

@jmccrosky
Copy link
Contributor

jmccrosky commented Mar 21, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants