# Deterministic Assignment

Depending on the use case, a deterministic function might make sense. An advantage is that it won't need a database or network call before it can start rendering the variant. That's great for an assignment that needs to happen before a network request, like as soon as a mobile app starts up.

Just like before, there are four things we should keep an eye out for:

* ① Given a user_id, return the string of the color to show
* ② The same user_id is assigned to the same color
* ③ Different user_ids are randomly assigned
* ④ The proportion of user_ids that see red and blue is roughly 50-50

In [1]:
import hashlib

from utils.simulate import n_different_users
from utils.simulate import same_user_n_times
from utils.spoilers import spoiler_hash_func
from utils.pretty import pp

## Outline

Here's the function that I'll use. I'll define the hash function later on.

In [2]:
def half_spoiler_choose_color_assignment(user_id):
    # I'll use a string with the experimental unit, and an experiment salt, "color".
    key = "{}|color".format(user_id)
    
    # Then I'll use a hash function. This one takes a string and returns an int.
    # Then I'll % 2 it to get a 0 or a 1
    assignment_i = spoiler_hash_func(key) % 2
    # And use that to index into an array of assignments
    return ['red', 'blue'][assignment_i]

This function takes care of ① for us.

### ① Given a user_id, return the string of the color to show ✅

The other functions in this notebook use the same outline, so I'll just show this once:

In [3]:
half_spoiler_choose_color_assignment(user_id=1)

'blue'

## Picking a `hash_func`

`hash_func` needs to help with the other three goals:

* ② The same user_id is assigned to the same color
* ③ Different user_ids are randomly assigned
* ④ The proportion of user_ids that see red and blue is roughly 50-50


### Bad hash function

Here's a bad version:

In [4]:
def bad_hash_func(input_str):
    return len(input_str)

# This is the same as the code before, but with a different name and hash function.
def bad_choose_color_assignment(user_id):
    key = "{}|color".format(user_id)    
    assignment_i = bad_hash_func(key) % 2
    return ['red', 'blue'][assignment_i]

###  ② The same user_id is assigned to the same color ✅

In [5]:
pp(
    same_user_n_times(bad_choose_color_assignment, n=10)
)

Unnamed: 0,user_id,color
0,1,blue
1,1,blue
2,1,blue
3,1,blue
4,1,blue
5,1,blue
6,1,blue
7,1,blue
8,1,blue
9,1,blue


### ③ Different `user_id`s are randomly assigned ❌

When we look at different `user_id`s, it doesn't look very random.

Specifically, similar `user_id`s are assigned to the same thing. This could lead to bias, like if `user_id`s are assigned incrementally!

In [6]:
pp(
    n_different_users(bad_choose_color_assignment, n=30)
)

Unnamed: 0,user_id,color
0,0,blue
1,1,blue
2,2,blue
3,3,blue
4,4,blue
5,5,blue
6,6,blue
7,7,blue
8,8,blue
9,9,blue


### ④ The proportion of `user_id`s that see red and blue is roughly 50-50 ❌

At this point, I'm already concerned about this function. But it's also nowhere near 50-50.

In [7]:
n_different_users(bad_choose_color_assignment, n=1000).groupby('color').count()

Unnamed: 0_level_0,user_id
color,Unnamed: 1_level_1
blue,910
red,90


## `md5`

I'm going to use `md5` for the hash function. 

If I give an input that differs just a tiny bit, it gives me a completely different number. It can be used to make statistically random assignments (Kohavi p165).

In [8]:
def hash_func(input_str):
    hash_hex = hashlib.md5(input_str.encode('utf-8')).hexdigest()
    return int(hash_hex[:15], 16)

def choose_color_assignment(user_id):
    key = "{}|color".format(user_id)    
    assignment_i = hash_func(key) % 2
    return ['red', 'blue'][assignment_i]

###  ② The same user_id is assigned to the same color  ✅

In [9]:
same_user_n_times(choose_color_assignment, n=100).groupby('color').count()

Unnamed: 0_level_0,user_id
color,Unnamed: 1_level_1
blue,100


### ③ Different user_ids are randomly assigned ✅ 

In [10]:
pp(
    n_different_users(choose_color_assignment, n=150)
)

Unnamed: 0,user_id,color
0,0,blue
1,1,blue
2,2,red
3,3,blue
4,4,blue
5,5,blue
6,6,red
7,7,blue
8,8,red
9,9,red


### ④ The proportion of user_ids that see red and blue is roughly 50-50 ✅

In [11]:
n_different_users(choose_color_assignment, n=10000).groupby('color').count()

Unnamed: 0_level_0,user_id
color,Unnamed: 1_level_1
blue,5004
red,4996


# Summary

This showed how to implement the assignment function deterministically by using a hash function.

# [Next : 3. Scaling](3.Scaling.ipynb)

This is the version of assignment I'll use. Now I'll extend it to support more than one experiment on the same users.

# TOC
- **[0. Introduction](0.Introduction.ipynb)**: What a good `choose_color_assignment` function looks like.
- **[1. Experimental Units](1.ExperimentalUnits.ipynb)**: What happens when I don't pay attention to experimental units.
- **[2. Deterministic Assignment](2.DeterministicAssignment.ipynb)**: What it looks like to deterministically assign
- **[3. Scaling](3.Scaling.ipynb)**: How not to run two experiments at the same time.
- **[4. Rollout](4.Rollout.ipynb)**: How to gradually show users a new experiment.