the first approach to causal inference: the randomized experiment (RCT/ AB test)

reminder:

correlation (more generally, *association*) is not causation

...unless...maybe.. it is.

we have a treatment/intervention $ T \in \{0,1\}$

and we have an outcome variable $ Y $

`avg(Y | T = 1) - avg(Y | T = 0)`

for now

treatment effect = "the difference between Y when the treatment is given vs when it is not given"

bias = "difference between Y for people who got the treatment vs didn't get the treatment in the case where NOONE got the treatment - ARE THE TWO GROUPS COMPARABLE"

the nice thing about AB tests, is that they are designed to eliminate bias

how?

bias can be eliminated through randomization.


In [1]:
import pandas as pd
cats = pd.read_csv('cookie_cats.csv')
cats.head()

Unnamed: 0,userid,version,sum_gamerounds,retention_1,retention_7
0,116,gate_30,3,False,False
1,337,gate_30,38,True,False
2,377,gate_40,165,True,False
3,483,gate_40,1,False,False
4,488,gate_40,179,True,True


In [3]:
# randomization - we have to take it for granted here.
# what data would we need to do the random assignment ourselves
# we want to make sure that "on average" the players in group A and B 
# had the same level of engagement before the experiment starts

# 3 outcome variables
# sum_gamerounds (numeric, total number of games played for the first 14 days after the install)
# retention_1 did the player come back yes/no 1/0 true/false one day after installing
# retention_7 ' seven after installing.

# statistics / hypothesis testing

# comparing sum of game rounds between group A and B.

# null hypothesis for most hypothesis tests is that the data is not different across the groups

# what hypothesis test should we use
# a numeric variable, and 2 groups/samples we are assuming the samples are independent
# t-test, specifically two-sample unpaired independent t-test

# null hypothesis: mean(sum of game rounds for group A) - mean(sum of game rounds for group B) = 0
# alternative hypothesis: ~null hypothesis 
# mean(sum of game rounds for group A) - mean(sum of game rounds for group B) != 0

gate_30 = cats.loc[cats['version'] == "gate_30","sum_gamerounds"] # group A
gate_40 = cats.loc[cats['version'] == "gate_40","sum_gamerounds"] # group B

# "we put the gate at level 40 instead of level 30 - does it make a difference in total gamerounds played"


In [4]:

from scipy.stats import ttest_ind

ttest_ind(gate_30,gate_40)

TtestResult(statistic=0.8910426211362967, pvalue=0.37290868247405207, df=90187.0)

pvalue - what does it mean?

probability of type 1 error

type 1 error - "rejecting the null hypothesis when you should NOT have"
type 1 error - false positive

pvalue = "in a world where there's no difference between group A and B, we would see this data/this result pvalue percentage of the time"



we did this experiment and there's 3 possible explanations for the data

- all the differences we see are due to the systematic effect of the treatment

this is very unlikely AND it would be immediately obvious

`sum_gamerounds_A = (100, 100, 100, 100, 100, ...)`

`sum_gamerounds_B = (120, 120, 120, 120, 120, ...)`

- all the differences we see are due to randomness/noise <- we are testing this when we look at the pvalue
- some combination of systematic effect and randomness

In [5]:
retention_gate_30 = cats.loc[cats['version'] == "gate_30","retention_7"] # group A
retention_gate_40 = cats.loc[cats['version'] == "gate_40","retention_7"] # group B

8502

In [8]:
# proportions z test
import numpy as np
from statsmodels.stats.proportion import proportions_ztest

count = np.array([retention_gate_30.sum(),retention_gate_40.sum()])
nobs = np.array([retention_gate_30.shape[0],retention_gate_40.shape[0]])
stat, pval = proportions_ztest(count, nobs)
print('{0:0.3f}'.format(pval))


0.002


In [10]:
cats.groupby('version')['retention_7'].agg(['sum','mean'])

Unnamed: 0_level_0,sum,mean
version,Unnamed: 1_level_1,Unnamed: 2_level_1
gate_30,8502,0.190201
gate_40,8279,0.182


In [27]:
# there's always more than one way to answer the same question

# sum_gamerounds (numeric) ~ version (categorical)
# we used a t-test for this
# is there other tools????
# sum_gamerounds = beta0 + beta_version*version{1,0}

import statsmodels.api as sm

cats['version_binary'] = np.where(cats['version'] == 'gate_30',1,0)
catsX = cats[['version_binary']]
x = sm.add_constant(catsX)
y = cats['sum_gamerounds']

mod = sm.OLS(y,x)

res = mod.fit()


In [28]:
y

0          3
1         38
2        165
3          1
4        179
        ... 
90184     97
90185     30
90186     28
90187     51
90188     16
Name: sum_gamerounds, Length: 90189, dtype: int64

In [24]:
from sklearn.linear_model import LinearRegression

mod = LinearRegression()
mod.fit(x,y)

ValueError: shapes (90189,2) and (90189,2) not aligned: 2 (dim 1) != 90189 (dim 0)

In [25]:
# Load modules and data
import numpy as np

import statsmodels.api as sm

spector_data = sm.datasets.spector.load()

spector_data.exog = sm.add_constant(spector_data.exog, prepend=False)

# Fit and summarize OLS model
print(spector_data.endog.shape, spector_data.exog.shape)


(32,) (32, 4)


In [29]:
print(x.shape,y.shape)

(90189, 2) (90189,)
