## Cookies Cat AB Testing
Business Statement : We will analyze the result of an A/B test where the first gate in Cookie Cats was moved from level 30 to level 40. In particular, we will analyze the impact on player retention and game rounds.


In [5]:
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency

In [6]:
df = pd.read_csv('cookie_cats.csv')

### Retaintion rate for day-1 and day-7
#### $\chi^2 test$ of independece between group

In [56]:
def chi_test_indep(df, batch):
    print(f'Testing result for {batch}')
    display(pd.crosstab(index=df['version'], columns=df[batch], normalize='index'))
    cross_tbl = pd.crosstab(index=df['version'], columns=df[batch])
    chi2, p, dof, ex = chi2_contingency(cross_tbl)
    print(f'Chi : {chi2}');print(f'P-value : {p}')
    return None

In [57]:
chi_test_indep(df, 'retention_1')

Testing result for retention_1


retention_1,False,True
version,Unnamed: 1_level_1,Unnamed: 2_level_1
gate_30,0.551812,0.448188
gate_40,0.557717,0.442283


Chi : 3.1591007878782262
P-value : 0.07550476210309086


In [58]:
chi_test_indep(df, 'retention_7')

Testing result for retention_7


retention_7,False,True
version,Unnamed: 1_level_1,Unnamed: 2_level_1
gate_30,0.809799,0.190201
gate_40,0.818,0.182


Chi : 9.959086799559167
P-value : 0.0016005742679058301


#### T-Test 1 sample proportion, `statsmodel`

In [41]:
from statsmodels.stats.proportion import proportions_ztest, proportion_confint, test_proportions_2indep, confint_proportions_2indep

def ab_test_with_ci(df, features):
    # cross tab summary
    display(pd.crosstab(index=df['version'], columns=df[features], normalize='index'))

    # each group test
    for gr in ['gate_30', 'gate_40']:
        grdf = df[df['version'] == gr]
        # 1 sample t-test
        print(f'\nGroup {gr}')
        print('T-test')
        propstat, p_value = proportions_ztest(count=grdf[features].sum(), nobs=grdf.shape[0], 
                            value=0, alternative='two-sided')
        print(f'P value {p_value}')

        # confidence interval
        print('Confidence interval')
        lower_ci, upper_ci = proportion_confint(count=grdf[features].sum(), 
        nobs=grdf.shape[0], alpha=0.05, method='normal')
        print(f'({lower_ci:.3f} , {upper_ci:.3f})')
    
    # difference proportion 
    success_a, success_b = (df[df['version']=='gate_30'][features].sum(), df[df['version']=='gate_40'][features].sum())
    n_a, n_b = (df[df['version']=='gate_30'].shape[0], df[df['version']=='gate_40'].shape[0])
    
    print('\nTest diff proportion, independence')
    stat, p_value = test_proportions_2indep(success_a, n_a, success_b, n_b, compare='diff', 
    alternative='two-sided', return_results=False)
    print(f'Test diff p-value {p_value:.3f}')

    print('\nConfint diff proportion, independence')
    low_ci, upp_ci = confint_proportions_2indep(success_a, n_a, success_b, n_b, compare='diff', alpha=0.05)
    print(f'Confidence interval diff ({low_ci:.3f}, {upp_ci:.3f})')

    return None

In [42]:
ab_test_with_ci(df, 'retention_1')

retention_1,False,True
version,Unnamed: 1_level_1,Unnamed: 2_level_1
gate_30,0.551812,0.448188
gate_40,0.557717,0.442283



Group gate_30
T-test
P value 0.0
Confidence interval
(0.444 , 0.453)

Group gate_40
T-test
P value 0.0
Confidence interval
(0.438 , 0.447)

Test diff proportion, independence
Test diff p-value 0.074

Confint diff proportion, independence
Confidence interval diff (-0.001, 0.012)


In [43]:
ab_test_with_ci(df, 'retention_7')

retention_7,False,True
version,Unnamed: 1_level_1,Unnamed: 2_level_1
gate_30,0.809799,0.190201
gate_40,0.818,0.182



Group gate_30
T-test
P value 0.0
Confidence interval
(0.187 , 0.194)

Group gate_40
T-test
P value 0.0
Confidence interval
(0.178 , 0.186)

Test diff proportion, independence
Test diff p-value 0.002

Confint diff proportion, independence
Confidence interval diff (0.003, 0.013)


# Power calculation
reference : https://vinaysays.medium.com/a-b-testing-how-to-determine-the-sample-size-46e5419a2242
Key take-away
- `Significance Level` ($\alpha$) : Chance of reject Null Hypothesis, when reality it is true = Accept Alternative Hypothesis, when reality is Null = `False Positive`, `Type I Error`
- `Type II Error` ($\beta$): Chance of reject Alternate Hypothesis, when reality it is true, = Accept Null Hyphothesis, when reality it is Alternative  = `False Negative`
- `Statistical Power` : Probability to avoid `Type II Error` = $(1 - \beta)$

## Power calculation with `statsmodels`
reference : http://jpktd.blogspot.com/2013/03/statistical-power-in-statsmodels.html  
reference : https://machinelearningmastery.com/statistical-power-and-power-analysis-in-python/

In [85]:
import statsmodels.stats.api as sms
# caculate effect size
es = sms.proportion_effectsize(0.2, 0.22)
# A) compute sample size with normal distibution difference proportion 
sms.NormalIndPower().solve_power(effect_size=es, alpha=0.05, power=0.8, ratio=1, alternative='two-sided')

6507.330263176526

In [81]:
# A.1) short-cut function of A)
sms.zt_ind_solve_power(effect_size=es, alpha=0.05, power=0.8, ratio=1)

6507.330263176526

In [86]:
# B) compute sample size with t-test distibution difference proportion 
sms.TTestIndPower().solve_power(effect_size=es, alpha=0.05, power=0.8, ratio=1, alternative='two-sided')

6508.290862720533

In [83]:
# B.1) short-cut function of B)
sms.tt_ind_solve_power(effect_size=es, alpha=0.05, power=0.8, ratio=1, alternative='two-sided')

6508.290862720533

### Is the sample size test of retaintion_1, enough?

In [93]:
display(pd.crosstab(index=df['version'], columns=df['retention_1'], margins=True));
display(pd.crosstab(index=df['version'], columns=df['retention_1'], normalize='index'));

retention_1,False,True,All
version,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
gate_30,24666,20034,44700
gate_40,25370,20119,45489
All,50036,40153,90189


retention_1,False,True
version,Unnamed: 1_level_1,Unnamed: 2_level_1
gate_30,0.551812,0.448188
gate_40,0.557717,0.442283


### `Lift` :  % increase of target measurement  
ex. define retaiontion rate of A/B test = 10%  
then the target test %retaintion = (1 + 10%)*(control group)  
`control` = gate_40, p = 0.4422  
`test` = gate_30, target p = 1.1*0.4422 = 0.48642

In [99]:
es = sms.proportion_effectsize(0.4422, 0.48642)
# number of obs for each test group
sms.zt_ind_solve_power(effect_size=es, nobs1=None, alpha=0.05, power=0.8, ratio=1)

1995.407473213585

Actual obeservation of `test` (44700) and `control`(45489) beyond from calculation  
Then the number of from this test satisfied the effect size condition.  
  
Also, could calculate the power from this test  

In [100]:
sms.zt_ind_solve_power(effect_size=es, nobs1=44700, alpha=0.05, power=None, ratio=1)

1.0

Power = $1 - \beta = 1$, $\beta$ = 0, or chance of `Type II Error` = 0

### Is the sample size test of retaintion_7, enough?

In [101]:
display(pd.crosstab(index=df['version'], columns=df['retention_7'], margins=True));
display(pd.crosstab(index=df['version'], columns=df['retention_7'], normalize='index'));

retention_7,False,True,All
version,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
gate_30,36198,8502,44700
gate_40,37210,8279,45489
All,73408,16781,90189


retention_7,False,True
version,Unnamed: 1_level_1,Unnamed: 2_level_1
gate_30,0.809799,0.190201
gate_40,0.818,0.182


In [106]:
es = sms.proportion_effectsize(0.1822, 1.1*0.1822)
theorical_obs = sms.zt_ind_solve_power(effect_size=es, nobs1=None, alpha=0.05, power=0.8, ratio=1)
print(f'Theorical n_obs for each group : {theorical_obs:.0f}')

Theorical n_obs for each group : 7312


In [110]:
test_power = sms.zt_ind_solve_power(effect_size=es, nobs1=44700, alpha=0.05, power=None, ratio=1)
print(f'Real test power : {test_power:.5f}')

Real test power : 1.00000
