# Summary

This note book:
- Loads data
- Transforms data to the format compatible with mean comparison tests
- Conducts mean comparison tests (Levene's test, Welch's test, ANOVA test, Tukey's and Games-Howell pairwise comparison)

# Data Management

In [18]:
## Load the raw data
import pandas as pd
data = pd.read_csv('ab_testing_data.csv')
data.head()

Unnamed: 0,sku,soid,soname,clid,clname,mkcname,suid,treated,addedtocart,placedorder,trafficcount,after,videocount,grs1month,grs2month,grs3month,grs12month,weightedavgscore,percentilerank,expectedgrs
0,8201MAAA,3,UK,21,Cribs,Nursery,1,0,0,0,14,1,0,0.0,0.0,0.0,0.0,,,
1,7801KBAA,4,DE,18,Mattress Toppers and Pads,Mattresses - Utility Bedding,7080,0,45,5,1362,1,0,265.558272,857.66493,3759.433838,6547.57893,,,
2,0001ECAA,6,CA,15,Area Rugs,Rugs,1,0,0,0,0,1,0,0.0,0.0,0.0,0.0,,,
3,3001FCAA,3,UK,15,Area Rugs,Rugs,4312,0,0,0,25,1,0,0.0,0.0,0.0,342.437186,,,
4,5001FCAA,3,UK,15,Area Rugs,Rugs,1,0,1,0,7,1,0,0.0,0.0,0.0,0.0,,,


## Aggregate to SKU Level

In [19]:
aggr = data.groupby(['sku','treated','after']).agg({'trafficcount':'sum', 'addedtocart':'sum', 'placedorder':'sum'}).reset_index()
aggr

Unnamed: 0,sku,treated,after,trafficcount,addedtocart,placedorder
0,000001AP,0,0,67,3,0
1,000001AP,0,1,43,2,0
2,000001BC,1,0,57,2,0
3,000001BC,1,1,66,2,0
4,000001BQ,0,0,396,6,0
...,...,...,...,...,...,...
3806670,9999YCUE,0,1,0,0,0
3806671,9999YHV,0,0,8,0,0
3806672,9999YHV,0,1,15,2,2
3806673,9999ZHAD,0,0,0,0,0


In [21]:
aggr.to_csv('aggregated_ab_testing_data.csv', index = False)

## Differencing between before and after

In [36]:
## Pivot the table so that we have a before and an after column for each metric
pv = aggr.pivot('sku','after').dropna() 
# also drops NaN to eliminate SKUs that did not have full before-after conditions
print('-----Pivot Table-----')
print(pv)

## Subtract before from after for each metric
difTraffic = pv['trafficcount',1] - pv['trafficcount',0]
difCart = pv['addedtocart',1] - pv['addedtocart',0]
difBuy = pv['placedorder',1] - pv['placedorder',0]

## Make dataframe from the differences 
df = pd.DataFrame({'treated':pv['treated',0], 'difTraffic':difTraffic, 'difCart':difCart, 'difBuy': difBuy}).reset_index()
    ### reset_index() passes the current index, which is sku, as a column.
    ### reset_index() is optional, but I prefer a manipulable column and an iterable index in most cases
print('')
print('-----Difference Table-----')
df

-----Pivot Table-----
         treated      trafficcount        addedtocart      placedorder     
after          0    1            0      1           0    1           0    1
sku                                                                        
000001AP     0.0  0.0         67.0   43.0         3.0  2.0         0.0  0.0
000001BC     1.0  1.0         57.0   66.0         2.0  2.0         0.0  0.0
000001BQ     0.0  0.0        396.0  317.0         6.0  2.0         0.0  0.0
000001CO     0.0  0.0         92.0   14.0         3.0  0.0         0.0  0.0
000001DG     0.0  0.0          3.0    3.0         0.0  0.0         0.0  0.0
...          ...  ...          ...    ...         ...  ...         ...  ...
9999XSGB     0.0  0.0          0.0    3.0         0.0  0.0         0.0  0.0
9999XWCN     0.0  0.0          9.0    7.0         1.0  1.0         0.0  0.0
9999YADV     0.0  0.0         14.0   41.0         0.0  0.0         0.0  0.0
9999YHV      0.0  0.0          8.0   15.0         0.0  2.0        

Unnamed: 0,sku,treated,difTraffic,difCart,difBuy
0,000001AP,0.0,-24.0,-1.0,0.0
1,000001BC,1.0,9.0,0.0,0.0
2,000001BQ,0.0,-79.0,-4.0,0.0
3,000001CO,0.0,-78.0,-3.0,0.0
4,000001DG,0.0,0.0,0.0,0.0
...,...,...,...,...,...
1864152,9999XSGB,0.0,3.0,0.0,0.0
1864153,9999XWCN,0.0,-2.0,0.0,0.0
1864154,9999YADV,0.0,27.0,0.0,0.0
1864155,9999YHV,0.0,7.0,2.0,2.0


In [44]:
## Save works to csv
df.to_csv('differenced_data.csv', index = False)

In [2]:
df = pd.read_csv('differenced_data.csv')

# Statistical Analyses - Mean Comparison

## Levene's Test for Equal Variance

First, we must test whether the variance in differenced values are "equal" between two groups, because the result would then decide which mean comparison test to use: ANOVA or Welch's test. We can use Levene's Test for Variance Equality.

pingouin allows passing a value column and a group (categorical) column straight from the dataframe. Other packages might require passing multiple series, one series for each group.

In [3]:
import pingouin as pg
def levene_test(data, value_field, category_field):
    return pg.homoscedasticity(data, dv=value_field, group=category_field, method='levene', alpha=0.05)

In [4]:
print('Variance Equality between control and treatment in Traffic difference')
print(levene_test(df, 'difTraffic', 'treated'))

print('----------')
print('Variance Equality in Added-to-Cart Clicks difference')
print(levene_test(df, 'difCart', 'treated'))

print('----------')
print('Variance Equality in Purchases difference')
print(levene_test(df, 'difBuy', 'treated'))

Variance Equality between control and treatment in Traffic difference
                  W  pval  equal_var
levene  1819.622559   0.0      False
----------
Variance Equality in Added-to-Cart Clicks difference
                  W  pval  equal_var
levene  1929.142058   0.0      False
----------
Variance Equality in Purchases difference
                  W  pval  equal_var
levene  1530.320935   0.0      False


The control and treatment groups differ in variances of all metrics' differences. We shall use:
- Welch's test instead of ANOVA for mean differences 
- Games-Howell test for post-hoc pairwise comparison instead of Tukey's test.

## Test of Difference in Means: ANOVA vs. Welch's Test

Both ANOVA and Welch's Test are used to compare the difference in means across multiple groups.

### ANOVA

When the variances are equal between groups, we can use ANOVA.

In [8]:
print('ANOVA on Traffic Difference')
print(pg.anova(dv='difTraffic', between='treated', data=df))

print('----------')
print('ANOVA on Added-to-Cart Difference')
print(pg.anova(dv='difCart', between='treated', data=df))

print('----------')
print('ANOVA on Purchase Difference')
print(pg.anova(dv='difBuy', between='treated', data=df))

ANOVA on Traffic Difference
    Source  ddof1    ddof2         F     p-unc       np2
0  treated      1  1864155  1.874216  0.170993  0.000001
----------
ANOVA on Added-to-Cart Difference
    Source  ddof1    ddof2           F         p-unc     np2
0  treated      1  1864155  186.352751  1.997701e-42  0.0001
----------
ANOVA on Purchase Difference
    Source  ddof1    ddof2           F          p-unc       np2
0  treated      1  1864155  722.199889  4.772576e-159  0.000387


### Welch's Test

When the variances are unequal between groups, we use Welch's Test. This is the appropriate test for our case.

In [7]:
from pingouin import welch_anova

print("Welch's test on Traffic Difference")
print(welch_anova(dv='difTraffic', between='treated', data=df))

print('----------')
print("Welch's test on Added-to-Cart Difference")
print(welch_anova(dv='difCart', between='treated', data=df))

print('----------')
print("Welch's test on Purchase Difference")
print(welch_anova(dv='difBuy', between='treated', data=df))

Welch's test on Traffic Difference
    Source  ddof1          ddof2         F     p-unc       np2
0  treated      1  199886.508265  0.564182  0.452581  0.000001
----------
Welch's test on Added-to-Cart Difference
    Source  ddof1          ddof2          F         p-unc     np2
0  treated      1  196227.780707  41.469138  1.200217e-10  0.0001
----------
Welch's test on Purchase Difference
    Source  ddof1         ddof2           F         p-unc       np2
0  treated      1  194135.64732  127.607125  1.397569e-29  0.000387


## Post-hoc Pairwise Comparison: Tukey's vs Games-Howell

After ANOVA or Welch's test, we should compare the means between the groups. This step is not necessary when comparing between just two groups, because we already know their means, variances, and whether the difference between them is significant. However, when we have more than two groups to compare, doing a t-test, ANOVA, or Welch' test for each pair in a large collection of groups would accumulate the errors of these tests. It is thus important to use pairwise comparison to examine the differences between each pair without widening error. Which pairwise test to use also depends on equal or unequal variances.

### Tukey's Test

For equal variances among groups, we can use Tukey's test. 

In [9]:
def tukey(data, value_field, category_field, n_round):
    return pg.pairwise_tukey(data=data, dv=value_field,
                        between=category_field).round(n_round)

In [16]:
print("Tukey's Pairwise Comparison on Traffic Difference")
print(tukey(df, 'difTraffic', 'treated', 1))

print('----------')
print("Tukey's on Added-to-Cart Difference")
print(tukey(df, 'difCart', 'treated', 1))

print('----------')
print("Tukey's on Purchase Difference")
print(tukey(df, 'difBuy', 'treated', 1))

Tukey's Pairwise Comparison on Traffic Difference
     A    B  mean(A)  mean(B)  diff   se    T  p-tukey  hedges
0  0.0  1.0     21.7     18.1   3.6  2.6  1.4      0.2     0.0
----------
Tukey's on Added-to-Cart Difference
     A    B  mean(A)  mean(B)  diff   se     T  p-tukey  hedges
0  0.0  1.0      0.2     -2.8   3.0  0.2  13.7      0.0     0.0
----------
Tukey's on Purchase Difference
     A    B  mean(A)  mean(B)  diff   se     T  p-tukey  hedges
0  0.0  1.0     -0.3     -1.9   1.6  0.1  26.9      0.0     0.1


### Games-Howell Test

For unequal variances among groups, we can use Games-Howell post-hoc pairwise comparison. This is the appropriate test for our case. 

In [14]:
def games_howell(data, value_field, category_field, n_round):
    return pg.pairwise_gameshowell(data=data, dv=value_field,
                        between=category_field).round(n_round)

In [17]:
print("Games-Howell Pairwise Comparison on Traffic Difference")
print(games_howell(df, 'difTraffic', 'treated', 1))

print('----------')
print("Games-Howell on Added-to-Cart Difference")
print(games_howell(df, 'difCart', 'treated', 1))

print('----------')
print("Games-Howell on Purchase Difference")
print(games_howell(df, 'difBuy', 'treated', 1))

Games-Howell Pairwise Comparison on Traffic Difference
     A    B  mean(A)  mean(B)  diff   se    T        df  pval  hedges
0  0.0  1.0     21.7     18.1   3.6  4.8  0.8  199886.5   0.5     0.0
----------
Games-Howell on Added-to-Cart Difference
     A    B  mean(A)  mean(B)  diff   se    T        df  pval  hedges
0  0.0  1.0      0.2     -2.8   3.0  0.5  6.4  196227.8   0.0     0.0
----------
Games-Howell on Purchase Difference
     A    B  mean(A)  mean(B)  diff   se     T        df  pval  hedges
0  0.0  1.0     -0.3     -1.9   1.6  0.1  11.3  194135.6   0.0     0.0
