## A / B Testing: Chi-2 with Montana Library case study

In this notebook we perform a Chi square test with data from the Library of Montana University case study, applying a post-hoc correction to perform pairwise tests and find the true winner. 
Scipy approach.

We structure the steps by answering three questions: 
1.   What was the click-through rate for each version?
2.   Which version was the winner?
3.   Do the results seem conclusive?


In [3]:
import numpy as np
import pandas as pd
#numpy.set_printoptions(suppress=True)
pd.set_option("max_colwidth", 1000)
#pd.set_option("max_rows", 1000)

In [7]:
# Element list Homepage Version 1 - Interact, 5-29-2013.csv
v1 = pd.read_csv("../data_crazy_egg/HomepageVersion1.csv")

# Element list Homepage Version 2 - Connect, 5-29-2013.csv
v2 = pd.read_csv("../data_crazy_egg/HomepageVersion2.csv")

# Element list Homepage Version 3 - Learn, 5-29-2013.csv
v3 = pd.read_csv("../data_crazy_egg/HomepageVersion3.csv")

# Element list Homepage Version 4 - Help, 5-29-2013.csv
v4 = pd.read_csv("../data_crazy_egg/HomepageVersion4.csv")

# Element list Homepage Version 5 - Services, 5-29-2013.csv
v5 = pd.read_csv("../data_crazy_egg/HomepageVersion5.csv")

In [11]:
v5

Unnamed: 0,Element ID,Tag name,Name,No. clicks,Visible?,Snapshot information
0,69,a,FIND,397,True,Homepage Version 5 - Services • http://www.lib.montana.edu/index5.php
1,61,input,s.q,323,True,"created 5-29-2013 • 20 days 4 hours 59 mins • 2064 visits, 1348 clicks"
2,67,a,lib.montana.edu/find/,106,True,
3,62,button,Search,85,True,
4,98,a,Hours,81,True,
5,78,a,REQUEST,57,True,
6,129,area,Montana State University - Home,49,False,
7,87,a,SERVICES,45,True,
8,96,a,News,24,True,
9,76,a,lib.montana.edu/request/,22,True,


In [2]:
# visits on each page (they are in the last column of the second row, we read them manually)
v1_visits = 10283
v2_visits = 2742
v3_visits = 2747
v4_visits = 3180
v5_visits = 2064

In [3]:
visits_list = [v1_visits, v2_visits, v3_visits, v4_visits, v5_visits]
visits_list

[10283, 2742, 2747, 3180, 2064]

In [4]:
v1_clicks = list(v1[v1['Name'] == 'INTERACT']['No. clicks'])
v2_clicks = list(v2[v2['Name'] == 'CONNECT']['No. clicks'])
v3_clicks = list(v3[v3['Name'] == 'LEARN']['No. clicks'])
v4_clicks = list(v4[v4['Name'] == 'HELP']['No. clicks'])
v5_clicks = list(v5[v5['Name'] == 'SERVICES']['No. clicks'])
v3_clicks

[21]

In [5]:
clicks_list = [v1_clicks, v2_clicks, v3_clicks, v4_clicks, v5_clicks]
clicks_list

[[42], [53], [21], [38], [45]]

In [6]:
clicks_list = [element for sublist in clicks_list for element in sublist]

In [7]:
noclick_list = list()
for item1, item2 in zip(visits_list, clicks_list): 
  noclick_list.append(item1 - item2)

noclick_list

[10241, 2689, 2726, 3142, 2019]

In [8]:
v1_CTR = v1_clicks[0] / v1_visits
v2_CTR = v2_clicks[0] / v2_visits
v3_CTR = v3_clicks[0] / v3_visits
v4_CTR = v4_clicks[0] / v4_visits
v5_CTR = v5_clicks[0] / v5_visits
v5_CTR

0.02180232558139535

In [9]:
CTR_list = [v1_CTR, v2_CTR, v3_CTR, v4_CTR, v5_CTR]
CTR_list

[0.0040844111640571815,
 0.019328956965718454,
 0.007644703312704768,
 0.011949685534591196,
 0.02180232558139535]

In [10]:
rates = pd.Series(CTR_list)
names = pd.Series(["Interact", "Connect", "Learn", "Help", "Services"])
ctr_df = pd.DataFrame({"rates":rates, "names":names}).sort_values("rates")
ctr_df.sort_values("rates", ascending=False)
ctr_df

Unnamed: 0,rates,names
0,0.004084,Interact
2,0.007645,Learn
3,0.01195,Help
1,0.019329,Connect
4,0.021802,Services


In [11]:
# no-clicks
v1_noclick = v1_visits - v1_clicks[0]
v2_noclick = v2_visits - v2_clicks[0]
v3_noclick = v3_visits - v3_clicks[0]
v4_noclick = v4_visits - v4_clicks[0]
v5_noclick = v5_visits - v5_clicks[0]

In [12]:
# alternative way
# contingency table as a pd.DataFrame creation
#clicks = pd.Series([v1_clicks[0], v2_clicks[0], v3_clicks[0], v4_clicks[0], v5_clicks[0]])
#noclicks = pd.Series([v1_noclick, v2_noclick, v3_noclick, v4_noclick, v5_noclick])

#observed_4 = pd.DataFrame(data = [clicks, noclicks])
#observed.columns_4 = ["Interact", "Connect", "Learn", "Help", "Services"]
#observed.index_4 = ["Click", "No-click"]

#observed_4

In [13]:
# observed results
observed_3 = pd.DataFrame([clicks_list, noclick_list],
                           columns = ["INTERACT", "CONNECT", "LEARN", "HELP", "SERVICES"],
                           index = ["clicks", "no clicks"]
                            )

Null Hypothesis: **Interact, Connect, Learn, Help, Services** have the number of clicks and no-clicks values

Alternative Hypthesis: **Interact, Connect, Learn, Help, Services** do not have the same number of clicks and no-clicks

Significance level: **95%** or 0.95

Alpha: 1 - 0.95 = 0.05

To reject the Null Hypothesis p-value needs to be less or equal to alpha (p-value  <= 0.05)

In [14]:
from scipy import stats
chisq, pvalue, df, expected = stats.chi2_contingency(observed_3)
pvalue

4.852334301093838e-20

In [17]:
print(pvalue <= 0.05)

True


In [15]:
# not done yet, still have to find a way to order the dataframe from winner to looser and to get the dataframe to the original one at the end of the function
def a_b_algo (dataframe, no_columns): # dataframe has to be sorted from loser to winner 
  from scipy import stats
  dataframe_new = dataframe
  for x in range(no_columns-1): 
    chisq, pvalue, df, expected = stats.chi2_contingency(dataframe_new)
    if float(pvalue < 0.1): 
      print(pvalue)
      #ctr_df = ctr_df.iloc[1: , :]
      dataframe_new.drop(columns=dataframe_new.columns[0], axis=1, inplace=True)
    else:  
      print("Null Hypothesis can't be rejected")
    dataframe_new = dataframe

In [16]:
a_b_algo(observed_3, 5)

4.852334301093838e-20
5.25509870228566e-05
8.044453904790285e-05
0.007370912499282061
