# Website A/B Testing - Lab

## Introduction

In this lab, you'll get another chance to practice your skills at conducting a full A/B test analysis. It will also be a chance to practice your data exploration and processing skills! The scenario you'll be investigating is data collected from the homepage of a music app page for audacity.

## Objectives

You will be able to:
* Analyze the data from a website A/B test to draw relevant conclusions
* Explore and analyze web action data

## Exploratory Analysis

Start by loading in the dataset stored in the file 'homepage_actions.csv'. Then conduct an exploratory analysis to get familiar with the data.

> Hints:
    * Start investigating the id column:
        * How many viewers also clicked?
        * Are there any anomalies with the data; did anyone click who didn't view?
        * Is there any overlap between the control and experiment groups? 
            * If so, how do you plan to account for this in your experimental design?

In [4]:
#Your code here
import pandas as pd
import numpy as np
import scipy.stats as stats
df = pd.read_csv('homepage_actions.csv')
df.head()

Unnamed: 0,timestamp,id,group,action
0,2016-09-24 17:42:27.839496,804196,experiment,view
1,2016-09-24 19:19:03.542569,434745,experiment,view
2,2016-09-24 19:36:00.944135,507599,experiment,view
3,2016-09-24 19:59:02.646620,671993,control,view
4,2016-09-24 20:26:14.466886,536734,experiment,view


In [23]:
!cat flatiron_stats.py

#flatiron_stats
import numpy as np
import scipy.stats as stats

def welch_t(a, b):
    
    """ Calculate Welch's t statistic for two samples. """

    numerator = a.mean() - b.mean()
    
    # â€œddof = Delta Degrees of Freedomâ€�: the divisor used in the calculation is N - ddof, 
    #  where N represents the number of elements. By default ddof is zero.
    
    denominator = np.sqrt(a.var(ddof=1)/a.size + b.var(ddof=1)/b.size)
    
    return np.abs(numerator/denominator)

def welch_df(a, b):
    
    """ Calculate the effective degrees of freedom for two samples. This function returns the degrees of freedom """
    
    s1 = a.var(ddof=1) 
    s2 = b.var(ddof=1)
    n1 = a.size
    n2 = b.size
    
    numerator = (s1/n1 + s2/n2)**2
    denominator = (s1/ n1)**2/(n1 - 1) + (s2/ n2)**2/(n2 - 1)
    
    return numerator/denominator


def p_value_welch_ttest(a, b, two_sided=False):
    """Calculates the p-value for Welch's t-test given two samples.
    By default, the returned p-val

In [6]:
import flatiron_stats as fs

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8188 entries, 0 to 8187
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   timestamp  8188 non-null   object
 1   id         8188 non-null   int64 
 2   group      8188 non-null   object
 3   action     8188 non-null   object
dtypes: int64(1), object(3)
memory usage: 256.0+ KB


In [8]:
df['timestamp'].min()

'2016-09-24 17:42:27.839496'

In [9]:
df['timestamp'].max()

'2017-01-18 10:24:08.629327'

In [10]:
df['group'].value_counts()

control       4264
experiment    3924
Name: group, dtype: int64

In [11]:
df['action'].value_counts()

view     6328
click    1860
Name: action, dtype: int64

My notes I didn't really understand the questions above "How many viewers also clicked? * Are there any anomalies with the data; did anyone click who didn't view?" because it seems like my 'action' value counts kind of answers it already. However, in the spirit of learning more, let's do it this way as well:

In [12]:
#view_ids = df['id']['action' == 'view'].unique()

#No, you (1) make it a set and (2) put the condition first, I guess:
view_ids = set(df[df.action=='view']['id'].unique())
click_ids = set(df[df.action=='click']['id'].unique())
print("Number of viewers: {}, /t Number of clickers: {}".format(len(view_ids), len(click_ids)))

Number of viewers: 6328, /t Number of clickers: 1860


In [13]:
print("Number of viewers who didn't click: {}".format(len(view_ids) - len(click_ids)))
print("Number of clickers who didn't view: {}".format(len(click_ids) - len(view_ids)))

Number of viewers who didn't click: 4468
Number of clickers who didn't view: -4468


One last thing is making sure our control & experiment groups don't overlap:

In [14]:
#control_ids = set(df.group == 'control')['id'].unique()
control_ids = set(df[df.group == 'control']['id'].unique())
experiment_ids = set(df[df.group == 'experiment']['id'].unique())
#overlap = experiment_ids.isin(control_ids)
print("Number of overlap between experiment & control id numbers: {}".format(len(control_ids&experiment_ids)))

Number of overlap between experiment & control id numbers: 0


## Conduct a Statistical Test

Conduct a statistical test to determine whether the experimental homepage was more effective than that of the control group.

In [15]:
#Your code here
#We'll set up a contingency table by first filtering the df for the group and action columns
# only. Then, we'll set up filter A and filter B for the control & experiment groups. 

#filtered_df = df[df['group', 'action']]

#Nope, don't need the second 'df' in there:
filtered_df = df[['group', 'action']]
filtered_df

Unnamed: 0,group,action
0,experiment,view
1,experiment,view
2,experiment,view
3,control,view
4,experiment,view
...,...,...
8183,experiment,view
8184,experiment,view
8185,experiment,view
8186,control,view


In [16]:
filtered_df_A = filtered_df[filtered_df['group'] == 'control']
filtered_df_B = filtered_df[filtered_df['group'] == 'experiment']

control_clicks = sum(filtered_df_A['action'] == 'click')
experiment_clicks = sum(filtered_df_B['action'] == 'click')
control_clicks, experiment_clicks

(932, 928)

In [17]:
control_no_clicks = sum(filtered_df_A['action'] == 'view')
experiment_no_clicks = sum(filtered_df_B['action'] == 'view')
control_no_clicks, experiment_no_clicks

(3332, 2996)

In [19]:
contingency_table = np.array([
                            (control_clicks, control_no_clicks),
                            (experiment_clicks, experiment_no_clicks)
])
contingency_table

array([[ 932, 3332],
       [ 928, 2996]])

In [20]:
stats.chi2_contingency(contingency_table)

(3.636160051233291,
 0.056537191086915774,
 1,
 array([[ 968.61748901, 3295.38251099],
        [ 891.38251099, 3032.61748901]]))

It looks like the p-value is above our alpha of 0.05, so we cannot reject the null hypothesis. In other words, we found no significant difference between the control and experiment groups.

WELP, after all that, it turns out I ran the wrong kind of test. ⊙﹏⊙∥ 

I suppose I'm really just comparing the conversion rates of clicks between two groups. In other words, I'm comparing two sets of numerical values, so a two-sided Welch's t-test will do it. At least, I think that's the reasoning. I'm still confused because we just worked with similar data (conversion rates) in the previous A/B testing lecture with Greg and we used a chi-squared test there, so...╰（‵□′）╯
In any case, they make a 'count' column where each row gets the same value: 1. This can be used to calculate sums easily. I'll make it here in my filtered_df and then go from there setting up & running a two-sidded Welch's t-test. However, I don't need to that. I already put the subsets together that I need. 
What I should've done is calculated the perentages of views turn into clicks; that would give me a mean. Specifically, the "average click rate," which means I'm just comparing two means, which means I use Welch's ttest. I think Welch's is also used with control/experiment stuff more, but not 100% sure on that.


In [21]:
control_click_rate = control_clicks/control_no_clicks
experiment_click_rate = experiment_clicks/experiment_no_clicks
control_click_rate, experiment_click_rate

(0.2797118847539016, 0.3097463284379172)

In [27]:
fs.p_value_welch_ttest(control_click_rate, experiment_click_rate)

AttributeError: 'float' object has no attribute 'mean'

In [28]:
def welch_t(a, b):
    
    """ Calculate Welch's t statistic for two samples. """

    numerator = a.mean() - b.mean()
    
    # â€œddof = Delta Degrees of Freedomâ€�: the divisor used in the calculation is N - ddof, 
    #  where N represents the number of elements. By default ddof is zero.
    
    denominator = np.sqrt(a.var(ddof=1)/a.size + b.var(ddof=1)/b.size)
    
    return np.abs(numerator/denominator)

def welch_df(a, b):
    
    """ Calculate the effective degrees of freedom for two samples. This function returns the degrees of freedom """
    
    s1 = a.var(ddof=1) 
    s2 = b.var(ddof=1)
    n1 = a.size
    n2 = b.size
    
    numerator = (s1/n1 + s2/n2)**2
    denominator = (s1/ n1)**2/(n1 - 1) + (s2/ n2)**2/(n2 - 1)
    
    return numerator/denominator


def p_value_welch_ttest(a, b, two_sided=False):
    """Calculates the p-value for Welch's t-test given two samples.
    By default, the returned p-value is for a one-sided t-test. 
    Set the two-sided parameter to True if you wish to perform a two-sided t-test instead.
    """
    t = welch_t(a, b)
    df = welch_df(a, b)
    
    p = 1-stats.t.cdf(np.abs(t), df)
    
    if two_sided:
        return 2*p
    else:
        return p

In [29]:
p_value_welch_ttest(control_click_rate, experiment_click_rate)

AttributeError: 'float' object has no attribute 'mean'

In [30]:
fs.p_value_welch_ttest(control_clicks, experiment_clicks)

AttributeError: 'int' object has no attribute 'mean'

Okay, this won't work because I already calculated a mean - the "average click rate" - and the flatiron_stats function needs an array (or at least some sort of group) of numbers. I do remember that the solution approached the Welch's TTest here differently with a 'count' column they could sum up. 

I tried to do it my way, but I wind up with a preexisting mean the function isn't designed for. Wait, what if I just take the "sum" part out of my control_clicks/no_clicks & experiment_click/no_clicks and then run it?

In [31]:
control_clicks = filtered_df_A['action'] == 'click'
experiment_clicks = filtered_df_B['action'] == 'click'
control_no_clicks = filtered_df_A['action'] == 'view'
experiment_no_clicks = filtered_df_B['action'] == 'view'

In [32]:
fs.p_value_welch_ttest(control_clicks, experiment_clicks)

0.026743886922199422

So that finally won't throw an error, but why is my result different than the solution? It must be in how the solution sets up the test. At this point, I've spent hours trying to figure it out on my own, so let's investigate why the solution is producing a different result:

In [37]:
df['count'] = 1
df.head()

Unnamed: 0,timestamp,id,group,action,count
0,2016-09-24 17:42:27.839496,804196,experiment,view,1
1,2016-09-24 19:19:03.542569,434745,experiment,view,1
2,2016-09-24 19:36:00.944135,507599,experiment,view,1
3,2016-09-24 19:59:02.646620,671993,control,view,1
4,2016-09-24 20:26:14.466886,536734,experiment,view,1


In [38]:
control = df[df.group=='control'].pivot(index='id', columns='action', values='count')
control = control.fillna(value=0)

experiment = df[df.group=='experiment'].pivot(index='id', columns='action', values='count')
experiment = experiment.fillna(value=0)



print("Sample sizes:\tControl: {}\tExperiment: {}".format(len(control), len(experiment)))
print("Total Clicks:\tControl: {}\tExperiment: {}".format(control.click.sum(), experiment.click.sum()))
print("Average click rate:\tControl: {}\tExperiment: {}".format(control.click.mean(), experiment.click.mean()))
control.head()

Sample sizes:	Control: 3332	Experiment: 2996
Total Clicks:	Control: 932.0	Experiment: 928.0
Average click rate:	Control: 0.2797118847539016	Experiment: 0.3097463284379172


action,click,view
id,Unnamed: 1_level_1,Unnamed: 2_level_1
182994,1.0,1.0
183089,0.0,1.0
183248,1.0,1.0
183515,0.0,1.0
183524,0.0,1.0


In [39]:
fs.p_value_welch_ttest(control.click, experiment.click)

0.004466402814337078

Ah, so the binary count column is set up such that all views get a 1, but clicks only get a count if it says "click." Subsequently, it will produce a sample you can average within the welch ttest function. I guess I'm still confused why the output is different because my version of a pivot table (as filtered_df) also produces such an input. Let's explore their pivot table some more:

In [43]:
control.click

id
182994    1.0
183089    0.0
183248    1.0
183515    0.0
183524    0.0
         ... 
936786    0.0
937003    0.0
937073    0.0
937108    0.0
937217    1.0
Name: click, Length: 3332, dtype: float64

In [44]:
#Here's my attempt at the same thing from earlier:
control_clicks

3       False
9       False
23      False
24      False
25      False
        ...  
8178    False
8181    False
8182     True
8186    False
8187    False
Name: action, Length: 4264, dtype: bool

So, I have a boolean thing going as opposed to a numerical column and it's longer, so...in the interest of time, let's keep in mind that when we're compiling subsets and numerical inputs for stats tests, it may be better to convert our values ('view'/'click', etc.) into numerical values with something like this 'count' column and we should get better results. 

## Verifying Results

One sensible formulation of the data to answer the hypothesis test above would be to create a binary variable representing each individual in the experiment and control group. This binary variable would represent whether or not that individual clicked on the homepage; 1 for they did and 0 if they did not. 

The variance for the number of successes in a sample of a binomial variable with n observations is given by:

## $n\bullet p (1-p)$

Given this, perform 3 steps to verify the results of your statistical test:
1. Calculate the expected number of clicks for the experiment group, if it had the same click-through rate as that of the control group. 
2. Calculate the number of standard deviations that the actual number of clicks was from this estimate. 
3. Finally, calculate a p-value using the normal distribution based on this z-score.

### Step 1:
Calculate the expected number of clicks for the experiment group, if it had the same click-through rate as that of the control group. 

In [46]:
#Your code here
p_experiment_clicks = sum(experiment.click)/len(experiment.click)
expected_clicks_experiment = len(control.click) * (p_experiment_clicks*(1-p_experiment_clicks))
print("expected number of clicks for the experiment group", expected_clicks_experiment)
print("click-through rate of control group:", (control.click.sum()/len(control.click)))

expected number of clicks for the experiment group 712.3933968032143
click-through rate of control group: 0.2797118847539016


I'm not sure if I'm answering the question above correctly; it feels like I'm doing what it asks, 
but I don't think I get what it's asking. WHat does the solution do here?

In [49]:
control_rate = control.click.mean()
expected_experiment_clicks_under_null = control_rate * len(experiment)
expected_experiment_clicks_under_null

838.0168067226891

Oh, it's asking what the rate of clicks is for the control group and then we take that and apply it to the length of the experiment group to set up expectations for what that rate would be. 

### Step 2:
Calculate the number of standard deviations that the actual number of clicks was from this estimate.

In [58]:
#Your code here
#ok, wow, this going back to basics on s.d. and variance and I really had no idea what to do,
# so first you have to calculate the variance, like they showed above. Then, you have to 
# remember that the s.d. is the square root of the variance. 
n = len(experiment.click)
#The p for our purposes will be control_rate because that's what we used above
p = control_rate
variance = n*(p*(1-p))
std = np.sqrt(variance)
std

24.568547907005815

In [56]:
#Now we (should so obviously and easily) recall that we measure the number of 
# s.d.'s from a mean with a z-score. z-scores are (x-mu)/s.d.:

#z_score = (expected_experiment_clicks_under_null - control.click.mean())/std
#（＞人＜；）
z_score = (expected_experiment_clicks_under_null - sum(experiment.click))/std
z_score

-3.6625360854823588

Hmm, the solution's numerator order is reversed, but I feel like according to the z-score
formula, I set it up the right way.

In [57]:
sum(experiment.click)

928.0

Yeah, it's less so it should be -3.66, but I guess either way works if the question is simply
"how many s.d.'s away".

### Step 3: 
Finally, calculate a p-value using the normal distribution based on this z-score.

In [60]:
#Your code here
#If I'm reading my notes right, all I have to do is this:
stats.norm.cdf(z_score)

0.00012486528006951198

### Analysis:

Does this result roughly match that of the previous statistical test?

> Comment: No, those are different p-values and something is wrong and oh wait, never mind, I see now that
even though this p-value is lower, they would both lead to rejecting the null, which is true, so yes it
can be said to 'roughly match' the previous stats test. 

## Summary

In this lab, you continued to get more practice designing and conducting AB tests. This required additional work preprocessing and formulating the initial problem in a suitable manner. Additionally, you also saw how to verify results, strengthening your knowledge of binomial variables, and reviewing initial statistical concepts of the central limit theorem, standard deviation, z-scores, and their accompanying p-values.