# Website A/B Testing - Lab

## Introduction

In this lab, you'll get another chance to practice your skills at conducting a full A/B test analysis. It will also be a chance to practice your data exploration and processing skills! The scenario you'll be investigating is data collected from the homepage of a music app page for audacity.

## Objectives

You will be able to:
* Analyze the data from a website A/B test to draw relevant conclusions
* Explore and analyze web action data

## Exploratory Analysis

Start by loading in the dataset stored in the file 'homepage_actions.csv'. Then conduct an exploratory analysis to get familiar with the data.

> Hints:
    * Start investigating the id column:
        * How many viewers also clicked?
        * Are there any anomalies with the data; did anyone click who didn't view?
        * Is there any overlap between the control and experiment groups? 
            * If so, how do you plan to account for this in your experimental design?

In [1]:
#Your code here
import pandas as pd
import numpy as np
import scipy.stats as stats
df = pd.read_csv('homepage_actions.csv')
df

Unnamed: 0,timestamp,id,group,action
0,2016-09-24 17:42:27.839496,804196,experiment,view
1,2016-09-24 19:19:03.542569,434745,experiment,view
2,2016-09-24 19:36:00.944135,507599,experiment,view
3,2016-09-24 19:59:02.646620,671993,control,view
4,2016-09-24 20:26:14.466886,536734,experiment,view
...,...,...,...,...
8183,2017-01-18 09:11:41.984113,192060,experiment,view
8184,2017-01-18 09:42:12.844575,755912,experiment,view
8185,2017-01-18 10:01:09.026482,458115,experiment,view
8186,2017-01-18 10:08:51.588469,505451,control,view


In [2]:
#I think this is here because it wasn't loading in right last year
!cat flatiron_stats.py

#flatiron_stats
import numpy as np
import scipy.stats as stats

def welch_t(a, b):
    
    """ Calculate Welch's t statistic for two samples. """

    numerator = a.mean() - b.mean()
    
    # â€œddof = Delta Degrees of Freedomâ€�: the divisor used in the calculation is N - ddof, 
    #  where N represents the number of elements. By default ddof is zero.
    
    denominator = np.sqrt(a.var(ddof=1)/a.size + b.var(ddof=1)/b.size)
    
    return np.abs(numerator/denominator)

def welch_df(a, b):
    
    """ Calculate the effective degrees of freedom for two samples. This function returns the degrees of freedom """
    
    s1 = a.var(ddof=1) 
    s2 = b.var(ddof=1)
    n1 = a.size
    n2 = b.size
    
    numerator = (s1/n1 + s2/n2)**2
    denominator = (s1/ n1)**2/(n1 - 1) + (s2/ n2)**2/(n2 - 1)
    
    return numerator/denominator


def p_value_welch_ttest(a, b, two_sided=False):
    """Calculates the p-value for Welch's t-test given two samples.
    By default, the returned p-val

In [3]:
import flatiron_stats as fs

In [4]:
#Ok, back to the regularly scheduled programming:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8188 entries, 0 to 8187
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   timestamp  8188 non-null   object
 1   id         8188 non-null   int64 
 2   group      8188 non-null   object
 3   action     8188 non-null   object
dtypes: int64(1), object(3)
memory usage: 256.0+ KB


In [5]:
df['timestamp'].min()

'2016-09-24 17:42:27.839496'

In [6]:
df['timestamp'].max()

'2017-01-18 10:24:08.629327'

In [7]:
df['group'].value_counts()

control       4264
experiment    3924
Name: group, dtype: int64

In [8]:
df['action'].value_counts()

view     6328
click    1860
Name: action, dtype: int64

In [9]:
len(df['id'].unique())

6328

I still don't really understand the questions "How many viewers also clicked?" and "Are there any anomalies with the data; did anyone click who didn't view?" because it seems like the 'action' column kind of answers that already. However, in the spirit of learning more, let's do it this way as well:

In [10]:
view_ids = set(df[df['action']=='view']['id'].unique())
click_ids = set(df[df['action']=='click']['id'].unique())
print("Number of viewers: {}, Number of clickers: {}".format(len(view_ids), len(click_ids)))
#Ok, this is the same as the df['action'].value_counts() above

Number of viewers: 6328, Number of clickers: 1860


My notes: There are 6,328 total unique people here. They are all 'viewers' in the sense that they *viewed* the homepage. However, 1,860 of them clicked (something) on the homepage. Those 'clickers' are then given another record where their 'action' column value is listed as 'click'. Subsequently, there are a total of 8,188 records here because that's the sum of everyone involved plus the the ones who 'clicked'.

At least, I *think* that's what's going on here. Let's verify that:

In [11]:
len(list(click_ids & view_ids))

1860

In [12]:
list(click_ids & view_ids)

[901121,
 311308,
 491548,
 532513,
 548902,
 507942,
 311339,
 221228,
 557099,
 647214,
 466992,
 540722,
 442421,
 540729,
 426042,
 630845,
 393280,
 737348,
 245830,
 925767,
 589899,
 385103,
 458832,
 213073,
 319578,
 204891,
 614501,
 737385,
 696425,
 639086,
 761973,
 335999,
 811142,
 696456,
 745611,
 860314,
 508060,
 327843,
 860324,
 188580,
 254118,
 786605,
 491695,
 647343,
 467124,
 590004,
 409783,
 352461,
 442574,
 770255,
 704717,
 491729,
 655572,
 565462,
 794838,
 205023,
 532710,
 770282,
 917744,
 221425,
 442609,
 352499,
 483570,
 532734,
 737539,
 581893,
 491810,
 762149,
 336180,
 450877,
 368962,
 442700,
 696656,
 434522,
 729442,
 647524,
 868723,
 328052,
 934268,
 631170,
 917891,
 450946,
 295304,
 450958,
 582033,
 737683,
 795030,
 926105,
 868765,
 262562,
 336304,
 737715,
 278964,
 188853,
 450999,
 451008,
 713154,
 205250,
 909764,
 508360,
 549328,
 385492,
 778714,
 426465,
 360932,
 532966,
 565738,
 795115,
 246250,
 745970,
 279027,
 

In [13]:
#Let's take one and see if it only shows up twice:
df.loc[df['id'] == 901121]

Unnamed: 0,timestamp,id,group,action
796,2016-10-05 19:57:39.826306,901121,experiment,view
797,2016-10-05 19:58:35.479290,901121,experiment,click


Ok great, that's definitely what went on here. One last thing is making sure our control & experiment groups don't overlap:

In [14]:
control_ids = set(df[df['group'] == 'control']['id'].unique())
experiment_ids = set(df[df['group'] == 'experiment']['id'].unique())
print("Number of overlap between experiment & control id numbers: {}".format(len(control_ids&experiment_ids)))

Number of overlap between experiment & control id numbers: 0


## Conduct a Statistical Test

Conduct a statistical test to determine whether the experimental homepage was more effective than that of the control group.

**My notes:** I'm coming back to this is 2023 after having done it in 2022. It looks like I went a real roundabout way to finally pin down the test last time, so for now, I'm going to ignore that and see if I can do so in a more concise manner.

First, let's think about what we're comparing and which statestical test works best. Our job is to show whether the experiment grouped clicked at a different rate than the control group. 

- **Null hypothesis:** There is no difference in the view-to-click conversion rates between the control & experiment groups.
- **Alternative hypothesis:** There *is* a difference in the view-to-click conversion rates between the control & experiment groups.

This sounds like a two-tailed, independent test, so I'll use scipy.stats.ttest_ind(*the experimental group as an array*, *the control group as an array*), which is a tstt for independent samples. It'll output the t-value and the p-value. Before I do that, I'll need to make sub-df's out of the main df for the experimental & control groups.
    
However, it can't *just* be a df['control'] and df['experiment'] sub-df, right? The ttest_ind needs two arrays from which it can derive an average and standard deviation. Ah, okay I see why we did what we did before with the 1's in an 'action == view' column and then 0 or 1 in the 'action == click' column. If I can construct a 'sum of actions' column within the experiment & control groups, I think that's what I'd input into scipy.stats.ttest_ind.

In [15]:
df.loc[df.action == 'view', 'clicked'] = 0
df.loc[df.action == 'click', 'clicked'] = 1
df.head()

Unnamed: 0,timestamp,id,group,action,clicked
0,2016-09-24 17:42:27.839496,804196,experiment,view,0.0
1,2016-09-24 19:19:03.542569,434745,experiment,view,0.0
2,2016-09-24 19:36:00.944135,507599,experiment,view,0.0
3,2016-09-24 19:59:02.646620,671993,control,view,0.0
4,2016-09-24 20:26:14.466886,536734,experiment,view,0.0


In [16]:
df.clicked.value_counts()

0.0    6328
1.0    1860
Name: clicked, dtype: int64

In [17]:
#Set up control & experiment groups:
control_group = df[df['group'] == 'control']
experiment_group = df[df['group'] == 'experiment']

In [18]:
#Let's check out their means:
control_group_mean = np.mean(control_group['clicked'])
experiment_group_mean = np.mean(experiment_group['clicked'])
control_group_mean, experiment_group_mean

(0.21857410881801126, 0.23649337410805302)

In [19]:
#Ok, not looking too significant, but what the fuck do I know? Anyway, 
# let's isolate these two groups' respective 'clicked' columns for the ttest:
control_group_clicks = control_group['clicked']
experiment_group_clicks = experiment_group['clicked']

In [20]:
stats.ttest_ind(control_group_clicks, experiment_group_clicks)

Ttest_indResult(statistic=-1.9334751824865355, pvalue=0.05321212418167477)

**Below is what I did back in 2022**:

#We'll set up a contingency table by first filtering the df for the group and action columns
# only. Then, we'll set up filter A and filter B for the control & experiment groups. 

filtered_df = df[['group', 'action']]
filtered_df

filtered_df_A = filtered_df[filtered_df['group'] == 'control']
filtered_df_B = filtered_df[filtered_df['group'] == 'experiment']

control_clicks = sum(filtered_df_A['action'] == 'click')
experiment_clicks = sum(filtered_df_B['action'] == 'click')
control_clicks, experiment_clicks

control_no_clicks = sum(filtered_df_A['action'] == 'view')
experiment_no_clicks = sum(filtered_df_B['action'] == 'view')
control_no_clicks, experiment_no_clicks

contingency_table = np.array([
                            (control_clicks, control_no_clicks),
                            (experiment_clicks, experiment_no_clicks)
])
contingency_table

stats.chi2_contingency(contingency_table)

It looks like the p-value is above our alpha of 0.05, so we cannot reject the null hypothesis. In other words, we found no significant difference between the control and experiment groups.

WELP, after all that, it turns out I ran the wrong kind of test. ⊙﹏⊙∥ 

I suppose I'm really just comparing the conversion rates of clicks between two groups. In other words, I'm comparing two sets of numerical values, so a two-sided Welch's t-test will do it. At least, I think that's the reasoning. I'm still confused because we just worked with similar data (conversion rates) in the previous A/B testing lecture with Greg and we used a chi-squared test there, so...╰（‵□′）╯

In any case, they make a 'count' column where each row gets the same value: 1. This can be used to calculate sums easily. I'll make it here in my filtered_df and then go from there setting up & running a two-sidded Welch's t-test. However, I don't need to that. I already put the subsets together that I need. 

What I should've done is calculated the perentages of views that turn into clicks; that would give me a mean. Specifically, the "average click rate," which means I'm just comparing two means, which means I use Welch's ttest. I think Welch's is also used with control/experiment stuff more, but not 100% sure on that.

control_click_rate = control_clicks/control_no_clicks
experiment_click_rate = experiment_clicks/experiment_no_clicks
control_click_rate, experiment_click_rate

fs.p_value_welch_ttest(control_click_rate, experiment_click_rate)

def welch_t(a, b):
    
    """ Calculate Welch's t statistic for two samples. """

    numerator = a.mean() - b.mean()
    
    # â€œddof = Delta Degrees of Freedomâ€�: the divisor used in the calculation is N - ddof, 
    #  where N represents the number of elements. By default ddof is zero.
    
    denominator = np.sqrt(a.var(ddof=1)/a.size + b.var(ddof=1)/b.size)
    
    return np.abs(numerator/denominator)

def welch_df(a, b):
    
    """ Calculate the effective degrees of freedom for two samples. This function returns the degrees of freedom """
    
    s1 = a.var(ddof=1) 
    s2 = b.var(ddof=1)
    n1 = a.size
    n2 = b.size
    
    numerator = (s1/n1 + s2/n2)**2
    denominator = (s1/ n1)**2/(n1 - 1) + (s2/ n2)**2/(n2 - 1)
    
    return numerator/denominator


def p_value_welch_ttest(a, b, two_sided=False):
    """Calculates the p-value for Welch's t-test given two samples.
    By default, the returned p-value is for a one-sided t-test. 
    Set the two-sided parameter to True if you wish to perform a two-sided t-test instead.
    """
    t = welch_t(a, b)
    df = welch_df(a, b)
    
    p = 1-stats.t.cdf(np.abs(t), df)
    
    if two_sided:
        return 2*p
    else:
        return p

p_value_welch_ttest(control_click_rate, experiment_click_rate)

fs.p_value_welch_ttest(control_clicks, experiment_clicks)

Okay, this won't work because I already calculated a mean - the "average click rate" - and the flatiron_stats function needs an array (or at least some sort of group) of numbers. I do remember that the solution approached the Welch's TTest here differently with a 'count' column they could sum up. 

I tried to do it my way, but I wind up with a preexisting mean the function isn't designed for. Wait, what if I just take the "sum" part out of my control_clicks/no_clicks & experiment_click/no_clicks and then run it?

control_clicks = filtered_df_A['action'] == 'click'
experiment_clicks = filtered_df_B['action'] == 'click'
control_no_clicks = filtered_df_A['action'] == 'view'
experiment_no_clicks = filtered_df_B['action'] == 'view'

fs.p_value_welch_ttest(control_clicks, experiment_clicks)

So that finally won't throw an error, but why is my result different than the solution? It must be in how the solution sets up the test. At this point, I've spent hours trying to figure it out on my own, so let's investigate why the solution is producing a different result:

df['count'] = 1
df.head()

control = df[df.group=='control'].pivot(index='id', columns='action', values='count')
control = control.fillna(value=0)

experiment = df[df.group=='experiment'].pivot(index='id', columns='action', values='count')
experiment = experiment.fillna(value=0)



print("Sample sizes:\tControl: {}\tExperiment: {}".format(len(control), len(experiment)))
print("Total Clicks:\tControl: {}\tExperiment: {}".format(control.click.sum(), experiment.click.sum()))
print("Average click rate:\tControl: {}\tExperiment: {}".format(control.click.mean(), experiment.click.mean()))
control.head()

fs.p_value_welch_ttest(control.click, experiment.click)

Ah, so the binary count column is set up such that all views get a 1, but clicks only get a count if it says "click." Subsequently, it will produce a sample you can average within the welch ttest function. I guess I'm still confused why the output is different because my version of a pivot table (as filtered_df) also produces such an input. Let's explore their pivot table some more:

control.click

#Here's my attempt at the same thing from earlier:
control_clicks

So, I have a boolean thing going as opposed to a numerical column and it's longer, so...in the interest of time, let's keep in mind that when we're compiling subsets and numerical inputs for stats tests, it may be better to convert our values ('view'/'click', etc.) into numerical values with something like this 'count' column and we should get better results. 

**Ok, back to Flatiron's prompts and doing this in 2023:**

## Verifying Results

One sensible formulation of the data to answer the hypothesis test above would be to create a binary variable representing each individual in the experiment and control group. This binary variable would represent whether or not that individual clicked on the homepage; 1 for they did and 0 if they did not. 

The variance for the number of successes in a sample of a binomial variable with n observations is given by:

## $n\bullet p (1-p)$

Given this, perform 3 steps to verify the results of your statistical test:
1. Calculate the expected number of clicks for the experiment group, if it had the same click-through rate as that of the control group. 
2. Calculate the number of standard deviations that the actual number of clicks was from this estimate. 
3. Finally, calculate a p-value using the normal distribution based on this z-score.

### Step 1:
Calculate the expected number of clicks for the experiment group, if it had the same click-through rate as that of the control group. 

In [21]:
#Your code here
#p_experiment_clicks = sum(experiment_group_clicks)/len(experiment_group_clicks)
control_click_rate = sum(control_group_clicks)/len(control_group_clicks)
expected_experimental_group_clicks = control_click_rate * len(experiment_group_clicks)
print("expected number of clicks for the experiment group:", int(expected_experimental_group_clicks))
print("click-through rate of control group:", round(control_click_rate, 3))

expected number of clicks for the experiment group: 857
click-through rate of control group: 0.219


### Step 2:
Calculate the number of standard deviations that the actual number of clicks was from this estimate.

In [22]:
#Your code here
n = len(experiment_group_clicks)
#The p for our purposes will be control_click_rate because that's what we used above
p = control_click_rate
variance = n*(p*(1-p))
std = np.sqrt(variance)
round(std, 3)

25.889

In [23]:
#Now we (should so obviously and easily) recall that we measure the number of 
# s.d.'s from a mean with a z-score. z-scores are (x-mu)/s.d.:

#z_score = (expected_experiment_clicks_under_null - control.click.mean())/std
#（＞人＜；）
z_score = (sum(experiment_group_clicks) - expected_experimental_group_clicks)/std
round(z_score, 3)

2.716

### Step 3: 
Finally, calculate a p-value using the normal distribution based on this z-score.

In [24]:
#Your code here
#If I'm reading my notes right, all I have to do is this:
stats.norm.sf(z_score)

0.003303067275926571

### Analysis:

Does this result roughly match that of the previous statistical test?

> Comment: No, my old p-value was 0.05321, which lead me to fail to reject the null, so something's wrong. (#｀-_ゝ-) Let's figure it out, I guess.

In [25]:
#Here's what the solution did:
p_value_welch_ttest(control_group_clicks, experiment_group_clicks)

NameError: name 'p_value_welch_ttest' is not defined

In [26]:
#LOL, still not working here in 2023, so i gotta do this:
def welch_t(a, b):
    
    """ Calculate Welch's t statistic for two samples. """

    numerator = a.mean() - b.mean()
    
    # “ddof = Delta Degrees of Freedom”: the divisor used in the calculation is N - ddof, 
    #  where N represents the number of elements. By default ddof is zero.
    
    denominator = np.sqrt(a.var(ddof=1)/a.size + b.var(ddof=1)/b.size)
    
    return np.abs(numerator/denominator)

def welch_df(a, b):
    
    """ Calculate the effective degrees of freedom for two samples. This function returns the degrees of freedom """
    
    s1 = a.var(ddof=1) 
    s2 = b.var(ddof=1)
    n1 = a.size
    n2 = b.size
    
    numerator = (s1/n1 + s2/n2)**2
    denominator = (s1/ n1)**2/(n1 - 1) + (s2/ n2)**2/(n2 - 1)
    
    return numerator/denominator


def p_value_welch_ttest(a, b, two_sided=False):
    """Calculates the p-value for Welch's t-test given two samples.
    By default, the returned p-value is for a one-sided t-test. 
    Set the two-sided parameter to True if you wish to perform a two-sided t-test instead.
    """
    t = welch_t(a, b)
    df = welch_df(a, b)
    
    p = 1-stats.t.cdf(np.abs(t), df)
    
    if two_sided:
        return 2*p
    else:
        return p

In [27]:
#Now this should work, so as I was saying...

#My earlier attempt was a two-sided one because I wanted to see whether the averages were different. 
# However, I probably should've formulated my hyoptheses to see whether the experimental group resulted
# in an increase (and therefore a one-tailed test) because, after, all, that's probably what the stakeholder
# actually cares about.
my_first_attempt_p_value = p_value_welch_ttest(control_group_clicks, experiment_group_clicks, two_sided=True)

In [28]:
#So, if I were to take my first-attempt p-value and divide it by two to get the one-tailed result, 
# it'd be this:
my_first_attempt_p_value/2

0.026743886922199422

In [29]:
#Subsequently, I would have rejected the null, so now I can say yes: the first attempt's p-value and 
# the verified one are similar enough because they both lead me to reject the null.

## Summary

In this lab, you continued to get more practice designing and conducting AB tests. This required additional work preprocessing and formulating the initial problem in a suitable manner. Additionally, you also saw how to verify results, strengthening your knowledge of binomial variables, and reviewing initial statistical concepts of the central limit theorem, standard deviation, z-scores, and their accompanying p-values.