# Practice: Statistical Significance

Let's say that we've collected data for a web-based experiment. In the experiment, we're testing the change in layout of a product information page to see if this affects the proportion of people who click on a button to go to the download page. This experiment has been designed to have a cookie-based diversion, and we record two things from each user: which page version they received, and whether or not they accessed the download page during the data recording period. (We aren't keeping track of any other factors in this example, such as number of pageviews, or time between accessing the page and making the download, that might be of further interest.)

Your objective in this notebook is to perform a statistical test on both recorded metrics to see if there is a statistical difference between the two groups.

In [9]:
# import packages

import numpy as np
import pandas as pd
import scipy.stats as stats
from statsmodels.stats import proportion as proptests

from jupyterthemes import jtplot


import matplotlib.pyplot as plt
%matplotlib inline
jtplot.style(theme='solarizedd')
plt.rcParams['figure.figsize'] = (20.0, 10.0)

In [10]:
# import data

data = pd.read_csv('homepage-experiment-data.csv')
data.head(10)

Unnamed: 0,Day,Control Cookies,Control Downloads,Control Licenses,Experiment Cookies,Experiment Downloads,Experiment Licenses
0,1,1764,246,1,1850,339,3
1,2,1541,234,2,1590,281,2
2,3,1457,240,1,1515,274,1
3,4,1587,224,1,1541,284,2
4,5,1606,253,2,1643,292,3
5,6,1681,287,3,1780,299,3
6,7,1534,262,5,1555,276,8
7,8,1798,331,12,1787,326,20
8,9,1478,223,30,1553,298,38
9,10,1461,236,32,1458,289,23


In the dataset, the 'condition' column takes a 0 for the control group, and 1 for the experimental group. The 'click' column takes a values of 0 for no click, and 1 for a click.

## Checking the Invariant Metric

First of all, we should check that the number of visitors assigned to each group is similar. It's important to check the invariant metrics as a prerequisite so that our inferences on the evaluation metrics are founded on solid ground. If we find that the two groups are imbalanced on the invariant metric, then this will require us to look carefully at how the visitors were split so that any sources of bias are accounted for. It's possible that a statistically significant difference in an invariant metric will require us to revise random assignment procedures and re-do data collection.

In this case, we want to do a two-sided hypothesis test on the proportion of visitors assigned to one of our conditions. Choosing the control or the experimental condition doesn't matter: you'll get the same result either way. Feel free to use whatever method you'd like: we'll highlight two main avenues below.

If you want to take a simulation-based approach, you can simulate the number of visitors that would be assigned to each group for the number of total observations, assuming that we have an expected 50/50 split. Do this many times (200 000 repetitions should provide a good speed-variability balance in this case) and then see in how many simulated cases we get as extreme or more extreme a deviation from 50/50 that we actually observed. Don't forget that, since we have a two-sided test, an extreme case also includes values on the opposite side of 50/50. (e.g. Since simulated outcomes of .48 and lower are considered as being more extreme than an actual observation of 0.48, so too will simulated outcomes of .52 and higher.) The proportion of flagged simulation outcomes gives us a p-value on which to assess our observed proportion. We hope to see a larger p-value, insufficient evidence to reject the null hypothesis.

If you want to take an analytic approach, you could use the exact binomial distribution to compute a p-value for the test. The more usual approach, however, is to use the normal distribution approximation. Recall that this is possible thanks to our large sample size and the central limit theorem. To get a precise p-value, you should also perform a 
continuity correction, either adding or subtracting 0.5 to the total count before computing the area underneath the curve. (e.g. If we had 415 / 850 assigned to the control group, then the normal approximation would take the area to the left of $(415 + 0.5) / 850 = 0.489$ and to the right of $(435 - 0.5) / 850 = 0.511$.)

You can check your results by completing the following the workspace and the solution on the following page. You could also try using multiple approaches and seeing if they come up with similar outcomes!

In [11]:
# your work here: feel free to create additional code cells as needed!


In [12]:
totals = data.sum().drop('Day')
totals = totals.rename(index={
    'Control Cookies': 'control_cookies',
    'Control Downloads': 'control_downloads',
    'Control Licenses': 'control_licenses',
    'Experiment Cookies': 'experiment_cookies',
    'Experiment Downloads': 'experiment_downloads',
    'Experiment Licenses': 'experiment_licenses',
})
totals['both_cookies'] = totals['control_cookies'] + totals['experiment_cookies']
totals['both_downloads'] = totals['control_downloads'] + totals['experiment_downloads']
totals['both_licenses'] = totals['control_licenses'] + totals['experiment_licenses']
totals

control_cookies         46851
control_downloads        7554
control_licenses          710
experiment_cookies      47346
experiment_downloads     8548
experiment_licenses       732
both_cookies            94197
both_downloads          16102
both_licenses            1442
dtype: int64

### Analytically

### Download rate

Let's find the p-value (the probability of the current value given a N(0.5, sigma))

In [13]:
n_1 = totals.experiment_cookies
n = totals.both_cookies
current_frac = (n_1 + 0.5) / n
p = 0.5
sigma = np.sqrt(p * (1 - p) / n)

In [14]:
z = (current_frac - p) / sigma
z

1.6160810608583913

In [15]:
p_value = 2 * (1 - stats.norm.cdf(z))
p_value

0.10607678842460189

In [16]:
print('The p-value is {}. It has to be less than {} to reject the null hypothesis'.format(
    p_value, 0.05))

The p-value is 0.10607678842460189. It has to be less than 0.05 to reject the null hypothesis


The null hypothesis is not rejected

## Checking the Evaluation Metric

After performing our checks on the invariant metric, we can move on to performing a hypothesis test on the evaluation metric: the click-through rate. In this case, we want to see that the experimental group has a significantly larger click-through rate than the control group, a one-tailed test.

The simulation approach for this metric isn't too different from the approach for the invariant metric. You'll need the overall click-through rate as the common proportion to draw simulated values from for each group. You may also want to perform more simulations since there's higher variance for this test.

There are a few analytic approaches possible here, but you'll probably make use of the normal approximation again in these cases. In addition to the pooled click-through rate, you'll need a pooled standard deviation in order to compute a z-score. While there is a continuity correction possible in this case as well, it's much more conservative than the p-value that a simulation will usually imply. Computing the z-score and resulting p-value without a continuity correction should be closer to the simulation's outcomes, though slightly more optimistic about there being a statistical difference between groups.

As with the previous question, you'll find a quiz and solution following the workspace for you to check your results.

In [134]:
# your work here: feel free to create additional code cells as needed!


### Analytically

In [17]:
totals

control_cookies         46851
control_downloads        7554
control_licenses          710
experiment_cookies      47346
experiment_downloads     8548
experiment_licenses       732
both_cookies            94197
both_downloads          16102
both_licenses            1442
dtype: int64

In [18]:
download_rate_0 = totals.control_downloads / totals.control_cookies
download_rate_1 = totals.experiment_downloads / totals.experiment_cookies

In [20]:
n_0 = totals.control_cookies
n_1 = totals.experiment_cookies
p_0 = download_rate_0
p_1 = download_rate_1
p_pooled = (totals.control_downloads + totals.experiment_downloads) / (totals.control_cookies + totals.experiment_cookies)
sigma_pooled = np.sqrt(p_pooled * (1 - p_pooled) * ((1 / n_0) + (1 / n_1)))

In [21]:
p_pooled

0.1709396265273841

In [22]:
sigma_pooled

0.0024531940948456393

In [26]:
d = p_1 - p_0
d

0.01930868281829759

In [27]:
z = d / sigma_pooled
z

7.870833726066236

In [25]:
p_value = 1 - stats.norm.cdf(z)
p_value

1.7763568394002505e-15

As the p-value is not larger than 0.05 the null hypothesis can be rejected.

### License purchase rate

In [30]:
data

Unnamed: 0,Day,Control Cookies,Control Downloads,Control Licenses,Experiment Cookies,Experiment Downloads,Experiment Licenses
0,1,1764,246,1,1850,339,3
1,2,1541,234,2,1590,281,2
2,3,1457,240,1,1515,274,1
3,4,1587,224,1,1541,284,2
4,5,1606,253,2,1643,292,3
5,6,1681,287,3,1780,299,3
6,7,1534,262,5,1555,276,8
7,8,1798,331,12,1787,326,20
8,9,1478,223,30,1553,298,38
9,10,1461,236,32,1458,289,23


In [33]:
data[data.Day < 22].shape

(21, 7)

In [34]:
data[data.Day > 8].shape

(21, 7)

In [35]:
cookies_0 = data[data.Day < 22]['Control Cookies'].sum()
cookies_1 = data[data.Day < 22]['Experiment Cookies'].sum()

purchases_0 = data[data.Day > 8]['Control Licenses'].sum()
purchases_1 = data[data.Day > 8]['Experiment Licenses'].sum()

In [36]:
n_0 = cookies_0
n_1 = cookies_1
p_0 = purchases_0 / cookies_0
p_1 = purchases_1 / cookies_1
p_pooled = (purchases_0 + purchases_1) / (n_0 + n_1)
sigma_pooled = np.sqrt(p_pooled * (1 - p_pooled) * ((1 / n_0) + (1 / n_1)))

In [37]:
p_0

0.020232241246519345

In [40]:
p_1

0.020094356106936922

In [41]:
p_pooled

0.020162711466165415

In [39]:
sigma_pooled

0.0010772993492029904

In [42]:
d = p_1 - p_0
d

-0.0001378851395824228

In [43]:
z = d / sigma_pooled
z

-0.12799148136906724

In [44]:
p_value = 1 - stats.norm.cdf(z)
p_value

0.550922142761739

The null hypothesis cannot be rejected