# Practice: Statistical Significance

Let's say that we've collected data for a web-based experiment. In the experiment, we're testing the change in layout of a product information page to see if this affects the proportion of people who click on a button to go to the download page. This experiment has been designed to have a cookie-based diversion, and we record two things from each user: which page version they received, and whether or not they accessed the download page during the data recording period. (We aren't keeping track of any other factors in this example, such as number of pageviews, or time between accessing the page and making the download, that might be of further interest.)

Your objective in this notebook is to perform a statistical test on both recorded metrics to see if there is a statistical difference between the two groups.

In [1]:
# import packages

import numpy as np
import pandas as pd
import scipy.stats as stats
from statsmodels.stats import proportion as proptests

import matplotlib.pyplot as plt
%matplotlib inline
%config Completer.use_jedi = False

In [2]:
# import data

data = pd.read_csv('data/statistical_significance_data.csv')
data.head(5)

Unnamed: 0,condition,click
0,1,0
1,0,0
2,0,0
3,1,1
4,1,0


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999 entries, 0 to 998
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   condition  999 non-null    int64
 1   click      999 non-null    int64
dtypes: int64(2)
memory usage: 15.7 KB


In [4]:
data.describe() # condition looks quite balanced <-- mean of 0.508509

Unnamed: 0,condition,click
count,999.0,999.0
mean,0.508509,0.096096
std,0.500178,0.294871
min,0.0,0.0
25%,0.0,0.0
50%,1.0,0.0
75%,1.0,0.0
max,1.0,1.0


In the dataset, the 'condition' column takes a 0 for the control group, and 1 for the experimental group. The 'click' column takes a values of 0 for no click, and 1 for a click.

## Checking the Invariant Metric

First of all, we should check that the number of visitors assigned to each group is similar. It's important to check the invariant metrics as a prerequisite so that our inferences on the evaluation metrics are founded on solid ground. If we find that the two groups are imbalanced on the invariant metric, then this will require us to look carefully at how the visitors were split so that any sources of bias are accounted for. It's possible that a statistically significant difference in an invariant metric will require us to revise random assignment procedures and re-do data collection.

In this case, we want to do a two-sided hypothesis test on the proportion of visitors assigned to one of our conditions. Choosing the control or the experimental condition doesn't matter: you'll get the same result either way. Feel free to use whatever method you'd like: we'll highlight two main avenues below.

If you want to take a simulation-based approach, you can simulate the number of visitors that would be assigned to each group for the number of total observations, assuming that we have an expected 50/50 split. Do this many times (200 000 repetitions should provide a good speed-variability balance in this case) and then see in how many simulated cases we get as extreme or more extreme a deviation from 50/50 that we actually observed. Don't forget that, since we have a two-sided test, an extreme case also includes values on the opposite side of 50/50. (e.g. Since simulated outcomes of .48 and lower are considered as being more extreme than an actual observation of 0.48, so too will simulated outcomes of .52 and higher.) The proportion of flagged simulation outcomes gives us a p-value on which to assess our observed proportion. We hope to see a larger p-value, insufficient evidence to reject the null hypothesis.

If you want to take an analytic approach, you could use the exact binomial distribution to compute a p-value for the test. The more usual approach, however, is to use the normal distribution approximation. Recall that this is possible thanks to our large sample size and the central limit theorem. To get a precise p-value, you should also perform a 
continuity correction, either adding or subtracting 0.5 to the total count before computing the area underneath the curve. (e.g. If we had 415 / 850 assigned to the control group, then the normal approximation would take the area to the left of $(415 + 0.5) / 850 = 0.489$ and to the right of $(435 - 0.5) / 850 = 0.511$.)

You can check your results by completing the following the workspace and the solution on the following page. You could also try using multiple approaches and seeing if they come up with similar outcomes!

### Analytic approach
Reference to normal distribution approximation at the link [here](https://revisionmaths.com/advanced-level-maths-revision/statistics/normal-approximations), or [here](https://courses.lumenlearning.com/boundless-statistics/chapter/normal-approximation/).

**Binomial approximation :** if X~B(n,p) and n is large and/or p is close to 1/2, then X is approximately N(np, npq) where q=1-p.

**Continuity correction :** The binomial distributions are discrete random variables, whereas the normal distribution is continous. Therefore, we the normal distribution is used to approximate a binomial distribution, we need to take this into account using a continuity correction. Add/substract 0.5 from the total count.

In [22]:
# get number of trials and number of 'successes' 
# based observed outcomes
n_obs = data.shape[0]
n_control = data.groupby('condition').size()[0]

In [23]:
# computing a z-score and p-value
p = 0.5
stdev = np.sqrt(n_obs * p * (1-p))

z = ((n_control+0.5) - p * n_obs) / stdev
print(z)
print(2 * stats.norm.cdf(z))

-0.5062175977346661
0.6127039025537114


### Simulation approach

In [20]:
# get number of trials and number of 'successes'
n_obs = data.shape[0]
n_control = data.groupby('condition').size()[0]

In [22]:
# simulate outcomes under null
p = 0.5
n_trials = 200_000 # same as without underline

samples = np.random.binomial(n_obs, p, n_trials) # simulate num of success x n_trials

In [27]:
# and compare to the observed outcomes
print(np.logical_or(samples <= n_control, 
                    samples >= (n_obs - n_control)).mean())

0.61188


## Checking the Evaluation Metric

After performing our checks on the invariant metric, we can move on to performing a hypothesis test on the evaluation metric: the click-through rate. In this case, we want to see that the experimental group has a significantly larger click-through rate than the control group, a one-tailed test.

The simulation approach for this metric isn't too different from the approach for the invariant metric. You'll need the overall click-through rate as the common proportion to draw simulated values from for each group. You may also want to perform more simulations since there's higher variance for this test.

There are a few analytic approaches possible here, but you'll probably make use of the normal approximation again in these cases. In addition to the pooled click-through rate, you'll need a pooled standard deviation in order to compute a z-score. While there is a continuity correction possible in this case as well, it's much more conservative than the p-value that a simulation will usually imply. Computing the z-score and resulting p-value without a continuity correction should be closer to the simulation's outcomes, though slightly more optimistic about there being a statistical difference between groups.

### Analytic approach

In [52]:
clicks = data.groupby('condition').mean()['click']
clicks

condition
0    0.079430
1    0.112205
Name: click, dtype: float64

In [53]:
# get number of trials and overall 'success' rate under null
n_control = data.groupby('condition').size()[0]
n_experiment = data.groupby('condition').size()[1]
p_null = data['click'].mean()

In [56]:
# Compute pooled stadard error
se_p = np.sqrt(p_null * (1-p_null) * (1/n_control + 1/n_experiment))

In [60]:
# Z-score and p-value
z = (clicks[1]-clicks[0]) / se_p

print(z)
print(1-stats.norm.cdf(z))

1.7571887396196666
0.039442821974613684


### Simulation approach

In [62]:
# get number of trials and overall 'success' rate under null
n_control = data.groupby('condition').size()[0]
n_experiment = data.groupby('condition').size()[1]
p_null = data['click'].mean()

In [63]:
n_trials = 200_000

ctrl_clicks = np.random.binomial(n_control, p_null, n_trials)
exp_clicks = np.random.binomial(n_experiment, p_null, n_trials)
samples = exp_clicks/n_experiment - ctrl_clicks/n_control

In [64]:
# P-value
print((samples >= (clicks[1]-clicks[0])).mean())

0.04038


---
# Practical significance

**The below notes are derived directly from the course material.**

Even if an experiment result shows a statistically significant difference in an evaluation metric between control and experimental groups, that does not necessarily mean that the experiment was a success. 

Even if an experiment result shows a statistically significant difference in an evaluation metric between control and experimental groups, that does not necessarily mean that the experiment was a success. 

### Confidence interval is fully in practical significance region
(Below, $m_0$ indicates the null statistic value, $d_{min}$ the practical significance bound, and the blue line the confidence interval for the observed statistic. We assume that we're looking for a positive change, ignoring the negative equivalent for $d_{min}$)

<img src="https://video.udacity-data.com/topher/2018/September/5bad451e_c03-practicalsignificance-01/c03-practicalsignificance-01.png">

### Confidence interval completely excludes any part of practical significance region

<img src="https://video.udacity-data.com/topher/2018/September/5bad45b9_c03-practicalsignificance-02/c03-practicalsignificance-02.png">

### Confidence interval includes points both inside and outside practical significance bounds

<img src="https://video.udacity-data.com/topher/2018/September/5bad45c7_c03-practicalsignificance-03/c03-practicalsignificance-03.png">

 In each of these cases, there is an uncertain possibility of practical significance being achieved. In an ideal world, you would be able to collect more data to reduce our uncertainty, reducing the scenario to one of the previous cases. Outside of this, you'll need to consider the risks carefully in order to make a recommendation on whether or not to follow through with a tested change. Your analysis might also reveal subsets of the population or aspects of the manipulation that do work, in order to refine further studies or experiments.