# Hypothesis tests covered in the course

&nbsp; | Comparing a sample statistic to   a hypothesized population value for a… | Testing an association between a   binary categorical variable (2 variables) and a… | Testing an association between a   categorical variable with 3 or more categories and a…
-- | -- | -- | --
Quantitavie   Variable | One sample t-test | Two-sample t-test | ANOVA with Tukey's Range Test
Categorical   Variable | Binomial Test | Chi-Square Test | Chi-Square Test


## A/B Test

Baseline Conversion Rate:
e.g. our estimate for the percent of people who will buy a widget under the current website design (look at historical data to infer this), e.g. 16%
Number can be written as a proportion (0.16) or percentage (16%).

Minimum Detectable Effect (aka Desired Lift):
The smallest difference that we actually care to measure, as a percent of the baseline conversion Rate
e.g. Current conversion rate: 6%
Desired conversion rate: 8%
Minimum Detectable Effect: (8-6)/6 = 33%

Significance Threshold:
The threshold at which a change is considered significant. 0.05 (5%) is commonly used, which means that the "null hypothesis" will be rejected and "B" is significantly better than "A" if p-value is less than 0.05. 0.05 means the chances of a false positive is relatively low.
The significance threshold turns out to be the false positive rate (i.e. the probability of finding a significant difference when there isn't one).
There's a tradeoff between the false positive and false negative rates. Most A/B test sample size calculators estimate the sample size needed for a 20% false negative rate, while a data scientist chooses the false positive rate s/he is comfortable with. The lower the false positive rate, the larger the sample size needed.



## Run a Chi Square calculation on an A/B Test

In [113]:
import pandas as pd
from scipy.stats import chi2_contingency

data = pd.read_csv("a-b-test.csv")
print(data.head())

# Calculate contingency table
ab_contingency = pd.crosstab(data.Web_Version, data.Purchased)
print(ab_contingency)

# Run chi square test
chi2, pval, dof, expected = chi2_contingency(ab_contingency)
print(pval)

# Based on this p-value, we would make a decision about which website design to use. A small p-value would provide evidence that the purchase rates are significantly different for the 2 groups, while a large p-value would suggest no significant difference.

   Index Web_Version Purchased
0      0           A        no
1      1           A        no
2      2           A       yes
3      3           A       yes
4      4           A       yes
Purchased    no  yes
Web_Version         
A            24   26
B            15   35
0.10096676200907678


## Simulating Data for a Chi-Square test

In [114]:
import numpy as np

# Simulate 50% yes and 50% no for our control group (i.e. A test)
sample_control = np.random.choice(['yes', 'no'], size=50, p=[.5, .5])
# Simulate a lift of 30%
sample_new_design = np.random.choice(['yes', 'no'], size=50, p=[.65, .35])

# Assemble into a data frame

# Create 100 rows, first 50 rows being 'control' and the next 50 being 'new_design'
group = ['control']*50 + ['new_design']*50
# Set outcome, with first 50 rows listing sample_control and next 50 sample_new_design
outcome = list(sample_control) + list(sample_new_design)
# Construct dictionary that looks like this: {'Website': ['control', 'control'], 'Purchased': ['no', 'yes']}
sim_data = {"Website": group, "Purchased": outcome}
# Convert to data frame
sim_data = pd.DataFrame(sim_data)
print(sim_data.head())

   Website Purchased
0  control       yes
1  control        no
2  control       yes
3  control        no
4  control       yes


### Generalize the previous script

In [115]:
# Run this a few times. Results will vary depending on random values selected by np.random.choice

significance_threshold = 0.05
sample_size = 100
lift = .3
control_rate = .5
new_design_rate = (1 + lift) * control_rate

# Simulate data

sample_control = np.random.choice(['yes', 'no'], size=int(sample_size/2), p=[control_rate, 1-control_rate])
sample_new_design = np.random.choice(['yes', 'no'], size=int(sample_size/2), p=[new_design_rate, 1-new_design_rate])

group = ['control']*int(sample_size/2) + ['new_design']*int(sample_size/2)
outcome = list(sample_control) + list(sample_new_design)
sim_data = {"Website": group, "Purchased": outcome}
sim_data = pd.DataFrame(sim_data)

# run a chi-square test

ab_contingency = pd.crosstab(np.array(sim_data.Website), np.array(sim_data.Purchased))
print('ab_contingency')
print(ab_contingency)
chi2, pval, dof, expected = chi2_contingency(ab_contingency, correction=False)
print("P Value:", pval)

result = ('significant' if pval < significance_threshold else 'not significant')
print(result)

ab_contingency
col_0       no  yes
row_0              
control     23   27
new_design  23   27
P Value: 1.0
not significant


### Run the previous script many times to simulate multiple outcomes

In [131]:
# Here we're going to estimate the proportion of simuated datasets taht lead to a 'significant' result...

def proportion_of_significant_results(significance_threshold, sample_size, lift):
    control_rate = .5
    new_design_rate = (1 + lift) * control_rate
    simulations = 100

    results = []

    for i in range(simulations):

        # Simulate data

        sample_control = np.random.choice(['yes', 'no'], size=int(sample_size/2), p=[control_rate, 1-control_rate])
        sample_new_design = np.random.choice(['yes', 'no'], size=int(sample_size/2), p=[new_design_rate, 1-new_design_rate])

        group = ['control']*int(sample_size/2) + ['new_design']*int(sample_size/2)
        outcome = list(sample_control) + list(sample_new_design)
        sim_data = {"Website": group, "Purchased": outcome}
        sim_data = pd.DataFrame(sim_data)

        # run a chi-square test

        ab_contingency = pd.crosstab(np.array(sim_data.Website), np.array(sim_data.Purchased))

        # Note: correction=False was removed from chi2_contingency().
        # This does make a difference in the result, but not sure why Codeacademy did that.
        chi2, pval, dof, expected = chi2_contingency(ab_contingency)

        result = ('significant' if pval < significance_threshold else 'not significant')
        results.append(result)

    # proportion of significant results (aka power of the test):
    results = np.array(results)
    return np.sum(results == 'significant') / simulations

print(proportion_of_significant_results(0.05, 100, .3))

# Result is usually between 0.2 and 0.36. So 20% to 36% of the time, chi-square will report that there's a significant difference. That means that 64 to 80% of the time, we will conclude that there's not a significant difference, even though we're simulating an average lift of ~30%. The low significance_threshold sets a low false positive, but we're also getting many false negatives 64 to 80% of the time. We can increase the significant results (aka power of the test) by increasing the sample size and/or increasing the significance threshold. We'll do both of those later.


0.23


### False positive

Now let's determine the false positive by setting lift to 0. The result should be 'not significant', but let's see in what proportion 'significant' is reported.

In [136]:
print(proportion_of_significant_results(0.05, 100, 0))

# Result is about 0.05 (i.e. similar to significant_threshold). So ~5% of the time, we'll get a false positive.

0.03


### Power of the test

The power of a test is the probability of correctly detecting a significant result (i.e. the proportion of detecting significant results when there really is one (i.e. when lift is > 0)). It's also called the true positive rate. Most sample size calculators aim for a power of 80%.

Increasing the power of the test can be done in 2 ways:

* Increasing the sample size increases the power of the test (the probability of detecting a difference if there is one); however, larger sample sizes require more time and resources.

* Increasing the significance threshold also increases the power of the test; however, it simultaneously increases the false positive rate (the probability of detecting a difference when there isn’t one).

Also, we can choose a larger minimum detectable effect/lift, which will decrease the sample size without decreasing power. However, setting a minimum lift of 30% (for example), we may not be able to detect smaller differences that are still meaningful.

### Smaller lift

Larger sample sizes are needed to detect smaller effect sizes (e.g. lift)

### Increase sample size and significance threshold

In [153]:
# Increase the sample size to 500

print(proportion_of_significant_results(0.05, 500, .3))

# Result is about 0.9 now (i.e. we're reporting a significant result ~90% of the time with a simulated lift of 30%). We increase the power of the test, and got over the desired threshold of 80%.


# Also increase the significant threshold to 0.10

print(proportion_of_significant_results(0.10, 500, .3))

# Result is even higher

# Note that we also increased the false positive by increasing the significant result to 0.10

print(proportion_of_significant_results(0.10, 500, 0))

# Now increase the lift to 40%

print(proportion_of_significant_results(0.10, 500, .4))

# Wowza, now getting between 99% and 100%

0.87
0.99
0.11
1.0
