In [17]:
from scipy import stats
import numpy as np
np.random.seed(12345678)
import statsmodels.api as sm

In [3]:
rvs1 = stats.norm.rvs(loc=5,scale=10,size=500)
rvs2 = stats.norm.rvs(loc=5,scale=10,size=500)
stats.ttest_ind(rvs1,rvs2)
stats.ttest_ind(rvs1,rvs2, equal_var = False)


Ttest_indResult(statistic=0.26833823296238857, pvalue=0.7884945274950106)

In [None]:
## Question Three: Hypothesis Testing
This question will test your statistical and reasoning abilities.  You have been asked to analyze the results of a randomized, controlled experiment on a fictitious website and provide a recommendation.  For this experiment, each visitor to the site is randomly exposed to one of four different product images; visitors are then tracked to see which ones make a purchase or not.  Based on the data provided, which version of the image should be selected to be presented to all visitors and why?

| image | visitors | purchases 
|----------|-----------|----------|
| A        | 21        | 3        | 
| B        | 180       | 30       | 
| C        | 250       | 50       | 
| D        | 100       | 15       | 

*Bonus Question:* How would your analysis change if the visitors and purchase counts numbered in the millions? 

In [None]:
"""Certain assumptions are required for answering this question.
1) Define a KPI to measure. In this case, it will be a conversion rate calculated as purchases divided by visitors
2) It is not clear whether the provided numbers are for the whole population or are a sample proportion.
If former, it's definitely not enough data to sample sample means. It can be probably handled by bootstrap, but it's a
different story. If latter, we can work with it
3) We make assumption about normality of the distribution, homoscedasticity, independency of observations
4) That said, comparing more than two proportions at a time is tricky in Python but can be easily implemented in R.

"""

#### Here comes the R code
purchases <- c(3,30,50,15)

visitors <- c(21, 180, 250, 100)

pp <- prop.test(purchases, visitors, conf.level = 0.95)

The output is:

data:  purchases out of visitors

X-squared = 1.6991, df = 3, p-value = 0.6371

alternative hypothesis: two.sided

sample estimates:
   prop 1    prop 2    prop 3    prop 4 
   
0.1428571 0.1666667 0.2000000 0.1500000

In [None]:
"""
As you can see, the p-value doesn't allow us to make any conclusion about the provided dataset. 
If we had more data, "millions of visitors and purchase metrics," we could sample sample means, take advantage of the centra 
limit theorem and an ANOVA test. Alternatively, one can compare images one to another to double check his or her work:
A-B, A-C, A-D, B-C, B-D, C-D
Using A-B as an example: 
H0: Pa-Pb = 0
H1: Pa-Pb <> 0
Mean is Pa-Pb
sigma squared is [Pa*(1-Pa)]/Na - [Pb*(1-Pb)]/Nb
Solve for any confidence interval you want, reject or fail to reject H0
"""

In [2]:
import pandas as pd
import numpy as np
import scipy
from scipy import stats

In [3]:
tableA = [('A', 21,3),
         ('B', 180, 30),
         ('C', 250, 50),
         ('D',100,15)]
labels = ['image', 'visitors','purchases']
dfA = pd.DataFrame.from_records(tableA, columns=labels)

In [5]:
dfA['conversion'] = dfA['purchases']/dfA['visitors']
dfA.style.format({'conversion': "{:.2%}"})
dfA.style.hide_index()

In [18]:
from statsmodels.stats.proportion import proportions_ztest

In [None]:
stats.chi2_contingency(dfA.iloc[:,1:2])

In [None]:
chi2_stat, p_val, dof, ex = stats.chi2_contingency(dfA.iloc[:,1:2])
print("===Chi2 Stat===")
print(chi2_stat)
print("\n")
print("===Degrees of Freedom===")
print(dof)
print("\n")
print("===P-Value===")
print(p_val)
print("\n")
print("===Contingency Table===")
print(ex)

In [None]:
count = dfA['purchases']
nobs = dfA['visitors']
stat, pval = proportions_ztest(count, nobs)
print('{0:0.3f}'.format(pval))

In [None]:
"""
https://www.statsmodels.org/dev/generated/statsmodels.stats.proportion.proportions_ztest.html
https://www.khanacademy.org/math/statistics-probability/significance-tests-confidence-intervals-two-samples/comparing-two-proportions/v/comparing-population-proportions-1
"""