# Udacity_A_B_testing 

#### Context

At the time of this experiment, Udacity courses currently have two options on the course overview page: "start free trial", and "access course materials". If the student clicks "start free trial", they will be asked to enter their credit card information, and then they will be enrolled in a free trial for the paid version of the course. After 14 days, they will automatically be charged unless they cancel first. If the student clicks "access course materials", they will be able to view the videos and take the quizzes for free, but they will not receive coaching support or a verified certificate, and they will not submit their final project for feedback.
In the experiment, Udacity tested a change where if the student clicked "start free trial", they were asked how much time they had available to devote to the course. If the student indicated 5 or more hours per week, they would be taken through the checkout process as usual. If they indicated fewer than 5 hours per week, a message would appear indicating that Udacity courses usually require a greater time commitment for successful completion, and suggesting that the student might like to access the course materials for free. At this point, the student would have the option to continue enrolling in the free trial, or access the course materials for free instead.

#### Hypothesis
The hypothesis was that this might set clearer expectations for students upfront, thus reducing the number of frustrated students who left the free trial because they didn't have enough time—without significantly reducing the number of students to continue past the free trial and eventually complete the course. If this hypothesis held true, Udacity could improve the overall student experience and improve coaches' capacity to support students who are likely to complete the course.

#### Unit of diversion 
The unit of diversion is a cookie, although if the student enrolls in the free trial, they are tracked by user-id from that point forward. The same user-id cannot enroll in the free trial twice. For users that do not enroll, their user-id is not tracked in the experiment, even if they were signed in when they visited the course overview page.


In [2]:
#Loading experiment data
import pandas as pd
import numpy as np

In [3]:
control_pd=pd.read_csv('control_group.csv')
exp_pd=pd.read_csv('experiment_group.csv')

In [4]:
control_pd.shape,exp_pd.shape

((37, 5), (37, 5))

In [5]:
control_pd.count()

Date           37
Pageviews      37
Clicks         37
Enrollments    23
Payments       23
dtype: int64

In [6]:
control_pd.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7723,687,134.0,70.0
1,"Sun, Oct 12",9102,779,147.0,70.0
2,"Mon, Oct 13",10511,909,167.0,95.0
3,"Tue, Oct 14",9871,836,156.0,105.0
4,"Wed, Oct 15",10014,837,163.0,64.0


In [7]:
exp_pd.count()

Date           37
Pageviews      37
Clicks         37
Enrollments    23
Payments       23
dtype: int64

In [8]:
exp_pd.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7716,686,105.0,34.0
1,"Sun, Oct 12",9288,785,116.0,91.0
2,"Mon, Oct 13",10480,884,145.0,79.0
3,"Tue, Oct 14",9867,827,138.0,92.0
4,"Wed, Oct 15",9793,832,140.0,94.0


In [10]:
#check sample size
sample_size_control = control_pd.Pageviews.sum()
sample_size_experiment = exp_pd.Pageviews.sum()
sample_size=sample_size_control+sample_size_experiment
print('control group size is',sample_size_control)
print('experiment group size is',sample_size_experiment)
print('total size of 2 groups',sample_size)
#sample_size_control,sample_size_experiment,sample_size

(345543, 344660, 690203)

### Metric Choice

Which of the following metrics would you choose to measure for this experiment and why? For each metric you choose, indicate whether you would use it as an invariant metric or an evaluation metric. The practical significance boundary for each metric, that is, the difference that would have to be observed before that was a meaningful change for the business, is given in parentheses. All practical significance boundaries are given as absolute changes.
Any place "unique cookies" are mentioned, the uniqueness is determined by day. (That is, the same cookie visiting on different days would be counted twice.) User-ids are automatically unique since the site does not allow the same user-id to enroll twice.

* Number of cookies: That is, number of unique cookies to view the course overview page. (dmin=3000)
* Number of user-ids: That is, number of users who enroll in the free trial. (dmin=50)
* Number of clicks: That is, number of unique cookies to click the "Start free trial" button (which happens before the free trial screener is trigger). (dmin=240)
* Click-through-probability: That is, number of unique cookies to click the "Start free trial" button divided by number of unique cookies to view the course overview page. (dmin=0.01)
* Gross conversion: That is, number of user-ids to complete checkout and enroll in the free trial divided by number of unique cookies to click the "Start free trial" button. (dmin= 0.01)
* Retention: That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by number of user-ids to complete checkout. (dmin=0.01)
* Net conversion: That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by the number of unique cookies to click the "Start free trial" button. (dmin= 0.0075)

You should also decide now what results you will be looking for in order to launch the experiment. Would a change in any one of your evaluation metrics be sufficient? Would you want to see multiple metrics all move or not move at the same time in order to launch? This decision will inform your choices while designing the experiment.

### Invariant metric  

Invariant Metrics don't change in the beginning of the experiment, and it still shouldn't change after the experiment.

This experiment will potentially change user behaviours after they click "Start free trail" button then will be exposed to the new pop up window to fill out the hours of study before proceeding to checkout as before. Therefore all the metrics at and before button clicking should be not changed after the experiement. 

Based on this reasonings, we can choose those metrics as invariant metrics for us to do the sanity check later on.

* Number of cookies: That is, number of unique cookies to view the course overview page. (dmin=3000)
* Number of clicks: That is, number of unique cookies to click the "Start free trial" button (which happens before the free trial screener is trigger). (dmin=240)
* Click-through-probability: That is, number of unique cookies to click the "Start free trial" button divided by number of unique cookies to view the course overview page. (dmin=0.01)



### Evaluation metric

In contrast to the invariant metrics, we expect the following metrics to be affected by the treatment and vary between control and treatment group.

* Gross conversion: That is, number of user-ids to complete checkout and enroll in the free trial divided by number of unique cookies to click the "Start free trial" button. (dmin= 0.01)
* Retention: That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by number of user-ids to complete checkout. (dmin=0.01)
* Net conversion: That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by the number of unique cookies to click the "Start free trial" button. (dmin= 0.0075)

Lastly, we would also expect the number of user-ids (i.e. the number of users who enroll in the free trial; dmin=-50) to decrease. However, the metric is not normalized and would not provide any information we are not already capturing with gross conversion (as the number of clicks will be controlled for). Thus, we will not use it as an evaluation metric.

### Hypotheses 
Given the available and selected metrics, we can now specify our hypotheses. While it could be argued that in some cases a one-sided test is appropriate, we are thereby sticking with a more conservative two-sided test.


*Hypothese 1*
* H0 : GC treatment = GC control
* H1 : GC treatment != GC control

*Hypothese 2*
* H0 : R treatment = R control
* H1 : R treatment != R control

*Hypothese 3*
* H0 : CN treatment = CN control
* H1 : CN treatment != CN control

Note: later on, we will drop the second hypothesis as it would demand a sample size that requires the the test to run unreasonably long 

### Measuring variability in metrics

For all selected evaluation metrics, making analytic esitmate of its standard deviation, given a smaple size of 5000 cookies visiting the course overview page and baseline values https://docs.google.com/spreadsheets/d/1MYNUtC47Pg8hdoCjOXaHqF-thheGpUshrFA21BAJnNc/edit#gid=0

In [62]:
#Storing baseline data
d = {"Metric Name": ["Cookies", "Clicks", "User-ids", "Click-through-probability", "Gross conversion", "Retention", "Net conversion"], 
     "Estimator": [40000, 3200, 660, 0.08, 0.20625, 0.53, 0.109313],
     "dmin": [3000, 240, -50, 0.01, -0.01, 0.01, 0.0075]}
md = pd.DataFrame(data=d, index=["C", "CL", "ID", "CTP", "CG", "R", "CN"])
md

Unnamed: 0,Metric Name,Estimator,dmin
C,Cookies,40000.0,3000.0
CL,Clicks,3200.0,240.0
ID,User-ids,660.0,-50.0
CTP,Click-through-probability,0.08,0.01
CG,Gross conversion,0.20625,-0.01
R,Retention,0.53,0.01
CN,Net conversion,0.109313,0.0075


### Calculating standard errors
Next, we need to calculate the standard deviation of the sampling distribution of the sample mean (standard error, in short) for each of the evaluation metrics. To be more precise, in this case we calculate the estimated standard errors of the sample proportions as our evaluation metrics are probabilities. The standard error is an estimate of how far the sample proportion is likely to be from the population proportion.

### Assumptions
Since the unit of diversion is the same as the unit of analysis (denominator of the metric formula) for each evaluation metric (cookie in the case of Gross Conversion and Net Conversion and user-id in the case of Retention) and we can make assumptions about the distributions of the metrics (binominal), we can calculate the standard errors analytically (instead of empirically).

Further, as n is relatively large in each case, we can assume that the sampling distribution of a sample proportion approaches a normal distribution (due to the Central Limit Theorem)

### Computing standard errors
Given above assumption, we can approximate SE using square root of (p_hat * (1 - p_hat )/n)

In [64]:
#Create new column to store standard errors
md['SE']=np.nan
# formula to calculate SE
def StandardError (n,p):
    return (p*(1-p)/n)**0.5

In [68]:
#calculating standard errors for evaluation metrics and store them in md
for i in ["CG", "CN"]:
    md.loc[i,'SE']=StandardError(md.loc['CL','Estimator'],md.loc[i,'Estimator'])    
md.loc['R','SE']=StandardError(md.loc['ID','Estimator'],md.loc['R','Estimator'])

In [69]:
md

Unnamed: 0,Metric Name,Estimator,dmin,SE
C,Cookies,40000.0,3000.0,
CL,Clicks,3200.0,240.0,
ID,User-ids,660.0,-50.0,
CTP,Click-through-probability,0.08,0.01,
CG,Gross conversion,0.20625,-0.01,0.007153
R,Retention,0.53,0.01,0.019427
CN,Net conversion,0.109313,0.0075,0.005516


In [70]:
#storing alpha and beta in a dictionary
error_prob = {"alpha": 0.05, "beta": 0.20}
error_prob

{'alpha': 0.05, 'beta': 0.2}

### Determining experiment sample size

In [73]:
#create new column n_c to store sample sizes
md["n_C"] = np.nan

#define function for approach B
def get_sampleSize (alpha, beta, p, dmin):
    '''Return sample size given alpha, beta, p and dmin'''
    return (pow((norm.ppf(1-alpha/2)*(2*p*(1-p))**0.5+norm.ppf(1-beta)*(p*(1-p)+(p+dmin)*(1-(p+dmin)))**0.5),2))/(pow(dmin,2))

#calculate sample sizes for evaluation metrics with defined adjustments and store results in md
for i in ["CG", "CN"]:
    md.at[i, "n_C"] = round((get_sampleSize(error_prob["alpha"], error_prob["beta"], md.loc[i]["Estimator"], md.loc[i]["dmin"])/md.loc["CTP"]["Estimator"])*2)

md.at["R", "n_C"] = round(((get_sampleSize(error_prob["alpha"], error_prob["beta"], md.loc["R"]["Estimator"], md.loc["R"]["dmin"])/md.loc["CTP"]["Estimator"])/md.loc["CG"]["Estimator"])*2)
md

Unnamed: 0,Metric Name,Estimator,dmin,SE,n_C
C,Cookies,40000.0,3000.0,,
CL,Clicks,3200.0,240.0,,
ID,User-ids,660.0,-50.0,,
CTP,Click-through-probability,0.08,0.01,,
CG,Gross conversion,0.20625,-0.01,0.007153,638940.0
R,Retention,0.53,0.01,0.019427,4737771.0
CN,Net conversion,0.109313,0.0075,0.005516,685336.0


Given our calculations, we would need around 638,940 pageviews (cookies) to test the first hypothesis (given our assumptions on alpha, beta, baseline conversions and dmin). To additionally test the third hypothesis, we would need a total of 685,336 pageviews. And, in case we would like to also test the second hypothesis, we would need a total of around 4,737,771 pageviews.

### Experiment exposure and duration
Now, for each case, we can calculate how many days we would approximately need to run the experiment in order to reach n_C. According to the challenge description, we are thereby assuming that there are no other experiments we want to run simultaneously. So, theoretically, we could divert 100% of the traffic to our experiment (i.e. about 50% of all visitors would then be in the treatment condition). Given our estimation that there are about 40,000 unique pageviews per day, this would result in:

In [74]:
#traffic diverted to experiment [0:1]
traffic_diverted = 1

#Days it would take to run experiment for each case
for i, j in zip(["CG", "CN", "R"],["CG", "CG+CN", "CG+CN+R"]):
   print("Days required for",j,":", round(md.loc[i]["n_C"]/(md.loc["C"]["Estimator"]*traffic_diverted),2))

Days required for CG : 15.97
Days required for CG+CN : 17.13
Days required for CG+CN+R : 118.44


We see that we would need to run the experiment for about 119 days in order to test all three hypotheses (and this does not even take into account the 14 additional days (free trial period) we have to wait until we can evaluate the experiment). Such a duration (esp. with 100% traffic diverted to it) appears to be very risky. First, we cannot perfom any other experiment during this period (opportunity costs). Secondly, if the treatment harms the user experience (frustrated students, inefficient coaching resources) and decreases conversion rates, we won't notice it (or cannot really say so) for more than four months (business risk). Consequently, it seems more reasonable to only test the first and third hypothesis and to discard retention as an evaluation metric. Especially since net conversion is a product of rentention and gross conversion, so that we might be able to draw inferences about the retention rate from the two remaining evaluation metrics.

So, how much traffic should we divert to the experiment? Given the considerations above, we want the experiment to run relatively fast and for not more than a few weeks. Also, as the nature of the experiment itself does not seem to be very risky (e.g. the treatment doesn't involve a feature that is critical with regards to potential media coverage), we can be confident in diverting a high percentage of traffic to the experiment. Still, since there is always the potential that something goes wrong during implemention, we may not want to divert all of our traffic to it. Hence, 80% (22 days) would seem to be quite reasonable. However, when we look at the data provided by Udacity (see 4.1) we see that it takes 37 days to collect 690,203 pageviews, meaning that they most likely diverted somewhere between 45% and 50% of their traffic to the experiment

In [75]:
#traffic diverted to experiment
traffic_diverted = 0.47

#Days it would take to run experiment if we use net conversion and gross coversion as evaluation metrics
print("Experiment duration in days, CN+CG: ",round(md.loc["CN"]["n_C"]/(md.loc["C"]["Estimator"]*traffic_diverted),2))

Experiment duration in days, CN+CG:  36.45


### Checking Invariants

Start by checking whether your invariant metrics are equivalent between the two groups. If the invariant metric is a simple count that should be randomly split between the 2 groups, you can use a binomial test as demonstrated in Lesson 5. Otherwise, you will need to construct a confidence interval for a difference in proportions using a similar strategy as in Lesson 1, then check whether the difference between group values falls within that confidence level.
If your sanity checks fail, look at the day by day data and see if you can offer any insight into what is causing the problem.

In [98]:
#create empty dataframe to store sanity check results
sanity_check = pd.DataFrame(columns=["CI_left", "CI_right", "obs","passed?"], index=["C", "CL", "CTP"])

#set alpha and p_hat
p = 0.5
alpha = 0.05

#fill dataframe with results from binomial test
#for cookies and clicks do the following
for i,j in zip(["C", "CL"], ["Pageviews", "Clicks"]):
    #calculate the number of successes (n_control) and number of observations (n)
    n = control_pd[j].sum()+exp_pd[j].sum()
    n_control = control_pd[j].sum()
    
    #compute confidence interval
    sanity_check.at[i, "CI_left"] = p-(norm.ppf(1-alpha/2)*StandardError(n,p))
    sanity_check.at[i, "CI_right"] = p+(norm.ppf(1-alpha/2)*StandardError(n,p))
    
    #compute observed fraction of successes
    sanity_check.at[i, "obs"] = round(n_control/(n),4)
    
    #check if the observed fraction of successes lies within the 95% confidence interval
    if sanity_check.at[i, "CI_left"] <= sanity_check.at[i, "obs"] <= sanity_check.at[i, "CI_right"]:
        sanity_check.at[i, "passed?"] = "yes"
    else:
        sanity_check.at[i, "passed?"] = "no"

#return results
sanity_check

Unnamed: 0,CI_left,CI_right,obs,passed?
C,0.49882,0.50118,0.5006,yes
CL,0.495885,0.504115,0.5005,yes
CTP,,,,


In [99]:
#compute CTP for both groups
CTP_control = control_pd["Clicks"].sum()/control_pd["Pageviews"].sum()
CTP_experiment = exp_pd["Clicks"].sum()/exp_pd["Pageviews"].sum()

#compute sample standard deviations for both groups
S_control = (CTP_control*(1-CTP_control))**0.5
S_experiment = (CTP_experiment*(1-CTP_experiment))**0.5

#compute SE_pooled
SE_pooled = (S_control**2/control_pd["Pageviews"].sum()+S_experiment**2/exp_pd["Pageviews"].sum())**0.5

#compute 95% confidence interval and store it in sanity check
alpha = 0.05

sanity_check.at["CTP", "CI_left"] = 0-(norm.ppf(1-alpha/2)*SE_pooled)
sanity_check.at["CTP", "CI_right"] = 0+(norm.ppf(1-alpha/2)*SE_pooled)

#compute observed difference d and store it in sanity check
sanity_check.at["CTP", "obs"] = round(CTP_experiment - CTP_control,4)

#check if sanity check is passed
if sanity_check.at["CTP", "CI_left"] <= sanity_check.at["CTP", "obs"] <= sanity_check.at["CTP", "CI_right"]:
    sanity_check.at["CTP", "passed?"] = "yes"
else:
    sanity_check.at["CTP", "passed?"] = "no"

#return results
sanity_check

Unnamed: 0,CI_left,CI_right,obs,passed?
C,0.49882,0.50118,0.5006,yes
CL,0.495885,0.504115,0.5005,yes
CTP,-0.00129566,0.00129566,0.0001,yes


Since our pool proportion is still within the interval, this does pass sanity check. Then we don't have to worry about checking further number of cookies by day data.

In summary, if you don't pass the sanity check, you should not continue to analyse your experiment. Commonly there's three things that you could do. First one is in the technical level. You should take to your engineers what went wrong, atribute that is really different in invariant metrics. Secondly, you could do retrospective analysis in your data, see if you can debug through slicing by features that we have talked earlier. Finally you could check the metric by pre-period and experiment period that's been discussed in previous blog. If you see changes in both period, then it maybe backends/infrastructure failure. If you see changes in experiment, then it could means there's something wrong in your experiment.

There's many thing that you can investigate when something is wrong between experiment and control groups. You could have different data capture between both groups, or you could have different filtering. Or users have reset their cookies, so you can check all of these things.

Learning effect could be attribute to changes in both your experiment and control groups. But it something that evolve in time, when user adapting to change. If you see sudden change, then learning effect may not at fault.

If all of the invariant metric has passed, then you can finally analyse the experiment.

### Check for Practical and Statistical Significance

Next, for your evaluation metrics, calculate a confidence interval for the difference between the experiment and control groups, and check whether each metric is statistically and/or practically significance. A metric is statistically significant if the confidence interval does not include 0 (that is, you can be confident there was a change), and it is practically significant if the confidence interval does not include the practical significance boundary (that is, you can be confident there is a change that matters to the business.)
If you have chosen multiple evaluation metrics, you will need to decide whether to use the Bonferroni correction. When deciding, keep in mind the results you are looking for in order to launch the experiment. Will the fact that you have multiple metrics make those results more likely to occur by chance than the alpha level of 0.05?

In [100]:
#create dataframe test_results
test_results = pd.DataFrame(columns=["CI_left", "CI_right", "d","stat sig?", "dmin"], index=["CG", "CN"])

#set alpha
alpha = 0.05


#run two proportion z test for both metrics
for i,j in zip(["Enrollments", "Payments"],["CG", "CN"]):
    #compute sample conversion rates
    conv_control = control_pd.iloc[:23][i].sum()/control_pd.iloc[:23]["Clicks"].sum()
    conv_experiment = exp_pd.iloc[:23][i].sum()/exp_pd.iloc[:23]["Clicks"].sum()
    
    #compute observed difference between treatment and control conversion d
    test_results.at[j, "d"] = conv_experiment-conv_control
    
    #compute sample standard deviations
    S_control = (conv_control*(1-conv_control))**0.5
    S_experiment = (conv_experiment*(1-conv_experiment))**0.5
    
    #compute SE_pooled
    SE_pooled = (S_control**2/control_pd.iloc[:23]["Clicks"].sum()+S_experiment**2/exp_pd.iloc[:23]["Clicks"].sum())**0.5
    
    #compute 95% confidence interval around observed difference d
    test_results.at[j, "CI_left"] = test_results.at[j, "d"]-(norm.ppf(1-alpha/2)*SE_pooled)
    test_results.at[j, "CI_right"] = test_results.at[j, "d"]+(norm.ppf(1-alpha/2)*SE_pooled)
      #check statistical significance
    if test_results.at[j, "CI_left"] <= 0 <= test_results.at[j, "CI_right"]:
        test_results.at[j, "stat sig?"] = "no"
    else:
        test_results.at[j, "stat sig?"] = "yes"
    
    #import dmin
    test_results.at[j, "dmin"] = md.loc[j]["dmin"]
test_results

Unnamed: 0,CI_left,CI_right,d,stat sig?,dmin
CG,-0.0291202,-0.0119896,-0.0205549,yes,-0.01
CN,-0.0116042,0.00185674,-0.00487372,no,0.0075


### Run Sign Tests

For each evaluation metric, do a sign test using the day-by-day breakdown. If the sign test does not agree with the confidence interval for the difference, see if you can figure out why.


### Make a Recommendation

Finally, make a recommendation. Would you launch this experiment, not launch it, dig deeper, run a follow-up experiment, or is it a judgment call? If you would dig deeper, explain what area you would investigate. If you would run follow-up experiments, briefIy describe that experiment. If it is a judgment call, explain what factors would be relevant to the decision.


Gross conversion: the observed gross conversion in the treatment group is around 2.06% smaller than the gross conversion observed in the control group. Further, we see that also the values within the confidence interval are most compatible with a negative effect. Lastly, this effect appears to be practically relevant as those values are smaller than dmin, the minimum effect size to be considered relevant for the business.

Net conversion: While we cannot reject the null hypothesis for this test, we see that the observed net conversion in the treatment group is around 0.49% smaller than the net conversion observed in the control group. Further, the values that are considered most reasonabily compatible with the data range from -1.16% to 0.19%.

Given these results, we can assume that the introduction of the "Free Trial Screener" may indeed help to set clearer expectations for students upfront. However, the results are less compatible with the assumption that the decrease in gross conversion is entirely absorbed by an improvement in the overall student experience and still less compatible with dmin(net conversion), the minimum effect size to be considered relevant for the business. Consequently, assuming that Udacity has a fair interest in increasing revenues, we would recommend to not roll out the "Free Trial Screener" feature.