In [8]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 $('#toggleButton').val('Show Code')
 } else {
 $('div.input').show();
 $('#toggleButton').val('Hide Code')
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" id = "toggleButton" value="Hide Code"></form>''')



In [9]:
import pandas as pd


<h1><center>Lesson 5: Analyzing Results</center></h1>


### 1. Sanity Checks:
    
- Things can go wrong that can invalidate results. Eg. Filter differently in control and experiment/ data capture not correct, etc.

- Two main types of Checks:
    * Population Sizing Metrics: Check if experiment and control populations are comparable. If there is a difference between sizes of 2 groups, check it is statistically significant
    
    * Invariant metrics: Check that invariant metrics didn't change when you ran your experiment.

Population Sizing Example: Consider an experiment run for 7 days
    
    

In [72]:
days = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun','Total']
cont = [2451,2475,2394,2482,2374,1704,1468,15348]
exp = [2404,2507,2376,2444,2504,1612,1465,15312]


pop_df = pd.DataFrame(
            {'Day': days,
            'Control': cont,
             'Experiment': exp
            })


def highlight_max(x):
    return ['font-weight: bold' if v==x.loc[7] else ''
                for v in x]

pop_df['Control'] = pop_df['Control'].astype(int).apply(lambda x : '{0:,}'.format(x))
pop_df['Experiment'] = pop_df['Experiment'].astype(int).apply(lambda x : '{0:,}'.format(x))

pop_df.style.apply(highlight_max).hide_index()

Day,Control,Experiment
Mon,2451,2404
Tue,2475,2507
Wed,2394,2376
Thu,2482,2444
Fri,2374,2504
Sat,1704,1612
Sun,1468,1465
Total,15348,15312


Margin of Error for a Binomial (Use Normal in this case since N is high) with probability 0.5 of success (being assigned to control group) = <br>

<b>Margin of Error </b>

m = Z x S.E.


&nbsp; = Z x $\sqrt\frac{\hat(p)(1-\hat(p))}{N}$

&nbsp; = 1.96 x $\sqrt\frac{\hat(0.5)(1-\hat(0.5))}{30660}$

&nbsp; = 0.00559680274

<b>Confidence Interval</b>

C.I. = 0.5 $\pm$ 0.00559680274 <br>

&nbsp; = 0.4944 to 0.5055

<b>Check if observed fraction within this interval:</b>

$\hat{p} = \frac{X}{N}$ <br>
$\hat{p} = \frac{15348}{15348 + 15312}$ = 0.5005

Since 0.5005 is within the CI, this passes the sanity test.




### 2. Single Evaluation Metric

* Goal is to make a business decision about whether experiment has favorably impacted your metrics.
* Analytically, it means we want to decide if we see a statistically significant result of the experiment.

#### Simpson's Paradox

> <i>A trend appears in several different groups of data but disappears or reverses when these groups are combined.</i>


In [82]:
labels = ['New Users','Experienced Users','Total']
n_cont = ['150,000','100,000','250,000']
n_exp = ['75,000','175,000','250,000']
X_cont_ctr = ['30,000 (20%)','1,000 (1%)','31,000 (12.4%)']
X_exp_ctr = ['18,750 (25%)','3,500 (2%)','22,250 (8.9%)']


sp_df = pd.DataFrame(
            {'':labels,
            'N_Cont': n_cont,
            'X_Cont (CTR)': X_cont_ctr,
             'N_Exp': n_exp,
             'X_Exp (CTR)':X_exp_ctr
            })


In [84]:
sp_df.style.hide_index()

Unnamed: 0,N_Cont,X_Cont (CTR),N_Exp,X_Exp (CTR)
New Users,150000,"30,000 (20%)",75000,"18,750 (25%)"
Experienced Users,100000,"1,000 (1%)",175000,"3,500 (2%)"
Total,250000,"31,000 (12.4%)",250000,"22,250 (8.9%)"


<li> In this example, the overall CTR is higher in Control Group, but when user groups are considered, 
it is higher in experiment group.

<li> This is because there are more pageviews from new users in the control group - which indicates something is wrong with the setup or change affects new and experienced users differenctly.
    
   


### 3. Multiple Metric

<li> The more thigns you test, the more likely you are to see significant differences by chance.<a href = 'https://en.wikipedia.org/wiki/Multiple_comparisons_problem'> Multiple comparisons problem </a>


Experiment: Prompt students to contact coach more frequently. <br>
Metrics:
<li> Probability that student signs up for coaching.
<li> How early student signs up for coaching.
<li> Average price paid per student.
    
If audancity tracks all 3 metrics and 3 separate significance tests (alpha = 0.05), what is the probability that at least 1 metric will show a significant difference if there is no true difference? <br>
    
P(0 False Positives) = 0.95 x 0.95 x 0.95 = 0.857 <br>    
P(at least 1 FP) = 1 - P(0 False Positives) = 0.143  <br>
    
<li> This assumes independance, so P(at least 1 FP) is an overestimate
    
<li> Using higher confidence can negate this.
    
<li> Bonferroni correction used more frequently. The Bonferroni correction is a very simple method, but there are many other methods, including the closed testing procedure, the Boole-Bonferroni bound, and the Holm-Bonferroni method. This article on multiple comparisons contains more information, and this article contains more information about the false discovery rate (FDR), and methods for controlling that instead of the familywise error rate (FWER).    


Conclusion: What do results do and dont tell you

- If statistical significance results, it means unlikely to have 0 impact on user experience. But do you want to launch a change?

- If Statistical significance seen in some metrics, decision depends on size of change.

- If Statistical significance in some slices in the data:
    - different users?
    - effect seen elsewhere?
    - is there a bug?

Lessons Learned
- Check for invariance
- check experiment metrics look sane
- Look for not just statistical significance but also business sense
- Consider Engineering cost,product costs, opportunity cost relative to rewards from change
- Now is a good time to test for incidental impact
- Try running few different experiment variations before launch.


<hr>
<div dir = "rtl">
<ul style = 'list-style-type:square'>
<li> End of Document
</li>
