This notebook aims to replicate the materials in the article [Selection Bias in Online Experimentation from Airbnb](https://medium.com/airbnb-engineering/selection-bias-in-online-experimentation-c3d67795cceb) (retrieved April 15, 2018). 

The article is not complicated, though it takes some time to get used to. Overall, the article is a nice introduction to the form of statistics that a Data Scientist is expected to understand at a Internet company. 

This article deals with statistical significance, hypothesis testing, and slightly about multiple comparison. 

# The context
The team at Airbnb performs a series of A/B test sequentially. Out of these tests, they select a subset of experiments that have statistically significant results (there are 6 of them). They are interested in finding the aggregate impact of these 6 experiments. There are 2 ways to estimate this:
- A bottom-up approach: based on the testing data (which has sequential experiments), they estimate the impact of each separate experiment
- An aggregate approach using a split / hold-out group: I assume that they combine these 6 experiments into one and perform another A/B testing. 

<img src="https://cdn-images-1.medium.com/max/1600/1*RHH0aobieNL3VA3LeMdJxA@2x.png" width="70%">

The result shows an apparent discrepancy between the impact: bottom-up approach has an estimate of 7.2% while split-holdout has an estimate of 4%. While there are many confounding factors at play such as the variance in the experiment results, seasonal effect in experiments, the authors from Airbnb use this example to show the Selection bias in this experiment setting.

# The bias
By selecting these 6 experiments out of all the experiments perform, we have incurred a selection bias: these experiments have been selected precisely because of their good results, even though it may be due to chance. Therefore, our value of 7.2% ( the sum of values from these 6 experiments) is an over-estimate of the aggregate impact of these __6 experiments__. 

Whether these 6 experiments should have been selected is another issue altogether.

# Mathematical explanation
Suppose we run $n$ experiments independently, the result of each follows a normal distribution N($ a_i$, 1%) with i = 1,2,..10. We then select only those that are statistically different from 0: $ A = \{ i \mid \frac{X_i}{\sigma_i} > t_i \}$ where $t_i$ is the critical value of individual experiment). If we use the same level of significance for all experiments, then all $t_i$ are the same, say the familiar 1.96 for 2.5% significance level for a one-sided test. We take the sum of these values to get an estimate of the aggregate effect:
$$S_A = \Sigma \{ i\in A\}X_i$$

We are interested in knowing whether whether this sum is actually a good estimation for the aggregate impact of these features or not, ie: $$T_A = \Sigma \{ i\in A\}a_i$$ wheren $a_i$ is the real, unobserved value.

Note that $A$ itself is a random variable and has its own distribution. Nevertheless, we can attempt to answer the question above by comparing $E_X[S_A]$ and $E_X[T_A]$. We can actually show that $E_X[S_A] \geq E_X[T_A]$ by showing $E_X[S_A-T_A]\geq0$.
Here I omit the subscript X for clarity:

$$E[S_A-T_A]=E[\sum_{i \in A}(X_i-a_i)] = E[\sum_{i=1}^n I(i\in A)\:(X_i-a_i)]$$
$$=\sum_{i=1}^n E[I(i\in A)\:(X_i-a_i)]=\sum_{i=1}^n E[I(\frac{X_i}{\sigma_i} > t_i)\:(X_i-a_i)]$$

We are making progress by breaking the sum into $n$ smaller chunks, ie: we can treat each experiment separately. I was intimidated by the indicator function $I(\frac{X_i}{\sigma_i} > t_i)$, but we can re-write this in a more intuitive form:
$$E[I(X_i> \sigma_i * t_i)\:(X_i-a_i)] = E[I(X_i-a_i> \sigma_i * t_i-a_i)\:(X_i-a_i)]$$

The explanation in the article is that each summand (above) is "the mean of lower truncated mean-zero distribution" and is positive, therefore the sum is positive, QED. Personally, I don't understand the jargon here so what I did was to re-write $X_i-a_i = Y_i$ and $sigma_i * t_i-a_i$ as $something$, then the summand becomes:
$$E[I(Y_i > something)\:Y_i]$$

Recall that $E[Y_i]=E[X_i]-a_i=0$, so $Y_i$ has mean-zero. Now the quote makes sense:
- If you take the average of $Y_i$ across all possible values, you get 0, ie: mean-zero distribution
- If you only care about all the values greater than a floor (in this case the value in $something$) (this is the "lower truncated"), then the average of these values is definitely greater than 0. 

Now that we have reduced the expression into this form, it appears simple enough that there should be a formula to calculate this explicitly. Indeed, in the article, the Airbnb team calculate this term as an adjustment to their estimates.
# Illustration using Python

# Extra materials
If you are interested in understanding more about these concepts, I suggest looking up for materials related to Multiple Comparison. I find this to be a comprehensive resource:
http://www.biostathandbook.com/multiplecomparisons.html

