# Design of Experiments

There are two main types of DOE:
- Qasi Experimental design       : **NO Randomized Assignment** of Test and Control group
- Non-Qasi Experimental design   : **Randomized Assignment** of Test and Control group

## Basics of A/B Testing

A/B testing falls under Non-Qasi Experimental design


* What is an A/B Test?
    - A/B testing is a process of randomly assigning people into two (or more) groups to observe and select the version that drives most business impact for the organization.


* Though there are tools available, what is the disadvantage of using them?
    - Integration issue: If webpage is tagged with one vendor but the A/B solution is obtained from different vendor, then there are complications while integrating them.
    - Latency issue: There might be some kind of slowness in page or flickering, because a javascript is needed at both client and server ends to track the data that leads to this issue.
    - Limited ebility to test the changes: Stakeholders are restricted to make changes that are provided by the tool and request for something specific.
    - Privacy issue: Due to GDPR and other laws, it's not easy to trasfer all the data into the tool.

## Sampling

We won't be using the whole population, the conversion rates that we'll get will inevitably be only estimates of the true rates.

The number of people (or user sessions) we decide to capture in each group will have an effect on the precision of our estimated conversion rates: **the larger the sample size**, the more precise our estimates (i.e. the smaller our confidence intervals), **the higher the chance to detect a difference in the two groups, if present**. On the other hand, the larger our sample gets, the more expensive (and impractical) our study becomes.

The sample size we need is estimated through something called Power analysis(The **main purpose underlying power analysis is to determine the smallest sample size that is suitable to detect the effect of a given test at the desired level of significance**), and it depends on a few factors:

- **Power of the test (1-β)** - This represents the probability of finding a statistical difference between the groups in our test when a difference is actually present. This is usually set at 0.8 as a convention
- **Significance Level: Alpha value (α)** - The critical value we set to 0.05
- **Effect size** - How big of a difference we expect there to be between the conversion rates

Since we need a difference of 2%, we can use 13% and 15% to calculate the effect size.

In Python: https://www.statsmodels.org/dev/generated/statsmodels.stats.power.NormalIndPower.solve_power.html

By Hand: 

**n = (16*σ²) / 𝛿²**

- n: number of samples.
- σ²: sample variance.
- 𝛿: the difference between the treatment and control groups (in percentage).



***For Understanding:***

Here is the bi-relationship between these three parameters and the required sample size:

- Significance Level decreases → Larger Sample Size
- Statistical Power increases → Larger Sample Size
- The Minimum Detectable Effect decreases → Larger Sample Size


- β = probability of a Type II error, known as a "false negative"
- 1 − β = probability of a "true positive", i.e., correctly rejecting the null hypothesis. "1 − β" is also known as the power of the test.
- α = probability of a Type I error, known as a "false positive"
- 1 − α = probability of a "true negative", i.e., correctly not rejecting the null hypothesis

In [3]:
# Import Libraries
import statsmodels.stats.api as sms
from math import ceil

# Caculate the Sample Size
required_n = sms.NormalIndPower().solve_power(
    0.1, 
    power=0.8, 
    alpha=0.05,
    nobs1= None,
    ratio=1
    )                                                  # Calculating sample size needed

required_n = ceil(required_n)                          # Rounding up to next whole number                          

print(required_n)

1570


So, there is a need of 1570 users for each group.

## Randomization in A/B Tests

* What is an Randomization in A/B Tests?
    - It is the process by which we allocate users to either Treatment or Control group in a A/B test. We have to make sure the users are randomly assigned and there should not be any information conveyed by the way we assign the users. (Remove the nuances factors that might give information about the groups.) One specific user should have visited only either Control or Treatment but not both.
    
    
* What is **SUTVA (Stable Unit Treatment Value Assumption)**?
    - SUTVA states that the treatment and control units don’t interact with each other and are independent of each other; otherwise, the interference leads to biased estimates.
    

* When SUTVA is violated?
    - A users experience both control and treatment variations in a single session.
    - Resources are shared among experiment subjects. Ex: two people using the same Email Id to operate a youtube premium account and the unit of randomization for the experiment being Login ID
    

## SRM: Sample Ratio Mismatch



* What is SRM?
    - SRM is Sample Ratio Mismatch. In simple words after spliting population into groups(traffic allocation) and introduce a variation(treatment), we expect the traffic to flow equally into two groups but it doesn't happen. One group receives many more visitors(page views or sessions) than the other, the ratio of traffic is not equal. And you have a Sample Ratio Mismatch issue.
    
* Why SRM an issue?
    - When SRM occurs the results can't be trusted because traffic skews conversion numbers. 
    
    To understand the concept much better let's look at an example: If we run a A/B test with equal split and leave the test for 5 weeks, and get a total of 579,286 total visitors, then the expection in control and treatment group would be 289,643 visitors each. But, let's say this is not the case. The Control group has 247,563 visitors and Treatment group has 331,723 visitors. Let's say for understanding purpose, we are looking to increase conversions, the control achieved 4,735 conversions, but the test slightly outperformed with 5,323 conversions.
    
    Now the question is: Is this going to affect the final output? Ans is **YES**
    
    Consider Conversion Rate **with and without SRM issue**: 
        
        - Calcultion for Conversion Rate = Conversions/Visitors. 
    
        - Calculation for Effect = (Conversion Rate of Treatment - Conversion Rate of Control) / (Conversion Rate of Control)
    
    |SRM Issue?|Metric|Contol|Treatment|Effect|
    |-----|-----|-----|-----|-----|
    |N|Visitors|289,643|289,643||
    |N|Conversions|4,735|5,323||
    |N|Conversion Rate|1.63%|**1.84%**|**12.88%**|
    ||||||
    ||||||
    |Y|Visitors|247,563|331,723||
    |Y|Conversions|4,735|5,323||
    |Y|Conversion Rate|**1.91%**|1.84%|**-16.23%**|
    
    From the above table we understand that SRM keeps our real result in the dark. The data is inaccurate. So, we are going to avoid SRM. SRM a gold standard of reliable test results; by checking for SRM, and verifying your test doesn't have an SRM issue, you can be more confident results are trustworthy.
    
    Let's try to perform this on Python

In [1]:
# Sample Ratio Calculation
control_visitors = 247563
treatment_visitors = 331723

sample_ratio = treatment_visitors / control_visitors
print(sample_ratio)

1.339953870327957


The sample ratio is 1.34 while the design ratio for which we had planned, was 1. In order to determine if this occurrence is statistically significant or just by chance, we need to calculate the p-value of the Sample Ratio.


* How do you identify SRM in A/B test:
    To calculate the p-value of Sample Ratio we perform either **standard T-test or a chi-squared test** 

    **Null Hypothesis:** There is no difference between the treatment group and the control group

    **Alternate Hypothesis:** There is a significant difference between the two groups.
    
    If p-value <= 0.01, we are essentially saying that there is only a 1% chance or less of obtaining a false positive a.k.a there is only a 1% chance we say that there is a difference between the two groups when in reality there isn't any. For a p-value <= 0.01, we reject our null hypothesis and conclude that the two groups are different (which is what we are checking for when we observe a different sample ratio when compared to the design ratio).

In [22]:
# Fuction to compute the p values for an A/B Test.
from scipy.stats import chisquare
from typing import Union

def chi_stat_sig(treatment:Union[int,float], control:Union[int,float], alpha:float) -> str:
    """
    Parameters:
    ---------------
    treatement : the count of the unit of randomization in the treatement group (can be a cookie/user id/ device id)
    control : the count of the unit of randomization in the control group (can be a cookie/user id/ device id)
    alpha : the statistical significance boundry
    """
    if not isinstance(alpha,float):
        raise TypeError("variable alpha is not of type float")
        
    control_visitors = control
    treatment_visitors = treatment
    observed_visitors = [treatment_visitors, control_visitors]
    total_visitors = treatment_visitors + control_visitors
    expected_visitors = [total_visitors/2, total_visitors/2]
    
    chi = chisquare(observed_visitors,f_exp=expected_visitors)
    if chi[1] <= alpha:
        return(f'the p-value for the chi squared test is {chi[1]} and there is a difference between treatment group and control group')
    else:
        return(f'the p-value for the chi squared test is {chi[1]} and there is "NO" difference between treatment group and control group')

In [24]:
chi_stat_sig(331723,247563,0.01)

'the p-value for the chi squared test is 0.0 and there is a difference between treatment group and control group'

The calculated p-value is less than 0.01 for our Sample Ratio and hence we will reject the Null hypothesis. It is very likely that our implementation of the A/B test is incorrect and that is what has caused a Sample Ratio Mismatch.

* Why SRM happens?
    - Test not being set-up properly or not randomized properly
    - Bug in test or test set-up (testing variants where the users are redirected to a new experience - Browser Redirects)
    - Tracking or reporting issues
    - Traffic from Bots
    
    
* What to do about SRM issues?
    - Review test set-up
    - Look for broader trends in test data to understand where problem is present
    - Re-run the study
    - Run an A/A test