# A/B Testing Course

## Q&A Session 1

### Q1. What are recommend methods for working with dependent data?

Dependent data themselves do not really ruin our lives, except that we cannot achieve a uniform distribution of p-values in A/A tests.  
Thus, you can make absolutely any transformations and if your A/A tests (p-value) are distributed uniformly (uniform distribution) with the fairness of the null hypothesis, then it can be said that you have come up with a new method of working with dependent data.   

Often all these methods of working with dependent data are related to somehow correcting the variance. Thus, for t-test with dependent variables, you need to carefully recalculate the variances in the denominator taking into account the covariance that has appeared. That is, if you have a sequence of actions and understand how one action depends on another, then you can calculate the covariance and make all sorts of corrections to the variance for dependencies of variables when calculating t-test statistics.  

The next thing we can talk about is when you don’t have a randomized experiment. A randomized experiment is when you first allocate control, then test, influence one of them and then compare, i.e. do everything as usual. But sometimes we cannot allocate independent control and test groups ourselves, for example, when people get vaccinated (because we cannot tell one group to get vaccinated and another not), and when people have already been vaccinated themselves, those who did it and those who did not are initially different, i.e. we do not have true randomization. Here you can use such methods as *matching* - when you try to select a group A’ among group A (usually among a larger group) that was as similar as possible to group B in initial characteristics, and then compare them.

### Q2. Where to find advanced knowledge about A/B testing?

- articles from top companies
- university articles on statistics   

For example, CUPED scientifically is a Hilbert space of random variables. The method was known more than 50-60 years ago.

### Q3. What is the best way to master advanced A/B testing techniques?

The easiest and most reliable way is to get a job at a company that uses them, and most importantly, has enough data to apply them.

### Q4. Is a sufficiently large sample size needed when working with highly skewed metric distributions so that the CLT approximates the means well? Or should non-parametrics be used instead, such as Mann-Whitney?

1. Let's consider a typical example of data with right skew (e.g. data on average check). What can be done?  
For example, stratify based on historical data. Roll back to a period in the past and do this: one stratum - people who bought nothing or little, another stratum - everyone else. Then you will have not such a large variance in the first stratum and also in the second stratum. Thus, in each of the strata, your metric will no longer be so skewed. And you can evaluate them independently, and then take a combination of their results in proportion to the weights from the general population.
2. Next, if there is an additional skew (outlier) to the left.
- What can be done?  
Remove this outlier, estimate without it, but do not forget to mention it when reporting the results. Plus, think about how it happened that this outlier appeared and what can be done to minimize the probability of its appearance when conducting subsequent tests.
- What else can be done?   
Let’s decompose this metric and say that zero (who did not make a purchase) is 0, and everything from zero to plus infinity (those who made a purchase) is 1. Thus, from the average check (money), you can switch to such a metric as the share of people who performed some action (made a purchase). You can also try to split not into 0 and 1, but by the number of purchases (targeted actions), for example 0, 1, 2, 3, 4 etc. In this case there will be no big skewness, but unfortunately this will not be exactly the metric we need.  
- You can also take only those who made purchases in group A and group B and compare them with each other, and exclude those with zero average check from both samples. That is, now we will compare sales per person in group A and sales per person in group B.
- The second and third options separately here may not be very informative, because the number of people who made a purchase may decrease, and sales per person in the same group may increase, so you will need to multiply the results of the second and third options to get the final result. 
- Bucket method - google how to apply. We take users, randomly divide them into 100 buckets, count the metric in each bucket, thus we get 100 metrics in group A and 100 metrics in group B. And in the end we compare 100 metrics of group A with 100 metrics of group B. It should converge better.

### Q5. What to do if users are not registered? How to approach A/B testing?

The main thing is to run an A/A test on historical data so that the p-values are distributed evenly. If yes, then everything is fine. You have found the right approach to selecting groups for unregistered users.

### Q6. What are bucket, salt, etc. How to calculate them?

In A/B testing, a bucket refers to a group of users who are randomly assigned to either variant A or variant B.  

Salt is a term used in A/B testing to describe the process of adding random noise to the data in order to prevent bias and ensure that the test is statistically valid.  

We have a large number of existing users and we have conditionally divided them into 100 buckets. We want to conduct an experiment, if we take users from one bucket for the control group and users from another bucket for the test group, then after several years of experiments our buckets will differ significantly from each other. Therefore, double hashing is used - we take two buckets, combine them into one large bucket and take a hash function on it: 
```
if hash(id + salt) % 2 == 0:
    control
else:
    test
```

After that we get our control and test groups. If you want to conduct an experiment again on users from these same two buckets, all you need to do is take another salt. Thus, users will be divided by an orthogonal partition and sequentially conducted experiments will not affect each other.

We need other buckets (first level) to conduct many experiments in parallel. They allow us to control the intersection of experiments. The second level of salt and hash is needed so that users can be re-shuffled for subsequent experiments and so that buckets do not drift apart over time in their characteristics.

To get 100 first-level buckets also `id+Salt % 100`.

In large companies there are thousands and hundreds of thousands of first-level buckets.

### Q7. Should data be tested for normality before applying parametric or non-parametric tests?

To make the t-test work formally, the following three conditions must be met:
- The numerator is normally distributed.
- The denominator has a chi-square distribution.
- The numerator and denominator are independent random variables.  

If Xi and Yi are independent identically distributed random variables from a normal distribution, the first condition is met. But in real life, this condition is often ignored.  

The distribution of the sample does not matter, because according to the Central Limit Theorem (CLT), the distribution of the sample means will always be normally distributed, and it doesn't matter what the distribution of the original sample is. Since mathematical statistics guarantees that if you take random variables and take their mean, then it turns out that with a sufficiently large sample size, this mean will tend towards a normal distribution, i.e., the t-test requirement is automatically fulfilled.  

Therefore, there is no need to check data for normality, and the choice between the Student's t-test or the Mann-Whitney test is not based on the results of the normality test.

### Q8. How do you choose between Student's t-test or Mann-Whitney test then? If not based on normality testing.

On historical data, you should resample your sample a large number of times (10,000 times). Each time, look at the p-value and remember it. Then, you build the distribution of your p-values:
- if it is uniform, then you can use the t-test.
- if it is not uniform, then you either have dependent data or outliers. In this case, you should think about how to get rid of these two problems.  

Your ultimate goal is to obtain a uniform distribution on A/A tests. This will guarantee that your design and allocation are correct, and you control the occurrence of type I errors.

### Q9. There are two models, and one of them is 1% better than the other. There are no users. How can you tell if this is a real difference and not just statistical error?

You bootstrap pairs (target_i, predict_i) for each model, obtaining bootstrapped loss functions. You do the same for the second model. Now you have a large number of metrics. All that remains is to construct confidence intervals for the differences and see if zero falls within it. If it does, then there is no statistically significant change, but if zero lies outside the confidence interval, then we have managed to detect a statistically significant difference.

### Q10. Is it permissible to neglect the correction for multiple testing when conducting multiple tests?

It is not permissible! Especially if you are making the final decision.

What can be done: run 100 experiments for 100 different variations, for example, the color of the button. Look at the uplift across them. From the 100, select the top 2 results with the highest uplift. And after that, conduct a classic A/B/C with only two treatment options.

That is, multiple testing is never for decision making. As a tool for filtering hypotheses, it can be applied to identify potentially good changes.

### Q11. Which method should be used for prioritizing hypotheses?

For this, it is better to use product managers and product owners, rather than analysts.

### Q12. What is a random variable and what is the realization of a random variable?

A random variable is a mathematical abstract object. It has a normal distribution under certain conditions of independence of observations in the sample.

A realization of a random variable is what we observe in the real world. A realization of a random variable is two numbers (the means of our control and test groups). And all you can do is take the difference between these means, divided by the variance (see t-test formula above) and see how much it deviates from what is expected. That is, it is expected to be around zero. If your difference in sample means falls within 1.96*σ, then you say that I have no reason to believe that the null hypothesis is false and we should reject it.

In mathematical statistics, you observe only one sample mean (a realization of a random variable) and this is where all the complexity lies. But knowing the theory, we can say from which distribution it came given that the null hypothesis is true, and see if our observed value corresponds to the expected one. If it does not correspond, then we have grounds to reject the null hypothesis of equality of two means.

The sample mean is normal according to the Central Limit Theorem.

### Q13. What is better to use for CTR?

If you are calculating CTR, you need linearization.

Bootstrap - to build a confidence interval.

t-test - if the data is independent.

Linearization - if the data is dependent.

### Q14. What is the hypothesis of the Kolmogorov-Smirnov test?

The null hypothesis of the Kolmogorov-Smirnov test is that the two distributions being compared are identical. However, it is typically used to test whether there is a significant shift in the mean of the two distributions, so this test is not particularly useful for most tasks.

### Q15. How to deal with the fact that t-test requires a normal distribution?

It is **not** necessary for t-test to have a normal distribution. And there is no need to try to make it normal either.

### Q16. How to evaluate A/B/C/D/... tests?

Simply comparing each variant to each other is not an option, as it would lead to multiple comparisons.

The proper approach is to compare each variant with group A one at a time, select the best hypotheses, and then conduct a classical A/B test.

### Q17. How to properly evaluate conversions: in users or in events?

It is correct to evaluate both in users and in events. These are simply two different metrics. Based on the speaker's experience, it is more appropriate to evaluate conversions in events and apply linearization.

### Q18. Is it necessary to do an A/A test before every test?

If you have the opportunity, it is better to do so. Especially if you do not have a reliable platform for online experiments.

The best way to do this is to take a piece of historical data, conduct 1000 A/A tests on it, conduct 1000 simulations of A/B tests on it (artificially adding uplift). Take another piece of historical data, do the same thing on it and repeat the process 10 times.

This way you will ensure that you control the type I error (via the A/A test) and the power (via the A/B simulations) of your experiment.

### Q19. Should A/A/B tests be conducted?

In other words, taking a sample of customers, dividing them into two control groups and one test group, and then comparing the control groups with each other and the test group with the controls. - This should never be done because such an experiment does not give you any additional knowledge. Just one control group and the standard A/B option are enough.

And controlling the type I error occurs on historical data when performing the classic A/A test.

### Q20. Tell me about one-sided and two-sided criteria?

Single-tailed and two-tailed tests are used to determine statistical significance in hypothesis testing.

If you have a question about which one to use and are unsure, it's best to use a two-tailed test.

For example, if the metric in the experimental group has decreased, and your null hypothesis is that B>A, then you won't reject the null hypothesis. You'll only reject it if the metric in the experimental group has increased.

Using a single-tailed test means that you'll miss out on the possibility of detecting unsuccessful experiments (losing half the information). Therefore, it's recommended to always use a two-tailed test.

### Q21. What examples of unsuccessful experiments can you give?


Unsuccessful experiments happen often, sometimes due to various technical and business limitations. For example, you cannot conduct a treatment in only one part of the store and not conduct it in the other part. Or, for example, we do not have two cities like Moscow.

During crises, sometimes everything was good in historical data, but everything has changed in reality. In such a situation, if possible, it is better to stop, wait it out, and when the situation normalizes again, run a repeat A/B test.

### Q22. What is orthogonal user segmentation?

During parallel testing of different changes (for example, website design and promotional code email), it is important to segment users into groups orthogonally. This means that the set of users for the website design change is segmented vertically, while the set of users for the email promotion is segmented horizontally. This way, we can conduct our two different tests honestly at the same time.

When you intersect your experiments, there is always a risk. But if you don't intersect, you can greatly slow down your business by constantly waiting for one experiment to end before starting a new one. Therefore, if you do intersect, do it intelligently, for example, as shown above.

### Q23.How to calculate sample size in multiple comparisons?

Calculating sample size for multiple comparisons is similar to calculating it for a single comparison, except now we use alpha prime (adjusted for Bonferroni correction for multiple comparisons) instead of alpha. Beta is not affected by multiple comparisons.

### Q24. Can 20 days be removed from the middle of a 100-day experiment?


The answer, as always, is simple:
- It should have been built into the design.
- The design should be extensively tested on historical AA and AB tests. If the p-value distribution is uniform on AA tests, and the required power is achieved on AB tests, then it can be done.
- It is a purely practical question from there. It may not have been accounted for in the design, and something may have broken in the middle of the experiment. In this case, as mathematicians, we should restart the experiment, but the business may not allow us to do so. Therefore, we often have no choice but to continue the experiment. In any case, we need to estimate the probability of Type 1 and Type 2 errors for such a design based on historical data, and make adjustments if necessary.