# Statistical Experiments and Significance Testing

Design of experiments is a cornerstone of the practice of statistics, with
applications in virtually all areas of research. The goal is to design an experiment
in order to confirm or reject a hypothesis.

This process starts with a hypothesis.
An experiment is designed to test the hypothesis 
designed in such a way that, hopefully, will deliver conclusive results. The data
is collected and analyzed, and then a conclusion is drawn. The term inference
reflects the intention to apply the experiment results, which involve a limited set
of data, to a larger process or population.

Formulate Hypothesis &#8594; Design Experiment &#8594; Collect Data &#8594; Inference / Conclusions

## A/B Testing

An A/B test is an experiment with two groups to establish which of two
treatments, products, procedures, or the like is superior. Often one of the two
treatments is the standard existing treatment, or no treatment. If a standard (or
no) treatment is used, it is called the control. A typical hypothesis is that
treatment is better than control.

- **Treatment**: Something (drug, price, web headline) to which a subject is exposed.
- **Treatment group**: A group of subjects exposed to a specific treatment.
- **Control group**: A group of subjects exposed to no (or standard) treatment.
- **Randomization**: The process of randomly assigning subjects to treatments.
- **Subjects**: The items (web visitors, patients, etc.) that are exposed to treatments.
- **Test statistic**: The metric used to measure the effect of the treatment.

A/B Test examples:
- Testing two prices to determine which yields more net profit
- If a new iteration of a model has a higher click-through rate compared to the current model

A proper A/B test has subjects that can be assigned to one treatment or another.
The subject might be a person, a plant seed, a web visitor; the key is that the
subject is exposed to the treatment. Ideally, subjects are randomized (assigned
randomly) to treatments. In this way, you know that any difference between the
treatment groups is due to one of two things:
- The effect of the different treatments
- Luck of the draw in which subjects are assigned to which treatments (i.e., the random assignment may have resulted in the naturally better-performing subjects being concentrated in A or B)

You also need to pay attention to the test statistic or metric you use to compare
group A to group B. Perhaps the most common metric in data science is a binary
variable: click or no-click, buy or don’t buy, fraud or no fraud, and so on. Those
results would be summed up in a 2×2 table e.g.:

| Outcome       | Model A | Model B |
|---------------|---------|---------|
| Conversion    | 450     | 350     |
| No conversion | 53489   | 49278   |

If the metric is a continuous variable (purchase amount, profit, etc.), or a count
(e.g., days in hospital, pages visited) the result might be displayed differently. If
one were interested not in conversion, but in revenue per page view, the results
of the test in the table above could be mean revenue per page view along the top row
and the standard deviation along the bottom.

### Why Have a Control Group?

Without a control group, there is no assurance that “other things are equal” and
that any difference is really due to the treatment (or to chance). When you have a
control group, it is subject to the same conditions (except for the treatment of
interest) as the treatment group. If you simply make a comparison to “baseline”
or prior experience, other factors, besides the treatment, might differ.

The use of A/B testing in data science is typically in a web context. Treatments
might be the design of a web page, the price of a product, the wording of a
headline, or some other item. Some thought is required to preserve the principles
of randomization. Typically the subject in the experiment is the web visitor, and
the outcomes we are interested in measuring are clicks, purchases, visit duration,
number of pages visited, whether a particular page is visited, and the like. In a
standard A/B experiment, you need to decide on one metric ahead of time.
Multiple behavior metrics might be collected and be of interest, but if the
experiment is expected to lead to a decision between treatment A and treatment
B, a single metric, or test statistic, needs to be established beforehand. **Selecting
a test statistic after the experiment is conducted opens the door to researcher
bias.**

### Why Just A/B? Why Not C, D…?

A/B tests are popular in the marketing and ecommerce worlds, but are far from
the only type of statistical experiment. Additional treatments can be included.
Subjects might have repeated measurements taken. Pharmaceutical trials where
subjects are scarce, expensive, and acquired over time are sometimes designed
with multiple opportunities to stop the experiment and reach a conclusion.