# Statistical Experiments and Significance Testing

Chapter 3 of *"Practical Statistics for Data Scientists"* delves into the critical concepts of statistical experiments and significance testing. These methods are crucial for data scientists to draw valid conclusions from data and make informed decisions.

## Introduction to Experimental Design and Inference

- The chapter highlights the goal of experimental design, which is to confirm or reject a hypothesis.
- It distinguishes between classical statistics focused on inference, which involves generalizing results from a sample to a larger population, and the field of data science, which often focuses on prediction.
- The classical statistical inference process typically includes these steps: formulating a hypothesis, designing an experiment to test it, collecting and analyzing data, and drawing a conclusion.

## A/B Testing: A Practical Approach

- A/B testing is a common method in data science that compares two versions (A and B) of a product or feature to determine which one performs better.

- Key aspects of A/B testing include:
  - **Randomization**: Subjects should be randomly assigned to different treatment groups.
  - **Control group**: A group that receives no treatment or the existing standard against which to compare the treatment group.

- The aim is to determine if the difference in performance between the two versions is statistically significant.
- Results of A/B tests can be presented as a 2x2 table or by using descriptive statistics like means and standard deviations.
- A/B tests help answer questions like "Is the difference between price A and price B statistically significant?"
- It's crucial to obtain permission when experimenting with human subjects to avoid ethical issues.

## Multi-Arm Bandit Algorithm: A Dynamic Strategy

- The multi-arm bandit algorithm is useful for scenarios with multiple options and dynamically allocates samples to treatments based on observed outcomes. This method balances exploration (trying new options) and exploitation (using known better options).

## The Null and Alternative Hypotheses

- A hypothesis test always involves a null hypothesis (a statement of no effect or no difference) and an alternative hypothesis (a statement that contradicts the null hypothesis).

- Examples:
  - Null = "There is no difference between the means of group A and group B"; alternative = "The means of group A and B are different."
  - Null = "A ≤ B"; alternative = "A > B."

## Resampling: Permutation Tests

- Resampling involves repeatedly sampling values from observed data to assess the random variability of a statistic.
- Permutation tests use resampling to check whether an observed difference between groups could be due to random chance.
- Permutation involves combining data from different groups, shuffling, and reallocating to create resampled groups, and then comparing the observed results to the distribution of results from this resampled data.

## Statistical Significance and p-Values

- Statistical significance is a measure of whether an experiment's result is more extreme than what might be expected due to chance.
- The p-value is the probability of observing a result as extreme as the one seen, assuming the null hypothesis is true.
- A low p-value suggests that the observed result is unlikely to have occurred by chance alone.

- The American Statistical Association (ASA) notes:
  - P-values do not measure the probability that the studied hypothesis is true.
  - Conclusions should not be based solely on whether a p-value is below a threshold.
  - Statistical significance does not measure the size or importance of an effect.

- Practical significance should be considered in addition to statistical significance, as a statistically significant result may not have any meaningful practical implications.

## Example: Web Stickiness

- The chapter provides an example that compares the "stickiness" (time spent on a page) of two web pages.
- A permutation test is used to compare the mean session times of page A and page B.
- **Figure 1** shows boxplots for session times on the two pages.

| ![session_AB](figure/c3/fig3-3.png) | 
|:--:| 
| *Figure 1.  Session times for web pages A and B* |

- **Figure 2** displays a histogram of permuted differences in session times, with a vertical line indicating the observed difference.

| ![hist_AB](figure/c3/fig3-4.png) | 
|:--:| 
| *Figure 2.  Frequency distribution for session time differences between pages A and B; the vertical line shows the observed difference* |

## Degrees of Freedom

- Degrees of freedom (d.f.) refers to the number of independent pieces of information used to calculate a statistic. This concept is important when calculating test statistics.

## Analysis of Variance (ANOVA)

- ANOVA is used to compare the means of more than two groups, checking for significant differences among group means relative to the variability within each group.
- ANOVA also uses a resampling procedure.
- **Figure 3** shows boxplots of four groups being compared using ANOVA.

| ![4_groups](figure/c3/fig3-6.png) | 
|:--:| 
| *Figure 3.  Boxplots of the four groups show considerable differences among them* |

- A key element of ANOVA is the decomposition of variance, which separates observed data into the grand average, treatment effect, and residual error.

## Chi-Square Test

- The chi-square test checks if observed counts match an expected distribution (the null model), commonly used with categorical data.
- The test can be performed using a resampling procedure.

### Formula:

- Pearson residual (R) = (Observed - Expected) / sqrt(Expected)

## Power and Sample Size

- **Power** is the probability of detecting a true effect.
- **Sample size** is the number of observations required to detect an effect with a given power.

- Four components involved:
  1. Sample size
  2. Effect size
  3. Significance level (alpha)
  4. Power

- Typically, you would want to calculate sample size, thus you must specify the other three components.

## Key Takeaways

- Experimental design and hypothesis testing are important for drawing valid conclusions from data.
- Resampling techniques, such as permutation tests, are useful for understanding the role of random variation.
- It's essential to recognize the role that random variation can play.
- While formal statistical inference is not always needed in data science, it's important to understand its principles.
- P-values should be used as one of several inputs to decision-making rather than the single determining factor.

## Appendix: Summary of Methods

### A/B Testing

1. Randomly assign subjects to either treatment or control groups.
2. Select a suitable metric to compare groups (e.g., conversion rate).
3. Collect data for the chosen metric in each group.
4. Compare the results using a statistical test, to see if there is a significant difference.

### Multi-Arm Bandit Algorithm

1. Start with a set of treatment options.
2. Use an algorithm to choose a treatment, balancing between exploration (trying new options) and exploitation (using known better options).
3. Update the allocation of treatment based on outcomes.
4. Continue until a desired level of optimization is achieved.

### Permutation Test

1. Combine all observed data into a single dataset.
2. Shuffle the combined dataset randomly.
3. Divide the shuffled data into groups that match the sample sizes of the original groups.
4. Calculate a test statistic on the resampled groups (e.g., difference of means).
5. Repeat steps 2-4 many times (e.g., 1000 times).
6. Calculate the p-value by determining how often the test statistic from the resampled groups is as or more extreme than the observed test statistic.

### Analysis of Variance (ANOVA)

1. Combine all observed data into a single dataset.
2. Shuffle and draw random samples that match the original group sizes from the combined data.
3. Calculate group means and then the variance of the means.
4. Repeat steps 2-3 many times.
5. Calculate the p-value by determining how often the variance of the means from the resampled data is as or more extreme than the variance of the means observed in the actual experiment.

### Chi-Square Test

1. Create a contingency table of observed counts for different categories.
2. Compute the expected counts for each cell under the null hypothesis.
3. Calculate the Pearson residuals: (Observed - Expected) / sqrt(Expected).
4. Compute the chi-square statistic by summing the squared Pearson residuals.
5. Determine the p-value by comparing the calculated chi-square statistic to a chi-square distribution.

### Power and Sample Size Calculation

1. Specify the desired statistical power (the probability of detecting an effect).
2. Specify the significance level (alpha) for the test.
3. Specify the minimum effect size that is desired to be detected.
4. Calculate the required sample size using statistical software or formulas.
