# Overview.

This notebook provides five interactive demonstrations to explore the Central Limit Theorem and some strengths and weaknesses of nonparametric statistics.

They are:
1. The binomial convergence to normal
2. General sample mean convergence to normal
3. Sample mean convergence of the Cauchy distribution
4. Nonparametric vs parametric paired-sample tests for normal data
5. Nonparametric vs parametric non-paried tests for data with outliers

In [1]:
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

import demo_utils

sns.set_context('poster')
sns.set_style('ticks')

## Binomial convergence to normal

For $Y \sim \mathrm{Binom}(n, p)$, we can estimate
$$\hat{p} = Y/n$$
Since $Y = \sum_i^n X_i$ where $X_i \sim \mathrm{Bernoulli(p)}$, the central limit theorem implies:
$$ \hat{p} \sim N(\mu, \sigma^2)$$
where:
* $\mu = p$
* $\sigma^2 = pq/n$

This simulation allows you to explore this convergence as n and p vary. Does the $npq>5$ rule seem right?

**Params**
* *n*: number of samples to draw from the binomial
* *p*: probability of success under binomial

In [2]:
demo_utils.binomial_interactive();

## The Central Limit Theorem for common distributions

The binomial is not the only distribution who's sample mean will converge to a normal distribution. The Central Limit Theorem tells us that the sample mean of any *iid* observations from a distribution with finite mean and variance will converge to a normal distribution.

Suppose $\{X_1, X_2, ..., X_n\}$ are *iid* observations from a probability distribution with mean $mu$ and variance $\sigma^2$. 

Define
$$ \bar{X} = \frac{1}{n} \sum_i^n X_i.$$
Then
$$ \lim_{n\to\infty} \sqrt{n} \frac{\bar{X} - \mu}{\sigma} \sim N(0, 1).$$

This simulation allows you to explore this convergence for a variety of common distributions, both continuous and and discrete, by simulating $\bar{X}$ many times.

**Params**:
* *distribution*: your choice of distribution
* *outer_n*: number of times to simulate $\bar{X}$
* *inner_n*: number of observations to draw from the distribution to estimate each $\bar{X}$
* *density*: use kernel-density estimation (otherwise plot a histogram)

In [3]:
demo_utils.CLT_interactive();

## Sample mean of the Cauchy distribution

The standard Cauchy distribution has the probability density function:
$$f(x) = \frac{1}{\pi(1+x^2)}.$$

Observations from this distribution have a peculiar property. Can you figure out what it is?

In [4]:
demo_utils.cauchy_CLT_interactive();

## Behavior of the Signed-Rank test for paired observations

Nonparametric statistics are broadly applicable under very general assumptions. The cost of this generality is power. But how much power do we lose?

This simulation allows you to explore how the Wilcoxon Signed-Rank test performs compared to a paired t-test when the data truly are normally distributed.

**Params**
* *mu1*: the mean of group 1
* *sigma1*: the standard deviation of group 1
* *mu2*: the mean of group 2
* *sigma2*: the standard deviation of group 2
* *n*: the number of paired observations

In [5]:
demo_utils.paired_sample_interactive();

## Nonparametric vs parametric performance with outliers

Commonly in biomedical data are outliers, observations which arise from a different distribution than other samples. This can be do to technical artifact or true biology.

Important questions to consider when analyzing data are: 1) do my data have outliers and 2) how robust are my statistical methods to outliers? 

In this simulation, you can explore the performance of the Wilcoxon Rank-Sum (Mann-Whitney U) test when outliers exist. Two sets of observations are independently drawn from the same distribution; to one of these sets is added a number of observations drawn from a different distribution. Under which conditions should you use each technique?

**Params**:
* *n1*: number of observations in set 1
* *n2*: number of observations in set 2 drawn from the same distribution as set 1
* *n3*: number of outlier observations to add to set 2
* mu: the mean of the outlier distribution
* sigma: the standard deviation of the outlier distribution

In [6]:
demo_utils.outlier_interactive();