# Central Limit Theorem
I created this interactive guide to help you visualize the Central Limit Theorem. The importance of this theorem cannot be overstated. It will serve as the basis for our discussion of statistical significance for the final few class meetings. 

Don't worry - it is not necessary for you to understand the code below. Just follow the instructions for each step one at a time. I look forward to seeing you in class on Tuesday morning.


## Step 1
In this step, we’ll create two populations. The code in the cell below will do two things:

- Generate 25,000 random values for each population that represent some data. One population's distribution will be skewed and the other symmetrical. 
- Plot this data to visually confirm the shape of the distribution.

**To Do**: Run the code in the cell below (click on the cell and then select Run, Run Selected Cell from the menu above). Observe the histogram plots that appear. These plots show the distribution of the 25,000 values in each population.

## Step 2
We rarely have the ability to study an entire population. Instead, we rely on sample data to make conclusions (inferences) about a population. In this step, the code below we will select repeated samples from the two populations. Each sample will consist of 30 values chosen at random from the 25,000 values in the population. We will compute the mean of that sample, and then repeat the process 5,000 times. The histogram plots that appear show the distribution of the sample means.  


**To Do**: Run the code in the cell below. Observe the histogram plots that appear. Notice that both distributions of sample means appears approximatley Normal regardless of the shape of the population distribution. 

In [None]:
# Generate sample means for each population
exp_sample_means = [np.mean(np.random.choice(exp_population_data, sample_size)) for _ in range(num_samples)]
uniform_sample_means = [np.mean(np.random.choice(uniform_population_data, sample_size)) for _ in range(num_samples)]

# Set up side-by-side subplots
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
fig.suptitle("Distributions of Sample Means", fontsize=16)

# Plot the histogram of sample means for the exponential distribution
axes[0].hist(exp_sample_means, bins=30, color='coral', edgecolor='black')
axes[0].set_title("Samples from Population A")
axes[0].set_xlabel("Sample Mean")
axes[0].set_ylabel("Frequency")

# Plot the histogram of sample means for the uniform distribution
axes[1].hist(uniform_sample_means, bins=30, color='skyblue', edgecolor='black')
axes[1].set_title("Sample from Population B")
axes[1].set_xlabel("Sample Mean")
axes[1].set_ylabel("Frequency")

plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

## Step 3
What you observe is evidence of the Central Limit Theorem, which states:

**If the sample size is large enough, the distribution of sample means will be approximately normal regardless of the population distribution.** 

In this case, we used samples of size 30. That sample size in large enough in most cases except for populations that are extremely skewed. With symmetric populations, a much smaller sample size would work. Since we do not always know about the shape of the population distribution, it is generally accepted that we should choose sample sizes of 30 or more. 

Interestingly, that's the shape of the distribution is just part of the story. The following equations are also true

$$\mu_{\scriptscriptstyle X} = \mu_\bar{\scriptscriptstyle X}$$

and

$$\sigma_{\bar{\scriptscriptstyle X}} = \frac{\sigma_{\scriptscriptstyle X}}{\sqrt{n}}$$

Here's the English version:

1. the mean of the distribution of sample means is equal to the mean of the population
2. the standard deviation of the distribution of sample means is equal to the standard deviation of the population divided by the square root of the sample size. 

**To Do**: Run the code in the cell below. Notice how the means of the population and distribution of sample means are essentially equal. Now, divide the standard deviation of each population by the square root of the sample size (30) and compare your result to the standard deviation of the distribution of sample means.