# Activity 5: Law of Large Numbers and Central Limit Theorem
In this activity, we will learn about 
1. Statstical simulation
2. Graphing
3. Understanding sampling bias


## Part 1: Simulated Survey Data

Imagine you are running a UX study for a new app feature. You survey users with the question:

> **"Were you satisfied with this product?"**  

This question has a **binary outcome** — users either say "Yes" (satisfied) or "No" (not satisfied).

We will start by generating a **simulated population dataset** of 1,000 users, where 70% of users are satisfied. This will represent the *true population* from which we will later draw samples.

Let's build up to this by doing a series of exercises to learn about the random.choice function from the numpy library and the DataFrame function in the pandas library. 


In [1]:
# Import numpy, pandas, matplotlib and seaborn

# Set a random seed 

# Let's learn about np.random.choice 
# Read documentation: https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html 

# Generate a single random choice between 0 and 1 

# Generate a random choice between ['Satisfied','Not Satisfied']

# Generate 10 random choices between 0 and 1

# Generate 10 random chocies between 0 and 1 where 1 is chosen 90% of the time 

# What kind of object does np.random.choice generate in python? 

# Let's learn about the DataFrame function from the pandas library
# Read documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html

# Generate a pandas DataFrame composed of 10 random choices between 0 and 1 where 1 is chosen 60% of the time 

# What kind of object does DataFrame generate in python? 

# Generate a pandas DataFrame 
# * composed of 1000 random choices 
# * between 0 and 1 
# where 1 is chosen 70% of the time  
# and the column is called satisfaction 
# To do this, let's define variables for 'population size' and 'satisfaction probability'
# Call the dataset "population"

# Inspect the first 10 rows of this dataframe 

# Inspect 10 rows in the middle of this dataframe

# Print the length of this dataframe using a formatted string 


## Part 2: Sampling from the Population

In real research, we usually **don’t have data from the entire population**.  
Instead, we take a **sample** — a smaller subset of people — and try to use it to learn about the population.

Let’s see what happens when we draw random samples from our `df` DataFrame.

We’ll start small (10 people), and then try larger samples (like 100 or 500 people) to see how the sample average compares to the true population average.


In [2]:
# Calculate and print the true population mean, rounding to 2 digits. 

# Let's use the .sample function from the pandas library to sample 10 
# Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html 

# Use .sample and .mean to compute the mean of a sample of 20 observations

# Let's write a function that:
# - takes two arguments: a one column dataset and sample size 
# - and returns: the mean

# Run the function to get the mean of a sample of 30 observations


## Part 3: Visualizing the Law of Large Numbers

The **Law of Large Numbers** says that as we take larger and larger samples from a population,  
our sample mean should get closer to the *true population mean.*

Let’s see this visually.

We’ll:
1. Take samples of increasing size (10, 20, 30, … up to 1000).
2. Compute the mean for each sample.
3. Plot those means to see how they behave as the sample grows.


In [3]:
# Create a list of sample sizes from 10 to 1000, going by 10, using the range() function
### Documentation: https://docs.python.org/3/library/stdtypes.html#typesseq-range 

# Inspect the 9th item in the list 

### Aside: List Comprehension 
# Create a range from 1 to 10 

# Create a for loop to print 10 * each value in the range

# Create a list containin 10 times each value in the range using "list comprehension"

# Print the 7th element of the list 

### Aside Concluded ### 
# Using List Comprehension, apply custom function to each sample size 

# Create a plot with : 
# - y axis sample size
# - x axis 
# - red dashed line for population mean (axhline)
# - label x and y axis 


## Part 4: The Central Limit Theorem (CLT)

The **Central Limit Theorem (CLT)** goes one step further.

It says that if we take **many random samples** from a population and compute the **mean** of each sample,  
those means will form a **normal (bell-shaped) distribution** — even if the original data are not normal.

Let’s see this in action by:
1. Repeatedly taking samples of size 50 from our population.
2. Storing the mean of each sample.
3. Plotting the distribution of those sample means.


In [4]:
# Using List Comprehension and range(), compute the mean of 10 samples of length 20. Print these with a for loop. 

# Use the same method to create a dataset of 3000 means drawn using 50 observations 

# Convert to pandas dataframe

# Create a basic histogram


## Part 5: A 95% Confidence Interval

According to the **Central Limit Theorem (CLT)**, the sample means are approximately normal with:

$$
\mu_{\bar{X}} = \mu \quad \text{and} \quad \sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}}
$$

where:

- $\mu_{\bar{X}}$ is the mean of sample means (should equal the population mean)  
- $\sigma_{\bar{X}}$ is the **standard error (SE)**  
- $\sigma$ is the population standard deviation  
- $n$ is the sample size used to compute each mean  

If the distribution of sample means is normal, then about **95% of sample means** should fall within:

$$
\mu_{\bar{X}} \pm 1.96 \times SE
$$

Let's apply this formula in practice.

In [5]:
# Compute the true standard deviation in the population 

# Compute the standard error 

# Compute the upper/lower bounds of the 95% confidence interval 

# Add Confidence Intervals to the Plot using axvline 

# Create a subset of the means dataset containing the means that fall within the 95% confidence interval

# What percentage of means fall within the 95% confidnece interval numerically? 
