Let's break down confidence intervals with opinion polls simply and then dive into some Python fun!

What's a Confidence Interval? 🤔
Imagine you want to know how many high schoolers prefer pizza over burgers. You can't ask every single high schooler in the world, right? That would take forever! So, you take a sample – maybe ask 100 high schoolers.

Let's say in your sample of 100, 60% say they prefer pizza. Now, you might think, "Great! 60% of all high schoolers prefer pizza." But is it exactly 60%? Probably not. Your sample is just a snapshot.

A confidence interval is like saying, "We're pretty sure the true percentage of all high schoolers who prefer pizza is somewhere between X% and Y%." It gives you a range, not just a single number.

Think of it like this: You throw a dart at a dartboard. You might not hit the bullseye every time, but you can say, "I'm 95% confident my dart will land within this big circle around the bullseye." The big circle is your confidence interval.

So, for our pizza example, we might say, "We're 95% confident that between 50% and 70% of all high schoolers prefer pizza." This means if we took many, many samples, 95% of the confidence intervals we calculated would contain the true percentage of pizza lovers.

Key takeaway: A confidence interval gives us a range where we're pretty sure the true value lies, based on our sample.

In [None]:
! pip install statsmodels

In [2]:
#Python Example: Tracking Pizza Preference Over Time 🍕📈

    
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm

# Set a random seed for reproducibility
np.random.seed(42)

# --- Simulate our data ---
# Imagine the true preference for pizza changes a little over time
true_preference = 0.60 + 0.05 * np.sin(np.linspace(0, 2 * np.pi, 12)) # A little wiggle
sample_size = 500 # Number of high schoolers we poll each month

# Simulate monthly poll results
# Number of high schoolers who prefer pizza
pizza_lovers = np.random.binomial(n=sample_size, p=true_preference, size=12)
# Percentage of high schoolers who prefer pizza in our sample
sample_proportion = pizza_lovers / sample_size

# --- Calculate Confidence Intervals ---
# We'll use a 95% confidence level
confidence_level = 0.95

# Calculate the standard error for proportions
# For a proportion p and sample size n, SE = sqrt(p*(1-p)/n)
# We use a Z-score for a 95% confidence level (approx 1.96)
z_score = 1.96 # This value comes from the standard normal distribution

# Calculate lower and upper bounds of the confidence interval
lower_bound = sample_proportion - z_score * np.sqrt(sample_proportion * (1 - sample_proportion) / sample_size)
upper_bound = sample_proportion + z_score * np.sqrt(sample_proportion * (1 - sample_proportion) / sample_size)

# --- Plotting the results ---
months = np.arange(1, 13)

plt.figure(figsize=(10, 6))
plt.plot(months, sample_proportion, marker='o', linestyle='-', color='blue', label='Sampled Pizza Preference')
plt.fill_between(months, lower_bound, upper_bound, color='lightblue', alpha=0.4, label='95% Confidence Interval')
plt.plot(months, true_preference, linestyle='--', color='red', label='True (Simulated) Pizza Preference') # The true value we're trying to estimate

plt.title('Monthly Pizza Preference Polls with 95% Confidence Interval', fontsize=14)
plt.xlabel('Month', fontsize=12)
plt.ylabel('Proportion Preferring Pizza', fontsize=12)
plt.xticks(months)
plt.ylim(0.45, 0.75) # Set limits for better visualization
plt.grid(True, linestyle='--', alpha=0.7)
plt.legend(fontsize=10)
plt.tight_layout()
plt.show()

ModuleNotFoundError: No module named 'statsmodels'


Explanation of the Code:

Simulating Data: We create true_preference which is the "real" (but unknown to us in a real scenario) percentage of high schoolers who prefer pizza each month. Then, we simulate taking a sample_size poll each month and getting pizza_lovers results, which gives us our sample_proportion.

Calculating Confidence Interval:

We use a confidence_level of 95%, which is very common.

The z_score of 1.96 is a magic number for 95% confidence (it tells us how many standard deviations away from the mean we need to go to capture 95% of the data).

We calculate the lower_bound and upper_bound of our interval using a formula that involves the sample_proportion, z_score, and sample_size. This formula is based on the idea that our sample proportion is an estimate of the true proportion, and we use the standard error to understand how much that estimate might vary.

Plotting: The blue line shows what our poll actually found each month. The light blue shaded area is our 95% confidence interval for each month. The dashed red line is the "true" preference (which we only know because we simulated it!). Notice how most of the time, the true preference falls within our confidence interval. This shows that the confidence interval is doing its job!

In [5]:
# Hands-On Assignments 

# Confidence Interval Assignment 🚀

These assignments will help you get hands-on experience with confidence intervals using the provided Python code. Remember to run the Python code first to generate the plot before starting the assignments.

---

## Assignment 1: The Sample Size Effect 🤔

**Goal:** Understand how the number of people you poll affects your confidence interval.

1.  **Run the Python code above.** Observe the **width** of the light blue shaded area (the confidence interval).
2.  **Change `sample_size` to `100`.** Rerun the code. What happens to the width of the confidence interval? Is it wider or narrower?
3.  **Change `sample_size` to `1000`.** Rerun the code. What happens to the width of the confidence interval now?
4.  **Discuss:** Why do you think increasing the sample size makes the confidence interval narrower? (Hint: Think about how much more certain you'd be if you asked 1000 people versus 100 people about their pizza preference.)

---



## Assignment 2: The Confidence Level Effect 🎯

**Goal:** Understand what the "95%" in "95% confident" means.

1.  **Run the original Python code** with `sample_size = 500` and `confidence_level = 0.95`. Note how many of the red dashed line points fall *outside* the blue shaded area.
2.  **Change `z_score` to `2.58`** (which corresponds to a **99% confidence level**). Rerun the code. Observe the interval width.
3.  **Change `z_score` to `1.64`** (which corresponds to a **90% confidence level**). Rerun the code. Observe the interval width.
4.  **Discuss:**
    * When you **increased** the confidence level (e.g., to 99%), what happened to the width of the confidence interval? Why does being *more* confident require a *wider* range?
    * When you **decreased** the confidence level (e.g., to 90%), what happened to the width? What's the **trade-off** here? Would you rather be 99% confident with a very wide range, or 90% confident with a narrower range?

---

## Assignment 3: Real-World Poll Analysis 📰

**Goal:** Apply your understanding to a real-world scenario (even if simulated).

1.  Imagine you are a news reporter and you just received the `sample_proportion` and the `lower_bound`/`upper_bound` for *one* month (let's say Month 7 from our simulated data).
2.  **Pick a specific month** (e.g., Month 7) from the output of your Python code.
    * What was the **`sample_proportion`** for that month?
    * What was the **95% confidence interval** for that month? (You'll need to look at the numbers generated by the Python code for that specific month.)
3.  **Write a short news headline and a one-sentence news report** based *only* on the **confidence interval** for that month. For example, if your interval was `[0.55, 0.65]`, you might write:

    * **Headline:** "New Poll Shows Majority of High Schoolers Favor Pizza!"
    * **Report:** "A recent survey indicates that between 55% and 65% of high school students prefer pizza, with 95% confidence."

4.  **Reflect:** How does the confidence interval give a more **complete picture** than just reporting the `sample_proportion` alone? Why is this important for **accurate reporting**?