# Week 2 Lab: Probability & Statistical Testing

**Estimated Time:** 30-60 minutes  
**Objective:** Apply probability concepts and statistical testing to real-world Philippine data scenarios.

In this lab, you will:
- Simulate probability scenarios using Python
- Perform basic statistical tests
- Analyze Philippine demographic data with statistical methods

---

## Setup

Run this cell first to import required libraries:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Set random seed for reproducibility
np.random.seed(42)

# Set display options
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("husl")

print("‚úì Libraries imported successfully!")

---
## Part 1: Probability Simulation

**Background:** Probability helps us quantify uncertainty. Let's simulate a real-world scenario involving Philippine elections.

**Scenario:** In a barangay with 1000 registered voters:
- 60% support Candidate A
- 40% support Candidate B

We want to understand the probability that a random sample of 100 voters accurately represents the population.

### Exercise 1.1: Simulate Single Sample

**Task:** Simulate drawing a random sample of 100 voters. Calculate the proportion that supports Candidate A.

In [None]:
# Population parameters
population_size = 1000
p_candidate_a = 0.60
sample_size = 100

# TODO: Create a population array where 60% support A (1) and 40% support B (0)
# Hint: Use np.random.choice() or create an array with the right proportions
population = None  # Your code here

# TODO: Draw a random sample of 100 voters from the population
sample = None  # Your code here

# TODO: Calculate the proportion supporting Candidate A in the sample
sample_proportion = None  # Your code here

print(f"Sample proportion for Candidate A: {sample_proportion:.2%}")
print(f"Difference from population: {abs(sample_proportion - p_candidate_a):.2%}")

### Exercise 1.2: Simulate Multiple Samples (Sampling Distribution)

**Task:** Repeat the sampling process 1000 times to understand the distribution of sample proportions.

In [None]:
# TODO: Simulate 1000 samples and store their proportions
n_simulations = 1000
sample_proportions = []

# Your code here (loop 1000 times, each time draw a sample and calculate proportion)

# TODO: Convert to numpy array and plot histogram
sample_proportions = np.array(sample_proportions)

plt.figure(figsize=(10, 6))
plt.hist(sample_proportions, bins=30, edgecolor='black', alpha=0.7)
plt.axvline(p_candidate_a, color='red', linestyle='--', linewidth=2, label=f'True Population ({p_candidate_a:.0%})')
plt.axvline(sample_proportions.mean(), color='green', linestyle='--', linewidth=2, label=f'Sample Mean ({sample_proportions.mean():.2%})')
plt.xlabel('Sample Proportion for Candidate A')
plt.ylabel('Frequency')
plt.title('Sampling Distribution of Proportions (n=100, 1000 simulations)')
plt.legend()
plt.show()

print(f"Mean of sample proportions: {sample_proportions.mean():.4f}")
print(f"Standard deviation: {sample_proportions.std():.4f}")

---
## Part 2: Statistical Testing

**Background:** Let's test a hypothesis about Philippine household income using statistical inference.

**Scenario:** A researcher claims that the average monthly household income in Metro Manila is ‚Ç±35,000. We have sample data to test this claim.

### Exercise 2.1: Create Sample Data

In [None]:
# Simulated household income data (in thousands of pesos)
# Sample of 50 households
household_incomes = np.array([
    32, 28, 45, 38, 25, 42, 36, 29, 48, 31,
    35, 40, 33, 27, 44, 37, 30, 39, 34, 41,
    26, 43, 36, 32, 38, 29, 47, 35, 31, 40,
    33, 28, 42, 37, 30, 45, 34, 39, 32, 41,
    27, 44, 36, 31, 38, 29, 43, 35, 33, 40
])

# TODO: Calculate basic statistics
sample_mean = None  # Your code here
sample_std = None   # Your code here
sample_size = None  # Your code here

print(f"Sample Size: {sample_size}")
print(f"Sample Mean: ‚Ç±{sample_mean:.2f}k")
print(f"Sample Std Dev: ‚Ç±{sample_std:.2f}k")

### Exercise 2.2: Perform One-Sample t-test

**Task:** Test the hypothesis: H‚ÇÄ: Œº = 35,000 vs H‚ÇÅ: Œº ‚â† 35,000 at Œ± = 0.05

In [None]:
# TODO: Perform one-sample t-test
# Hint: Use stats.ttest_1samp()
claimed_mean = 35
t_statistic, p_value = None, None  # Your code here

print(f"Claimed population mean: ‚Ç±{claimed_mean}k")
print(f"Sample mean: ‚Ç±{sample_mean:.2f}k")
print(f"\nt-statistic: {t_statistic:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"\nSignificance level: Œ± = 0.05")

# TODO: Interpret the results
if p_value < 0.05:
    print("Result: REJECT the null hypothesis")
    print(f"Conclusion: The average household income is significantly different from ‚Ç±{claimed_mean}k")
else:
    print("Result: FAIL TO REJECT the null hypothesis")
    print(f"Conclusion: Not enough evidence to say the average differs from ‚Ç±{claimed_mean}k")

---
## Part 3: Comparing Two Groups

**Background:** Let's compare household incomes between two cities in Metro Manila.

### Exercise 3.1: Two-Sample t-test

In [None]:
# Household income data (in thousands)
makati_incomes = np.array([45, 52, 48, 55, 43, 50, 47, 54, 46, 51, 49, 53, 44, 56, 48])
mandaluyong_incomes = np.array([38, 42, 36, 44, 35, 40, 39, 43, 37, 41, 38, 42, 36, 45, 39])

# TODO: Calculate means for both groups
makati_mean = None  # Your code here
mandaluyong_mean = None  # Your code here

print(f"Makati average income: ‚Ç±{makati_mean:.2f}k")
print(f"Mandaluyong average income: ‚Ç±{mandaluyong_mean:.2f}k")
print(f"Difference: ‚Ç±{makati_mean - mandaluyong_mean:.2f}k")

# TODO: Perform two-sample t-test
# Hint: Use stats.ttest_ind()
t_stat, p_val = None, None  # Your code here

print(f"\nt-statistic: {t_stat:.4f}")
print(f"p-value: {p_val:.4f}")

if p_val < 0.05:
    print("\n‚úì The income difference is statistically significant")
else:
    print("\n‚úó The income difference is NOT statistically significant")

### Exercise 3.2: Visualize the Comparison

In [None]:
# TODO: Create a box plot comparing the two cities
# Hint: Use plt.boxplot() or sns.boxplot()

# Your visualization code here

plt.figure(figsize=(10, 6))
# Your code to create box plot
plt.ylabel('Monthly Income (‚Ç± thousands)')
plt.title('Household Income Comparison: Makati vs Mandaluyong')
plt.show()

---
## Reflection Questions

Answer these questions in markdown cells below:

1. **Sampling Distribution:** Based on Exercise 1.2, does the sample mean tend to equal the population mean as we increase the number of simulations? What does this tell us about sampling?

2. **Statistical Significance:** In Exercise 2.2, what does a small p-value (< 0.05) really mean? How should we interpret "statistical significance" in real-world decision-making?

3. **Philippine Context:** How could these statistical methods help organizations like the Philippine Statistics Authority (PSA) or local government units make data-driven decisions?

### Your Answer to Question 1:

[Your answer here]

### Your Answer to Question 2:

[Your answer here]

### Your Answer to Question 3:

[Your answer here]

---

## üéØ Congratulations!

You've completed Week 2 Lab on Probability & Statistical Testing.

**Key Takeaways:**
- Probability helps us quantify uncertainty in data
- Sampling distributions show us how sample statistics vary
- Statistical tests help us make data-driven decisions with confidence
- Philippine data provides rich opportunities for statistical analysis

**Remember:** This lab is for practice. Check the **solution notebook** if you get stuck!

**Next Steps:**
- Review your lecture notes on Week 2
- Try modifying the exercises with different parameters
- Explore other Philippine datasets that interest you

---

*CMSC 178DA - Data Analytics | University of the Philippines Cebu*