<a href="https://colab.research.google.com/github/vijaygwu/MathematicsOfML/blob/main/HypothesisTesting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Scenario

A local coffee shop advertises that its **average customer wait time** at the counter is **5 minutes**. You decide to measure wait times for a sample of customers to verify whether the coffee shop's claim holds true.

- **Null Hypothesis ($ H_0 $)**: The true mean wait time is 5 minutes.  
- **Alternative Hypothesis ($ H_1 $)**: The true mean wait time is not 5 minutes (it could be higher or lower).

You collect data by timing **50** randomly selected customers from the moment they join the queue until they receive their order. We'll generate some mock data to simulate real measurements.

## Explanation

1. **Data Generation**  
   - In a real-world scenario, you would **collect actual wait time data** instead of generating it. Here, we simulate it with a normal distribution centered at 5.5 minutes, implying the true average is slightly above the coffee shop’s claim of 5 minutes.

2. **Formulating Hypotheses**  
   - We want to see if the coffee shop’s stated average (5 minutes) is accurate.  
   - \($H_0$): \($mu = 5$) (the mean wait time is exactly 5 minutes)  
   - \($H_1$): \($mu \neq 5$) (the mean wait time is different from 5 minutes)

3. **Performing a One-Sample T-Test**  
   - We use `scipy.stats.ttest_1samp(data, mu_0)` to compare our sample's mean to the hypothesized mean (5 minutes).  
   - This returns two values:
     - `t_statistic`: measures how far the sample mean is from the hypothesized mean in units of standard error.  
     - `p_value`: the probability of observing a result at least as extreme as ours, assuming \($H_0$) is true.

4. **Constructing a 95% Confidence Interval (CI)**  
   - We manually compute the 95% CI for the sample mean:
     - \$\text{Sample Mean} \pm t_{\text{crit}} \times \text{Standard Error}$ .  
     - If the hypothesized mean (5) is **outside** this interval, it suggests a significant difference at the 5% level.

5. **Interpreting the Results**  
   - **p-value**: If \(p < 0.05\), we **reject** \$(H_0\ $). Otherwise, we **fail to reject** \$(H_0\$).  
   - **Confidence Interval**: If 5 is not inside the computed interval, we can infer the population mean is likely different from 5 (with 95% confidence).


In [None]:
import numpy as np
import scipy.stats as stats

# -----------------------
# 1. DATA GENERATION
# -----------------------
# Scenario: The coffee shop claims the average wait time is 5 minutes.
# We measure 50 wait times. Here, we simulate the data to have a mean of 5.5,
# suggesting in reality they might be waiting longer than advertised.
np.random.seed(42)  # For reproducibility
sample_size = 50
true_mean = 5.5  # Real mean (unknown in practice, only known here to simulate data)
data = np.random.normal(loc=true_mean, scale=1, size=sample_size)

# -----------------------
# 2. DEFINE HYPOTHESES
# -----------------------
# H0: mean = 5
# H1: mean != 5
mu_0 = 5  # The "advertised" mean wait time

# -----------------------
# 3. ONE-SAMPLE T-TEST
# -----------------------
t_statistic, p_value = stats.ttest_1samp(data, mu_0)

# -----------------------
# 4. CONFIDENCE INTERVAL (95%)
# -----------------------
# We calculate a 95% CI for the sample mean manually:
#  - sample_mean: mean of the data
#  - sample_std: sample standard deviation (ddof=1)
#  - se (standard error): sample_std / sqrt(n)
#  - t_crit: critical t-value for 95% confidence and n-1 degrees of freedom
#  - margin_of_error = t_crit * se
#  - 95% CI = (sample_mean - margin_of_error, sample_mean + margin_of_error)

sample_mean = np.mean(data)
sample_std = np.std(data, ddof=1)  # Sample standard deviation
n = len(data)
se = sample_std / np.sqrt(n)

confidence_level = 0.95
alpha = 1 - confidence_level
df = n - 1  # degrees of freedom
t_crit = stats.t.ppf(1 - alpha/2, df)
margin_of_error = t_crit * se
conf_interval = (sample_mean - margin_of_error, sample_mean + margin_of_error)

# -----------------------
# 5. RESULTS & INTERPRETATION
# -----------------------
print(f"Sample size (n): {n}")
print(f"Sample mean: {sample_mean:.3f} minutes")
print(f"Sample standard deviation: {sample_std:.3f} minutes")
print(f"T-statistic: {t_statistic:.3f}")
print(f"P-value: {p_value:.6f}")
print(f"95% Confidence Interval: ({conf_interval[0]:.3f}, {conf_interval[1]:.3f})")

# Decision rule at α = 0.05
alpha_level = 0.05
if p_value < alpha_level:
    print(f"\nSince p-value < {alpha_level}, we REJECT the null hypothesis.")
    print("Conclusion: The data suggests the true mean wait time is different from 5 minutes.")
else:
    print(f"\nSince p-value >= {alpha_level}, we FAIL TO REJECT the null hypothesis.")
    print("Conclusion: We do not have enough evidence to say the true mean differs from 5 minutes.")


Sample size (n): 50
Sample mean: 5.275 minutes
Sample standard deviation: 0.934 minutes
T-statistic: 2.079
P-value: 0.042864
95% Confidence Interval: (5.009, 5.540)

Since p-value < 0.05, we REJECT the null hypothesis.
Conclusion: The data suggests the true mean wait time is different from 5 minutes.


## What This Means for Our Coffee Shop Scenario

- If the test shows a **low p-value** (e.g., < 0.05) and a 95% CI that **does not include 5**, then it’s likely the **true average wait time is significantly longer** (or shorter) than the coffee shop’s advertised 5 minutes.  
- If the p-value is high and 5 lies comfortably within the 95% CI, we don’t have enough evidence to conclude the wait times differ from 5 minutes.  

In practice, this kind of test helps both the coffee shop’s management and customers understand if the service claims align with reality—and whether any process improvements may be needed.