## Section 1: Import Required Libraries

In [15]:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import pandas as pd

# Set style for better visualizations
plt.style.use('seaborn-v0_8-darkgrid')
print("Libraries imported successfully!")

Libraries imported successfully!


## Section 2: Define Problem Parameters

In [16]:
# Define the problem parameters
mu_0 = 200  # Claimed population mean latency (in ms)
sigma = 30  # Population standard deviation (in ms)
n = 64  # Sample size
x_bar = 212  # Sample mean (in ms)
alpha = 0.05  # Significance level

# Display the parameters in a nice table
parameters = {
    'Parameter': ['Claimed Mean Latency (μ₀)', 'Population Std Dev (σ)', 'Sample Size (n)', 
                  'Sample Mean (x̄)', 'Significance Level (α)'],
    'Symbol': ['μ₀', 'σ', 'n', 'x̄', 'α'],
    'Value': [mu_0, sigma, n, x_bar, alpha]
}

df_params = pd.DataFrame(parameters)
print("Problem Parameters:")
print("=" * 70)
print(df_params.to_string(index=False))
print("=" * 70)

Problem Parameters:
                Parameter Symbol  Value
Claimed Mean Latency (μ₀)     μ₀ 200.00
   Population Std Dev (σ)      σ  30.00
          Sample Size (n)      n  64.00
         Sample Mean (x̄)     x̄ 212.00
   Significance Level (α)      α   0.05


## Section 3: State Null and Alternative Hypotheses

**Null Hypothesis (H₀):**
$$H_0: \mu = 200 \text{ ms}$$

The AI team's claim is that the mean inference latency is 200 ms.

**Alternative Hypothesis (H₁):**
$$H_1: \mu \neq 200 \text{ ms}$$

We are testing whether the mean inference latency is different from 200 ms (two-tailed test).

**Test Type:** Two-tailed test (because we're testing if μ is not equal to 200)
- **Rejection Region:** Both tails of the distribution
- **Significance Level:** α = 0.05
- **Each tail:** α/2 = 0.025

In [17]:
print("HYPOTHESES SETUP")
print("=" * 70)
print(f"Null Hypothesis (H₀):          μ = {mu_0} ms")
print(f"Alternative Hypothesis (H₁):   μ ≠ {mu_0} ms")
print(f"\nTest Type: Two-tailed test")
print(f"Significance Level (α): {alpha}")
print(f"Each tail (α/2): {alpha/2}")
print("=" * 70)

HYPOTHESES SETUP
Null Hypothesis (H₀):          μ = 200 ms
Alternative Hypothesis (H₁):   μ ≠ 200 ms

Test Type: Two-tailed test
Significance Level (α): 0.05
Each tail (α/2): 0.025


## Section 4: Calculate the Test Statistic

**Formula for Z-test statistic:**
$$Z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}$$

Where:
- $\bar{x}$ = sample mean = 212 ms
- $\mu_0$ = hypothesized population mean = 200 ms
- $\sigma$ = population standard deviation = 30 ms
- $n$ = sample size = 64

In [18]:
# Calculate the standard error
standard_error = sigma / np.sqrt(n)

# Calculate the Z-test statistic
z_statistic = (x_bar - mu_0) / standard_error

print("Z-TEST STATISTIC CALCULATION")
print("=" * 70)
print(f"Standard Error (SE) = σ / √n")
print(f"                   = {sigma} / √{n}")
print(f"                   = {sigma} / {np.sqrt(n):.4f}")
print(f"                   = {standard_error:.4f}")
print()
print(f"Z-test Statistic = (x̄ - μ₀) / SE")
print(f"                 = ({x_bar} - {mu_0}) / {standard_error:.4f}")
print(f"                 = {x_bar - mu_0} / {standard_error:.4f}")
print(f"                 = {z_statistic:.4f}")
print("=" * 70)
print(f"\n✓ Calculated Z-statistic: {z_statistic:.4f}")

Z-TEST STATISTIC CALCULATION
Standard Error (SE) = σ / √n
                   = 30 / √64
                   = 30 / 8.0000
                   = 3.7500

Z-test Statistic = (x̄ - μ₀) / SE
                 = (212 - 200) / 3.7500
                 = 12 / 3.7500
                 = 3.2000

✓ Calculated Z-statistic: 3.2000


## Section 5: Determine Critical Values and P-value

For a two-tailed test at α = 0.05:
- Each tail has an area of α/2 = 0.025
- We need to find the critical Z-values where the cumulative probability is 0.025 and 0.975

**Critical Z-values for α = 0.05 (two-tailed):**
- Left critical value: Z_{α/2} = -Z_{0.025}
- Right critical value: Z_{1-α/2} = Z_{0.975}

In [19]:
# Find critical Z-values for two-tailed test
z_critical = stats.norm.ppf(1 - alpha/2)  # Right critical value
z_critical_left = -z_critical  # Left critical value

# Calculate the p-value (two-tailed)
p_value = 2 * stats.norm.sf(abs(z_statistic))  # sf = survival function (1 - CDF)

print("CRITICAL VALUES AND P-VALUE")
print("=" * 70)
print(f"Significance Level (α): {alpha}")
print(f"Each tail (α/2): {alpha/2}")
print()
print(f"Left Critical Value:  Z_{{α/2}} = {z_critical_left:.4f}")
print(f"Right Critical Value: Z_{{1-α/2}} = {z_critical:.4f}")
print()
print(f"Rejection Regions:")
print(f"  - Reject H₀ if Z < {z_critical_left:.4f}")
print(f"  - Reject H₀ if Z > {z_critical:.4f}")
print()
print(f"Two-Tailed P-value: {p_value:.6f}")
print("=" * 70)

CRITICAL VALUES AND P-VALUE
Significance Level (α): 0.05
Each tail (α/2): 0.025

Left Critical Value:  Z_{α/2} = -1.9600
Right Critical Value: Z_{1-α/2} = 1.9600

Rejection Regions:
  - Reject H₀ if Z < -1.9600
  - Reject H₀ if Z > 1.9600

Two-Tailed P-value: 0.001374


## Section 6: Make a Decision

**Decision Rules:**
1. If |Z| > Z_{critical} → **Reject H₀**
2. If p-value < α → **Reject H₀**
3. Otherwise → **Fail to reject H₀**

In [20]:
# Make decision
is_reject = abs(z_statistic) > z_critical

print("\nDECISION ANALYSIS")
print("=" * 70)
print(f"Test Statistic (Z): {z_statistic:.4f}")
print(f"Critical Value (|Z|): ±{z_critical:.4f}")
print(f"P-value: {p_value:.6f}")
print(f"Significance Level (α): {alpha}")
print()
print(f"Decision Criteria 1 (Critical Value):")
print(f"  |Z| = |{z_statistic:.4f}| = {abs(z_statistic):.4f}")
print(f"  Critical Value = {z_critical:.4f}")
print(f"  {abs(z_statistic):.4f} > {z_critical:.4f}? {abs(z_statistic) > z_critical}")
print()
print(f"Decision Criteria 2 (P-value):")
print(f"  P-value ({p_value:.6f}) < α ({alpha})? {p_value < alpha}")
print()
print("=" * 70)
if is_reject:
    print("✓ DECISION: REJECT the null hypothesis (H₀)")
    print(f"\nConclusion: At the {alpha} level of significance,")
    print("we have sufficient evidence to conclude that the mean")
    print("inference latency is significantly different from 200 ms.")
else:
    print("✗ DECISION: FAIL TO REJECT the null hypothesis (H₀)")
    print(f"\nConclusion: At the {alpha} level of significance,")
    print("we do not have sufficient evidence to conclude that the")
    print("mean inference latency is different from 200 ms.")
print("=" * 70)


DECISION ANALYSIS
Test Statistic (Z): 3.2000
Critical Value (|Z|): ±1.9600
P-value: 0.001374
Significance Level (α): 0.05

Decision Criteria 1 (Critical Value):
  |Z| = |3.2000| = 3.2000
  Critical Value = 1.9600
  3.2000 > 1.9600? True

Decision Criteria 2 (P-value):
  P-value (0.001374) < α (0.05)? True

✓ DECISION: REJECT the null hypothesis (H₀)

Conclusion: At the 0.05 level of significance,
we have sufficient evidence to conclude that the mean
inference latency is significantly different from 200 ms.


## Section 7: Visualize the Hypothesis Test

Let's create a comprehensive visualization showing:
1. The standard normal distribution
2. The rejection regions
3. The critical values
4. The calculated test statistic
5. The p-value region

In [21]:
# Create visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Plot 1: Standard Normal Distribution with Critical Regions
x = np.linspace(-4, 4, 1000)
y = stats.norm.pdf(x)

ax1.plot(x, y, 'b-', linewidth=2.5, label='Standard Normal Distribution')

# Fill rejection regions (tails)
x_left = x[x <= z_critical_left]
y_left = stats.norm.pdf(x_left)
ax1.fill_between(x_left, y_left, alpha=0.3, color='red', label=f'Rejection Region (α/2 = {alpha/2})')

x_right = x[x >= z_critical]
y_right = stats.norm.pdf(x_right)
ax1.fill_between(x_right, y_right, alpha=0.3, color='red')

# Plot critical values
ax1.axvline(z_critical_left, color='red', linestyle='--', linewidth=2, label=f'Critical Values: ±{z_critical:.4f}')
ax1.axvline(z_critical, color='red', linestyle='--', linewidth=2)

# Plot test statistic
ax1.axvline(z_statistic, color='green', linestyle='-', linewidth=3, label=f'Test Statistic Z = {z_statistic:.4f}')

# Shade p-value region
x_p_value = x[x >= abs(z_statistic)]
y_p_value = stats.norm.pdf(x_p_value)
ax1.fill_between(x_p_value, y_p_value, alpha=0.2, color='orange', label=f'P-value area = {p_value:.6f}')

# Add labels and formatting
ax1.set_xlabel('Z-Score', fontsize=12, fontweight='bold')
ax1.set_ylabel('Probability Density', fontsize=12, fontweight='bold')
ax1.set_title('Two-Tailed Z-Test: Critical Regions and Test Statistic', fontsize=14, fontweight='bold')
ax1.legend(loc='upper right', fontsize=10)
ax1.grid(True, alpha=0.3)
ax1.set_xlim(-4, 4)

# Add annotations
ax1.annotate(f'Z = {z_critical_left:.4f}', xy=(z_critical_left, 0), xytext=(z_critical_left-0.5, 0.15),
            arrowprops=dict(arrowstyle='->', color='red'), fontsize=10, color='red', fontweight='bold')
ax1.annotate(f'Z = {z_critical:.4f}', xy=(z_critical, 0), xytext=(z_critical+0.3, 0.15),
            arrowprops=dict(arrowstyle='->', color='red'), fontsize=10, color='red', fontweight='bold')
ax1.annotate(f'Test Stat\nZ = {z_statistic:.4f}', xy=(z_statistic, 0.1), xytext=(z_statistic-0.8, 0.35),
            arrowprops=dict(arrowstyle='->', color='green'), fontsize=10, color='green', fontweight='bold',
            bbox=dict(boxstyle='round,pad=0.5', facecolor='lightgreen', alpha=0.7))

# Plot 2: Summary Information
ax2.axis('off')

# Create summary text
summary_text = f"""HYPOTHESIS TEST SUMMARY
{'='*50}

Problem: Model Inference Latency Testing
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Sample Information:
  • Sample Size (n): {n}
  • Sample Mean (x̄): {x_bar} ms
  • Population Std Dev (σ): {sigma} ms
  • Standard Error (SE): {standard_error:.4f} ms

Hypotheses:
  • H₀: μ = {mu_0} ms
  • H₁: μ ≠ {mu_0} ms (Two-tailed)

Test Statistics:
  • Test Statistic (Z): {z_statistic:.4f}
  • Critical Value (±): {z_critical:.4f}
  • P-value: {p_value:.6f}
  • Significance Level (α): {alpha}

Decision Rule:
  • Reject H₀ if |Z| > {z_critical:.4f}
  • |{z_statistic:.4f}| > {z_critical:.4f}? {abs(z_statistic) > z_critical}
  • P-value < {alpha}? {p_value < alpha}

{'='*50}
DECISION: {'REJECT H₀' if is_reject else 'FAIL TO REJECT H₀'}
{'='*50}

Interpretation:
At the {alpha} level of significance, there is
{'sufficient' if is_reject else 'insufficient'} evidence to conclude that
the mean inference latency is {'SIGNIFICANTLY DIFFERENT' if is_reject else 'NOT significantly different'}
from 200 ms.

The sample provides {'strong' if is_reject else 'weak'} evidence against
the AI team's claim."""

    ax2.text(0.05, 0.95, summary_text, transform=ax2.transAxes, fontsize=10.5,
        verticalalignment='top', fontfamily='monospace',
        bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.9))

plt.tight_layout()
plt.savefig('model_latency_hypothesis_test.png', dpi=150, bbox_inches='tight')
plt.show()

print("\n✓ Visualization saved as 'model_latency_hypothesis_test.png'")

IndentationError: unexpected indent (844787709.py, line 91)

## Final Summary and Conclusion

### Test Results Summary

| Metric | Value |
|--------|-------|
| **Null Hypothesis** | μ = 200 ms |
| **Alternative Hypothesis** | μ ≠ 200 ms |
| **Test Type** | Two-Tailed Z-Test |
| **Sample Size (n)** | 64 |
| **Sample Mean (x̄)** | 212 ms |
| **Population Std Dev (σ)** | 30 ms |
| **Standard Error** | 3.75 ms |
| **Calculated Z-Statistic** | 3.2 |
| **Critical Z-Value (±)** | ±1.96 |
| **P-Value** | 0.001371 |
| **Significance Level (α)** | 0.05 |
| **Decision** | **REJECT H₀** |

### Conclusion

**At the 0.05 level of significance, we REJECT the null hypothesis.**

This means there is **sufficient statistical evidence** to conclude that the mean inference latency of the deployed model is **significantly different from 200 milliseconds**.

The sample mean of 212 ms provides strong evidence (Z = 3.2, p-value = 0.00137) that the true mean inference latency is not 200 ms. The p-value (0.00137) is much less than the significance level (0.05), confirming our decision to reject H₀.

### Practical Interpretation

The AI team's claim that the model has an average inference latency of 200 ms is rejected at the 5% significance level. The sample evidence suggests the actual mean latency is higher than claimed (212 ms observed vs 200 ms claimed).

**Recommendation:** The model's inference latency should be investigated further, as it appears to exceed the claimed specification by approximately 12 milliseconds on average. This could indicate:
- Need for model optimization
- System performance issues
- Changes in operational conditions since the claim was made