<div style="background-image: url('https://www.dropbox.com/scl/fi/wdrnuojbnjx6lgfekrx85/mcnair.jpg?rlkey=wcbaw5au7vh5vt1g5d5x7fw8f&dl=1'); background-size: cover; background-position: center; height: 300px; display: flex; align-items: center; justify-content: center; color: white; text-shadow: 2px 2px 4px rgba(0,0,0,0.7); margin-bottom: 20px; position: relative;">
  <h1 style="text-align: center; font-size: 2.5em; margin: 0;">JGSB Python Workshop <br> Part 9: Statistics</h1>
  <div style="position: absolute; bottom: 10px; left: 15px; font-size: 0.9em; color: white; text-shadow: 2px 2px 4px rgba(0,0,0,0.7);">
    Authored by Kerry Back
  </div>
  <div style="position: absolute; bottom: 10px; right: 15px; text-align: right; font-size: 0.9em; color: white; text-shadow: 2px 2px 4px rgba(0,0,0,0.7);">
    Rice University, 9/6/2025
  </div>
</div>

# t Test for a Population Mean

A one-sample t-test determines whether a sample mean differs significantly from a hypothesized population mean. This is commonly used in business to test whether actual performance meets targets or expectations.

**Key Concepts:**
- **Null Hypothesis (H₀):** The sample mean equals the hypothesized population mean
- **Alternative Hypothesis (H₁):** The sample mean differs from the hypothesized population mean
- **t-statistic:** Measures how far the sample mean is from the hypothesized mean in standard error units
- **p-value:** Probability of observing the test statistic if the null hypothesis is true

**When to Use:**
- Testing if average sales meet targets
- Checking if customer satisfaction scores differ from industry standards
- Validating if production times match specifications

## Example: Testing Average Order Value

Let's test whether our online store's average order value differs from the industry standard of $75.

# t Test for a Difference of Means with Paired Treatment

A paired t-test compares two related measurements on the same subjects, testing whether the mean difference between paired observations is significantly different from zero. This is powerful for before-and-after comparisons in business.

**Key Concepts:**
- **Paired Data:** Two measurements on the same experimental unit (person, store, product, etc.)
- **Null Hypothesis (H₀):** The mean difference between pairs equals zero
- **Alternative Hypothesis (H₁):** The mean difference between pairs does not equal zero
- **Advantages:** Controls for individual differences, increasing statistical power

**Business Applications:**
- Before/after training program effectiveness
- Pre/post marketing campaign performance
- Website A/B testing with user sessions
- Sales performance before/after system implementation

## Example: Email Marketing Campaign Effectiveness

Let's test whether a new email marketing strategy increases weekly sales compared to the previous strategy using the same 20 stores.

# t Test of a Difference of Means with Independent Samples

An independent samples t-test compares the means of two unrelated groups to determine if they differ significantly. This is essential for comparing different treatments, groups, or conditions in business research.

**Key Concepts:**
- **Independent Groups:** Separate, unrelated samples (different people, stores, regions, etc.)
- **Null Hypothesis (H₀):** The two population means are equal
- **Alternative Hypothesis (H₁):** The two population means are different
- **Assumptions:** Normal distribution, independent observations, similar variances

**Business Applications:**
- Comparing performance between departments
- Testing difference in customer satisfaction across regions
- Evaluating effectiveness of different marketing strategies
- Analyzing sales differences between product versions

## Example: Regional Sales Performance Comparison

Let's compare average monthly sales between the East and West regions to determine if there's a significant difference in performance.

# Linear Regression: The Formula Method

Linear regression using the formula method (Patsy formulas) provides an intuitive, R-like syntax for specifying regression models. This approach is excellent for business analysts who want readable, flexible model specifications.

**Key Concepts:**
- **Formula Syntax:** Uses string formulas like `'y ~ x1 + x2'` to specify relationships
- **Automatic Handling:** Categorical variables, interactions, and transformations
- **Business Interpretation:** Coefficients represent the change in Y for a one-unit change in X
- **R-squared:** Proportion of variance in the dependent variable explained by the model

**Formula Syntax Examples:**
- `'sales ~ advertising'` - Simple linear regression
- `'sales ~ advertising + price'` - Multiple regression
- `'sales ~ advertising * region'` - Interaction terms
- `'sales ~ np.log(advertising)'` - Transformations

## Example: Predicting Sales from Advertising Spend

Let's build a model to predict monthly sales based on advertising expenditure and store size.

# Linear Regression: The Matrix Method

The matrix method provides direct access to the mathematical foundation of linear regression using matrix operations. This approach offers maximum flexibility and is essential for understanding custom model implementations and advanced techniques.

**Key Concepts:**
- **Matrix Form:** Y = Xβ + ε, where Y is dependent variable, X is design matrix, β is coefficients
- **Normal Equation:** β = (X'X)⁻¹X'Y solves for optimal coefficients
- **Design Matrix:** X includes intercept column and all independent variables
- **Direct Control:** Full access to intermediate calculations and custom modifications

**When to Use Matrix Method:**
- Custom model specifications not available in formula syntax
- Understanding mathematical foundations
- Building specialized regression techniques
- Performance-critical applications
- Research and advanced analytics

## Example: Portfolio Return Analysis

Let's predict portfolio returns using the matrix method with multiple risk factors.

## Exercise: Custom Regression Implementation

Build a custom weighted regression model where observations have different importance weights. This technique is valuable when some data points are more reliable or representative than others.

**Your Task:**
1. Generate sample data:
   ```python
   np.random.seed(202)
   x1 = np.random.uniform(10, 50, 60)
   x2 = np.random.uniform(5, 20, 60)
   weights = np.random.uniform(0.5, 2.0, 60)  # Observation weights
   y = 100 + 3*x1 + 2*x2 + np.random.normal(0, 8, 60)
   ```

2. Implement weighted least squares using matrix operations:
   - Create design matrix X with intercept, x1, and x2
   - Create weight matrix W = diag(weights)
   - Use weighted normal equation: β = (X'WX)⁻¹X'Wy

3. Compare results with unweighted regression using `sm.OLS()`
4. Use `sm.WLS()` to verify your manual calculation
5. Interpret the differences between weighted and unweighted models

In [ ]:
# Generate financial data for matrix regression
np.random.seed(42)
n_periods = 200

# Risk factors (independent variables)
market_return = np.random.normal(0.08, 0.15, n_periods)    # Market factor
size_factor = np.random.normal(0.02, 0.10, n_periods)      # Size factor  
value_factor = np.random.normal(0.01, 0.08, n_periods)     # Value factor

# Portfolio return (dependent variable) - based on factor model
portfolio_return = (0.03 +                                 # Alpha
                   0.8 * market_return +                   # Market beta
                   0.3 * size_factor +                     # Size loading
                   -0.1 * value_factor +                   # Value loading
                   np.random.normal(0, 0.05, n_periods))   # Idiosyncratic risk

# Method 1: Using statsmodels with matrix approach
# Create design matrix X (includes intercept)
X = np.column_stack([
    np.ones(n_periods),      # Intercept column
    market_return,           # Market factor
    size_factor,            # Size factor  
    value_factor            # Value factor
])

y = portfolio_return

# Fit model using matrix method
matrix_model = sm.OLS(y, X).fit()

print("MATRIX METHOD RESULTS:")
print("="*50)
print(f"Coefficients:")
factor_names = ['Intercept (Alpha)', 'Market Beta', 'Size Loading', 'Value Loading']
for i, (name, coef) in enumerate(zip(factor_names, matrix_model.params)):
    print(f"{name:20s}: {coef:8.4f}")

print(f"\nModel Statistics:")
print(f"R-squared: {matrix_model.rsquared:.4f}")
print(f"Adjusted R-squared: {matrix_model.rsquared_adj:.4f}")
print(f"F-statistic: {matrix_model.fvalue:.4f}")
print(f"Prob (F-statistic): {matrix_model.f_pvalue:.4f}")

# Method 2: Manual calculation using normal equation
print("\n" + "="*50)
print("MANUAL MATRIX CALCULATION:")
print("="*50)

# Calculate coefficients manually: β = (X'X)^(-1)X'Y
XtX = X.T @ X                    # X transpose times X
XtX_inv = np.linalg.inv(XtX)     # Inverse of X'X
Xty = X.T @ y                    # X transpose times y
beta_manual = XtX_inv @ Xty      # Final coefficients

print("Manual coefficients (should match statsmodels):")
for i, (name, coef) in enumerate(zip(factor_names, beta_manual)):
    print(f"{name:20s}: {coef:8.4f}")

# Calculate fitted values and residuals
y_fitted = X @ beta_manual
residuals = y - y_fitted
rss = np.sum(residuals**2)       # Residual sum of squares
tss = np.sum((y - np.mean(y))**2) # Total sum of squares
r_squared_manual = 1 - (rss / tss)

print(f"\nManual R-squared: {r_squared_manual:.4f}")

# Show the mathematical relationship
print(f"\nPortfolio Return Model:")
print(f"Return = {beta_manual[0]:.4f} + {beta_manual[1]:.4f}*Market + {beta_manual[2]:.4f}*Size + {beta_manual[3]:.4f}*Value")

In [ ]:
# Generate business data for regression analysis
np.random.seed(42)
n_stores = 100

# Independent variables
advertising = np.random.uniform(5000, 25000, n_stores)  # Monthly ad spend
store_size = np.random.uniform(1000, 5000, n_stores)    # Square footage
region = np.random.choice(['North', 'South', 'East', 'West'], n_stores)

# Dependent variable with realistic business relationship
# Sales = base + advertising effect + store size effect + region effect + noise
sales = (50000 + 
         1.2 * advertising +           # $1.20 return per $1 ad spend
         8 * store_size +              # $8 per sq ft
         np.where(region == 'North', 5000,
         np.where(region == 'South', -2000, 
         np.where(region == 'East', 3000, 0))) +  # Regional differences
         np.random.normal(0, 10000, n_stores))    # Random variation

# Create DataFrame
business_data = pd.DataFrame({
    'sales': sales,
    'advertising': advertising,
    'store_size': store_size,
    'region': region
})

print("Business Data Summary:")
print(business_data.describe().round(2))
print(f"\nRegion distribution:")
print(business_data['region'].value_counts())

# Fit regression model using formula method
import statsmodels.formula.api as smf

# Simple regression
simple_model = smf.ols('sales ~ advertising', data=business_data).fit()
print("\n" + "="*50)
print("SIMPLE REGRESSION: Sales ~ Advertising")
print("="*50)
print(simple_model.summary().tables[1])
print(f"R-squared: {simple_model.rsquared:.4f}")

# Multiple regression
multiple_model = smf.ols('sales ~ advertising + store_size', data=business_data).fit()
print("\n" + "="*50)
print("MULTIPLE REGRESSION: Sales ~ Advertising + Store Size")
print("="*50)
print(multiple_model.summary().tables[1])
print(f"R-squared: {multiple_model.rsquared:.4f}")

# Model with categorical variable
full_model = smf.ols('sales ~ advertising + store_size + region', data=business_data).fit()
print("\n" + "="*50)
print("FULL MODEL: Sales ~ Advertising + Store Size + Region")
print("="*50)
print(full_model.summary().tables[1])
print(f"R-squared: {full_model.rsquared:.4f}")

In [ ]:
# Generate independent samples for two regions
np.random.seed(42)

# East region sales (30 stores) - higher performing region
east_sales = np.random.normal(85000, 15000, 30)

# West region sales (25 stores) - lower performing region  
west_sales = np.random.normal(78000, 12000, 25)

# Create combined dataset for analysis
regional_data = pd.DataFrame({
    'Sales': np.concatenate([east_sales, west_sales]),
    'Region': ['East'] * 30 + ['West'] * 25
})

print("Regional Sales Summary:")
summary_stats = regional_data.groupby('Region')['Sales'].agg(['count', 'mean', 'std']).round(2)
print(summary_stats)

# Check for equal variances (Levene's test)
from scipy.stats import levene
levene_stat, levene_p = levene(east_sales, west_sales)
print(f"\nLevene's test for equal variances:")
print(f"Test statistic: {levene_stat:.4f}, p-value: {levene_p:.4f}")

equal_var = levene_p > 0.05
print(f"Equal variances assumption: {'Met' if equal_var else 'Violated'}")

# Perform independent samples t-test
t_stat, p_value = stats.ttest_ind(east_sales, west_sales, equal_var=equal_var)

print(f"\nIndependent Samples t-Test:")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")

# Calculate effect size (Cohen's d)
pooled_std = np.sqrt(((len(east_sales) - 1) * np.var(east_sales, ddof=1) + 
                     (len(west_sales) - 1) * np.var(west_sales, ddof=1)) / 
                     (len(east_sales) + len(west_sales) - 2))
cohens_d = (np.mean(east_sales) - np.mean(west_sales)) / pooled_std
print(f"Cohen's d (effect size): {cohens_d:.4f}")

# Interpretation
alpha = 0.05
if p_value < alpha:
    print(f"\nConclusion: Reject H₀ (p < {alpha})")
    print("Significant difference between East and West region sales")
    difference = np.mean(east_sales) - np.mean(west_sales)
    print(f"East region outperforms West by ${difference:.2f} on average")
else:
    print(f"\nConclusion: Fail to reject H₀ (p ≥ {alpha})")
    print("No significant difference between regional sales")

In [ ]:
# Generate paired data for 20 stores
np.random.seed(42)
store_ids = range(1, 21)

# Sales before new email strategy (baseline)
sales_before = np.random.normal(5000, 800, 20)

# Sales after new email strategy (generally higher, but with store-specific variation)
# Each store has its own baseline + random improvement
store_effects = np.random.normal(500, 200, 20)  # Average improvement of $500
sales_after = sales_before + store_effects + np.random.normal(0, 300, 20)

# Create DataFrame for better visualization
sales_data = pd.DataFrame({
    'Store_ID': store_ids,
    'Sales_Before': sales_before,
    'Sales_After': sales_after,
    'Difference': sales_after - sales_before
})

print("Sales Data Summary:")
print(sales_data.describe().round(2))

print(f"\nMean difference: ${sales_data['Difference'].mean():.2f}")
print(f"Std of differences: ${sales_data['Difference'].std(ddof=1):.2f}")

# Perform paired t-test
t_stat, p_value = stats.ttest_rel(sales_after, sales_before)

print(f"\nPaired t-Test Results:")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"Degrees of freedom: {len(sales_before) - 1}")

# Interpretation
alpha = 0.05
if p_value < alpha:
    print(f"\nConclusion: Reject H₀ (p < {alpha})")
    print("The new email strategy significantly increased sales")
    print(f"Average improvement: ${sales_data['Difference'].mean():.2f} per store per week")
else:
    print(f"\nConclusion: Fail to reject H₀ (p ≥ {alpha})")
    print("No significant difference between email strategies")

In [ ]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from scipy import stats
import matplotlib.pyplot as plt

# Generate sample data: order values for 50 customers
np.random.seed(42)
order_values = np.random.normal(78.5, 12.3, 50)  # Mean slightly above $75

# Display basic statistics
print("Sample Statistics:")
print(f"Sample size: {len(order_values)}")
print(f"Sample mean: ${np.mean(order_values):.2f}")
print(f"Sample std: ${np.std(order_values, ddof=1):.2f}")
print(f"Industry standard: $75.00")

# Perform one-sample t-test
hypothesized_mean = 75
t_stat, p_value = stats.ttest_1samp(order_values, hypothesized_mean)

print(f"\nOne-Sample t-Test Results:")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"Degrees of freedom: {len(order_values) - 1}")

# Interpretation
alpha = 0.05
if p_value < alpha:
    print(f"\nConclusion: Reject H₀ (p < {alpha})")
    print("The average order value significantly differs from $75")
else:
    print(f"\nConclusion: Fail to reject H₀ (p ≥ {alpha})")
    print("No significant difference from the industry standard of $75")

### $t$ Test for a Difference of Means with Paired Treatment

### $t$ Test of a Difference of Means with Independent Samples

### Linear Regression: The Formula Method

### Linear Regression: The Matrix Method