# Chapter 10: Generalized Linear Models

**Core Goal:** Extend linear regression to handle non-normal response variables and non-linear relationships through a unified framework.

**Motivation:** Classical linear regression assumes the response variable is continuous and normally distributed. Many real-world problems violate these assumptions: binary outcomes (success/failure), count data (number of events), proportions (rates between 0 and 1). Applying ordinary least squares to such data leads to inappropriate predictions (probabilities outside [0,1], negative counts) and invalid inference. Generalized Linear Models provide a unified framework that extends linear models to exponential family distributions and uses link functions to ensure predictions remain in valid ranges. This single framework encompasses logistic regression, Poisson regression, and many other models, making it one of the most important tools in applied statistics.

In [None]:
import numpy as np
import scipy.stats as stats

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns; sns.set_theme()

## 10.1 Limitations of Linear Regression

**Classical Linear Model:** $Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$ where $\epsilon_i \sim N(0, \sigma^2)$

**Assumptions:**
- Response $Y$ is continuous
- Errors are normally distributed
- Variance is constant (homoscedasticity)
- Relationship between $E[Y|X]$ and $X$ is linear

**Problems with Non-Normal Data:**

In [None]:
# Binary response: predict probability of success based on x
np.random.seed(42)
x = np.linspace(-2, 2, 50)
true_prob = 1 / (1 + np.exp(-2*x))  # True probability (logistic)
y = np.random.binomial(1, true_prob)  # Binary outcomes

In [None]:
# Fit linear regression (INAPPROPRIATE for binary data)
from sklearn.linear_model import LinearRegression
linear_model = LinearRegression()
linear_model.fit(x.reshape(-1, 1), y)
linear_pred = linear_model.predict(x.reshape(-1, 1))

In [None]:
plt.scatter(x, y, alpha=0.5, label='Observed (0/1)')
plt.plot(x, linear_pred, 'r--', linewidth=2, label='Linear regression')
plt.plot(x, true_prob, 'g-', linewidth=2, label='True probability')
plt.xlabel('x'); plt.ylabel('y / P(Y=1)')
plt.title('Linear Regression Fails for Binary Data')
plt.legend(); plt.ylim(-0.2, 1.2)

**Problems Observed:**
- Linear model predicts probabilities < 0 and > 1 (nonsensical)
- Assumes constant variance (but variance of binary variable depends on probability)
- Cannot capture the S-shaped relationship between predictors and probability

**Solution:** Generalized Linear Models address these issues through link functions and exponential family distributions.

## 10.2 Exponential Family Distributions

**Exponential Family:** A distribution belongs to the exponential family if its probability density/mass function can be written as:

$$f(y; \theta, \phi) = \exp\left(\frac{y\theta - b(\theta)}{a(\phi)} + c(y, \phi)\right)$$

where:
- $\theta$ is the natural parameter
- $\phi$ is the dispersion parameter
- $b(\theta)$ is the cumulant function
- $a(\phi)$ and $c(y, \phi)$ are known functions

**Key Properties:**
- $E[Y] = \mu = b'(\theta)$
- $\text{Var}(Y) = b''(\theta) \cdot a(\phi)$

**Common Members:**
- **Normal:** $\theta = \mu$, $b(\theta) = \theta^2/2$
- **Binomial:** $\theta = \log(p/(1-p))$, $b(\theta) = \log(1+e^\theta)$
- **Poisson:** $\theta = \log(\lambda)$, $b(\theta) = e^\theta$
- **Gamma, Exponential, Inverse Gaussian**

**Motivation:** Exponential family distributions have convenient mathematical properties for likelihood-based inference. Maximum Likelihood Estimators have closed forms or are easy to compute numerically, and large-sample theory provides asymptotic normality. Restricting Generalized Linear Models to exponential family ensures good statistical properties.

## 10.3 Components of Generalized Linear Models

**Generalized Linear Model has three components:**

1. **Random Component:** Response $Y_i$ follows an exponential family distribution with mean $\mu_i$

2. **Systematic Component:** Linear predictor $\eta_i = \beta_0 + \beta_1 X_{i1} + ... + \beta_p X_{ip} = \mathbf{x}_i^T \boldsymbol{\beta}$

3. **Link Function:** Connects mean to linear predictor: $g(\mu_i) = \eta_i$

**Key Insight:** Link function transforms the response mean (which may be constrained, e.g., $\mu \in [0,1]$ for probabilities) to the linear predictor (unconstrained, $\eta \in (-\infty, \infty)$).

**Canonical Link:** The link function that makes $\eta = \theta$ (natural parameter). Canonical links have nice theoretical properties.

## 10.4 Logistic Regression

**Application:** Binary response variable (success/failure, yes/no, 0/1)

**Random Component:** $Y_i \sim \text{Bernoulli}(p_i)$

**Link Function:** Logit link (canonical): $\log\left(\frac{p_i}{1-p_i}\right) = \eta_i = \beta_0 + \beta_1 X_i$

**Mean Function (Inverse Link):** $p_i = \frac{e^{\eta_i}}{1 + e^{\eta_i}} = \frac{1}{1 + e^{-\eta_i}}$ (logistic function)

**Motivation:** The logit link maps probabilities $p \in [0,1]$ to log-odds $\log(p/(1-p)) \in (-\infty, \infty)$, allowing us to model the relationship linearly. The inverse logistic function ensures predicted probabilities always lie in [0,1]. This S-shaped curve captures how probabilities change gradually near extreme predictor values but rapidly in the middle range.

### Fitting Logistic Regression

In [None]:
# Same binary data as before
from sklearn.linear_model import LogisticRegression
logistic_model = LogisticRegression()
logistic_model.fit(x.reshape(-1, 1), y)

In [None]:
# Predicted probabilities: p = 1/(1 + exp(-η))
logistic_pred = logistic_model.predict_proba(x.reshape(-1, 1))[:, 1]
print(f"Logistic regression coefficients: β₀ = {logistic_model.intercept_[0]:.3f}, β₁ = {logistic_model.coef_[0][0]:.3f}")

In [None]:
plt.scatter(x, y, alpha=0.5, label='Observed (0/1)')
plt.plot(x, logistic_pred, 'b-', linewidth=2, label='Logistic regression')
plt.plot(x, true_prob, 'g--', linewidth=2, label='True probability')
plt.xlabel('x'); plt.ylabel('P(Y=1)')
plt.title('Logistic Regression for Binary Data')
plt.legend(); plt.ylim(-0.1, 1.1)

**Result:** Logistic regression produces valid probabilities in [0,1] and captures the S-shaped relationship.

### Interpreting Logistic Regression Coefficients

**Log-Odds Interpretation:** $\beta_1$ represents the change in log-odds for one-unit increase in $X$.

**Odds Ratio:** $e^{\beta_1}$ is the multiplicative change in odds for one-unit increase in $X$.

**Example:** If $\beta_1 = 0.693$, then $e^{0.693} = 2$, meaning each one-unit increase in $X$ doubles the odds of success.

In [None]:
# Odds ratio: exp(β₁)
beta1 = logistic_model.coef_[0][0]
odds_ratio = np.exp(beta1)
print(f"Odds ratio: {odds_ratio:.3f}")
print(f"Interpretation: One-unit increase in x multiplies odds by {odds_ratio:.2f}")

### Maximum Likelihood Estimation for Logistic Regression

**Log-Likelihood:** $\ell(\boldsymbol{\beta}) = \sum_{i=1}^n [y_i \log(p_i) + (1-y_i)\log(1-p_i)]$

where $p_i = 1/(1 + e^{-\mathbf{x}_i^T\boldsymbol{\beta}})$

**Estimation:** No closed form; use iterative methods (Newton-Raphson, Fisher scoring).

**Asymptotic Properties:** Maximum Likelihood Estimator is asymptotically normal: $\hat{\boldsymbol{\beta}} \sim N(\boldsymbol{\beta}, I^{-1}(\boldsymbol{\beta}))$ where $I$ is Fisher information matrix.

## 10.5 Poisson Regression

**Application:** Count data (number of events: accidents, customer arrivals, disease cases)

**Random Component:** $Y_i \sim \text{Poisson}(\lambda_i)$

**Link Function:** Log link (canonical): $\log(\lambda_i) = \eta_i = \beta_0 + \beta_1 X_i$

**Mean Function (Inverse Link):** $\lambda_i = e^{\eta_i}$

**Motivation:** Count data are non-negative integers. The log link ensures predicted counts are always positive: $\lambda = e^\eta > 0$ for any $\eta$. The exponential relationship means effects are multiplicative on the count scale: increasing $X$ by one unit multiplies the expected count by $e^{\beta_1}$.

In [None]:
# Generate count data: number of events increases with x
np.random.seed(42)
x_count = np.linspace(0, 3, 40)
true_lambda = np.exp(1 + 0.5 * x_count)  # λ = exp(1 + 0.5x)
y_count = np.random.poisson(true_lambda)

In [None]:
# Fit Poisson regression: log(λ) = β₀ + β₁x
from sklearn.linear_model import PoissonRegressor
poisson_model = PoissonRegressor()
poisson_model.fit(x_count.reshape(-1, 1), y_count)

In [None]:
# Predicted counts: λ̂ = exp(β₀ + β₁x)
poisson_pred = poisson_model.predict(x_count.reshape(-1, 1))
print(f"Poisson regression: β₀ = {poisson_model.intercept_:.3f}, β₁ = {poisson_model.coef_[0]:.3f}")

In [None]:
plt.scatter(x_count, y_count, alpha=0.6, label='Observed counts')
plt.plot(x_count, poisson_pred, 'r-', linewidth=2, label='Poisson regression')
plt.plot(x_count, true_lambda, 'g--', linewidth=2, label='True λ')
plt.xlabel('x'); plt.ylabel('Count'); plt.title('Poisson Regression')
plt.legend()

### Interpreting Poisson Regression Coefficients

**Rate Ratio:** $e^{\beta_1}$ is the multiplicative change in expected count for one-unit increase in $X$.

**Example:** If $\beta_1 = 0.5$, then $e^{0.5} = 1.65$, meaning each one-unit increase in $X$ increases the expected count by 65%.

In [None]:
# Rate ratio: exp(β₁)
beta1_poisson = poisson_model.coef_[0]
rate_ratio = np.exp(beta1_poisson)
print(f"Rate ratio: {rate_ratio:.3f}")
print(f"Interpretation: One-unit increase in x multiplies expected count by {rate_ratio:.2f}")

## 10.6 Common Link Functions

**Different distributions require different link functions:**

| **Distribution** | **Canonical Link** | **Link Function** | **Inverse Link** | **Application** |
|------------------|-------------------|-------------------|------------------|----------------|
| Normal | Identity | $g(\mu) = \mu$ | $\mu = \eta$ | Continuous response |
| Binomial | Logit | $g(p) = \log(p/(1-p))$ | $p = 1/(1+e^{-\eta})$ | Binary/proportion |
| Poisson | Log | $g(\lambda) = \log(\lambda)$ | $\lambda = e^\eta$ | Count data |
| Gamma | Inverse | $g(\mu) = 1/\mu$ | $\mu = 1/\eta$ | Positive continuous |

**Non-Canonical Links:**
- **Probit link** for binomial: $g(p) = \Phi^{-1}(p)$ (inverse normal cumulative distribution function)
- **Complementary log-log:** $g(p) = \log(-\log(1-p))$ for binomial

**Choosing Link:**
- Canonical link: theoretically optimal, simplifies computation
- Alternative links: may fit data better or have more natural interpretation

## 10.7 Model Fitting and Inference

**Estimation Method:** Maximum Likelihood via Iteratively Reweighted Least Squares

**Procedure:**
1. Start with initial parameter estimates
2. Compute working response and weights
3. Fit weighted least squares
4. Update parameters
5. Repeat until convergence

**Asymptotic Distribution:** $\hat{\boldsymbol{\beta}} \sim N(\boldsymbol{\beta}, \mathcal{I}^{-1}(\boldsymbol{\beta}))$

**Standard Errors:** Diagonal elements of $\mathcal{I}^{-1}(\hat{\boldsymbol{\beta}})$

**Confidence Intervals:** $\hat{\beta}_j \pm z_{\alpha/2} \cdot \widehat{SE}(\hat{\beta}_j)$

**Hypothesis Tests:**
- **Wald test:** $z = \hat{\beta}_j / \widehat{SE}(\hat{\beta}_j) \sim N(0,1)$ under $H_0: \beta_j = 0$
- **Likelihood ratio test:** Compare nested models using deviance

In [None]:
# Access coefficient statistics from statsmodels
import statsmodels.api as sm
# Add intercept to design matrix
X_with_intercept = sm.add_constant(x.reshape(-1, 1))

In [None]:
# Fit logistic regression with full inference
logit_model = sm.Logit(y, X_with_intercept)
logit_result = logit_model.fit(disp=False)

In [None]:
# Summary with Standard Errors, z-statistics, p-values, Confidence Intervals
print(logit_result.summary2())

## 10.8 Deviance and Model Comparison

**Deviance:** Measure of model fit (smaller is better).

**Definition:** $D = -2[\ell(\hat{\boldsymbol{\beta}}) - \ell(\hat{\boldsymbol{\beta}}_{saturated})]$

where $\ell(\hat{\boldsymbol{\beta}})$ is log-likelihood of fitted model and $\ell(\hat{\boldsymbol{\beta}}_{saturated})$ is log-likelihood of saturated model (one parameter per observation).

**Null Deviance:** Deviance for model with only intercept

**Residual Deviance:** Deviance for fitted model

**Interpretation:** Reduction in deviance measures improvement of model over null model.

**Motivation:** Deviance plays the role of residual sum of squares in linear regression. It quantifies how well the model fits the data. The difference in deviance between nested models follows a chi-squared distribution, enabling likelihood ratio tests.

In [None]:
# Deviance statistics from fitted model
print(f"Null deviance: {logit_result.llnull * -2:.2f}")
print(f"Residual deviance: {logit_result.deviance:.2f}")
print(f"Reduction in deviance: {logit_result.llnull * -2 - logit_result.deviance:.2f}")

### Likelihood Ratio Test

**Test Statistic:** $LR = D_{reduced} - D_{full} \sim \chi^2_{\Delta df}$ under $H_0$

**Use:** Compare nested models (e.g., test if additional predictors improve fit)

In [None]:
# Add quadratic term: test if x² improves model
X_quadratic = sm.add_constant(np.column_stack([x, x**2]))
logit_quad = sm.Logit(y, X_quadratic)
logit_quad_result = logit_quad.fit(disp=False)

In [None]:
# Likelihood ratio test: H₀: quadratic term coefficient = 0
lr_statistic = logit_result.deviance - logit_quad_result.deviance
df_difference = 1  # One additional parameter
p_value_lr = 1 - stats.chi2.cdf(lr_statistic, df_difference)
print(f"Likelihood Ratio statistic: {lr_statistic:.3f}")
print(f"p-value: {p_value_lr:.4f}")

## 10.9 Model Diagnostics

**Residuals for Generalized Linear Models:**

**Deviance Residuals:** $r_i^D = \text{sign}(y_i - \hat{\mu}_i)\sqrt{d_i}$ where $d_i$ is contribution to deviance.

**Pearson Residuals:** $r_i^P = \frac{y_i - \hat{\mu}_i}{\sqrt{\widehat{\text{Var}}(Y_i)}}$

**Standardized Residuals:** Adjust for leverage: $r_i^{std} = \frac{r_i}{\sqrt{1-h_{ii}}}$

**Diagnostic Plots:**
- Residuals versus fitted values (check for patterns)
- Quantile-Quantile plot (check distributional assumptions)
- Cook's distance (identify influential observations)

In [None]:
# Compute deviance residuals
deviance_residuals = logit_result.resid_deviance
fitted_values = logit_result.fittedvalues

In [None]:
# Residual plot: Check for patterns
plt.scatter(fitted_values, deviance_residuals, alpha=0.6)
plt.axhline(0, color='r', linestyle='--')
plt.xlabel('Fitted Values'); plt.ylabel('Deviance Residuals')
plt.title('Residual Plot for Logistic Regression')

### Overdispersion

**Problem:** For Poisson and binomial, variance is determined by mean. Real data often show greater variability.

**Overdispersion:** $\text{Var}(Y) > \text{Var}_{model}(Y)$

**Detection:** Residual deviance / degrees of freedom $\gg 1$

**Solutions:**
- **Quasi-likelihood:** Estimate dispersion parameter from data
- **Negative binomial:** For count data with overdispersion
- **Beta-binomial:** For binomial data with overdispersion

In [None]:
# Check for overdispersion in Poisson model
X_count_with_intercept = sm.add_constant(x_count.reshape(-1, 1))
poisson_sm = sm.GLM(y_count, X_count_with_intercept, family=sm.families.Poisson())
poisson_sm_result = poisson_sm.fit()

In [None]:
# Dispersion parameter estimate: deviance / df
dispersion = poisson_sm_result.deviance / poisson_sm_result.df_resid
print(f"Dispersion parameter estimate: {dispersion:.3f}")
if dispersion > 1.5:
    print("Evidence of overdispersion (should be ≈ 1 for Poisson)")
else:
    print("No strong evidence of overdispersion")

## 10.10 Extensions and Related Models

**Multinomial Logistic Regression:** Categorical response with > 2 levels

**Ordinal Logistic Regression:** Ordered categorical response (e.g., ratings: poor/fair/good/excellent)

**Negative Binomial Regression:** Count data with overdispersion

**Zero-Inflated Models:** Count data with excess zeros

**Generalized Additive Models:** Replace linear predictors with smooth functions: $g(\mu_i) = f_1(x_{i1}) + f_2(x_{i2}) + ...$

**Mixed Effects Models:** Include random effects for hierarchical/clustered data

## Summary: Generalized Linear Models Framework

**Generalized Linear Models unify diverse regression models:**

**Three Components:**
1. **Random component:** Response from exponential family (Normal, Binomial, Poisson, Gamma, etc.)
2. **Systematic component:** Linear predictor $\eta = \mathbf{X}\boldsymbol{\beta}$
3. **Link function:** Connects mean to linear predictor: $g(\mu) = \eta$

**Key Models:**
- **Linear regression:** Identity link, Normal distribution (continuous response)
- **Logistic regression:** Logit link, Binomial distribution (binary response)
- **Poisson regression:** Log link, Poisson distribution (count data)

**Advantages:**
- Unified framework for different response types
- Link functions ensure predictions in valid range
- Maximum Likelihood Estimation with good asymptotic properties
- Flexible: can accommodate various distributions and relationships

**Inference:**
- Wald tests for individual coefficients
- Likelihood ratio tests for nested models
- Deviance measures goodness-of-fit
- Residual analysis checks assumptions

**Practical Considerations:**
- Check for overdispersion (especially Poisson/binomial)
- Use residual plots to diagnose problems
- Compare nested models with likelihood ratio tests
- Consider alternative link functions if canonical link fits poorly

## Key Takeaways

- **Generalized Linear Models extend linear regression beyond normality:** By allowing exponential family distributions and using link functions, Generalized Linear Models handle binary, count, and other non-normal responses that ordinary linear regression cannot.

- **Link functions ensure valid predictions:** Logit link keeps probabilities in [0,1], log link keeps counts positive. This prevents nonsensical predictions that arise from applying linear regression to constrained responses.

- **Coefficients have multiplicative interpretations:** In logistic regression, $e^{\beta}$ is an odds ratio. In Poisson regression, $e^{\beta}$ is a rate ratio. This differs from linear regression's additive effects.

- **Maximum Likelihood Estimation provides asymptotic normality:** Standard Errors, confidence intervals, and hypothesis tests follow from asymptotic theory, enabling inference even without exact sampling distributions.

- **Deviance replaces residual sum of squares:** Deviance measures model fit for non-normal responses. Differences in deviance between nested models follow chi-squared distributions, enabling likelihood ratio tests.

- **Residual analysis is still essential:** Deviance residuals and Pearson residuals help diagnose model problems, identify outliers, and check for overdispersion. Good Generalized Linear Model practice requires examining residuals just as in linear regression.

- **Overdispersion is common in practice:** Real Poisson and binomial data often show more variability than the model assumes. Detecting and addressing overdispersion (via quasi-likelihood, negative binomial, etc.) is crucial for valid inference.