# Section 2: Bayesian Inference and PyMC

#### PyData London 2025 - Bayesian Time Series Analysis with PyMC

---

## Statistical Paradigms: Two Ways of Thinking

Before diving into Bayesian methods, it's crucial to understand how Bayesian thinking differs from the traditional frequentist approach.

### Frequentist Worldview

![Fisher](images/fisher.png)

The **frequentist approach** to statistics treats data and parameters in a fundamentally different way than Bayesian methods. In this framework, **data are considered random** because they vary each time we collect them from the same underlying process. Conversely, **parameters are treated as fixed** but unknown constants that we seek to estimate. This leads to the mathematical formulation where we condition on parameters: 

$$\Large P(\text{data} | \text{parameters}) $$

Frequentist inference operates through **point estimators** and **confidence intervals**, where the interpretation of uncertainty relates to the long-run frequency properties of our estimation procedures. A 95% confidence interval, for example, means that if we repeated our sampling procedure many times, 95% of the intervals we construct would contain the true parameter value.

### Bayesian Worldview  

![Bayes](images/bayes.png)

The **Bayesian paradigm** flips this perspective: Here, **data are treated as fixed** once we have observed them—they represent the concrete evidence we have gathered. In contrast, **parameters are treated as random variables** about which we express our uncertainty using probability distributions. This leads to conditioning on the observed data: 

$$\Large P(\text{parameters} | \text{data}) $$

Bayesian inference proceeds through **probability distributions** that directly quantify our uncertainty about parameter values. A Bayesian 95% credible interval has the intuitive interpretation that there is a 95% probability that the parameter lies within that interval, given our data and model assumptions.

This fundamental philosophical difference leads to more **intuitive interpretations** and **natural uncertainty quantification**, making Bayesian methods particularly appealing for time series analysis where we often want to make probabilistic statements about future observations.

---

## Bayes' Theorem: The Foundation

**Bayes' Theorem** is our single tool for learning from data:

$$\huge
\underbrace{\text{Pr}(\theta | y)}_{\textcolor{yellow}{\small \text{Posterior Probability}}}
= 
\frac{
\overbrace{\text{Pr}(y | \theta)}^{\textcolor{yellow}{\small \text{Data Likelihood}}} \cdot
\overbrace{\text{Pr}(\theta)}^{\textcolor{yellow}{\small \text{Prior Probability}}}
}{
\underbrace{\text{Pr}(y)}_{\textcolor{yellow}{\small \text{Normalizing Constant}}}
}
$$

### Breaking Down the Components

- **$P(\theta | y)$** = **Posterior**: What we learn about parameters after seeing data
- **$P(y | \theta)$** = **Likelihood**: How well different parameter values explain our data  
- **$P(\theta)$** = **Prior**: Our initial beliefs about parameters before seeing data
- **$P(y)$** = **Evidence**: Normalizing constant (usually intractable)

### The Learning Process

Bayesian inference is a formal process of **updating beliefs**:

1. **Start** with prior beliefs $P(\theta)$
2. **Observe** data and compute likelihood $P(y | \theta)$
3. **Update** to posterior beliefs $P(\theta | y)$
4. **Repeat** as more data arrives

This makes Bayesian methods naturally suited for time series, where we continuously update our understanding as new observations arrive.

## Prior Specification: Encoding Our Beliefs

One of the most important aspects of Bayesian modeling is **prior specification**. Priors encode our beliefs about parameters before seeing data.

### Types of Priors

The choice of prior distribution represents one of the most important decisions in Bayesian modeling, as it directly influences both the computational efficiency and the interpretability of results. Understanding the different types of priors and their appropriate applications is essential for effective Bayesian analysis.

**Informative priors** incorporate substantial domain knowledge or previous research findings into the analysis. These priors express strong beliefs about parameter values based on external information. For example, if we know from demographic research that birth rates typically range between 10-20 per 1000 people, we might specify `pm.Normal('rate', mu=15, sigma=2)` to encode this knowledge. Informative priors are particularly valuable when working with limited data or when we want to incorporate expert knowledge into our analysis.

**Weakly informative priors** provide gentle regularization without imposing strong constraints on the parameter space. These priors help stabilize the estimation process while allowing the data to largely determine the posterior distribution. A common example is assuming that standardized coefficients usually fall between -3 and 3, leading to a specification like `pm.Normal('coef', mu=0, sigma=1)`. This approach prevents extreme parameter values while remaining relatively non-committal about the exact values.

**Non-informative priors** are designed to let the data dominate the analysis by expressing minimal prior knowledge. These priors attempt to be "objective" by spreading probability mass widely across the parameter space. An example might be `pm.Uniform('param', lower=-100, upper=100)` for a parameter where we have little prior knowledge. However, truly non-informative priors are often difficult to specify and can sometimes lead to computational problems.

### Prior Choice Guidelines

Effective prior specification requires balancing several considerations to ensure both computational stability and meaningful results. **Starting with weakly informative priors** provides an excellent default approach because they offer numerical stability without imposing strong constraints on the analysis. These priors help prevent the sampler from exploring extreme regions of the parameter space that might be computationally problematic.

**Using domain knowledge** when available can significantly improve model performance and interpretability. Subject matter expertise often provides valuable constraints on reasonable parameter ranges, and incorporating this knowledge through informative priors can lead to more realistic and interpretable results.

**Checking prior predictive distributions** serves as a crucial validation step in the modeling process. Before observing any data, we should simulate from our prior distributions to ensure they generate reasonable predictions. If our prior predictive distributions produce implausible results, this indicates that our prior specifications need adjustment.

Finally, **testing sensitivity** to prior choices helps assess the robustness of our conclusions. By fitting the model with different reasonable prior specifications and comparing the results, we can determine how much our conclusions depend on prior assumptions versus the observed data. Substantial sensitivity to prior choice may indicate that we need more data or more carefully considered priors.

## Likelihood Choice: Matching Model to Data

The **likelihood function** connects parameters to data by specifying the probability distribution of observations given parameter values.

### Common Likelihood Functions

**Normal** for continuous, symmetric data: `pm.Normal('obs', mu=mu, sigma=sigma)`  
**Poisson** for count data: `pm.Poisson('counts', mu=rate)`  
**Binomial** for binary outcomes: `pm.Binomial('successes', n=n_trials, p=prob)`

### Selection Guidelines

**Match data characteristics**: Consider whether your data are continuous, counts, or proportions. Check key assumptions like symmetry (Normal) or mean-variance relationship (Poisson).

**Use robust alternatives when needed**: Student-t handles outliers better than Normal. Negative Binomial accommodates overdispersed counts better than Poisson.

**Let the data-generating process guide you**: Time between events suggests Exponential; measurement errors often follow Normal; rare events suit Poisson.

In [None]:
import numpy as np
import polars as pl
import matplotlib.pyplot as plt
import pymc as pm
import arviz as az
import warnings

plt.style.use('seaborn-v0_8')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['figure.dpi'] = 100
warnings.filterwarnings('ignore')

RNG = np.random.default_rng(RANDOM_SEED:=42)

print("🔧 Libraries loaded successfully!")

As an example, let's build a very simple model using the births dataset from the last section.

In [None]:
births_data = pl.read_csv('../data/births.csv', null_values=['null', 'NA', '', 'NULL'])
births_data = births_data.filter(pl.col('day').is_not_null())

# Aggregate to monthly data
monthly_births = (births_data
    .group_by(['year', 'month'])
    .agg(pl.col('births').sum())
    .sort(['year', 'month'])
)

# Focus on 1970-1990 period
births_subset = (monthly_births
    .filter((pl.col('year') >= 1970) & (pl.col('year') <= 1990))
    .with_row_index('index')
)

# Standardize the data
original_data = births_subset['births'].to_numpy()
births_standardized = (original_data - original_data.mean()) / original_data.std()

print(f"📊 Data loaded: {len(births_standardized)} observations")

## PyMC API and Workflow

PyMC provides a high-level interface for building Bayesian models. The typical workflow involves:

1. **Model Definition**: Specify priors, likelihood, and relationships
2. **Model Fitting**: Use MCMC or another method to approximate the posterior
3. **Diagnostics**: Check convergence and model fit
4. **Analysis**: Summarize results and make predictions


## Polynomial Regression Model

Now let's implement a more sophisticated model that can capture trends in the data. Instead of assuming a constant mean, we'll use a **polynomial regression** model that can capture linear and quadratic trends over time.

### Model Specification

Our polynomial regression model is:

$$y_t = \mu + \beta_1 \cdot t + \beta_2 \cdot t^2 + \epsilon_t$$

where:
- $\mu$ is the intercept
- $\beta_1$ is the linear trend coefficient
- $\beta_2$ is the quadratic trend coefficient
- $\epsilon_t \sim \mathcal{N}(0, \sigma^2)$ is the observation noise

This model allows us to capture both linear trends and curvature in the time series.

In [None]:

with pm.Model(coords={'time': time_idx, 'coef': ['linear', 'quadratic']}) as poly_model:

    mu = pm.Normal('mu', mu=0, sigma=1)
    # Vector-valued polynomial regression coefficients with informative priors
    beta = pm.Normal('beta', mu=0, sigma=1, dims='coef')
    
    X = np.column_stack([
        time_normalized,
        time_normalized**2
    ])
    
    # Expected value (quadratic polynomial)
    mu_t = pm.Deterministic('mu_t', mu + pm.math.dot(X, beta), dims='time')
    
    # Observation noise
    sigma = pm.HalfNormal('sigma', sigma=1)
    
    # Likelihood
    y_obs = pm.Normal('y_obs', mu=mu_t, sigma=sigma, observed=births_standardized, dims='time')
    
    # Sample from the posterior
    trace_poly = pm.sample(1000, tune=1000, random_seed=RANDOM_SEED, chains=4)

print(az.summary(trace_poly, var_names=['beta', 'sigma']))

In [None]:
pm.model_to_graphviz(poly_model)

In [None]:
mu_post = trace_poly.posterior['mu'].values.flatten()
beta_1_post = trace_poly.posterior['beta'].sel(coef='linear').values.flatten()
beta_2_post = trace_poly.posterior['beta'].sel(coef='quadratic').values.flatten()
mu_t_post = trace_poly.posterior['mu_t'].values

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))

mu_t_reshaped = mu_t_post.reshape(-1, mu_t_post.shape[-1])

mu_mean = np.mean(mu_t_reshaped, axis=0)
mu_lower = np.percentile(mu_t_reshaped, 2.5, axis=0)
mu_upper = np.percentile(mu_t_reshaped, 97.5, axis=0)

ax1.fill_between(time_idx, mu_lower, mu_upper, alpha=0.3, color='blue', label='95% CI')
ax1.plot(time_idx, mu_mean, color='blue', linewidth=2, label='Posterior Mean')
ax1.scatter(time_idx, births_standardized, alpha=0.6, color='red', s=20, label='Observed Data')
ax1.set_title('Polynomial Regression Fit')
ax1.set_xlabel('Time Index')
ax1.set_ylabel('Standardized Births')
ax1.legend()
ax1.grid(True, alpha=0.3)

ax2.hist(mu_post, bins=50, alpha=0.7, density=True, label='μ (intercept)')
ax2.axvline(np.mean(mu_post), color='red', linestyle='--', label=f'Mean: {np.mean(mu_post):.3f}')
ax2.set_title('Posterior: Intercept (μ)')
ax2.set_xlabel('Value')
ax2.set_ylabel('Density')
ax2.legend()
ax2.grid(True, alpha=0.3)

ax3.hist(beta_1_post, bins=50, alpha=0.7, density=True, label='β₁ (linear)', color='green')
ax3.axvline(np.mean(beta_1_post), color='red', linestyle='--', label=f'Mean: {np.mean(beta_1_post):.3f}')
ax3.set_title('Posterior: Linear Term (β₁)')
ax3.set_xlabel('Value')
ax3.set_ylabel('Density')
ax3.legend()
ax3.grid(True, alpha=0.3)

ax4.hist(beta_2_post, bins=50, alpha=0.7, density=True, label='β₂ (quadratic)', color='orange')
ax4.axvline(np.mean(beta_2_post), color='red', linestyle='--', label=f'Mean: {np.mean(beta_2_post):.3f}')
ax4.set_title('Posterior: Quadratic Term (β₂)')
ax4.set_xlabel('Value')
ax4.set_ylabel('Density')
ax4.legend()
ax4.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Model Diagnostics: Ensuring Reliable Inference

Before trusting our results, we must check that our MCMC sampling worked properly. Poor sampling can lead to incorrect conclusions.

### Key Diagnostic Metrics

1. **R-hat (Gelman-Rubin statistic)**: Measures convergence across chains
   - **Good**: R-hat ≤ 1.01
   - **Acceptable**: R-hat ≤ 1.1  
   - **Poor**: R-hat > 1.1

2. **Effective Sample Size (ESS)**: Number of independent samples
   - **Good**: ESS > 400 (for tail quantities)
   - **Minimum**: ESS > 100

3. **Energy diagnostics**: Check for sampling pathologies
   - **E-BFMI**: Energy Bayesian Fraction of Missing Information
   - **Divergences**: Indicate problematic posterior geometry

In [None]:
print("🔧 **Model Diagnostics and Convergence Checking**")
print("="*60)

# Check basic diagnostics
print("\n📊 **Summary Statistics with Diagnostics**:")
summary = az.summary(trace_poly, var_names=['beta', 'sigma'])
print(summary)

# Plot trace plots for visual inspection
print("\n📈 **Trace Plots** (should look like 'fuzzy caterpillars'):")
az.plot_trace(trace_poly, var_names=['beta', 'sigma'], figsize=(12, 6))
plt.tight_layout()
plt.show()

# Check for divergences and other warnings
print("\n⚠️  **Sampling Diagnostics**:")
print(f"   • Number of divergences: {trace_poly.sample_stats.diverging.sum().values}")
print(f"   • Max tree depth reached: {(trace_poly.sample_stats.tree_depth >= 10).sum().values} times")

# Energy plot for convergence checking
print("\n⚡ **Energy Diagnostics**:")
az.plot_energy(trace_poly, figsize=(10, 4))
plt.show()

print("\n✅ **Diagnostic Interpretation**:")
print("   • R-hat close to 1.0 → Good convergence")
print("   • High ESS → Many effective samples")
print("   • No divergences → Sampling was stable")
print("   • Energy plots overlap → No pathological behavior")

Having fit our model and ensured that the MCMC sampler has converged, we can now use it as a **generative model** to simulate datasets. We can use thes predictive samples to check the validity of the model.

In [None]:
# Posterior predictive checks
with poly_model:
    posterior_predictive = pm.sample_posterior_predictive(trace_poly, random_seed=RANDOM_SEED)

# Plot comparison using ArviZ
az.plot_ppc(posterior_predictive, num_pp_samples=50, figsize=(10, 6))
plt.show()

## The Complete Bayesian Workflow

Let's summarize the essential steps of Bayesian analysis that we've demonstrated:

### 1. Model Specification
```python
with pm.Model() as model:
    # Priors
    theta = pm.Normal('theta', mu=0, sigma=1)
    # Likelihood  
    obs = pm.Normal('obs', mu=theta, sigma=1, observed=data)
```

### 2. Prior Predictive Checking
```python
prior_pred = pm.sample_prior_predictive(model)
# Check: Do simulated data look reasonable? If not, go back and revise model.
```

### 3. Posterior Sampling
```python
trace = pm.sample(model)
```

### 4. Convergence Diagnostics
```python
az.summary(trace)  
az.plot_trace(trace)  
# If we are doing MCMC, ensure that the sampler has converged
```

### 5. Posterior Predictive Checking
```python
post_pred = pm.sample_posterior_predictive(trace, model)
az.plot_ppc(post_pred)  
# Check: Do the results make sense? Is there reasonable model fit? If not, revise model.
```

### 6. Inference and Decision Making
```python
# Probability statements
prob_positive = (samples > 0).mean()
# Credible intervals
hdi = az.hdi(samples, hdi_prob=0.95)
```

---

## References

Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2013. Bayesian Data Analysis, Third Edition. CRC Press.

Downey, Allen. 2021. Think Bayes: Bayesian Statistics in Python, Second Edition. O'Reilly Media.

Pilon, Cam-Davidson. Probabilistic Programming and Bayesian Methods for Hackers

In [None]:
%load_ext watermark
%watermark -n -u -v -iv -w