# Forecast Scoring and Calibration: A Beginner's Guide to Evaluating Probabilistic Predictions

Welcome to this interactive guide on forecast scoring and calibration! As an aspiring data scientist or machine learning practitioner, you've likely encountered models that produce point predictions – a single best guess for a future value. However, in many real-world scenarios, uncertainty is inherent, and a single number doesn't capture the full picture. This is where **probabilistic forecasting** comes comes into play.

Instead of just predicting "what will happen," probabilistic forecasting aims to predict "what are the chances of various things happening." For example, instead of predicting that tomorrow's temperature will be exactly 20°C, a probabilistic forecast might tell us there's a 60% chance it will be between 18-22°C, a 20% chance it will be below 18°C, and a 20% chance it will be above 22°C.

This notebook will transform theoretical lecture slides into practical, runnable code examples, helping you build a strong intuition for these concepts. We'll explore:

* **Probabilistic Forecasting Fundamentals**: How do we represent probabilistic forecasts?
* **Scoring Rules**: How do we objectively evaluate the quality of these forecasts?
* **Calibration**: Are our forecasts reliable? Do they accurately reflect the true uncertainty?

Let's dive in!

## 1. Probabilistic Forecasting: Beyond Point Predictions

In probabilistic forecasting, our goal is to predict an entire probability distribution for a target variable $Y$, rather than just a single value. This predicted distribution is often denoted as $P$. For simplicity, we'll mostly focus on real-valued targets $Y$.

Think of it like this: If you're forecasting the temperature, a point forecast might say "20°C." A probabilistic forecast would give you a distribution, perhaps a bell-shaped curve centered at 20°C, indicating that temperatures closer to 20°C are more likely, but values slightly higher or lower are also possible, with decreasing probability as you move further away.

### 1.1 Representing Probabilistic Forecasts: CDFs and Quantiles

How do we mathematically represent these predicted distributions? The two most common ways are:

* **Cumulative Distribution Function (CDF)**: The CDF, denoted as $F(x)$, tells us the probability that the random variable $Y$ will take a value less than or equal to $x$.
    $$F(x) = P(Y \le x)$$
    For a continuous variable, this is an integral of the probability density function (PDF). For example, $F(20)$ might be the probability that the temperature is 20°C or less.

* **Quantile Function**: The quantile function, denoted as $F^{-1}(\tau)$, is the inverse of the CDF. It tells us the value $x$ such that the probability of $Y$ being less than or equal to $x$ is $\tau$.
    $$F^{-1}(\tau) = \inf\{x: F(x) \ge \tau\}$$
    Common quantiles include the median ($F^{-1}(0.5)$), the 25th percentile ($F^{-1}(0.25)$), and the 75th percentile ($F^{-1}(0.75)$). For example, if $F^{-1}(0.9) = 25^\circ C$, it means there's a 90% chance the temperature will be 25°C or less.

**Key Considerations**:
While CDFs and quantile functions are mathematically interconvertible, in practice, especially with discretized data, converting between them isn't always straightforward. This choice can impact how we evaluate and combine forecasts.

Let's illustrate with a simple example using Python. We'll consider a forecast that predicts a normal distribution for the temperature.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# Assume our forecast predicts a Normal distribution with mean 20 and std dev 2
forecast_mean = 20
forecast_std = 2
forecast_distribution = norm(loc=forecast_mean, scale=forecast_std)

# --- CDF Visualization ---
x_values = np.linspace(10, 30, 500)
cdf_values = forecast_distribution.cdf(x_values)

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(x_values, cdf_values, label='CDF: P(Y <= x)')
plt.title('Cumulative Distribution Function (CDF)')
plt.xlabel('Temperature (°C)')
plt.ylabel('Probability')
plt.grid(True)
plt.legend()

# --- Quantile Function Example ---
# Let's find some quantiles
quantiles = [0.05, 0.25, 0.50, 0.75, 0.95]
quantile_values = [forecast_distribution.ppf(q) for q in quantiles] # ppf is the inverse of cdf

print(f"Forecast Mean: {forecast_mean}°C, Standard Deviation: {forecast_std}°C\n")
print("Key Quantiles:")
for q, val in zip(quantiles, quantile_values):
    print(f"  {q*100:.0f}th Percentile: {val:.2f}°C")

# --- PDF Visualization (for intuition) ---
pdf_values = forecast_distribution.pdf(x_values)
plt.subplot(1, 2, 2)
plt.plot(x_values, pdf_values, label='PDF')
plt.title('Probability Density Function (PDF)')
plt.xlabel('Temperature (°C)')
plt.ylabel('Density')
plt.grid(True)
plt.legend()

plt.tight_layout()
plt.show()

# Example: Probability of temperature <= 18°C
prob_le_18 = forecast_distribution.cdf(18)
print(f"\nProbability of temperature <= 18°C: {prob_le_18:.2f}")

# Example: 90th percentile temperature
temp_90th_percentile = forecast_distribution.ppf(0.90)
print(f"Temperature at 90th percentile: {temp_90th_percentile:.2f}°C")

**Practical Considerations for Representation**:
* **Quantile Representation**: This approach can simplify forecast construction, especially with techniques like quantile regression, and avoids the often tricky task of "binning" for probability forecasts.
* **Discrete Responses**: For discrete data (like counts), exact quantiles might only exist at a limited set of probability levels, which can make quantile representation less ideal.

## 2. Scoring Rules: How Good is Our Forecast?

Once we have a probabilistic forecast, how do we know if it's any good? This is where **scoring rules** come in. A scoring rule $S(P, y)$ is a function that assigns a numerical score to a predicted distribution $P$ when the actual outcome is $y$. By convention, a *lower* score indicates a *better* forecast.

When comparing forecasts over time, we would typically average their scores:
$$\frac{1}{T}\sum_{t=1}^{T}S(P_{t},Y_{t})$$
A lower average score implies a better forecaster.

### 2.1 Proper Scores: Incentivizing Honesty

A crucial concept in scoring rules is "propriety." A scoring rule $S$ is said to be **proper** if a forecaster is incentivized to report their *true* belief (the actual distribution $Q$ of $Y$) rather than any other distribution $P$. Mathematically, this means:

$$S(P, Q) \ge S(Q, Q) \quad \text{for all } P, Q$$

where $S(P, Q) = \mathbb{E}_{Y \sim Q}[S(P, Y)]$ is the expected score when the true distribution is $Q$ and the forecast is $P$. If the inequality is *strict* for $P \ne Q$, the score is **strictly proper**, meaning there's a unique incentive to report the true distribution.

Let's look at some common proper scoring rules:

#### Log Score

The **Log Score** (or Logarithmic Score) is defined for a forecast with probability density function (or probability mass function) $p$ as:

$$LogS(p, y) = -\log p(y)$$

This is a **strictly proper score**. Intuitively, it heavily penalizes forecasts that assign very low probability to an event that actually occurs. If $p(y)$ is tiny, $-\log p(y)$ becomes very large (bad score).

**Why is it proper?** The difference between the expected log score of a forecast $p$ and the true distribution $q$ is the Kullback-Leibler (KL) divergence, $KL(q, p)$.
$$LogS(p, q) - LogS(q, q) = \int \log\frac{q(y)}{p(y)} q(y) dy = KL(q, p)$$
Since KL divergence is always non-negative and is zero only when $p=q$, the log score is strictly proper.

#### Quadratic Score (Brier Score)

The **Quadratic Score** (also known as Brier Score for binary outcomes, but generalized here) is defined for a forecast with density $p$ as:

$$QuadS(p, y) = -2p(y) + ||p||_2^2$$

where $||p||_2^2 = \int p(y)^2 dy$. This is a **strictly proper score**.

**Why is it proper?** The difference in expected quadratic scores relates to the $L^2$ distance between $p$ and $q$:
$$S(p, q) - S(q, q) = ||p-q||_2^2 = \int (p(y) - q(y))^2 dy$$
This difference is always non-negative and zero only when $p=q$.

The quadratic score is generally considered more "robust" than the log score because it's less aggressive in penalizing small probabilities for materialized events.

#### Linear Score (An Improper Example)

The **Linear Score** is defined simply as:
$$LinS(p, y) = -p(y)$$
While seemingly intuitive ("penalize if you put low probability on the outcome"), this score is **not proper** and is rarely used in practice. This means a forecaster could "game" the system by reporting a distribution different from their true belief to get a better score.

**Example**: If the true distribution is a standard normal $q$, and a forecaster reports a uniform distribution $p$ over a very small interval around the mean, they can achieve a better score than if they reported the true normal distribution. This is why properness is so important!

#### Continuous Ranked Probability Score (CRPS)

Not all probabilistic forecasts can be easily expressed with a simple density function (e.g., if there are point masses). The **Continuous Ranked Probability Score (CRPS)** is designed to handle any forecast expressed as a CDF $F$ and is defined as:

$$CRPS(F, y) = \int (F(x) - \mathbf{1}\{y \le x\})^2 dx$$

where $\mathbf{1}\{y \le x\}$ is an indicator function (1 if $y \le x$, 0 otherwise). This is a **strictly proper score**.

**Why is it proper?** The difference in expected CRPS values between forecast $F$ and true distribution $G$ is the Cramér-von Mises distance between their CDFs:
$$CRPS(F, G) - CRPS(G, G) = \int (F(x) - G(x))^2 dx$$
This is non-negative and zero only if $F=G$.

CRPS is popular in many forecasting communities because it's robust and broadly applicable. However, its integral form can make it computationally intensive.

#### Interval Score (IS) and Weighted Interval Score (WIS)

Sometimes, forecasts are expressed as prediction intervals. An **Interval Score** $IS_\alpha([l_\alpha, u_\alpha], y)$ evaluates a specific prediction interval $[l_\alpha, u_\alpha]$ (where $l_\alpha = F^{-1}(\alpha/2)$ and $u_\alpha = F^{-1}(1-\alpha/2)$ are the predicted $\alpha/2$ and $1-\alpha/2$ quantiles) for a given outcome $y$:

$$IS_\alpha([l_\alpha, u_\alpha], y) = (u_\alpha - l_\alpha) + \frac{2}{\alpha} \cdot \text{dist}(y, [l_\alpha, u_\alpha])$$

Here, $\text{dist}(y, S) = \inf_{x \in S} |x - y|$ is the shortest distance from $y$ to the interval $S$. This score balances **sharpness** (a narrower interval, $u_\alpha - l_\alpha$, is better) with **coverage** (a penalty if $y$ falls outside the interval).

The **Weighted Interval Score (WIS)** extends this to a collection of intervals at different $\alpha$ levels:

$$WIS_{\mathcal{A}}(\{[l_\alpha, u_\alpha]\}_{\alpha \in \mathcal{A}}, y) = \sum_{\alpha \in \mathcal{A}} \alpha \cdot IS_\alpha([l_\alpha, u_\alpha], y)$$
$$= \sum_{\alpha \in \mathcal{A}} \left(\alpha(u_\alpha - l_\alpha) + 2 \cdot \text{dist}(y, [l_\alpha, u_\alpha])\right)$$

WIS is a **proper score** for predicting quantiles.

#### Quantile Score (QS)

The **Quantile Score (QS)** is used when a forecast is expressed as a collection of predicted quantiles $q_\tau$ for various probability levels $\tau$:

$$QS_{\mathcal{T}}(\{q_\tau\}_{\tau \in \mathcal{T}}, y) = \sum_{\tau \in \mathcal{T}} \rho_\tau(y - q_\tau)$$

where $\rho_\tau(u)$ is the "tilted $L_1$ loss" (also known as pinball loss):
$$\rho_\tau(u) = \begin{cases} \tau u & \text{if } u \ge 0 \\ (1-\tau)(-u) & \text{if } u < 0 \end{cases}$$

This is the standard loss function used in quantile regression. The quantile score is also **proper**.

#### Connections Between Scoring Rules

Interestingly, these seemingly different scoring rules are deeply connected:

* **WIS and QS Equivalence**: WIS and QS are equivalent:
    $$WIS_{\mathcal{A}}(\{[l_\alpha, u_\alpha]\}_{\alpha \in \mathcal{A}}, y) = 2 \cdot QS_{\mathcal{T}}(\{q_\tau\}_{\tau \in \mathcal{T}}, y)$$
    where $\mathcal{T} = \bigcup_{\alpha \in \mathcal{A}} \{\alpha/2, 1-\alpha/2\}$. This connection provides a neat way to understand how WIS combines sharpness and coverage, and also proves its propriety.

* **CRPS and QS Equivalence**: CRPS can be expressed as an integral of the Quantile Score over all probability levels:
    $$\int (F(x) - \mathbf{1}\{y \le x\})^2 dx = 2 \int \rho_\tau(y - F^{-1}(\tau)) d\tau$$
    This means CRPS is essentially an "average" quantile score over all possible quantiles. This equivalence is particularly useful because it means we can approximate CRPS by discretizing and using WIS/QS, while maintaining properness.

### Code Examples for Scoring Rules

Let's implement these scoring rules and see them in action.

In [None]:
import numpy as np
from scipy.stats import norm
from scipy.integrate import quad

# Assume our forecast is a Normal distribution (mean, std)
forecast_mean = 20
forecast_std = 2
forecast_pdf = norm(loc=forecast_mean, scale=forecast_std).pdf
forecast_cdf = norm(loc=forecast_mean, scale=forecast_std).cdf
forecast_ppf = norm(loc=forecast_mean, scale=forecast_std).ppf # Quantile function

# Some example observed outcomes
observed_outcomes = [19, 22, 15, 25]

print(f"--- Forecast Distribution: Normal(mean={forecast_mean}, std={forecast_std}) ---\n")

# --- Log Score ---
def log_score(p_y, y):
    """Calculates log score for a given observed outcome y."""
    prob_at_y = p_y(y)
    if prob_at_y <= 0: # Avoid log(0)
        return float('inf')
    return -np.log(prob_at_y)

print("--- Log Scores ---")
for y in observed_outcomes:
    score = log_score(forecast_pdf, y)
    print(f"  Observed y={y}: Log Score = {score:.4f}")
print("  (Lower is better, penalizes low probability for actual outcome)\n")


# --- Quadratic Score (Brier Score) ---
# Need to calculate ||p||_2^2 = integral of p(y)^2 dy
p_squared_integral, _ = quad(lambda x: forecast_pdf(x)**2, -np.inf, np.inf)

def quadratic_score(p_y, p_squared_int, y):
    """Calculates quadratic score."""
    prob_at_y = p_y(y)
    return -2 * prob_at_y + p_squared_int

print("--- Quadratic Scores ---")
for y in observed_outcomes:
    score = quadratic_score(forecast_pdf, p_squared_integral, y)
    print(f"  Observed y={y}: Quadratic Score = {score:.4f}")
print("  (Lower is better, more robust than log score)\n")


# --- Continuous Ranked Probability Score (CRPS) ---
def crps(F, y):
    """Calculates CRPS for a given observed outcome y."""
    # The integral definition of CRPS: integrate (F(x) - 1{y <= x})^2 dx
    # This integral can be split into two parts:
    # 1. From -inf to y: (F(x))^2 dx
    # 2. From y to +inf: (F(x) - 1)^2 dx
    
    # Numerical integration using quad
    integral_part1, _ = quad(lambda x: (F(x))**2, -np.inf, y)
    integral_part2, _ = quad(lambda x: (F(x) - 1)**2, y, np.inf)
    
    return integral_part1 + integral_part2

print("--- Continuous Ranked Probability Scores (CRPS) ---")
# Note: CRPS can be computationally intensive, so we'll just do a few
for y in observed_outcomes[:2]: # Only for first two for brevity
    score = crps(forecast_cdf, y)
    print(f"  Observed y={y}: CRPS = {score:.4f}")
print("  (Lower is better, robust and generally applicable)\n")


# --- Quantile Score (Pinball Loss) ---
def pinball_loss(y_true, y_pred_quantile, tau):
    """Calculates the tilted L1 loss (pinball loss) for a single quantile."""
    error = y_true - y_pred_quantile
    if error >= 0:
        return tau * error
    else:
        return (tau - 1) * error

def quantile_score(quantile_function, y_true, taus):
    """Calculates the total quantile score for multiple quantiles."""
    total_qs = 0
    for tau in taus:
        q_tau = quantile_function(tau)
        total_qs += pinball_loss(y_true, q_tau, tau)
    return total_qs

# Let's use a few common quantiles (e.g., for a 90% prediction interval)
taus_for_qs = [0.05, 0.50, 0.95] # 5th, 50th (median), 95th percentiles

print(f"--- Quantile Scores (for taus={taus_for_qs}) ---")
for y in observed_outcomes:
    score = quantile_score(forecast_ppf, y, taus_for_qs)
    print(f"  Observed y={y}: Quantile Score = {score:.4f}")
print("  (Lower is better, sums pinball loss across quantiles)\n")


# --- Weighted Interval Score (WIS) ---
def dist_to_interval(y, l, u):
    """Calculates distance from y to interval [l, u]."""
    if y < l:
        return l - y
    elif y > u:
        return y - u
    else:
        return 0.0

def interval_score(l_alpha, u_alpha, y, alpha):
    """Calculates Interval Score for a single interval."""
    term1 = (u_alpha - l_alpha) # Interval width (sharpness)
    term2 = (2 / alpha) * dist_to_interval(y, l_alpha, u_alpha) # Penalty for miscoverage
    return term1 + term2

def weighted_interval_score(quantile_function, y_true, alphas):
    """Calculates Weighted Interval Score for multiple alpha levels."""
    total_wis = 0
    for alpha in alphas:
        l_alpha = quantile_function(alpha / 2)
        u_alpha = quantile_function(1 - alpha / 2)
        total_wis += alpha * interval_score(l_alpha, u_alpha, y_true, alpha)
    return total_wis

# Let's use a few alpha levels for prediction intervals (e.g., 50% and 90% PI)
alphas_for_wis = [0.1, 0.2, 0.5] # Corresponds to 90%, 80%, 50% prediction intervals

print(f"--- Weighted Interval Scores (for alphas={alphas_for_wis}) ---")
for y in observed_outcomes:
    score = weighted_interval_score(forecast_ppf, y, alphas_for_wis)
    print(f"  Observed y={y}: Weighted Interval Score = {score:.4f}")
print("  (Lower is better, weighted sum of interval scores)\n")

**Bregman Representation (Advanced Topic)**:
For those curious, there's a deeper mathematical connection between proper scores and **Bregman divergences**. Many common proper scores (like log score, quadratic score, and CRPS) can be expressed in terms of a Bregman divergence, which fundamentally relates them to convex functions. This theory provides a powerful framework for understanding why certain scores are proper. For beginners, the key takeaway is that "properness" isn't just a desirable property; it's deeply rooted in the mathematics of information theory and convex analysis.

## 3. Calibration: Are Our Forecasts Reliable?

Beyond just getting good scores, we want our probabilistic forecasts to be "reliable" or **calibrated**. Calibration assesses whether the predicted probabilities match the observed frequencies. For example, if your forecast says there's a 70% chance of rain, it should actually rain about 70% of the times you make such a forecast.

There are several "flavors" of calibration. We'll focus on two key types: Probabilistic Calibration and Marginal Calibration.

### 3.1 Preliminary Concept: Probability Integral Transform (PIT)

Before diving into calibration definitions, let's understand the **Probability Integral Transform (PIT)**. For any predicted CDF $F$ and an observed target $Y$, the PIT value is defined as:

$$F^*(Y) = V \cdot F(Y) + (1-V) \cdot F(Y^-)$$

where $F(y^-) = \lim_{x \to y^-} F(x)$ and $V \sim \text{Unif}(0,1)$ is a random variable independent of $F$ and $Y$. The $V$ term handles discontinuities in the CDF, ensuring that $F^*(Y)$ is always uniformly distributed if $F$ is the true CDF of $Y$.

If $F$ is continuous (which is often assumed for simplicity), the PIT simplifies to $F^*(Y) = F(Y)$. A fundamental property is that if $F$ is the true CDF of $Y$, then $F^*(Y)$ will follow a standard uniform distribution, $U \sim \text{Unif}(0,1)$.

### 3.2 Probabilistic Calibration (PIT Calibration)

A forecaster is **probabilistically calibrated** for a target $Y$ if its PIT values are uniformly distributed:

$$F^*(Y) \overset{d}{=} U$$

where $\overset{d}{=}$ means "equal in distribution" and $U \sim \text{Unif}(0,1)$. This is also known as **PIT calibration**. Both the forecast $F$ and the target $Y$ are random variables here.

**Intuition**: If your forecasts are probabilistically calibrated, then if you collect all the PIT values over many predictions, their histogram should look flat (like a uniform distribution). This means that for any probability level $\tau$, the observed outcome $Y$ falls below the $\tau$-th predicted quantile $F^{-1}(\tau)$ exactly $\tau \times 100\%$ of the time.

**Dispersion: Over and Under**:
* **Overdispersed**: If $F$ places *too little mass in the tails* (forecast is too narrow), then the PIT values $F^*(Y)$ will tend to be concentrated around 0.5, resulting in a U-shaped distribution and a variance *smaller* than that of a uniform distribution ($Var[F^*(Y)] < Var[U]$).
* **Underdispersed**: Conversely, if $F$ places *too much mass in the tails* (forecast is too wide), then the PIT values $F^*(Y)$ will tend to be concentrated at the extremes (0 and 1), resulting in an upside-down U-shaped distribution and a variance *larger* than that of a uniform distribution ($Var[F^*(Y)] > Var[U]$).

### 3.3 Marginal Calibration

A forecaster is **marginally calibrated** for a target $Y$ if:

$$F^{-1}(U) \overset{d}{=} Y$$

where $U \sim \text{Unif}(0,1)$ and is independent of $F$. Another way to express this is $\mathbb{E}[F(y)] = \mathbb{P}(Y \le y)$ for all $y$. This means that the average of your forecast CDFs should match the true CDF of the observed outcomes.

**Intuition**: For marginal calibration, we are essentially checking if the "average" forecast distribution matches the average observed distribution of the target variable. If your average 90% quantile is 25°C, then 90% of your observed temperatures should indeed be less than or equal to 25°C, on average.

**Dispersion: Over and Under (for Marginal Calibration)**:
* **Overdispersed**: If $F$ places *too much mass in the tails* (forecast is too wide), then the variance of $F^{-1}(U)$ will be comparably large ($Var[F^{-1}(U)] > Var[Y]$).
* **Underdispersed**: If $F$ places *insufficient mass in the tails* (forecast is too narrow), then the variance of $F^{-1}(U)$ will be comparably small ($Var[F^{-1}(U)] < Var[Y]$).

Notice the difference in intuition for "over" and "under" dispersion between PIT and marginal calibration. For PIT, overdispersion means the PIT is *too peaked*, while for marginal calibration, it means the forecast itself is *too wide*. This highlights why these two calibration types are distinct.

### PIT vs. Marginal Calibration: A Crucial Distinction

It's important to understand that **PIT calibration and marginal calibration are not the same and neither is strictly more general than the other**. A forecaster can be probabilistically calibrated but not marginally calibrated, and vice versa.

* **PIT Calibration**: A statement about the *joint distribution* of the forecaster $F$ and the target $Y$. It's about how well $F$ *aligns with* $Y$ *for each specific forecast*.
* **Marginal Calibration**: A statement about the *marginal distributions* of $F$ and $Y$. It's about whether the *average* forecast distribution matches the *average* observed distribution.

The intuition that "overdispersion means the forecast is too spread out" is often only justified for marginal calibration. For PIT calibration, it's about the *dependence* between $F$ and $Y$. An overdispersed PIT (too peaked) might mean the forecast isn't capturing the true variability, even if the forecast distribution itself has the "right" spread on average.

Let's illustrate with a conceptual example from the slides, adapted for clarity:

Suppose the true mean $\mu$ is drawn from $N(0, 1)$, and the true outcome $Y$ is drawn independently from $N(\mu, 1)$. We are evaluating different forecasters:

1.  **Ideal Forecaster**: $F = N(\mu, 1)$ (i.e., the forecaster knows the true $\mu$).
    * **Result**: Both probabilistically and marginally calibrated. This is our gold standard.

2.  **Climatological Forecaster**: $F = N(0, 2)$ (a fixed distribution, ignoring $\mu$).
    * **Result**: Both probabilistically and marginally calibrated. This might seem counterintuitive, as it ignores information, but it is calibrated on average.

3.  **Flipped Forecaster**: $F = N(-\mu, 1)$ (the forecaster uses the negative of the true mean).
    * **Result**: Marginally calibrated but *not* probabilistically calibrated. The average forecast distribution might match the average true distribution, but individual forecasts are systematically biased.

4.  **Unfocused Forecaster**: $F = \frac{1}{2}[N(\mu, 1) + N(\mu + \xi, 1)]$ where $\xi = \pm 1$ with equal probability, independent of $Y, \mu$. This forecast introduces additional, irrelevant uncertainty.
    * **Result**: Probabilistically calibrated but *not* marginally calibrated (not explicitly stated in the provided snippet but can be inferred from the nature of the problem, where the forecast distribution is artificially broadened).

These examples highlight that different types of miscalibration can occur, and identifying them requires careful analysis using appropriate tools like PIT histograms and reliability diagrams (which we won't cover in detail here but are visual tools to assess calibration).

### Code Examples for Calibration

Let's simulate some data and demonstrate PIT for different forecast scenarios.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm, uniform

# For reproducibility
np.random.seed(42)

# --- Scenario 1: Ideal Forecast (Probabilistically and Marginally Calibrated) ---
print("--- Scenario 1: Ideal Forecast ---")
num_samples = 1000
true_mus = norm.rvs(loc=0, scale=1, size=num_samples) # Simulate true means
observed_ys = norm.rvs(loc=true_mus, scale=1, size=num_samples) # Simulate observed Y based on true_mus

# Ideal forecaster's predicted CDF and its values based on observed_ys
# The ideal forecaster knows the true mu for each observation
ideal_forecast_cdfs = [norm(loc=mu, scale=1).cdf for mu in true_mus]
pit_values_ideal = [ideal_forecast_cdfs[i](observed_ys[i]) for i in range(num_samples)]

plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.hist(pit_values_ideal, bins=30, density=True, alpha=0.7, color='skyblue', label='PIT Values')
plt.plot([0, 1], [1, 1], 'r--', label='Uniform Distribution (Ideal)')
plt.title('PIT Histogram: Ideal Forecaster')
plt.xlabel('PIT Value')
plt.ylabel('Density')
plt.legend()
plt.grid(True)
print(f"  Mean PIT (Ideal): {np.mean(pit_values_ideal):.3f}")
print(f"  Variance PIT (Ideal): {np.var(pit_values_ideal):.3f} (Compare to Var[U]=1/12={1/12:.3f})\n")


# --- Scenario 2: Underdispersed Forecast (PIT too spread out) ---
print("--- Scenario 2: Underdispersed Forecast (Too Wide) ---")
# Forecast is too wide, e.g., predicts N(0, 2) when true is N(0, 1) for a fixed mu=0
# Let's simplify this to a single fixed forecast distribution for intuition
forecast_std_underdispersed = 2.0 # Forecast is wider than true
true_std = 1.0
fixed_forecast_cdf_under = norm(loc=0, scale=forecast_std_underdispersed).cdf
observed_ys_for_fixed_true = norm.rvs(loc=0, scale=true_std, size=num_samples)

pit_values_under = [fixed_forecast_cdf_under(y) for y in observed_ys_for_fixed_true]

plt.subplot(1, 2, 2)
plt.hist(pit_values_under, bins=30, density=True, alpha=0.7, color='lightcoral', label='PIT Values')
plt.plot([0, 1], [1, 1], 'r--', label='Uniform Distribution (Ideal)')
plt.title('PIT Histogram: Underdispersed Forecast (Too Wide)')
plt.xlabel('PIT Value')
plt.ylabel('Density')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
print(f"  Mean PIT (Underdispersed): {np.mean(pit_values_under):.3f}")
print(f"  Variance PIT (Underdispersed): {np.var(pit_values_under):.3f} (Should be > 1/12, PIT is spread out)\n")


# --- Scenario 3: Overdispersed Forecast (PIT too peaked) ---
print("--- Scenario 3: Overdispersed Forecast (Too Narrow) ---")
# Forecast is too narrow, e.g., predicts N(0, 0.5) when true is N(0, 1) for a fixed mu=0
forecast_std_overdispersed = 0.5 # Forecast is narrower than true
fixed_forecast_cdf_over = norm(loc=0, scale=forecast_std_overdispersed).cdf
observed_ys_for_fixed_true = norm.rvs(loc=0, scale=true_std, size=num_samples)

pit_values_over = [fixed_forecast_cdf_over(y) for y in observed_ys_for_fixed_true]

plt.figure(figsize=(6, 5))
plt.hist(pit_values_over, bins=30, density=True, alpha=0.7, color='lightgreen', label='PIT Values')
plt.plot([0, 1], [1, 1], 'r--', label='Uniform Distribution (Ideal)')
plt.title('PIT Histogram: Overdispersed Forecast (Too Narrow)')
plt.xlabel('PIT Value')
plt.ylabel('Density')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
print(f"  Mean PIT (Overdispersed): {np.mean(pit_values_over):.3f}")
print(f"  Variance PIT (Overdispersed): {np.var(pit_values_over):.3f} (Should be < 1/12, PIT is peaked)\n")

# --- Marginal Calibration Example (Conceptual) ---
# For marginal calibration, we would compare the average F(y) with the true P(Y <= y).
# This often involves looking at reliability diagrams for binned probabilities.
# A full implementation requires more advanced statistical testing and visualization,
# but conceptually, we'd check if the observed frequency of an event matches the average predicted probability for that event.

print("--- Marginal Calibration (Conceptual) ---")
print("  Assessing marginal calibration typically involves comparing the empirical CDF of observed outcomes")
print("  with the average of the forecast CDFs. For example, if we average many forecast CDFs, does this average")
print("  CDF match the CDF of the actual outcomes? This is often visualized with reliability diagrams.")
print("  (Implementation for a blog post requires more advanced statistical tooling and data generation beyond basic simulation.)\n")


## Conclusion

Understanding forecast scoring and calibration is fundamental for building reliable probabilistic models. We've explored how to represent probabilistic forecasts, delved into various proper scoring rules that incentivize honest predictions, and distinguished between different modes of calibration that assess the reliability of our forecasts.

By using tools like Log Score, Quadratic Score, CRPS, and Quantile Score, and by carefully analyzing Probability Integral Transforms, you can gain deep insights into the strengths and weaknesses of your probabilistic forecasting models. Remember, a good probabilistic forecast is not just about being "accurate" in terms of a single point, but about providing a well-calibrated distribution that truly reflects the underlying uncertainty.

Keep experimenting with these concepts and applying them to your own forecasting problems!

***

**References**

* **11. Forecast Scoring and Calibration.pdf**: The lecture slides used as the primary source for this notebook.
* Gneiting, T., Balabdaoui, F., & Raftery, A. E. (2007). Probabilistic forecasts, calibration and sharpness. *Journal of the Royal Statistical Society: Series B (Statistical Methodology)*, 69(2), 243-268.
* Laio, F., & Tamea, S. (2007). Verification tools for probabilistic forecasts of continuous hydrological variables. *Hydrology and Earth System Sciences*, 11(3), 1267-1277.
* Rumack, A., Tibshirani, R. J., & Rosenfeld, R. (2022). Recalibrating probabilistic forecasts of epidemics. *PLoS Computational Biology*, 18(12), e1010771.
* Bracher, J., et al. (2021). The German Covid-19 Forecast Hub. *arXiv preprint arXiv:2108.03210*. (Referenced for Figures 1 and 2 in original slides).