# Partition Function

> The normalization constant can be hard to compute as it involves summing or integrating over all possible values, often infeasible in high dimensions or with complex parameters. 


## Introduction
In machine learning we often need to manipulate the partition function of a probability distribution. The partition function is the normalization constant (constant with respect to the values that the random variable can take) that ensures that the probability distribution sums to 1.

A specific kind of partition function from statistical physics is the one that appears in the Boltzmann distribution, which is used to model the distribution of particles in a system. This is sometimes called the canonical partition function, and some investigators may not know the other form.

# Partition Function vs Normalization Constant

In statistics, the terms **partition function** and **normalization constant** are closely related, especially in the context of probability distributions and statistical physics. However, their usage depends on the specific context.

## Partition Function
The **partition function** is a term often used in **statistical mechanics** and machine learning (e.g., Boltzmann machines). It typically refers to a sum or integral over all possible states of a system, ensuring that probabilities are normalized.

Mathematically, for a probability distribution defined by an unnormalized probability $ \tilde{p}(x) $, the partition function $ Z $ is:

- For discrete states:
  $$
  Z = \sum_x \tilde{p}(x)
  $$
  
- For continuous states:
  $$
  Z = \int \tilde{p}(x) \, dx
  $$

The partition function is crucial because the normalized probability is:

$$
p(x) = \frac{\tilde{p}(x)}{Z}.
$$

## Normalization Constant
The **normalization constant** serves the same purpose: it ensures that a probability distribution integrates or sums to 1. In this sense, the partition function *is* the normalization constant in probabilistic contexts.

This term is more commonly used in general probability theory and statistics, where we normalize a function (often a likelihood or posterior) to make it a valid probability density or mass function.

## Differences in Context
1. **Statistical Physics**:
   - The term *partition function* is more common because it connects to physical properties like energy, entropy, and free energy.
   - $ Z $ is defined in terms of state energy levels, e.g., 
     $$
     Z = \sum_x e^{-\beta E(x)},
     $$
     where $ E(x) $ is the energy and $ \beta = 1/kT $ (inverse temperature).

2. **Probability and Statistics**:
   - The term *normalization constant* is preferred when describing a generic probabilistic model, such as Bayesian posterior normalization.
    - The normalization constant ensures that the probability density/mass function integrates or sums to 1.

## Formal Definition

If $f(x)$ is a probability distribution over a set of values $x$, and $f_0(x)$ is the unnormalized probability distribution, then the partition function is defined as:

$$Z = \sum_{x} f_0(x)$$

so $f(x)$ is:

$$f(x) = \frac{f_0(x)}{Z}$$

where the sum is over all possible values of $x$.

See that Z is constant in the values that $x$ can take, so that the probability distribution sums to 1.

### Parametric Models

For a pmf characterized by parameters $\theta$, then the partition function is:

$$Z(\theta) = \sum_{x} f_0(x; \theta)$$

where the sum is over all possible values of $x$.

So $f(x;\theta)$ is:

$$f(x; \theta) = \frac{f_0(x; \theta)}{Z(\theta)}$$

where $f(x; \theta)$ is the pmf characterized by parameters $\theta$.

Note again that $Z(\theta)$ is constant in the values that $x$ can take, so that the probability distribution sums to 1. (It's not constant in $\theta$, which is usually what we vary to fit the model. However for our case, we are not varying $\theta, so we can consider it constant).


## Examples

### Discrete Values

Let's say we have a probability distribution $f(x)$ over the values $x = \{1, 2, 3\}$, and the unnormalized probability distribution is $f_0(x) = \{2, 3, 1\}$. The partition function is:

$$Z = 2 + 3 + 1 = 6$$

Two Examples of Partition Function Calculation:

### Energy Based Models

In energy based models, the unnormalized probability distribution is defined as:

$$f_0(x) = \exp(-E(x;\theta))$$

where $E(x; \theta)$ is the energy function. The partition function is then:

$$Z = Z(\theta) = \sum_{x} \exp(-E(x; \theta))$$

So that the probability distribution is:

$$f(x; \theta) = \frac{\exp(-E(x; \theta))}{Z(\theta)}$$

### Bayesian Models

In Bayesian models, with a prior over the parameter $\theta$, and observed data D, the partition function is:

$$Z = Z(D) = \int_{\theta} f(D | \theta) \cdot p(\theta) d\theta$$

where $p(\theta)$ is the prior over the parameter $\theta$.

See again, that $Z = Z(D)$ is constant in the values that $D$ takes, so that the probability distribution sums to 1.

For Bayesian models, $Z(\theta)$ is also called the marginal likelihood, or evidence.


It will normalize the posterior distribution over the parameter $\theta$ given the data D:

$$f(\theta | D) = \frac{f(D | \theta) \cdot p(\theta)}{Z(D)}$$

where $f(\theta | D)$ is the posterior distribution over the parameter $\theta$ given the data D.


## Example

Normalization constant of Gamma function:

In [1]:
from sympy import symbols, integrate, exp, gamma

# Define variables
x, alpha, beta = symbols('x alpha beta', positive=True)

# Unnormalized part of the Gamma distribution
unnormalized = x**(alpha - 1) * exp(-beta * x)

# Partition function Z
Z = integrate(unnormalized, (x, 0, float('inf')))

# Simplify and compare with the known normalization constant
Z_simplified = Z.simplify()
Z_simplified

gamma(alpha)/beta**alpha

## Exercise

Work out the normalization constant for the following probability distribution. You can use sympy or work by hand.

1. $f(x) = \frac{1}{2} \exp(-x^2)$ for $x \in \mathbb{R}$.

2. beta distribution: $f(x) = \frac{1}{Z} x^{\alpha - 1} (1 - x)^{\beta - 1}$ for $x \in [0, 1]$, where $Z$ is the normalization constant.

3. $f(x) = \frac{1}{Z} \exp(-x^2)$ for $x \in [0, 1]$, where $Z$ is the normalization constant.

4. lognormal distribution: $f(x) = \frac{1}{Z} \exp(-\frac{(\log(x) - \mu)^2}{2\sigma^2})$ for $x \in \mathbb{R}^+$, where $Z$ is the normalization constant.

Do the same but for energy based models:

5. $E(x) = x^2$ for $x \in \mathbb{R}$.

6. $E(x) = x^2$ for $x \in [0, 1]$.

7. $E(x) = \frac{(\log(x) - \mu)^2}{2\sigma^2}$ for $x \in \mathbb{R}^+$.