# Information Engines:  Intelligence and Thermodynamics

## Neil D. Lawrence

## 9th January 2021

$$
\newcommand{\phaseVariables}{\boldsymbol{\Gamma}}
\newcommand{\stateVariables}{\mathbf{x}}
\newcommand{\nullVariables}{\mathbf{x}_0}
\newcommand{\domainVariables}{\mathbf{x}_1}
\newcommand{\dataVariables}{\mathbf{y}}
\newcommand{\parameterVector}{\mathbf{w}}
\renewcommand{\phaseVariables}{\Gamma}
\renewcommand{\stateVariables}{X}
\renewcommand{\nullVariables}{X_0}
\renewcommand{\domainVariables}{X_1}
\renewcommand{\dataVariables}{Y}
\renewcommand{\parameterVector}{W}
\newcommand{\expDist}[2]{\left\langle #1 \right\rangle_{#2}}
\newcommand{\trueProb}{\mathbb{P}}
\newcommand{\physicsProb}{p}
\newcommand{\approxProb}{q}
\newcommand{\statsProb}{\pi}
$$



Ashby's concept of "variety", the requisite law of vareity: https://en.wikipedia.org/wiki/Variety_(cybernetics)

## Introduction

The Helmholtz free energy was derived by Hermann von Helmholtz in the study of electrochemistry. 

In a thermodynamic system, the state space is referred to as the phase space[^1] 

[^1]: Here we are referring to the microstates, the macrostates would be the sufficient statistics of the microstates. 

Boltzmann's distribution gives us the probability distribution associated with that state space. In a continuous system, we write down the the *Hamiltonian*, the energy associated with the phase space.
$$
\mathbb{P}(\phaseVariables) = \frac{1}{Z_\phaseVariables} \exp(-\beta E(\phaseVariables))
$$
where $E(\phaseVariables)$ is the energy of the system in microstate $\phaseVariables$. 

The partition function is given by 
$$
Z_\phaseVariables = \sum_\phaseVariables \exp(-\beta E(\phaseVariables)),
$$
where the $Z$ comes from the German for "sum over states" or *Zustandssumme*, reflecting Boltzmann's Austrian heritage. In the case of continuous systems we have,
$$
Z_\phaseVariables = \frac{1}{h^3}\int \exp(-\beta E(\phaseVariables)) \text{d} \phaseVariables
$$
and $E(\phaseVariables)$ is the Hamiltonian of the system.

## Total Energy

The total energy, $U_\phaseVariables$ is defined as the expected energy,
$$
U_\phaseVariables = \expDist{E(\phaseVariables)}{\trueProb(\phaseVariables)}
$$
which can be decomposed using the definition of $\trueProb(\phaseVariables)$ as
$$
U_\phaseVariables = A_\phaseVariables + TS_\phaseVariables
$$
where
$$
A_\phaseVariables = - \frac{1}{\beta}\log Z_\phaseVariables
$$
is the *Helmholtz free energy* and 
$$
S_\phaseVariables = -k_B \expDist{\log \trueProb(\phaseVariables)}{\trueProb(\phaseVariables)}
$$
is the entropy of the system and $T$ is the temperature. 

This equation expresses a fundamental decomposition of the total energy into the available energy, $A_\phaseVariables$ and energy that is not available, $TS_\phaseVariables$. 

By our definition of intelligence, the aim is to use information to achieve a goal with less resource. Here we'll interpret that as increasing available energy. First, we're split the phase space into a the unobserved part, and an observed part $\phaseVariables = \{\stateVariables, \dataVariables\}$, this allows us to write
$$
U_{\stateVariables,\dataVariables} = A_{\stateVariables,\dataVariables} + TS_{\stateVariables,\dataVariables}
$$
where
$$
A_{\stateVariables,\dataVariables} = - \frac{1}{\beta}\log Z_{\stateVariables,\dataVariables}
$$
and 
$$
S_{\stateVariables,\dataVariables} = -k_B \expDist{\log \trueProb({\stateVariables,\dataVariables})}{\trueProb({\stateVariables,\dataVariables})}.
$$

## Information to Available Energy

Making an observation in this system is equivalent to conditioning on $\dataVariables$ for the total energy which we write as
$$
U_{\stateVariables|\dataVariables} = \expDist{E({\stateVariables|\dataVariables})}{\trueProb({\stateVariables|\dataVariables})}
$$

$$
$$
$$
\mathbb{P}({\stateVariables|\dataVariables}) = \frac{1}{Z_{\stateVariables|\dataVariables}} \exp\left(-\beta E(\dataVariables|\stateVariables) -\beta E(\stateVariables)\right)
$$
where we have decomposed $E(\stateVariables, \dataVariables)$ into two parts, one which represents the interaction between our measurements and the state, $E(\dataVariables|\stateVariables)$ and $E(\stateVariables)$ represents energy terms where there is no interaction and
$$
Z_{\stateVariables|\dataVariables} = \int \exp\left(-\beta E(\dataVariables|\stateVariables) -\beta E(\stateVariables)\right) \text{d}\stateVariables
$$
The new total energy, conditioning on the measurements is
$$
U_{\stateVariables|\dataVariables} = A_{\stateVariables|\dataVariables} + TS_{\stateVariables|\dataVariables}
$$
where
$$
A_{\stateVariables|\dataVariables} = - \frac{1}{\beta}\log Z_{\stateVariables|\dataVariables}
$$
and 
$$
S_{\stateVariables|\dataVariables} = -k_B \expDist{\log \trueProb({\stateVariables|\dataVariables})}{\trueProb({\stateVariables|\dataVariables})}.
$$

We can examine how this changes the free energy, the energy gain through observation is,
$$
\begin{align*}
A_{\stateVariables|\dataVariables} - A_{\stateVariables,\dataVariables} = & -\frac{1}{\beta} \log \frac{Z_{\stateVariables|\dataVariables}}{Z_{\stateVariables,\dataVariables}}\\
& -T k_B \log \trueProb(\dataVariables)
\end{align*}
$$
which is the information gained through the observation, $\dataVariables$.

This shows the relationship between measurement and energy. By measuring the system we can gain free energy. The less likely the measurements, the more energy we gain. 

<img src="https://upload.wikimedia.org/wikipedia/commons/d/d4/Entropy-mutual-information-relative-entropy-relation-diagram.svg">

<img src="https://upload.wikimedia.org/wikipedia/commons/b/b5/Figchannel2017ab.svg">

## Approximate Available Energy

Unfortunately, we do not know the true distribution, and cannot compute how much energy we've gained. The full model, $\trueProb(\stateVariables, \dataVariables)$, is not available to us. Indeed, the number of states across the universe is so many, that even if we knew all the physics, we couldn't write down the model of everything. 

The available energy is
$$\begin{align*}
A_{\stateVariables|\dataVariables} = & -\frac{1}{\beta} \log Z_{\stateVariables|\dataVariables} \\
= & -\frac{1}{\beta} \log \int \exp\left(-\beta E(\dataVariables|\stateVariables) -\beta E(\stateVariables)\right) \text{d}\stateVariables
\end{align*}
$$
which we can rewrite as
$$
A_{\stateVariables|\dataVariables} =  -\frac{1}{\beta}\expDist{\log \frac{\exp\left(-\beta E(\dataVariables|\stateVariables) -\beta E(\stateVariables)\right)}{\physicsProb(\stateVariables|\dataVariables) }}{\physicsProb(\stateVariables|\dataVariables) } - \frac{1}{\beta}\expDist{\log \frac{\physicsProb(\stateVariables|\dataVariables)}{\trueProb(\stateVariables|\dataVariables) }}{\physicsProb(\stateVariables|\dataVariables) }
$$
or alternatively
$$
A_{\stateVariables|\dataVariables} = U^\prime_{\stateVariables|\dataVariables} - TS^\prime_{\stateVariables|\dataVariables} - TM_{\stateVariables|\dataVariables}
$$
where 
$$
U^\prime_{\stateVariables|\dataVariables} = \expDist{E(\dataVariables|\stateVariables)}{\physicsProb(\stateVariables|\dataVariables) } 
$$
and
$$
S^\prime_{\stateVariables|\dataVariables}=k_B\expDist{\log\physicsProb(\stateVariables|\dataVariables) }{\physicsProb(\stateVariables|\dataVariables) }
$$
where $M_{\stateVariables|\dataVariables}$ is the model misspecification,
$$
M_{\stateVariables|\dataVariables} = k_B \expDist{\log \frac{\physicsProb(\stateVariables| \dataVariables)}{\trueProb(\stateVariables|\dataVariables) }}{\physicsProb(\stateVariables|\dataVariables) },
$$
which is recognised as the Kullback-Leibler diverence (information theoretic term) or the *relative entropy* between the true model and our approximation. 

We also define
$$
A^\prime_{\stateVariables|\dataVariables} = U^\prime_{\stateVariables|\dataVariables} - TS^\prime_{\stateVariables|\dataVariables}
$$
so we have
$$
A^\prime_{\stateVariables|\dataVariables} = A_{\stateVariables|\dataVariables} + TM_{\stateVariables|\dataVariables}.
$$
showing that $A^\prime_{\stateVariables|\dataVariables}$ is a lower bound on the true free energy.

## Free Energy Gain

Our interest is not the true free energy, but the energy we gain through making the observation. This is denoted
$$\begin{align*}
A_{\stateVariables|\dataVariables} - A_{\stateVariables,\dataVariables} = & -\frac{1}{\beta} \log \int \exp\left(-\beta \hat{E}(\dataVariables|\stateVariables) -\beta E(\stateVariables)\right) \text{d}\stateVariables \\
& +\frac{1}{\beta} \log \int \exp\left(-\beta E(\dataVariables|\stateVariables) -\beta E(\stateVariables)\right) \text{d}\stateVariables\text{d}\dataVariables
\end{align*}
$$
where 
$\hat{E}(\dataVariables|\stateVariables)\geq E(\dataVariables|\stateVariables)$ also includes the additional energy cost of conditioning on $\dataVariables$. 

$$\begin{align*}
A_{\stateVariables|\dataVariables} - A_{\stateVariables,\dataVariables} =  & -\frac{1}{\beta}\expDist{\log \frac{\exp\left(-\beta \hat{E}(\dataVariables|\stateVariables) -\beta E(\stateVariables)\right)}{\physicsProb(\stateVariables|\dataVariables) }}{\physicsProb(\stateVariables|\dataVariables) } - \frac{1}{\beta}\expDist{\log \frac{\physicsProb(\stateVariables|\dataVariables)}{\trueProb(\stateVariables|\dataVariables) }}{\physicsProb(\stateVariables|\dataVariables) }\\
& +\frac{1}{\beta}\expDist{\log \frac{\exp\left(-\beta E(\dataVariables|\stateVariables) -\beta E(\stateVariables)\right)}{\physicsProb(\stateVariables,\dataVariables) }}{\physicsProb(\stateVariables,\dataVariables) } + \frac{1}{\beta}\expDist{\log \frac{\physicsProb(\stateVariables,\dataVariables)}{\trueProb(\stateVariables,\dataVariables) }}{\physicsProb(\stateVariables,\dataVariables) }
\end{align*}$$

$$\begin{align*}
A_{\stateVariables|\dataVariables} - A_{\stateVariables,\dataVariables} =  & -\frac{1}{\beta}\expDist{\log \frac{\exp\left(-\beta \hat{E}(\dataVariables|\stateVariables) -\beta E(\stateVariables)\right)}{\physicsProb(\stateVariables|\dataVariables) }}{\physicsProb(\stateVariables|\dataVariables) } - \frac{1}{\beta}\expDist{\log \frac{\physicsProb(\stateVariables|\dataVariables)}{\trueProb(\stateVariables|\dataVariables) }}{\physicsProb(\stateVariables|\dataVariables) }\\
& +\frac{1}{\beta}\expDist{\log \frac{\exp\left(-\beta E(\dataVariables|\stateVariables) -\beta E(\stateVariables)\right)}{\physicsProb(\stateVariables,\dataVariables) }}{\physicsProb(\stateVariables,\dataVariables) } + \frac{1}{\beta}\expDist{\log \frac{\physicsProb(\stateVariables|\dataVariables)}{\trueProb(\stateVariables|\dataVariables) }}{\physicsProb(\stateVariables,\dataVariables) } + \frac{1}{\beta}\expDist{\log \frac{\physicsProb(\dataVariables)}{\trueProb(\dataVariables) }}{\physicsProb(\dataVariables) }
\end{align*}$$

$$\begin{align*}
A_{\stateVariables|\dataVariables} - A_{\stateVariables,\dataVariables} =  & U^\prime_{\stateVariables|\dataVariables} - U^\prime_{\stateVariables,\dataVariables} + TS^\prime_\dataVariables \\
& - \frac{1}{\beta}\expDist{\log \frac{\physicsProb(\stateVariables|\dataVariables)}{\trueProb(\stateVariables|\dataVariables) }}{\physicsProb(\stateVariables|\dataVariables) } \\
 & + \expDist{\frac{1}{\beta}\expDist{\log \frac{\physicsProb(\stateVariables|\dataVariables)}{\trueProb(\stateVariables|\dataVariables) }}{\physicsProb(\stateVariables|\dataVariables) }}{\physicsProb(\dataVariables)}\\
&  + \frac{1}{\beta}\expDist{\log \frac{\physicsProb(\dataVariables)}{\trueProb(\dataVariables) }}{\physicsProb(\dataVariables) }
\end{align*}$$

$$\begin{align*}
A_{\stateVariables|\dataVariables} - A_{\stateVariables,\dataVariables} =  & U^\prime_{\stateVariables|\dataVariables} - U^\prime_{\stateVariables,\dataVariables} + TS^\prime_\dataVariables - T\left(M_{\stateVariables|\dataVariables} - \expDist{M_{\stateVariables|\dataVariables}}{\physicsProb(\dataVariables)} - M_{\dataVariables}\right)
\end{align*}$$
where
$$
S^\prime_{\dataVariables}=-k_B\expDist{\log\physicsProb(\dataVariables) }{\physicsProb(\dataVariables) }
$$
and $M_{\dataVariables}$ is the model misspecification,
$$
M_{\dataVariables} = k_B \expDist{\log \frac{\physicsProb( \dataVariables)}{\trueProb(\dataVariables) }}{\physicsProb(\dataVariables) },
$$

If we look at the expected change under all data sets we get
$$
\expDist{A_{\stateVariables|\dataVariables} - A_{\stateVariables,\dataVariables}}{\trueProb(\dataVariables)} = -\expDist{E(\dataVariables)}{\trueProb(\dataVariables)} + TS^\prime_\dataVariables + TM_{\dataVariables}  - T\expDist{\left(M_{\stateVariables|\dataVariables} - \expDist{M_{\stateVariables|\dataVariables}}{\physicsProb(\dataVariables)}\right)}{\trueProb(\dataVariables)}
$$
where $E(\dataVariables)$ is the energy cost of measuring $\dataVariables$ and 
$$
\expDist{\left(M_{\stateVariables|\dataVariables} - \expDist{M_{\stateVariables|\dataVariables}}{\physicsProb(\dataVariables)}\right)}{\trueProb(\dataVariables)}
$$


So we have
$$
\expDist{\Delta A^\prime_{\stateVariables|\dataVariables}}{\trueProb(\dataVariables)} = -\expDist{E(\dataVariables)}{\trueProb(\dataVariables)} + TS^\prime_\dataVariables 
$$
where 
$$
E(\dataVariables) = \expDist{\hat{E}(\dataVariables|\stateVariables)}{\physicsProb(\stateVariables|\dataVariables)} - \expDist{E(\dataVariables|\stateVariables)}{\physicsProb(\stateVariables,\dataVariables)}
$$

## Null Space and Domain Space

One challenge is dealing with the number of variables in the world. Our first step will be to partition the state space into a *null space*, $\nullVariables$ and a domain space, $\domainVariables$, $\stateVariables = \{\nullVariables, \domainVariables\}$. The domain space contains variables of interest to our problem, and the null space contains variables that are not directly of interest. In particular the null space contains variables that we believe are only weakly influenced by our measurements. The next step is to assume that our approximating density factorises across these spaces,
$$
\physicsProb(\nullVariables, \domainVariables | \dataVariables) = \physicsProb(\nullVariables|\dataVariables) \physicsProb(\domainVariables | \dataVariables).
$$

If this is the only assumption we make about our approximation to the real world, then we can compute the optimum approximations by minimizing the model mispecification. 
$$
\physicsProb(\domainVariables|\dataVariables) = \frac{\exp\left(-\beta E^\prime(\dataVariables |\domainVariables) - \beta E^\prime(\domainVariables)\right)}{Z^\prime_{\domainVariables | \dataVariables}} 
$$
where
$$\begin{align*}
E^\prime(\dataVariables |\domainVariables) = & \expDist{E(\dataVariables |\stateVariables)}{\physicsProb(\nullVariables|\dataVariables)} \\
E^\prime(\domainVariables) = & \expDist{E(\stateVariables)}{\physicsProb(\nullVariables|\dataVariables)} \\
Z^\prime_{\domainVariables | \dataVariables} = & \int \exp\left(-\beta E^\prime(\dataVariables |\domainVariables) - \beta E^\prime(\domainVariables)\right) \text{d}\domainVariables. 
\end{align*}
$$
Similary,
$$
\physicsProb(\nullVariables|\dataVariables) = \frac{\exp\left(-\beta E^\prime(\dataVariables |\nullVariables) - \beta E^\prime(\nullVariables)\right)}{Z^\prime_{\nullVariables}} 
$$
where
$$\begin{align*}
E^\prime(\dataVariables |\nullVariables) = & \expDist{E(\dataVariables |\stateVariables)}{\physicsProb(\domainVariables|\dataVariables)} \\
E^\prime(\nullVariables) = & \expDist{E(\stateVariables)}{\physicsProb(\domainVariables|\dataVariables)} \\
Z^\prime_{\nullVariables | \dataVariables} = & \int \exp\left(-\beta E^\prime(\dataVariables |\nullVariables) - \beta E^\prime(\nullVariables)\right) \text{d}\nullVariables. 
\end{align*}
$$
Note that these two distributions are interdependent. The optimal form of $\physicsProb(\nullVariables|\dataVariables)$ is dependent on the form of $\physicsProb(\domainVariables|\dataVariables)$ and vice versa.

The two distributions are symmetric. Our assumption is that the domain variables are more influenced by the mearsurements, $\dataVariables$, than the null space. So we will focus on the domain variables. But due to the symmetry the derivation that follows would apply equally to the null space variables.

## Apparent Free Energy

Dropping the model misspecification we now have an upper bound on the free energy,
$$
A_{\stateVariables|\dataVariables} \leq U^\prime_{\stateVariables|\dataVariables} - TS^\prime_{\stateVariables|\dataVariables} 
$$
or alternatively, we can define
$$
A^\prime_{\stateVariables|\dataVariables} = U^\prime_{\stateVariables|\dataVariables} - TS^\prime_{\stateVariables|\dataVariables}. 
$$
which we call the *apparent free energy*. The apparent free energy is an upper bound on the true free energy. Our factorisation into domain and null space reflects the fact that we think we view $\physicsProb(\domainVariables | \dataVariables)$ as a model emerging from a domain expert. 

By our definitions above, the apparent total energy can be written as
$$
U^\prime_{\stateVariables|\dataVariables} = \expDist{E^\prime(\dataVariables |\domainVariables)}{\physicsProb(\domainVariables| \dataVariables)} + \expDist{E^\prime(\domainVariables)}{\physicsProb(\domainVariables| \dataVariables)} = U^\prime_{\domainVariables|\dataVariables} = U^\prime_{\nullVariables|\dataVariables}
$$
and our independence assumption means that
$$
S^\prime_{\stateVariables|\dataVariables} = S^\prime_{\domainVariables|\dataVariables} + S^\prime_{\nullVariables|\dataVariables}.
$$
We now can define 
$$
A^\prime_{\domainVariables|\dataVariables} = U^\prime_{\domainVariables|\dataVariables} - TS^\prime_{\domainVariables|\dataVariables} = -\frac{1}{\beta} \log Z^\prime_{\domainVariables|\dataVariables}
$$
and can see that the apparent free energy is
$$
A^\prime_{\stateVariables|\dataVariables} = A^\prime_{\domainVariables|\dataVariables} - TS^\prime_{\nullVariables|\dataVariables}.
$$ 



## Parameterised Model

The next step is to include *parameters* with the model. To do this we first separate a new set of auxiliary variables from the domain variables, $\domainVariables \rightarrow \{\domainVariables, \parameterVector\}$introduce a new distribution, 
$$
\approxProb(\domainVariables, \parameterVector) = \physicsProb(\domainVariables | \parameterVector)\approxProb(\parameterVector)
$$
$$\begin{align*}
A^\prime_{\domainVariables|\parameterVector} = & -\frac{1}{\beta} \log \int \exp\left(-\beta E^\prime(\dataVariables |\domainVariables, \parameterVector) - \beta E^\prime(\domainVariables, \parameterVector)\right) \text{d}\domainVariables \text{d}\parameterVector\\
= & -\frac{1}{\beta} \expDist{\log\frac{\exp\left(-\beta E^\prime(\dataVariables |\domainVariables, \parameterVector) - \beta E^\prime(\domainVariables, \parameterVector)\right)}{\approxProb(\domainVariables, \parameterVector)}}{\approxProb(\domainVariables, \parameterVector)} - \frac{1}{\beta} \expDist{\log\frac{\approxProb(\domainVariables , \parameterVector)}{\physicsProb(\domainVariables, \parameterVector | \dataVariables)}}{\approxProb(\domainVariables, \parameterVector)} \\
= & -\frac{1}{\beta} \expDist{\log\frac{\exp\left(-\beta \expDist{E^\prime(\dataVariables |\domainVariables, \parameterVector) - \beta E^\prime(\domainVariables, \parameterVector)\right)}{\physicsProb(\domainVariables | \parameterVector)}}{\approxProb(\parameterVector)}}{\approxProb(\parameterVector)} + \frac{1}{\beta} \expDist{\log \physicsProb(\domainVariables | \parameterVector)}{\approxProb(\domainVariables, \parameterVector)} - \frac{1}{\beta} \expDist{\log\frac{\approxProb(\domainVariables , \parameterVector)}{\physicsProb(\domainVariables, \parameterVector | \dataVariables)}}{\approxProb(\domainVariables, \parameterVector)} \\
= & -\frac{1}{\beta} \expDist{\log\frac{\exp\left(-\beta E^{\prime\prime}(\dataVariables |\parameterVector) - \beta E^{\prime\prime}(\parameterVector)\right)}{\approxProb(\parameterVector)}}{\approxProb(\parameterVector)} + \frac{1}{\beta} \expDist{\expDist{\log \physicsProb(\domainVariables | \parameterVector)}{\physicsProb(\domainVariables | \parameterVector)}}{\approxProb(\parameterVector)} - \frac{1}{\beta} \expDist{\expDist{\log\frac{\physicsProb(\domainVariables | \parameterVector)}{\physicsProb(\domainVariables| \parameterVector , \dataVariables)}}{\physicsProb(\domainVariables | \parameterVector)}}{\approxProb(\parameterVector)} - \frac{1}{\beta}\expDist{\log\frac{\approxProb(\parameterVector)}{\physicsProb(\parameterVector|\dataVariables)}}{\approxProb(\parameterVector)}
\end{align*}
$$

If we define 
$$
\statsProb(\dataVariables | \parameterVector) = \frac{\exp\left(-\beta E^{\prime\prime}(\dataVariables |\parameterVector)\right)}{Z^{\prime\prime}_{\dataVariables | \parameterVector}}
$$
and
$$
\statsProb(\parameterVector) = \frac{\exp\left(-\beta E^{\prime\prime}(\parameterVector)\right)}{Z^{\prime\prime}_{\parameterVector}}
$$
where
$$
Z^{\prime\prime}_{\dataVariables | \parameterVector} = \int \exp\left(-\beta E^{\prime\prime}(\dataVariables |\parameterVector)\right) \text{d}\dataVariables
$$
and
$$
Z^{\prime\prime}_{\parameterVector} = \int \exp\left(-\beta E^{\prime\prime}(\parameterVector)\right) \text{d}\parameterVector
$$
and
$$
Z^{\prime\prime}_{\parameterVector|\dataVariables} = \int \exp\left(-\beta E^{\prime\prime}(\dataVariables |\parameterVector)\right)  + \exp\left(-\beta E^{\prime\prime}(\parameterVector)\right) \text{d}\parameterVector
$$
and we set
$$
\approxProb(\parameterVector) = \statsProb(\parameterVector |\dataVariables)
$$
then we have
$$
A^\prime_{\domainVariables|\parameterVector} = -\frac{1}{\beta} \log  Z^{\prime\prime}_{\parameterVector|\dataVariables} - T\expDist{S^\prime_{\domainVariables | \parameterVector}}{\statsProb(\parameterVector | \dataVariables)} - T\expDist{B_{\domainVariables | \parameterVector}}{\statsProb(\parameterVector | \dataVariables)}  - \frac{1}{\beta}\expDist{\log\frac{\statsProb(\parameterVector|\dataVariables)}{\physicsProb(\parameterVector|\dataVariables)}}{\statsProb(\parameterVector|\dataVariables)}
$$
where
$$
B^\prime_{\domainVariables | \parameterVector} = k_B \expDist{\log\frac{\physicsProb(\domainVariables | \parameterVector)}{\physicsProb(\domainVariables| \parameterVector , \dataVariables)}}{\physicsProb(\domainVariables | \parameterVector)}
$$
so overall we have
$$
A_{\stateVariables|\dataVariables} = -\frac{1}{\beta} \log  Z^{\prime\prime}_{\parameterVector|\dataVariables} - T\expDist{S^\prime_{\domainVariables | \parameterVector}}{\statsProb(\parameterVector | \dataVariables)} - T\expDist{B^\prime_{\domainVariables | \parameterVector}}{\statsProb(\parameterVector | \dataVariables)}  - TM^\prime_{\parameterVector|\dataVariables} - TS^\prime_{\nullVariables|\dataVariables}- TM_{\stateVariables|\dataVariables}
$$
where
$$
M^\prime_{\stateVariables|\dataVariables} =  k_B\expDist{\log\frac{\statsProb(\parameterVector|\dataVariables)}{\physicsProb(\parameterVector|\dataVariables)}}{\statsProb(\parameterVector|\dataVariables)}
$$
Collecting terms
$$
A_{\stateVariables|\dataVariables} = A^{\prime\prime}_{\parameterVector|\dataVariables} - T\left[\expDist{S^\prime_{\domainVariables | \parameterVector}}{\statsProb(\parameterVector | \dataVariables)} + S^\prime_{\nullVariables|\dataVariables}\right]   - T\left[M^\prime_{\parameterVector|\dataVariables} + M_{\stateVariables|\dataVariables} + \expDist{B^\prime_{\domainVariables | \parameterVector}}{\statsProb(\parameterVector | \dataVariables)}\right]
$$
Or
$$
A^{\prime\prime}_{\parameterVector|\dataVariables} = A_{\stateVariables|\dataVariables} + T\left[\expDist{S^\prime_{\domainVariables | \parameterVector}}{\statsProb(\parameterVector | \dataVariables)} + S^\prime_{\nullVariables|\dataVariables}\right]  + T\left[M^\prime_{\parameterVector|\dataVariables} + M_{\stateVariables|\dataVariables} + \expDist{B^\prime_{\domainVariables | \parameterVector}}{\statsProb(\parameterVector | \dataVariables)}\right]
$$

## Speculative on Naming

Spilt the statistical model mismatch into two terms that represent how 'correct' and how 'consistent' the statistical model is. 

Correctness  represents the ability of the model to reconstruct the information about domain variables,  $\domainVariables$, that is provided by the data, $\dataVariables$, through the parameters, $\parameterVector$, alone. The value of $B^\prime_{\domainVariables|\dataVariables}$ increases with incorrectness.

Consistency reflects how likely different data sets are to give us different parameters. is the relative entropy (KL divergence) between the parameters of the statistical model, $\statsProb(\parameterVector|\dataVariables)$, and the physical model, $\physicsProb(\parameterVector|\dataVariables)$. T

## Absence of the Bayesian Controversy

While these ideas are normally considered for thermodynamics, they can also be seen as an attempt to capture the full state of a physical system through probability. In a classical system, each of these states is the result of some deterministic (physical) relationship. 

The fundamentals of statistical mechanics were derived by physicists and chemists such as Maxwell, Boltzmann, Gibbs and Helmholtz. In chemistry, particle theories of matter were uncontroversial, although in physics, Boltzmann struggled throughout his life to have his fellow physicists accept these ideas and it wasn't until Einstein formulated Brownian motion with a particle model [@Einstein-brownian05], introducing stochasticity to *differential equations* and giving predictions that were verified by Perrin [@Perrin-brownian10]. 

## Thermodynamic Relations

The view that Boltzmann was competing with was the positivist view of physics as a universe of energy, pushed by Mach and others. 

The relationship between energy and particles is given through conservation of energy of the microstates. The total energy of a system is given by the expected value of those energies under probability distribution of states.
$$
U = \expDist{E(\stateVariables)}{\mathbb{P}(\stateVariables)}
$$
Note that this is related to the entropy of $\mathbb{P}(\stateVariables)$ as follows
\begin{align*}
U = & \expDist{E(\stateVariables)}{\mathbb{P}(\stateVariables)} \\
= & -\frac{1}{\beta}\expDist{\log \mathbb{P}(\stateVariables)}{\mathbb{P}(\stateVariables)} - \frac{1}{\beta} \log Z  
\end{align*}
where in statistical mechanics we denote 
$$
TS = -\frac{1}{\beta}\expDist{\log\mathbb{P}(\stateVariables)}{\mathbb{P}(\stateVariables)}
$$
where $T$ is the temperature of the system and $S$ is the thermodynamic entropy which is given by the probabilistic entropy times Boltzmann's constant, $k_B$. For notational convenience the statistical mechanics temperature is defined as $\beta = \frac{1}{T k_B}$, so $Tk_B = \frac{1}{\beta}$. 

So from the probabilistic definition of expected energy, we can see that the total energy is related to the entropy as follows:
$$
U = TS + A
$$
where 
$$
A = \frac{1}{\beta} \log Z.
$$
In statistical mechanics this is known as the Helmholtz free energy. It is the amount of energy available for \emph{work} or \emph{Arbeit} in Helmholtz's original German, thus the letter $A$ to denote it. 

This decomposition of the joint distribution is a fundamental equation in statistical mechanics. 

In thermodynamics, the state variables, $\stateVariables$ are often split into extensive and intensive variables. Extensive variables depend on the quantity of matter (total mass, volume) and intensive do not (temperature, density).  

## Machine Learning

In the world of machine learning, we are more interested in the separation between states that are observable and states that are unobservable. The statistical mechanical model we've described doesn't include measurements. Let's modify the energy function by adding in a new energy, $E(\dataVariables | \stateVariables)$, which represents the interaction between our measurements, $\dataVariables$, and the state space $\stateVariables$. 

Now the total energy is given by,
$$
U_2 = \expDist{E(\dataVariables | \stateVariables)}{\mathbb{P}(\stateVariables| \dataVariables)} + \expDist{E(\stateVariables)}{\mathbb{P}(\stateVariables| \dataVariables)}
$$
and the Boltzmann distribution for $\mathbb{P}(\stateVariables | \dataVariables)$ is given by
$$
\mathbb{P}(\stateVariables,  \dataVariables) \propto \exp \left(-\beta E(\stateVariables) -\beta E(\dataVariables | \stateVariables)\right)
$$
where the $\dataVariables$ are values we observe in the system. 

This can be related to the entropy of the system given the measurements as follows, 
$$\begin{align*}
U_2 & = \expDist{E(\dataVariables |  \stateVariables)}{\mathbb{P}(\stateVariables |  \dataVariables)} + \expDist{E(\stateVariables)}{\mathbb{P}(\stateVariables | \dataVariables)} \\
& = -\frac{1}{\beta} \expDist{\log \mathbb{P}(\stateVariables | \dataVariables)}{\mathbb{P}(\stateVariables | \dataVariables)} - \frac{1}{\beta}\log Z_2.
\end{align*}$$
If we define the Helmholtz free energy to be,
$$
A_2 = -\frac{1}{\beta} \log  Z_2
$$
where 
$$
Z_2 = \int \exp \left(-\beta E(\stateVariables) -\beta E(\dataVariables | \stateVariables)\right) \text{d}\stateVariables \text{d}\dataVariables
$$
then we see that our new entropy term is
$$
TS_2 = -\frac{1}{\beta} \expDist{\log \mathbb{P}(\stateVariables | \dataVariables)}{\mathbb{P}(\stateVariables | \dataVariables)}
$$

This can be seen to be related to the original entropy as follows,
$$
\begin{align*}
TS_1 = & -\frac{1}{\beta} \expDist{\log \mathbb{P}(\stateVariables)}{\mathbb{P}(\stateVariables )} \\
= & -\frac{1}{\beta} \int \mathbb{P}(\dataVariables) \expDist{\log \mathbb{P}(\stateVariables | \dataVariables)}{\mathbb{P}(\stateVariables | \dataVariables)} \text{d}\mathbf{y} - \frac{1}{\beta} \expDist{\log \frac{\mathbb{P}(\stateVariables)}{\mathbb{P}(\stateVariables | \dataVariables)}}{\mathbb{P}(\stateVariables, \dataVariables)}\\
= & -\frac{1}{\beta} \int \mathbb{P}(\dataVariables) \expDist{\log \mathbb{P}(\stateVariables | \dataVariables)}{\mathbb{P}(\stateVariables | \dataVariables)} \text{d}\mathbf{y} + \frac{1}{\beta} \expDist{\log \frac{\mathbb{P}(\stateVariables , \dataVariables)}{\mathbb{P}(\stateVariables)\mathbb{P}(\dataVariables)}}{\mathbb{P}(\stateVariables, \dataVariables)} \\
= & T\expDist{S_2}{\mathbb{P}(\dataVariables)}  + T I
\end{align*}
$$
where I is the mutual information between $\dataVariables$ and $\stateVariables$,
$$
I = k_B \expDist{\log \frac{\mathbb{P}(\stateVariables , \dataVariables)}{\mathbb{P}(\stateVariables)\mathbb{P}(\dataVariables)}}{\mathbb{P}(\stateVariables, \dataVariables)}. 
$$

This also allows us to look at
$$
\begin{align*}
\expDist{U_2}{\mathbb{P}(\dataVariables)} = & \expDist{A_2}{\mathbb{P}(\dataVariables)} +  T\expDist{S_2}{\mathbb{P}(\dataVariables)} \\
= & A_2 + TS_1 - TI.
\end{align*}
$$

So we have 
$$
\expDist{U_2}{\mathbb{P}(\dataVariables)}- U_1    = \expDist{A_2}{\mathbb{P}(\dataVariables)}  - A_1 - TI
$$

If the measurements in the ML system do not disturb the marginal distribution, so that $\mathbb{P}_2(\stateVariables) = \mathbb{P}_1(\stateVariables)$ then we have 
$$
\expDist{U_2}{\mathbb{P}(\dataVariables)} = U_1 + \expDist{E(\dataVariables|\stateVariables)}{\mathbb{P}(\dataVariables, \stateVariables)}
$$
we have
$$
\expDist{E(\dataVariables|\stateVariables)}{\mathbb{P}(\dataVariables, \stateVariables)}  = \expDist{A_2}{\mathbb{P}(\dataVariables)}  - A_1 - TI
$$
Which can must be greater or equal to zero so
$$
\expDist{A_2}{\mathbb{P}(\dataVariables)} \geq A_1 + TI
$$
so the amount of free energy in the machine learning system is at least as much as in the original system. The additional available energy is  free energy is now available as 
$$
$$

We also note that
$$
TS_1  \geq TS_2 + TI
$$
with equality occuring when the mutual information is zero. So we are guaranteed to have reduced entropy in the machine learning system as long as the mutual information between $\dataVariables$ and $\stateVariables$ is greater than zero, in other words, $\dataVariables$ and $\stateVariables$ cannot be *independent*. 

Another special case would be when mutual information is maximized, which would occur if we measure every state variable. In this case we (check this) **would expect all entropy to be removed and the total energy to equal the free energy**.

## Maximizing the Free Energy

The sensible strategy for the machine learning scientist is to maximize the free energy,
$$
A_2 = -\frac{1}{\beta} \log \int \exp \left(-\beta E(\stateVariables) -\beta E(\dataVariables | \stateVariables)\right) \text{d}\stateVariables \text{d}\dataVariables.
$$
We can introduce an approximating distribution, $p(\dataVariables, \stateVariables)$, that represents our best understanding of how things interact. This allows us to decompose.
$$
\begin{align*}
-\frac{1}{\beta} \log \int \exp \left(-\beta E(\stateVariables) -\beta E(\dataVariables | \stateVariables)\right) \text{d}\stateVariables \text{d}\dataVariables = & -\frac{1}{\beta} \int p(\dataVariables, \stateVariables) \log \frac{\exp\left(-\beta E(\stateVariables) -\beta E(\dataVariables | \stateVariables)\right)}{p(\dataVariables, \stateVariables)}\text{d}\stateVariables \text{d}\dataVariables \\
& +\frac{1}{\beta} \int p(\dataVariables, \stateVariables) \log \frac{\mathbb{P}(\dataVariables, \stateVariables)}{p(\dataVariables, \stateVariables)}\text{d}\stateVariables \text{d}\dataVariables. \\
\end{align*}
$$
or for a given observation of $\dataVariables$ we have,
$$
\begin{align*}
-\frac{1}{\beta} \log \int \exp \left(-\beta E(\stateVariables) -\beta E(\dataVariables | \stateVariables)\right) \text{d}\stateVariables  = & -\frac{1}{\beta} \int p(\stateVariables| \dataVariables) \log \frac{\exp\left(-\beta E(\stateVariables) -\beta E(\dataVariables | \stateVariables)\right)}{p( \stateVariables| \dataVariables)}\text{d}\stateVariables  \\
& +\frac{1}{\beta} \int p(\stateVariables |\dataVariables) \log \frac{\mathbb{P}(\stateVariables| \dataVariables)}{p(\stateVariables | \dataVariables)}\text{d}\stateVariables. \\
\end{align*}
$$
$$
\begin{align*}
A_2 = & \expDist{E(\stateVariables) +E(\dataVariables | \stateVariables)}{p(\dataVariables, \stateVariables)} \\ 
& - \frac{1}{\beta} \expDist{\log p(\dataVariables, \stateVariables)}{p(\dataVariables, \stateVariables)} \\
& +\frac{1}{\beta} \int p(\dataVariables, \stateVariables) \log \frac{\mathbb{P}(\dataVariables, \stateVariables)}{p(\dataVariables, \stateVariables)}\text{d}\stateVariables \text{d}\dataVariables.
\end{align*}
$$
reordering as
$$
\begin{align*}
A_2 + \frac{1}{\beta} \expDist{\log \frac{p(\dataVariables, \stateVariables)}{\mathbb{P}(\dataVariables, \stateVariables)}}{p(\dataVariables, \stateVariables) } = & \expDist{E(\stateVariables) +E(\dataVariables | \stateVariables)}{p(\dataVariables, \stateVariables)} 
+ \frac{1}{\beta} \expDist{\log p(\dataVariables, \stateVariables)}{p(\dataVariables, \stateVariables)}.
\end{align*}
$$
We recognise that the left hand side is the free energy plus a Kullback-Leibler (KL) divergence between our approximating distribution, $p(\dataVariables, \stateVariables)$ and the true distribution $\mathbb{P}(\dataVariables, \stateVariables)$. We introduce an apparent free energy,
$$
\begin{align*}
\hat{A}_2 = & A_2 + \frac{1}{\beta} \expDist{\log \frac{p(\dataVariables, \stateVariables)}{\mathbb{P}(\dataVariables, \stateVariables)}}{p(\dataVariables, \stateVariables)} \\
= & A_2 + T \text{KL}\left(p(\dataVariables, \stateVariables) || \mathbb{P}(\dataVariables, \stateVariables)\right)
\end{align*}
$$
which, because the KL divergence is greater than or equal to zero, shows that the apparent free energy is an *upper bound* on the actual free energy. The tightness of the bound improves as our approximation, $p(\dataVariables, \stateVariables)$ approaches the truth $\mathbb{P}(\dataVariables, \stateVariables)$. 

The other two terms can be seen to be equivalent to the total energy, $U_2$, and the entropy, $S_2$. But again under the assumed distribution, $p(\dataVariables, \stateVariables)$, so we have
$$
\hat{U}_2 = \expDist{E(\stateVariables) +E(\dataVariables | \stateVariables)}{p(\dataVariables, \stateVariables)}
$$
and
$$
T\hat{S}_2 = - \frac{1}{\beta} \expDist{\log p(\dataVariables, \stateVariables)}{p(\dataVariables, \stateVariables)}
$$
This allows us to write down a new assumed free energy relationship.
$$
\hat{A}_2 = \hat{U}_2 - T\hat{S}_2.
$$
This is the world that, as modellers, we play with. The total energy term and the entropy term are now given by our *assumed* probability distribution for the world, $p(\dataVariables, \stateVariables)$. Or in other words by the model we use. That model is an approximation to the truth, $\mathbb{P}(\dataVariables, \stateVariables)$. However, rather insiduously, the worse our approximation to the truth, the greater the apparent free energy appears to be. Because the apparent free energy is an upper bound on the true free energy, we have to be very careful about model fitting,
$$
A_2 = \hat{A}_2 - T \text{KL}\left(p(\dataVariables, \stateVariables) || \mathbb{P}(\dataVariables, \stateVariables)\right).
$$
Normally we would be looking to maximize the free energy, which minimizes the entropy. But here we have to be careful about direct maximization, because a high *apparent* free energy can be achieved with a poor model.

Note also, that traditional approaches to probabilistic model fitting are often justified by minimization of the KL divergence. But that KL divergence is the *other way around*. Classical maximum likelihoood minimizes $T \text{KL}\left(\mathbb{P}(\dataVariables, \stateVariables) || p(\dataVariables, \stateVariables)\right)$. I.e. expectations are taken under the true distribution $\mathbb{P}(\dataVariables, \stateVariables)$.

So far, everything we've written about concerns *equilibrium thermodynamics*. The above laws apply once we have stationary distributions. In theory, that might take infinite time to occur. If we only consider equilibrium thermodynamics, we loose the element of time. Time is just as important as free energy as a resource to preserve. When building an intelligent system we are very often faced with a time horizon. Algorithms that require us to compute for many billions of years to get their answers are not of much use.

$$
\begin{align*}
-\frac{1}{\beta} \log \int \exp \left(-\beta E(\stateVariables) -\beta E(\dataVariables | \stateVariables)\right) \text{d}\stateVariables \text{d}\dataVariables = & -\frac{1}{\beta} \int p(\dataVariables)p(\stateVariables | \dataVariables) \log \frac{\exp\left(-\beta E(\stateVariables) -\beta E(\dataVariables | \stateVariables)\right)}{p(\stateVariables|\dataVariables) }\text{d}\stateVariables \text{d}\dataVariables  \\
& +\frac{1}{\beta} \int p(\dataVariables) p(\stateVariables|\dataVariables) \log \frac{\mathbb{P}( \stateVariables| \dataVariables)}{p( \stateVariables | \dataVariables)}\text{d}\stateVariables \text{d}\dataVariables + \frac{1}{\beta} \int p(\dataVariables)  \log  \mathbb{P}(  \dataVariables) \text{d}\dataVariables.\\
= & -\frac{1}{\beta} \int p(\dataVariables)p(\stateVariables | \dataVariables) \log \frac{\exp\left( -\beta E(\dataVariables | \stateVariables)\right)}{p(\stateVariables|\dataVariables) }\text{d}\stateVariables \text{d}\dataVariables + \int p(\dataVariables) p(\stateVariables|\dataVariables) E(\stateVariables) \text{d}\stateVariables \\
& +\frac{1}{\beta} \int p(\dataVariables) p(\stateVariables|\dataVariables) \log \frac{\mathbb{P}( \stateVariables| \dataVariables)}{p( \stateVariables | \dataVariables)}\text{d}\stateVariables \text{d}\dataVariables + \frac{1}{\beta} \int p(\dataVariables)  \log  \mathbb{P}(  \dataVariables) \text{d}\dataVariables.
\end{align*}
$$

## Maximizing the Available Energy

The apparent available energy is given by
$$
\hat{A}_2 =  \expDist{E(\stateVariables) +E(\dataVariables | \stateVariables)}{p(\dataVariables, \stateVariables)} + \frac{1}{\beta} \expDist{\log p(\dataVariables, \stateVariables)}{p(\dataVariables, \stateVariables)}
$$
which differs from the true available energy by a KL divergence that represents the *physical plausibility* of the model, $T \text{KL}\left(p(\dataVariables, \stateVariables) || \mathbb{P}(\dataVariables, \stateVariables)\right)$. 

### Maximising the Apparent Total Energy

Our first idea could be to maximise the apparent total energy, $\hat{U}_2$. This would require us to have an estimate of what that total energy is so we would have, 
$$
\hat{E}(\dataVariables, \stateVariables)  \approx E(\stateVariables) + E(\dataVariables | \stateVariables).
$$
Or more precisely an estimate of the expectations under our joint distribution.
$$
\expDist{\hat{E}(\dataVariables, \stateVariables)}{p(\dataVariables, \stateVariables)} \approx \expDist{E(\stateVariables) + E(\dataVariables | \stateVariables)}{p(\dataVariables, \stateVariables)}.
$$
Maximising the expected total energy could be done through proxies which relate those energies to actual costs in the real world, so we can see the approximation $\expDist{\hat{E}_2(\dataVariables, \stateVariables)}{p(\dataVariables, \stateVariables)}$ as an objective to be maximised, or its negative as cost to be minimized. 

Minimising expected costs is reminiscient of operational research, where the (often montetary) costs of a process are enumerated, and optimization algorithms are used to find configurations that minimize expected cost. Optimisng the total energy in this way is optimal when there is no uncertainty, i.e. when 
$$
\expDist{\hat{E}(\dataVariables, \stateVariables)}{\mathbb{P}(\dataVariables, \stateVariables)} =  \hat{E}\left(\expDist{\stateVariables}{\mathbb{P}(\dataVariables)}, \expDist{\dataVariables}{\mathbb{P}(\dataVariables)}\right) .
$$
Because in this case the entropy term can be ignored and the apparent available energy is equal to the total energy. In this case, our estimate of the apparent available energy would differ from the true available energy by the following residual,
$$
E(\langle\stateVariables\rangle) + E(\langle\dataVariables\rangle | \langle\stateVariables\rangle) - \hat{E}(\langle\dataVariables\rangle | \langle\stateVariables\rangle).
$$
But for our focus, when there is uncertainty in the system, we need to consider the entropy term, $T\hat{S}_2$. 



### Maximum Entropy Distribution

Maximising the apparent total energy is the most optimistic we can be, we assume no uncertainty. The most pessimistic we can be would be to consider the situation where the entropy, $\hat{S}_2$ is *maximized*. We can see the maximum apparent total energy as an optimistic approach, maximising is the other extreme.

The maximum entropy principle [@Jaynes-] allows us to find a form for $p(\dataVariables, \stateVariables)$ that maximises $T\hat{S}_2$ while ensuring constraints on the distribution moments match some known value. Here the idea constraint on moments would be,
$$
\expDist{E(\dataVariables | \stateVariables)}{p(\dataVariables, \stateVariables)} = \expDist{E(\dataVariables | \stateVariables)}{\mathbb{P}(\dataVariables, \stateVariables)}.
$$
This constraint would recover $\mathbb{P}(\dataVariables, \stateVariables)$. This would reflect a world where we had full knowledge of the physics. In practice, as we sugged for the maximisaiton of the apparent total energy, we may 
$$
p(\dataVariables, \stateVariables) \propto \exp(\lambda \hat{E}(\dataVariables, \stateVariables))
$$
with a partition function of the form
$$
\hat{Z}_2 = \int \exp(\lambda \hat{E}(\dataVariables , \stateVariables)) \text{d} \dataVariables \text{d} \stateVariables.
$$
which implies that,
$$
\hat{A}_2 = \expDist{E(\stateVariables) +E(\dataVariables | \stateVariables) - \frac{\lambda}{\beta}\hat{E}(\dataVariables, \stateVariables)}{p(\dataVariables, \stateVariables)} - \frac{1}{\beta} \log \hat{Z}_2.
$$
Which reflects how close our model is to the physical reality. We can substitute, 

## Recap

with the different modelling steps we have taken so far, we can relate the original total energy of the system to our new system as follows,
$$
\hat{A}_2 = U_1 + T I(\dataVariables, \stateVariables) + T\text{KL}\left(p(\dataVariables, \stateVariables) || \mathbb{P}(\dataVariables, \stateVariables)\right) - TS_1 - TS_2 - T\hat{S}_2.
$$
where we are gaining energy from two sources, firstly through the information gain from $\dataVariables$. This is genuine energy gain. More problematically is the information gain from the KL divergence. This KL divergence represents the physical plausibility of our model, $p(\dataVariables, \stateVariables)$. This is an illusory gain, because the less physical our model, the more information we seem to gain. We can immediately see how easy it is to fool ourselves by building models that don't correspond to the physical reality of our world. 

We are loosing energy due to uncertainty, first of all we loose energy that corresponds to our ignorance about the phase space, $\stateSpace$. Then we loose energy that conforms to our ignorance 

We can also see the difference between $\hat{A}_2$
Laplace's demon can be seen as relating the total energy to the available enrgy. , these two terms relate to 

## Computational Approximations

Firstly, let's deal with computational approximations. 

Introduce an auxilliary variable to the system, $\parameterVector$. Assume we can decompose,
$$
p(\stateVariables, \dataVariables) = \int p(\dataVariables| p(\stateVariables | \parameterVector) p(\parameterVector) \text{d} \parameterVector
$$


$$
\hat{A}_2 = \expDist{E(\dataVariables , \stateVariables) - \frac{\lambda}{\beta}\hat{E}(\dataVariables, \stateVariables)}{p(\stateVariables | \dataVariables)} - \frac{1}{\beta} \log \hat{Z}_2.
$$
$$\begin{align*}
\hat{Z}_2 = & \int \exp\left(\lambda\hat{E}(\dataVariables, \stateVariables)\right) \text{d}\stateVariables\\
= & \int q(\stateVariables, \parameterVector) \log \frac{\exp\left(\lambda\hat{E}(\dataVariables, \stateVariables)\right)p(\stateVariables|\parameterVector) p(\parameterVector)}{q(\stateVariables, \parameterVector)} \text{d}\stateVariables \text{d}\parameterVector+ \int q(\stateVariables, \parameterVector) \log \frac{q(\stateVariables, \parameterVector)p(\dataVariables)}{p(\dataVariables|\stateVariables)p(\stateVariables | \parameterVector)p(\parameterVector)} \text{d}\stateVariables \text{d}\parameterVector
\end{align*}$$

$$\begin{align*}
\hat{Z}_2 = & \int \exp\left(\lambda\hat{E}(\dataVariables, \stateVariables)\right) \text{d}\stateVariables\\
= & \int q(\parameterVector) \log \frac{\exp\left(\lambda\hat{E}(\dataVariables, \expDist{\stateVariables}{p(\stateVariables|\parameterVector)})\right)p(\parameterVector)}{q(\parameterVector)} \text{d}\stateVariables \text{d}\parameterVector+ \int q(\stateVariables, \parameterVector) \log \frac{q(\stateVariables, \parameterVector)p(\dataVariables)}{p(\dataVariables|\stateVariables)p(\stateVariables | \parameterVector)p(\parameterVector)} \text{d}\stateVariables \text{d}\parameterVector
\end{align*}$$

## Non Equilibrium Thermodynamics


### Crook's Fluctuation Theorem

Crook's fluctuation theorem [@Crooks-fluctuation99] relates the work done on a system to the free energy difference between the final and initial state even when the system has not reached equilibrium.

is an equation in statistical mechanics that relates the work done on a system during a non-equilibrium transformation to the free energy difference between the final and the initial state of the transformation. During the non-equilibrium transformation the system is at constant volume and in contact with a heat reservoir. The CFT is named after the chemist Gavin E. Crooks (then at University of California) \cite{}.