# Lecture 3: Fitting functions

* Noisy measurements: free energy derivatives
* Kernel fitting
* Gaussian process
* Free energy surface reconstruction

In [None]:
include("mdlecturesrc.jl")

## Hydrogen bonding in the water dimer

The free energy is the quantity that tells us about the probability of occupying a certain macroscopic or "mesoscopic" state. E.g. as a function of O-O distance

$$ 
A(r) = -kT \ln P(r) = -kT \ln \int dq \,\delta(r-r_\mathrm{OO}) \, e^{-V(q)/kT}
$$

Fix one O atom at the origin, let $r$ be the $x$ coordinate of the other O atom. For a simple coordinate like this, 

$$
\nabla A(r) = \langle \nabla V \rangle_{r=r_\mathrm{OO}}
$$

In [None]:
imolecule_draw(make_h4o2(optim=true))

In [None]:
r, e = water_dimer_dissoc()
    
figure()
plot(r, e, "k-");
xlabel(L"r_\mathrm{OO}"); ylabel("Energy"); title("Water dimer TIP3P")
for rc in [3,3.5,4,4.5,5,5.5,6]
    plot([rc,rc], [-0.29,0.29], "k--")
end
close(gcf())

In [None]:
f = water_dimer_dynamics(;Temp=100.0, Nsteps=1000, Nsave=0,
        fix_axes=true,constrainO2=true, OOdistance=3.0)
;

In [None]:
figure()
plot(f)
xlabel("iteration"); ylabel("force on constrained O"); title("constrained water dimer T=100 K")
close(gcf())


figure()
plt[:hist](f,20)
xlabel("force value"); ylabel("count")
close(gcf())

In [None]:
mean(f)

In [None]:

mean_err_corr(f; acorr_limit=500)

In [None]:
rOO = []
fmean = []
ferr = []
for r in (2.6:0.2:6.0)
    f = water_dimer_dynamics(;Temp=200.0, Nsteps=1000, Nsave=0, fix_axes=true,
                constrainO2=true, OOdistance=r)
    fm,fe = mean_err_corr(f[500:end]; acorr_limit=20)
    push!(fmean, fm)
    push!(ferr, fe)
    push!(rOO, r)
    println("r=$r done")
end

In [None]:
figure()
errorbar(rOO, fmean, ferr)
xlabel(L"r_\mathrm{OO}"); ylabel("f")
close(gcf())

In [None]:
figure()
a = cumsum(-fmean)*0.2
plot(rOO, a-minimum(a), "bo-")
xlabel(L"r_\mathrm{OO}"); ylabel("A(r)")
close(gcf())

## Function fitting with kernels in 1 dimension

Suppose we want to fit the function $f(x)$, and we have $N$ observations $y_i$ at locations $x_i$

We write the ansatz 

$$
f(x) = \sum_{i=1}^N \alpha_i k(x_i, x)
$$

where $k$ is a _kernel_, which we use to construct basis functions, e.g. 

$$
k(x, x') = \sigma^2_w e^{-|x-x'|^2/2\sigma^2}
$$

Because the ansatz is linear in its parameters $\alpha$, finding them is easy, we substitute the data:

$$
y_j = \sum_{i=1}^N \alpha_i k(x_i, x_j)
$$

This leads to a linear problem that is typically not invertible, so we _regularise_ it by adding something to the diagonal,

$$
y_j = \sum_{i=1}^N \alpha_i \left(k(x_i, x_j) + \sigma^2_\nu \delta_{ij}\right)
$$

Leading to

$$
\begin{eqnarray}
\mathbf{y} &=& (\mathbf{K} + \sigma^2_\nu \mathbf{I})\mathbf{a}\qquad \textrm{with }[\mathbf{K}]_{ij} = k(x_i,x_j)\\
\Rightarrow \mathbf{a}&=& \mathbf{C}^{-1} \mathbf{y}\qquad\qquad\textrm{with }\mathbf{C} = \mathbf{K}+\sigma^2_\nu \mathbf{I} 
\end{eqnarray}
$$

So the final form of the fitted function is

$$
f(x) = \mathbf{k}(x)^T \mathbf{C}^{-1} \mathbf{y}\qquad \textrm{with } [\mathbf{k}(x)]_i = k(x_i,x)
$$

### Let's try it

In [None]:
const sw2=0.3
se = ( (x1,x2) -> sw2*exp(-(x1-x2)^2/(2.0*0.5^2)) )

In [None]:
N = length(rOO)
K = zeros(N,N)
for i=1:N, j=1:N
    K[i,j] = se(rOO[i], rOO[j])
end

function fit(x, nu2)
    CinvY = inv(K + diagm(nu2*ones(N))) * fmean
    k = zeros(N)
    f = zeros(x)
    for i=1:length(x)
        for j=1:length(k)
            k[j] = se(rOO[j], x[i])
        end
        f[i] = k ⋅ CinvY
    end
    f
end

In [None]:
figure()

errorbar(rOO, fmean, ferr)
r = 2.2:0.01:8.0
plot(r, fit(r, 0.001), label=L"\sigma^2_\nu = 0.001")
plot(r, fit(r, 0.01), label=L"\sigma^2_\nu = 0.01")
plot(r, fit(r, 0.1), label=L"\sigma^2_\nu = 0.1")
legend(); xlabel(L"r_\mathrm{OO}");ylabel("f")
close(gcf())

In [None]:
CinvY = inv(K + diagm(0.01*ones(N))) * fmean # list of coefficients of the basis functions

figure()
errorbar(rOO, fmean, ferr)
r = 2.4:0.01:8
for i = 1:length(rOO)
    plot(r, 0.1*CinvY[i]*map(x->se(x, rOO[i]), r), "k-")
end
plot(r, fit(r, 0.01), "r-")
axis([2, 8, -0.5, 0.5])
xlabel(L"r_\mathrm{OO}");ylabel("f")
close(gcf())

In [None]:
figure()
f1 = cumsum(-fmean)*0.2
plot(rOO, f1-minimum(f1), "bo-", label=L"\mathrm{cumsum}(\bar f)")
r = 2.2:0.01:8
f2 = cumsum(-fit(r, 0.01))*0.01
plot(r, f2-minimum(f2[1:100]), "r-", label=L"\mathrm{cumsum}(\mathrm{GP})")
legend(loc=4)
xlabel(L"r_\mathrm{OO}"); ylabel("A(r)")
close(gcf())

* How do we set the correct $\sigma_\nu$ ? 
* How do we get error bars on the interpolation ?
* When we integrate the interpolated force, how do we propagate error bars? 
* How to generalise to multiple dimensions? (particularly the integration)

## Connection to Gaussian processes

We write $f$ again as a linear combination of $H$ basis functions $\{\phi_h(x)\}$, but now with weights $w_h$ which are random variables,

$$
f(x) = \sum_{h=1}^H w_h \phi_h(x)
$$

Substituting in the data points,

$$
f(x_i) = \sum_h R_{ih} w_h\qquad \textrm{with } R_{ih} = \phi_h(x_i)
$$

Let us take our prior the multivariate Gaussian distribution for the coefficients,

$$
P(\mathbf{w}) = N(0, \sigma^2_w \mathbf{I})
$$

Since the $f(x_i)$ values are linear combinations of the $w$ values, their distribution is also normal,

$$
P(\mathbf{f}) = N(0, \sigma^2_w \mathbf{RR}^T)
$$

The observations $\mathbf{y}$ differ from $\mathbf{f}$ by noise $\varepsilon$ which we also take to be Gaussian,

$$
\mathbf{y} = \mathbf{f} + \varepsilon\qquad P(\varepsilon) = N(0, \sigma^2_\nu \mathbf{I})
$$

$$
P(\mathbf{y}) = N(0, \sigma^2_w \mathbf{RR}^T + \sigma^2_\nu\mathbf{I})
$$

Note that basis functions $\phi$ no longer appear, we only need the covariance of the data locations. Taking $\phi$ to be Gaussian, and $H\rightarrow \infty$, we get

$$
\textrm{Cov}[f(x_i), f(x_j)] = [\mathbf{RR}^T]_{ij} = \sigma^2_w e^{-|x_i-x_j|^2/2\sigma^2}
$$

And similarly, 

$$
\textrm{Cov}[y_i, y_j] =  \sigma^2_w e^{-|x_i-x_j|^2/2\sigma^2} + \sigma^2_\nu \delta_{ij} \equiv C(x_i, x_j) \equiv [\mathbf{C}_N]_{ij}
$$


To predict the next observation, $y_{N+1}$ at $x_{N+1}$, after $N$ observations we consider the conditional

$$
P(y_{N+1} | \mathbf{y}_N) = \frac{P(y_{N+1}, \mathbf{y})}{P(\mathbf{y}_N)}
$$

Dropping the normalisations, we ultimately want the joint probability,

$$
P(y_{N+1} | \mathbf{y}_N) \propto P(y_{N+1}, \mathbf{y}_N)
$$

and we would like it as a probability distribution for $y_{N+1}$ in terms of $\mathbf{y}_N$ and $\mathbf{x}_N$ as parameters. 

We have

$$
\begin{eqnarray}
P(\mathbf{y}_N) &=& N(0,\mathbf{C}_N) \propto e^{-\mathbf{y}_N^T \mathbf{C}^{-1}_N \mathbf{y}_N}\\
P(y_{N+1},\mathbf{y}_N) &=& N(0,\mathbf{C}_{N+1}) \propto e^{-[\mathbf{y}_N^T y_{N+1}] \mathbf{C}^{-1}_{N+1} [\mathbf{y}_N y_{N+1}]}\\
\end{eqnarray}
$$

The covariance matrices are as follows

$$
\mathbf{C}_{N+1} = \left[\begin{matrix}
\mathbf{C}_N & \mathbf{k}\\
\mathbf{k}^T & \kappa\\
\end{matrix}\right] \qquad [\mathbf{k}]_i = C(x_i,x_{N+1}), \kappa = C(x_{N+1},x_{N+1})
$$

There is a neat trick to expressing $\mathbf{C}^{-1}_{N+1}$, 

$$
\begin{eqnarray}
\mathbf{C}^{-1}_{N+1} &=& \left[\begin{matrix}
\mathbf{M} & \mathbf{m}\\
\mathbf{m}^T & \mu\\
\end{matrix}\right]\\
\mathbf{M} &=& \mathbf{C}^{-1}_{N} + \frac1\mu \mathbf{m}\mathbf{m}^T\\
\mathbf{m} &=& -\mu \mathbf{C}^{-1}_{N} \mathbf{k}\\
\mu &=& (\kappa - \mathbf{k}^T \mathbf{C}^{-1}_{N} \mathbf{k})^{-1}\\
\end{eqnarray}
$$

Using this, we can write 

$$
\begin{eqnarray}
[\mathbf{y}_N^T y_{N+1}] \mathbf{C}^{-1}_{N+1} [\mathbf{y}_N y_{N+1}] &=& 
\mathbf{y}^T_N \mathbf{M}\mathbf{y}_N + 2\mathbf{y}^T_N \mathbf{m} y_{N+1}+\mu y^2_{N+1}\\
&=& \mu(y_{N+1} + \mathbf{y}^T_N\mathbf{m}/\mu)^2 + \ldots \qquad\textrm{(completing the square)}
\end{eqnarray}
$$

So the conditional probability for $y_{N+1}$ is then

$$
\begin{eqnarray}
P(y_{N+1} | \mathbf{y}_N) &\propto& e^{-(y_{N+1} - \bar y)^2/2\hat\sigma^2}\\
\bar y &=& \mathbf{k}^T \mathbf{C}^{-1}_N \mathbf{y}_N\\
2\hat\sigma^2 &=& \kappa - \mathbf{k}^T \mathbf{C}^{-1}_N \mathbf{k}
\end{eqnarray}
$$

The highest probability for $y_{N+1}$ occurs at the mean, $\bar y$, and we thus recovered the linear regression solution. 

* We now have an errorbar $\hat\sigma$
* Parameters have physical meaning
 * $\sigma_w$ is expected scale of the function
 * $\sigma_\nu$ is noise of observations
 * $\sigma$ is covariance length scale of input space 

#### Samples from the Gaussian prior

In [None]:
x = (0:0.1:10)
Nx=length(x)
R = zeros(Nx, Nx)
for i=1:Nx, j=1:Nx
    R[i,j] = se(x[i],x[j])
end

In [None]:
figure()
plot(x, R*randn(Nx))
plot(x, R*randn(Nx))
plot(x, R*randn(Nx))
plot(x, R*randn(Nx))
close(gcf())

In [None]:
function fit_err(x, nu2)
    Cinv = inv(K + diagm(nu2*ones(N)))
    CinvY =  Cinv * fmean
    k = zeros(N)
    f = zeros(x)
    err = zeros(x)
    for i=1:length(x)
        for j=1:length(k)
            k[j] = se(rOO[j], x[i])
        end
        f[i] = k ⋅ CinvY
        err[i] = 0.5*(sw2+nu2 - k ⋅ (Cinv * k))
    end
    f,sqrt(err)
end

In [None]:
figure()
errorbar(rOO, fmean, ferr, fmt="b-")
r = 2:0.01:8
f,err = fit_err(r, 0.01)
plot(r, f, "r-")
errorbar(r[1:50:end], f[1:50:end], err[1:50:end], fmt="r.")
axis([2, 8, -1, 1])
xlabel(L"r_\mathrm{OO}"); ylabel("f")
close(gcf())

## Learning the potential directly from derivative observations

What we would really like is to compute $P(y_{N+1} | L(\mathbf{y}))$, where the linear operator $L$ could be e.g.  $\partial/\partial x$. 

In the Gaussian process framework, all we need for this is the covariance structure of $L(\mathbf{y})$, and $\textrm{Cov}[y_{N+1}, L(\mathbf{y})]$. 

$$
\begin{eqnarray}
\textrm{Cov}[y_i, \partial y_j/\partial x_j] &=& \partial \textrm{Cov}[y_i,y_j]/\partial x_j\\
&=& \frac1{\sigma^2}(x_i-x_j) e^{-|x_i-x_j|^2/2\sigma^2}
\end{eqnarray}
$$

and

$$
\textrm{Cov}[\partial y_i/\partial x_i, \partial y_j/\partial x_j] =
-\frac1{\sigma^4}(x_i-x_j)^2 e^{-|x_i-x_j|^2/2\sigma^2}+\frac1{\sigma^2}e^{-|x_i-x_j|^2/2\sigma^2}
$$

In [None]:
const sw2=0.1^2

se_fd = ( (x1,x2) -> sw2*(x1-x2)*exp(-(x1-x2)^2/(2.0)) )

se_dd = ( (x1,x2) -> sw2*(1-(x1-x2)^2)*exp(-(x1-x2)^2/(2.0)))

In [None]:
Kdd = zeros(N,N)
for i=1:N, j=1:N
    Kdd[i,j] = se_dd(rOO[i], rOO[j])
end

function fit_deriv_err(x, nu2)
    Cinv = inv(Kdd + diagm(nu2*ones(N)))
    CinvY =  Cinv * (-fmean)
    k = zeros(N)
    f = zeros(x)
    err = zeros(x)
    for i=1:length(x)
        for j=1:length(k)
            k[j] = se_fd(x[i], rOO[j])
        end
        f[i] = k ⋅ CinvY
        err[i] = 0.5*(sw2+nu2 - k ⋅ (Cinv * k))
    end
    f,sqrt(err-minimum(err))
end

In [None]:
figure()
f1 = cumsum(-fmean)*0.2
plot(rOO, f1-f1[9], "bo-", label=L"\mathrm{cumsum}(\bar f)")


r = 2.2:0.01:8
f,err = fit_deriv_err(r, 0.1^2)
plot(r, f-f[200], "r-", label="GP deriv")
errorbar(r[1:50:end], f[1:50:end]-f[200], err[1:50:end], fmt="r.")
#axis([2, 8, -1, 1])
xlabel(L"r_\mathrm{OO}"); ylabel("A(r)")
legend(loc=4)
close(gcf())