# Stochastic Gradient Descent for a Maximum a Posteriori Estimate

This notebook contains a simple implementation of the stochastic gradient descent (SGD) method. SGD is a very popular optimization method for the training of Neural Networks (see for instance the *Adam* optimizer). Compared to standard gradient descent methods, it has several potential benefits, particularly for large parameter vectors and data sets:
- We do not need to store the entire gradient of the cost functional
- Data can be incorporated in a streaming fashion
- Randomness improves convergence for non-convex problems

We will use the SGD method to find the maximum a posteriori (MAP) estimate for a simplistic Bayesian inverse problem. To this end, we define a scalar unknown $u\in\mathbb{R}$ and a vector of data samples $\mathbf{y}=(y_1,y_2,\ldots,y_N)^T\in\mathbb{R}^N$. We assign a Gaussian prior and component-wise Gaussian additive noise to the problem,
$$
\begin{gather*}
    \rho(u) \propto \mathrm{exp}\Bigl( -\frac{c_{prior}}{2} (u-u_{prior})^2  \Bigr), \\
    l(\mathbf{y}|u) \propto \mathrm{exp}\Bigl( -\frac{c_{noise}}{2} \sum_{i=1}^N (y_i-u)^2 \Bigr),
\end{gather*}
$$
where $c_{prior}$ and $c_{noise}$ denote the prior and noise precision, respectively. We already know that the posterior for this problem will be Gaussian as well.

**Exercise:** Show that the posterior mean and the MAP coincide at
$$
    \hat{u}_{post} = \frac{c_{prior}u_{prior} + c_{noise}\sum_i y_i}{c_{prior} + N c_{noise}}.
$$

Next, we turn to the problem of finding the optimizer with the SGD algorithm. The cost functional is simply given as the negative log-posterior,
$$
    J(u) = \frac{c_{prior}}{2} (u-u_{prior})^2 + \frac{c_{noise}}{2} \sum_{i=1}^N (y_i-u)^2.
$$
Moving on, we define a discrete random variable $z=(z_1, z_2,\ldots,z_N)^T\in\mathbb{R}^N$. We further describe its pdf as a sum of dirac masses,
$$
    \zeta(z) = \frac{1}{N}\sum_{i=1}^N\delta(z-e^i),
$$
where $e_i$ denotes the $i$-th unit vector.

**Exercise:** Find a representation for $J(u)$ as
$$
    J(u) = \int_{\mathbb{R}^N} F(u,z)\zeta(z)dz.
$$

*Tip: The prior clearly does not depend on $\zeta$, but it can be "equally distributed" among the addends in the sum of the likelihood term.*

*Note: This will lead to a so-called data subsampling technique.*

Now recall the standard SGD algorithm. For an initial state $u^{0}$, a number of steps $L$ and a sequence of learning rates $\{\alpha_l\}_{l=0}^{L-1}$, we can compute iterates as
$$
    u^{(l+1)} = u^{(l)} - \alpha_l D_u F(u^{(l)}, z^{(l)}),\quad z^{(l)} \sim \zeta\ \mathrm{i.i.d.}
$$
In the context of data subsampling, drawing samples from $\zeta$ corresponds to choosing a random index $i$ and computing the partial gradient from the $i$-th addend of $J(u)$ (with data sample $y_i$). We might therefore reformulate the SGD iteration as

$$
\begin{align*}
    i^{(l)} &\sim \mathrm{uni}(\{ 1,2,\ldots,N \})\ \mathrm{i.i.d.}, \\
    u^{(l+1)} &= u^{(l)} - \alpha_l D_u \tilde{F}(u^{(l)}, i^{(l)}).
\end{align*}
$$

Finally, we can turn to the implementation:

In [None]:
# Necessary libraries
import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt

# Interactive plotting and style
%matplotlib widget
plt.close('all')
plt.style.use('bmh')

**Exercise:** Consider the Bayesian inverse problem with the parameters given below. Implement the SGD algorithm to compute the MAP estimate. Compare the iterates to the exact solution. You can use the code cells below as guidance for your implementation.

In [None]:
# Parameters for the simulation
TRUE_PARAMETER = 1.
NUM_SAMPLES = 10
NOISE_PRECISION = 10
PRIOR_MEAN = 0.
PRIOR_PRECISION = 1

# Generate data
data = norm.rvs(TRUE_PARAMETER, np.power(NOISE_PRECISION, -0.5), NUM_SAMPLES, random_state=42)

# Analytical solution of the MAP estimate
map_exact = # -> Enter solution

# Function to evaluate the partial gradient for given state, data and index permutation
def evaluate_gradient(state, data, index):
    # -> Implement function for the partial gradient
    
    return grad

In general, choosing a proper learning rate is non-trivial. A too-low learning rate leads to slow convergence, while high rates might lead to strong noise and oscillations or even divergence. For the plain SGD method, we employ a constant learning rate, since the optimization problem is very nice (1D and convex).

In [None]:
# Parameters for the optimization algorithm
NUM_ITERATIONS = 1000
LEARNING_RATE = 1e-3

# Initialize RNG and solution vectors
rng = np.random.default_rng(seed=123456)
iterates = [PRIOR_MEAN] # Use prior as initial guess

# Perform optimization
for i in range(NUM_ITERATIONS):
    # -> Implement SGD iteration

**Exercise:** Perform the optimization for different hyperparameters `NUM_ITERATIONS` and `LEARNING_RATE`. What do you observe? Can you think of modifications to increase the robustness and/or convergence rate of the algorithm? 

In [None]:
# Simple Visualization
iterations = np.arange(0, NUM_ITERATIONS+1, 1)
exact_values = map_exact * np.ones((NUM_ITERATIONS+1,))

plt.close('all')
fig, ax = plt.subplots()
ax.set_title('Stochastic Gradient Descent')
ax.plot(iterations, iterates)
ax.plot(iterations, exact_values)
ax.set_xlim(0, NUM_ITERATIONS+1)
ax.set_ylim(iterations[0], map_exact + 0.2)
ax.set_xlabel(r'$l$')
ax.set_ylabel(r'$u^{(l)}$')
ax.text(NUM_ITERATIONS/20, 1.02*map_exact, 'True Value')
plt.show()