diff --git a/docs/gallery.rst b/docs/gallery.rst
index e14bf99c..58a6d508 100644
--- a/docs/gallery.rst
+++ b/docs/gallery.rst
@@ -167,9 +167,6 @@
       <div class="sphx-glr-thumbnail-title">Character-level Transformer on Tiny Shakespeare.</div>
     </div>
 
-.. raw:: html
-
-    </div>
 
 .. raw:: html
 
diff --git a/examples/lbfgs.ipynb b/examples/lbfgs.ipynb
index ccb4ee2d..76073de7 100644
--- a/examples/lbfgs.ipynb
+++ b/examples/lbfgs.ipynb
@@ -11,8 +11,7 @@
         "L-BFGS is a classical optimization method that uses past gradients and parameters informations to iteratively refine a solution to a minimization problem. In this notebook, we illustrate\n",
         "1. how to use L-BFGS as a simple gradient transformation,\n",
         "2. how to wrap L-BFGS in a solver, and how linesearches are incorporated,\n",
-        "3. how to debug the solver if needed,\n",
-        "3. how to use L-BFGS to train a medium scale network on CIFAR10."
+        "3. how to debug the solver if needed,\n"
       ]
     },
     {
@@ -52,13 +51,17 @@
         "### What is L-BFGS?\n",
         "\n",
         "To solve a problem of the form\n",
+        "\n",
         "$$\n",
         "\\min_w f(w),\n",
         "$$\n",
+        "\n",
         "L-BFGS ([Limited memory Broyden–Fletcher–Goldfarb–Shanno algorithm](https://en.wikipedia.org/wiki/Limited-memory_BFGS)) makes steps of the form\n",
+        "\n",
         "$$\n",
         "w_{k+1} = w_k - \\eta_k P_k g_k,\n",
         "$$\n",
+        "\n",
         "where, at iteration $k$, $w_k$ are the parameters, $g_k = \\nabla f_k$ are the gradients, $\\eta_k$ is the stepsize, and $P_k$ is a *preconditioning* matrix, that is, a matrix that transforms the gradients to ease the optimization process.\n",
         "\n",
         "L-BFGS builds the preconditioning matrix $P_k$ as an approximation of the Hessian inverse $P_k \\approx \\nabla^2 f(w_k)^{-1}$ using past gradient and parameters information. Briefly, at iteration $k$, the previous preconditioning matrix $P_{k-1}$ is updated such that $P_k$ satisfies the secant condition $P_k(w_k-w_{k-1}) = g_k -g_{k-1}$. The original BFGS algorithm updates $P_k$ using all past information, the limited-memory variant only uses a fixed number of past parameters and gradients to build $P_k$. See [Nocedal and Wright, Numerical Optimization, 1999](https://www.math.uci.edu/~qnie/Publications/NumericalOptimization.pdf) or the [documentation](https://optax.readthedocs.io/en/latest/api/transformations.html#optax.scale_by_lbfgs) for more details on the implementation.\n"