# Optimization Visualization: Learnings

## Algorithms

### General
* Comparisons between methods, such as Gradient Descent and BFGS, should be done by setting each algorithm to use similar values of convergence tolerance.
* When using gradient-based solvers, first-order methods such as GD requires more iterations to find the global minimum than second-order methods such as BFGS.
* Stochastic-based solvers such as Simulated Annealing can find global minima despite getting caught in local minima, but gradient-based solvers such as GD cannot.  In exchange for this advantage, the stochastic-based solvers require more iterations to find global minima. 

### Gradient Descent (GD)
* Choice of learning rate, $\alpha$, has a big impact on the algorithm.
    * Values of $\alpha$ that are too big fail to find a solution.
    * Values of $\alpha$ that are too small require more iterations to find a solution.
* Gradient descent spends more time in flat valleys than along steep canyons.
    * An adaptive learning rate could help address this shortcoming.

### BFGS
* My own implementation of BFGS worked without issues on the Rosenbrock function, but would periodically fail on the Goldstein-Price function.
    * The line search step fails to compute a step length when the approximate Hessian is not invertible.

### Simulated Annealing
* The more local minima in the test function, then the more iterations that are required since some time will be spent in each local minima.
* The transition distribution and annealing schedule hyperparameters of the algorithm can be tuned to each particular test function.

## Test Functions

### Goldstein-Price
* Even a sophisticated gradient-based solver such as BFGS implemented in scipy.optimize is not always able to find the global minimum for this test function.
    * There is a local minimum at $(x_1=0.75, x_2=0.25)$ that gradient-based solvers will find depending on the choice of initial position $x_0$.
* Gradient descent requires more iterations but is more robust at finding the global minimum, perhaps because the step size is smaller.
    * Need to validate the hypothesis by comparing results from different initial positions.

### Egg Crate
* This is the most challenging test function of the group due to the large number of local minima.

## Visualization
* Filled contour plots are superior for visualizating 2d surfaces.
* Use log scale for surfaces with orders of magnitude differences in range.
* Less effort is required to make 3d surface plots look good (for simple surfaces) with matplotlib in comparison to pyvista.
* pyvista expects z scale to be normalized.
* pyvista has a nice feature to add semi-transparent contour lines to a surface plot. 