# 2.1  Introduction

- The problem of determining the smallest (or largest) value a function can take, referred to as its *global minimum* (or *global maximum*) is a centuries old pursuit that has numerous applications throughout the sciences and engineering.  


- In this Chapter we begin our investigation of mathematical optimization by describing the *zero order optimization* techniques (also called *Hessian free optimization*)

- While not always the most powerful optimization tools at our disposal, these techniques are quite simple and often quite effective.

- Discussing zero order methods first also allows us to lay bear, in a simple setting, a range of crucial concepts we will see throughout the Chapters that follow in more complex settings.

- These concepts include the notions of *optimality*, *local optimization*, *descent directions*, *steplengths*, and more.

## Visualizing minima and maxima

- When a function takes in only one or two inputs we can visually identify its minima or maxima by plotting it over a large swath of its input space.

- But what if a function takes in more than two inputs?  

- We begin our discussion by first examining a number of low dimensional examples to gain an intuitive feel for how we might effectively identify these desired minima or maxima in general.

#### <span style="color:#a50e3e;">Example. </span> Visual inspection of simple functions for minima and maxima

- Every machine learning problem has parameters that must be tuned properly to ensure optimal learning

- For example, there are two parameters that must be properly tuned in the case of a simple linear regression.

- That is, when fitting a line to a scatter of data: the slope and intercept of the linear model.

- These two parameters are tuned by forming what is called a *cost function* or *loss function*. 


- This is a continuous function in both parameters that measures how well the linear model fits a dataset given a value for its slope and intercept. 

- The proper tuning of these parameters via the cost function corresponds geometrically to finding the values for the parameters that make the cost function *as small as possible*

- Or, in other words, the parameters that *minimize* the cost function. 

- The image below illustrates how choosing a set of parameters higher on the cost function results in a corresponding line fit that is poorer than the one corresponding to parameters at the lowest point on the cost surface.

<img src="../../mlrefined_images/math_optimization_images/bigpicture_regression_optimization.png" width="80%"/>

- This same idea holds true for regression with higher dimensional input, as well as classification where we must properly tune parameters to *separate* classes of data.

- Again, the parameters minimizing an associated cost function provide the best classification result. This is illustrated for classification below.

<img src="../../mlrefined_images/math_optimization_images/bigpicture_classification_optimization.png" width="80%"/>

- The tuning of these parameters require the *minimization of a cost function* can be formally written as follows.  


- For a generic function $g(\mathbf{w})$ taking in a general $N$ dimensional input $\mathbf{w}$ the problem of finding the particular point $\mathbf{v}$ where $g$ attains its smallest value is written formally as

\begin{equation}
\underset{\mathbf{w}}{\mbox{minimize}}\,\,\,\,g\left(\mathbf{w}\right)
\end{equation}


- This formal problem can very rarely be solved 'by hand', instead we must rely on algorithmic techniques for finding function minima (or at the very least finding points close to them).  


- In this part of the text we examine many algorithmic methods of *mathematical optimization*, which aim to do just this.