### Definition

An **optimization problem** consists of minimizing (maximizing) a real function by choosing finding *best available* values from within an allowed set.

### Maximum and minimum


The value of x for which the first derivative f ’(x) is 0 corresponds to a maximum or a minimum of f(x).

• For a **maximum** the second derivative f’’(x) is **negative**.

• For a **minimum** the second derivative f’’(x) is **positive**.

• The second derivative is 0 for an **inflexion point**.


### Newton's method

**Basic formulation**

Newton's method finds the roots (zeros) of a function using **linear approximations** of the function.

It tries to guess a solution $x_0$ of the equation f(x) = 0.
Then, computes the linear approximation (tangent line) of f(x) at $x_0$ and then finds the x-intercept (y=0) of the linear approximation.

<img src="img/newton.png">


**Newton's method for Optmization**

In **optimization**, our goal is to find optima points (minima, maxima). These points are points where the derivative is null. <br>
So we look for the **roots of the derivative** (where the derivative is zero, so f'(x) = 0).

The roots are known as stationary points of f. <br>
These solutions may be minima, maxima, or saddle (inflection) points.



### Gradient descent

GD is a first-order iterative algorithm for finding the local minimum of a differentiable function f(x).

To find a local minimum, it starts from a point x0 and takes steps proportional to the negative of the gradient of the function at the current point.


The length of the steps is dictated by a parameter, the **learning rate**.

### Stochastic gradient descent

In Stochastic Gradient Descent (SGD), only a few samples (a “batch”) are randomly selected at each iteration since GD could be computationally very expensive.


Pure Stochastic Gradient Descent uses only a single sample, i.e., a batch size of one, to perform each iteration: many iterations very cheap.


**Here is why we are saving time**

Suppose we have 1 billion data points.

In GD, in order to update the parameters once, we need to have the (exact) gradient. This requires to sum up these 1 billion data points to perform 1 update.

In SGD, we can think of it as **trying to get an approximated gradient** instead of exact gradient. The approximation is coming from one data point (or several data points called mini batch). Therefore, in SGD, we can update the parameters very quickly. In addition, if we "loop" over all data (called one epoch), we actually have 1 billion updates.

The trick is that, in SGD you **do not need to have 1 billion iterations/updates, but much less iterations/updates**, say 1 million, and you will have "good enough" model to use.

### SVM

Support Vector Machines are machine learning classifiers which, given labeled training data (supervised learning), compute an optimal hyperplane which categorize new examples. <br>
The optimal hyperplane is the one that maximizes its margin, i.e., the distance between itself and the nearest point to classify.

The support vectors are the points closest to the separation hyperplane; if all other points were removed and learning re-run the result would be exactly the same.