# Continuous Optimization

## Unconstrained Optimization

- **Unconstrained Optimization**: Unconstrained optimization is the process of finding the maximum or minimum of a function, typically a loss or cost function, without any restrictions on the possible values of the variables. This is in contrast to constrained optimization, where the variables are subject to certain constraints.

- **Methods**: There are several methods for unconstrained optimization, including:
  - **Gradient Descent**: This is an iterative method that involves taking steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point.
  - **Newton's Method**: This is an iterative method that involves approximating the function by a quadratic form and then solving the resulting linear system to find its maximum or minimum.
  - **Conjugate Gradient Method**: This is an iterative method that combines the directions of steepest descent and Newton's method to converge more rapidly than either method alone.

### Gradient Descent

- **Gradient Descent**: Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. In machine learning, we use gradient descent to update the parameters of our model. Exploits the fact that the gradient of a function gives the direction of the greatest rate of increase of the function.

- **Math**: In the context of unconstrained optimization, gradient descent can be mathematically formulated as follows:  

x = x - η * ∇f(x)

Here, `x` represents the parameters we are trying to optimize, `η` is the learning rate (a hyperparameter that determines the step size during each iteration), and `∇f(x)` is the gradient of the function at the current point `x`. The gradient points in the direction of steepest ascent, and subtracting it from `x` moves us in the direction of steepest descent.

- **Intuition**: Imagine you're in a mountainous landscape and it's completely dark. Your task is to find the lowest point in the area (valley). A good strategy could be to check the steepness of the ground around you and take a step downhill. If you repeat this process, you're likely to eventually reach the valley, which is the point of minimum altitude. This is essentially what gradient descent does.

- **Steps**: In each step of gradient descent, you calculate the gradient (slope) of your function at your current position, then take a step in the direction opposite to the gradient (downhill rather than uphill). The size of each step is determined by the learning rate, a hyperparameter that you can tune.

- **Convergence**: The process is repeated iteratively until the algorithm converges to a minimum. In an ideal situation, this minimum will be global, but sometimes the algorithm might stop at a local minimum.

- **Momentum**: The idea of momentum in gradient descent is to add a fraction of the direction of the previous step to a current step. This serves two primary purposes: it accelerates convergence in directions of persistent gradient, and it reduces oscillations in directions of high curvature. The momentum term is a hyperparameter that you can tune.

- **Intuition**: Imagine a ball rolling down the hill, its velocity (speed in a specific direction) increases as it continues to roll. Similarly, when we add momentum to gradient descent, the gradient becomes the force that accelerates the ball in directions with persistent gradients, and the ball tends to keep moving in the same direction, helping it to traverse flat regions and to not get stuck in shallow, non-optimal minima.

- **Math**: In the context of gradient descent, momentum can be mathematically formulated as follows:

v = γ * v_prev + η * ∇f(x)
x = x - v

Here, `v` is the velocity (direction and amount to change the weights), `v_prev` is the velocity of the previous step, `γ` is the momentum term (usually set to a value like 0.9), `η` is the learning rate, `∇f(x)` is the gradient of the function at the current location, and `x` is the current location (weights).

### Stochastic Gradient Descent

- **Stochastic Gradient Descent (SGD)**: Stochastic gradient descent is a variant of gradient descent where the gradient is calculated on a subset of the data (a batch) rather than the entire dataset. This can be much faster, especially for large datasets.

- **Intuition**: Imagine you're trying to find the lowest point in a large, hilly area. Instead of checking the steepness of the ground across the entire area, you randomly pick a few spots and check the steepness there. You then move in the direction opposite to the steepest slope. This is essentially what stochastic gradient descent does.
