# Gradient Descent and Its Variants

Gradient descent is a fundamental optimization algorithm used to minimize an objective function, typically the loss function in machine learning models, by iteratively moving towards the minimum of the function.

## Mathematical Background

### Basic Gradient Descent

The basic idea of gradient descent is to update the parameters of the model in the opposite direction of the gradient of the objective function with respect to the parameters. The learning rate $\eta$ determines the size of the steps taken towards the minimum.

The update rule for gradient descent is given by:

$$
\theta_{t+1} = \theta_t - \eta \nabla_{\theta} L(\theta_t)
$$

where:
- $\theta_t$ are the parameters at iteration $t$.
- $\eta$ is the learning rate.
- $\nabla_{\theta} L(\theta_t)$ is the gradient of the loss function $L$ with respect to $\theta_t$.

**Advantages**:
- Simple and easy to implement.
- Suitable for convex problems.

**Disadvantages**:
- Can be slow for large datasets.
- Requires careful tuning of the learning rate.
- May get stuck in local minima for non-convex problems.

### Stochastic Gradient Descent (SGD)

In stochastic gradient descent, the gradient is computed using a single sample (or a small batch) instead of the entire dataset. This can lead to faster convergence but with more noise in the updates.

The update rule for SGD is:

$$
\theta_{t+1} = \theta_t - \eta \nabla_{\theta} L(\theta_t; x_i, y_i)
$$

where $(x_i, y_i)$ is a single training example.

**Advantages**:
- Faster iterations, suitable for large datasets.
- Introduces noise which can help escape local minima.

**Disadvantages**:
- The updates are noisy, which can lead to convergence issues.
- Requires careful tuning of the learning rate and batch size.

### Mini-Batch Gradient Descent

Mini-batch gradient descent is a compromise between batch gradient descent and stochastic gradient descent. It computes the gradient using a small batch of training examples.

The update rule for mini-batch gradient descent is:

$$
\theta_{t+1} = \theta_t - \eta \frac{1}{m} \sum_{i=1}^{m} \nabla_{\theta} L(\theta_t; x_i, y_i)
$$

where $m$ is the batch size.

**Advantages**:
- Reduces the variance of the updates, leading to more stable convergence.
- Can leverage the benefits of vectorized operations on modern hardware.

**Disadvantages**:
- Still requires careful tuning of the learning rate and batch size.
- Can be slower than SGD per iteration.

## Variants of Gradient Descent

### Momentum

Momentum helps accelerate gradients vectors in the right directions, thus leading to faster converging. It introduces a momentum term that accumulates the past gradients to smooth out the updates.

The update rule with momentum is:

$$
v_{t} = \gamma v_{t-1} + \eta \nabla_{\theta} L(\theta_t)
$$

$$
\theta_{t+1} = \theta_t - v_t
$$

where $\gamma$ is the momentum term.

**Advantages**:
- Speeds up convergence, especially in the presence of high curvature, small but consistent gradients, or noisy gradients.
- Reduces oscillations.

**Disadvantages**:
- Requires tuning of an additional hyperparameter (momentum term $\gamma$).

### Nesterov Accelerated Gradient (NAG)

NAG is a variant of momentum that looks ahead by calculating the gradient at the approximate future position of the parameters.

The update rule for NAG is:

$$
v_{t} = \gamma v_{t-1} + \eta \nabla_{\theta} L(\theta_t - \gamma v_{t-1})
$$

$$
\theta_{t+1} = \theta_t - v_t
$$

**Advantages**:
- Provides more accurate updates by looking ahead.
- Can lead to faster convergence than standard momentum.

**Disadvantages**:
- Requires tuning of an additional hyperparameter (momentum term $\gamma$).

### Adagrad

Adagrad adapts the learning rate for each parameter individually based on the historical gradients, which can be useful for sparse data.

The update rule for Adagrad is:

$$
\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_{t,ii} + \epsilon}} \nabla_{\theta} L(\theta_t)
$$

where $G_t$ is a diagonal matrix where each diagonal element $i, i$ is the sum of the squares of the gradients with respect to $\theta_i$ up to time step $t$, and $\epsilon$ is a small constant to avoid division by zero.

**Advantages**:
- Automatically adjusts the learning rate for each parameter.
- Effective for sparse data.

**Disadvantages**:
- Learning rate can become too small, stopping learning prematurely.
- Accumulation of squared gradients can lead to slow convergence.

### RMSprop

RMSprop modifies Adagrad to perform better in online and non-stationary settings by using a moving average of the squared gradients.

The update rule for RMSprop is:

$$
E[g^2]_t = \rho E[g^2]_{t-1} + (1 - \rho) \nabla_{\theta} L(\theta_t)^2
$$

$$
\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} \nabla_{\theta} L(\theta_t)
$$

where $\rho$ is the decay rate.

**Advantages**:
- Adapts learning rate like Adagrad but without diminishing it too much.
- Effective for non-stationary objectives.

**Disadvantages**:
- Requires tuning of additional hyperparameters (decay rate $\rho$).

### Adam

Adam combines the ideas of momentum and RMSprop. It computes adaptive learning rates for each parameter and keeps an exponentially decaying average of past gradients.

The update rules for Adam are:

$$
m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla_{\theta} L(\theta_t)
$$

$$
v_t = \beta_2 v_{t-1} + (1 - \beta_2) (\nabla_{\theta} L(\theta_t))^2
$$

$$
\hat{m}_t = \frac{m_t}{1 - \beta_1^t}
$$

$$
\hat{v}_t = \frac{v_t}{1 - \beta_2^t}
$$

$$
\theta_{t+1} = \theta_t - \frac{\eta \hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}
$$

where $m_t$ and $v_t$ are estimates of the first moment (mean) and the second moment (uncentered variance) of the gradients, respectively. $\beta_1$ and $\beta_2$ are decay rates for these moment estimates.

**Advantages**:
- Combines the benefits of RMSprop and momentum.
- Well-suited for problems with sparse gradients and non-stationary objectives.

**Disadvantages**:
- Requires tuning of additional hyperparameters ($\beta_1$, $\beta_2$).
- Can be computationally expensive due to maintaining multiple moving averages.

## Practical Considerations

### Learning Rate Scheduling

Adjusting the learning rate during training can lead to better performance. Common schedules include:
- **Step Decay**: Reducing the learning rate by a factor every few epochs.
- **Exponential Decay**: Reducing the learning rate exponentially over time.
- **Warm Restarts**: Periodically resetting the learning rate to a higher value.

### Choosing the Right Optimizer

Different optimizers have their own strengths and weaknesses. The choice of optimizer can depend on the specific problem, the architecture of the model, and the dataset. Experimentation and empirical testing are often required to determine the best optimizer for a given task.

