# Optimizers in Deep Learning

Optimizers are algorithms used to minimize the loss function by updating the model's weights during training. The goal is to find the best set of weights that allow the model to make accurate predictions. Below are some commonly used optimizers:

---

## 1. Gradient Descent (GD)
- **Description:** The most basic optimization method. It updates weights by moving in the direction of the negative gradient of the loss function.
- **Types:**
  - **Batch Gradient Descent:** Uses the entire dataset to calculate gradients.
  - **Stochastic Gradient Descent (SGD):** Uses one sample at a time.
  - **Mini-Batch Gradient Descent:** Uses small batches of data.
- **Drawback:** Can be slow and may get stuck in local minima.

---

## 2. Stochastic Gradient Descent (SGD)
- **Description:** Updates weights using a single sample at a time.
- **Advantage:** Faster for large datasets but can lead to noisy updates, causing slower convergence.
  
---

## 3. Momentum
- **Description:** Improves SGD by adding a fraction of the previous update to the current update. This helps in accelerating convergence and reducing oscillations.
- **Update:** 
  
- **Advantage:** Helps escape local minima and speeds up convergence.

---

## 4. RMSProp (Root Mean Square Propagation)
- **Description:** Adjusts the learning rate for each weight by dividing it by the square root of the average squared gradients.
- **Advantage:** Works well for non-stationary problems and helps in maintaining a stable learning rate.

---

## 5. Adam (Adaptive Moment Estimation)
- **Description:** Combines **Momentum** and **RMSProp**, tracking both the average of past gradients (Momentum) and the average of squared gradients (RMSProp).
- **Advantage:** Works well for large datasets and has fast convergence.

---

## 6. AdaGrad (Adaptive Gradient Algorithm)
- **Description:** Adjusts the learning rate for each weight based on how frequently that weight has been updated. 
- **Drawback:** Can lead to very small learning rates after several updates.

---

## 7. AdaDelta
- **Description:** An improvement on AdaGrad that limits the accumulation of past gradients, which prevents the learning rate from getting too small.
  
---

## 8. Nadam (Nesterov-accelerated Adaptive Moment Estimation)
- **Description:** An extension of Adam that includes **Nesterov momentum**, allowing for more informed weight updates by looking ahead.

---

This list of optimizers highlights the key methods used in deep learning and their respective advantages and drawbacks. Each optimizer has its use cases depending on the problem being solved.

Let me know if you would like to dive deeper into any specific optimizer!
