***
# ***Optimizers in Deep Learning***
***

In deep learning, an optimizer is an algorithm that adjusts a neural network's parameters to minimize the loss function. This process is crucial for the model to learn effectively. It involves making small, incremental changes to the parameters, but finding the right balance is tricky. An optimizer in a neural network helps guide the model to the best solution. Without them, the model might struggle to converge or learn effectively. The challenge 1  of choosing the right weights for the model is a daunting task, as a deep learning model generally consists of millions of parameters. 2    

These specialized algorithms facilitate the learning process of neural networks by iteratively refining the **weights** and **biases** based on the feedback received from the data. Well known optimizers in deep learning encompass **Stochastic Gradient Descent (SGD)**, **Adam**, and **RMSprop**, each equipped with **distinct update rules**, **learning rates**, and **momentum strategies**, all geared towards the overarching goal of discovering and converging upon optimal model parameters, thereby enhancing overall performance.

***
### ***Important Deep Learning Terms***
***

**Before proceeding, there are a few terms that you should be familiar with.**

- `Epoch` – The number of times the algorithm runs on the whole training dataset.
- `Sample` – A single row of a dataset.
- `Batch` – It denotes the number of samples to be taken to for updating the model parameters.
- `Learning rate` – It is a parameter that provides the model a scale of how much model weights should be updated.
- `Cost Function/Loss Function` – A cost function is used to calculate the cost, which is the difference between the predicted value and the actual value.
- `Weights/ Bias` – The learnable parameters in a model that controls the signal between two neurons.

***

***
### ***Gradient Descent Deep Learning Optimizer***
***

This optimization algorithm uses calculus to consistently modify the values and achieve the local minimum. Before moving ahead, you might question what a gradient is.

In simple terms, consider you are holding a ball resting at the top of a bowl. When you lose the ball, it goes along the steepest direction and eventually settles at the bottom of the bowl. A Gradient provides the ball in the steepest direction to reach the local minimum which is the bottom of the bowl.

Gradient descent works best for most purposes. However, it has some downsides too. It is expensive to calculate the gradients if the size of the data is huge. Gradient descent works well for convex functions, but it doesn’t know how far to travel along the gradient for nonconvex functions.

***

***
### ***Stochastic Gradient Descent Deep Learning Optimizer***
***

To tackle the challenges large datasets pose, we have stochastic gradient descent, a popular approach among optimizers in deep learning. The term stochastic denotes the element of randomness upon which the algorithm relies. In stochastic gradient descent, instead of processing the entire dataset during each iteration, we randomly select batches of data. This implies that only a few samples from the dataset are considered at a time, allowing for more efficient and computationally feasible optimization in deep learning models.
$$
     W = W - \eta \cdot \nabla L(W)
     $$

The procedure is first to select the initial parameters w and learning rate n. Then randomly shuffle the data at each iteration to reach an approximate minimum.

Since we are not using the whole dataset but the batches of it for each iteration, the path taken by the algorithm is full of noise as compared to the gradient descent algorithm. Thus, SGD uses a higher number of iterations to reach the local minima. Due to an increase in the number of iterations, the overall computation time increases. But even after increasing the number of iterations, the computation cost is still less than that of the gradient descent optimizer. So the conclusion is if the data is enormous and computational time is an essential factor, stochastic gradient descent should be preferred over batch gradient descent algorithm.

***

***
### ***Stochastic Gradient Descent With Momentum Deep Learning Optimizer***
***

*Stochastic Gradient Descent (SGD) with Momentum is an advanced version of the basic SGD optimizer. It aims to speed up the learning process by considering the previous updates while computing the current update. Here’s how it works:*

**Stochastic Gradient Descent (SGD):**

Imagine you're trying to roll a ball to the lowest point in a hilly terrain. With basic SGD, you take small steps in the direction of the steepest descent. However, this process can be slow and can get stuck in local minima (small dips that are not the lowest point).

**Adding Momentum:**

Momentum helps overcome these issues by adding a fraction of the previous update to the current update. This means the optimizer not only relies on the current gradient but also takes into account the direction of the previous step.

Think of it as adding inertia to the ball, so it keeps rolling in the same direction unless a significant force (gradient) changes its path.

**Benefits:**

Faster Convergence: By carrying forward the inertia, it helps the optimizer move faster towards the global minimum.

Smoother Updates: It reduces oscillations and smooths out the updates, leading to more stable training.

Here's a simplified formula for SGD with Momentum: $$ v_{t+1} = \beta v_t - \alpha \nabla f(\theta_t) $$ $$ \theta_{t+1} = \theta_t + v_{t+1} $$

Where:

- 𝑣
𝑡
 is the velocity (momentum) at time step t

- 𝛽
 is the momentum factor (typically between 0.9 and 0.99)

- 𝛼
 is the learning rate

- ∇
𝑓
(
𝜃
𝑡
)
 is the gradient of the loss function with respect to the parameters at time step t

- 𝜃
𝑡
 are the parameters at time step t

In essence, SGD with Momentum helps navigate the optimization landscape more efficiently, ensuring that the model learns quickly and effectively.

***

***
### ***Mini Batch Gradient Descent Deep Learning Optimizer***
***

*Mini Batch Gradient Descent is a popular optimization technique in deep learning that strikes a balance between two other methods: Stochastic Gradient Descent (SGD) and Batch Gradient Descent.*

**Gradient Descent Methods:**

`Batch Gradient Descent:` This method computes the gradient of the loss function with respect to the parameters for the entire dataset. While it's accurate, it can be very slow and computationally expensive, especially with large datasets.

`Stochastic Gradient Descent (SGD):` This method updates the parameters using the gradient computed from a single data point at each iteration. It's faster but can be noisy and less stable.

`Mini Batch Gradient Descent:` This method takes the best of both worlds by updating the parameters using the gradient computed from a small, random subset of the dataset called a mini-batch. It offers a compromise between the stability of batch gradient descent and the speed of stochastic gradient descent.

**Advantages of Mini Batch Gradient Descent:**

`Efficiency:` It uses mini-batches, which fit well with modern hardware like GPUs, making the training process faster.

`Stability:` By averaging the gradients over a mini-batch, it reduces the noise compared to SGD, leading to more stable updates.

`Scalability:` It works well with large datasets, as it does not require loading the entire dataset into memory at once.

**Here's a simplified outline of how Mini Batch Gradient Descent works:**

- Randomly shuffle the dataset.

- Divide the dataset into mini-batches of a fixed size.

- For each mini-batch, compute the gradient of the loss function with respect to the parameters.

- Update the parameters using the average gradient of the mini-batch.

- Repeat the process for a set number of iterations or until the model converges.

In essence, Mini Batch Gradient Descent helps neural networks learn more efficiently by making smart compromises between speed and accuracy.

***

***
### ***Adagrad (Adaptive Gradient Descent) Deep Learning Optimizer***
***

*Adagrad (Adaptive Gradient Descent) is a popular optimization algorithm in deep learning that adapts the learning rate for each parameter based on their historical gradients. This means that parameters with frequently large gradients get smaller learning rates, and parameters with infrequently large gradients get larger learning rates. Here's a simplified explanation:*

**Key Concept:**

Adagrad updates the learning rate for each parameter dynamically, which helps the model converge more quickly and efficiently.

**How Adagrad Works:**

`Initialization:` Like other optimizers, Adagrad starts by initializing the parameters and setting a learning rate.

`Gradient Accumulation:` It keeps a running total of the squared gradients for each parameter.

`Adaptive Learning Rate:` The learning rate for each parameter is adjusted based on the accumulated squared gradients. Parameters that have received large updates in the past will have smaller learning rates in the future.

`Formula:` The update rule for Adagrad can be represented as: $$ \theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{G_{t,ii} + \epsilon}} \nabla f(\theta_t) $$

Where:

- 𝜃
𝑡
 are the parameters at time step t

- 𝛼
 is the initial learning rate

- 𝐺
𝑡
,
𝑖
𝑖
 is the sum of the squares of the gradients up to time step t for parameter 
𝑖

- 𝜖
 is a small smoothing term to avoid division by zero

- ∇
𝑓
(
𝜃
𝑡
)
 is the gradient of the loss function with respect to the parameters at time step t

**Advantages:**

`Automatic Learning Rate Adjustment:` Adagrad adjusts the learning rate for each parameter, making it more efficient and reducing the need to manually tune the learning rate.

`Effective for Sparse Data:` It's particularly useful for models with sparse data or features, as it ensures that rare but important features are given appropriate attention during training.

**Disadvantages:**

`Decaying Learning Rate:` Over time, the learning rate can become very small, causing the model to stop learning. This can sometimes be mitigated by using variants like Adadelta or RMSprop.

In essence, Adagrad helps optimize the training process by dynamically adjusting the learning rates of parameters, ensuring efficient and effective convergence.

***

***
### ***RMS Prop (Root Mean Square) Deep Learning Optimizer***
***

*RMSprop (Root Mean Square Propagation) is another popular optimization algorithm used in deep learning that improves upon Adagrad by addressing its main drawback: the decaying learning rate. RMSprop adapts the learning rate for each parameter by dividing the gradient by an exponentially decaying average of past squared gradients.*

**Key Concept:**

RMSprop maintains a moving average of the squared gradients for each parameter, allowing the learning rate to adapt over time without decaying too quickly.

**How RMSprop Works:**

`Initialization:` Like other optimizers, RMSprop starts by initializing the parameters and setting a learning rate.

`Gradient Squaring:` It computes the square of the gradients for each parameter.

`Moving Average:` It maintains an exponentially decaying average of the squared gradients.

`Adaptive Learning Rate:` The learning rate for each parameter is adjusted based on the moving average, preventing the learning rate from becoming too small.

`Formula:` The update rule for RMSprop can be represented as: $$ E[g^2]t = \rho E[g^2]{t-1} + (1 - \rho) g_t^2 $$ $$ \theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{E[g^2]_t + \epsilon}} g_t $$

Where:

- 𝐸
[
𝑔
2
]
𝑡
 is the exponentially decaying average of past squared gradients at time step t

- 𝜌
 is the decay rate (typically around 0.9)

- 𝑔
𝑡
 is the gradient of the loss function with respect to the parameters at time step t

- 𝛼
 is the learning rate

- 𝜖
 is a small smoothing term to avoid division by zero

- 𝜃
𝑡
 are the parameters at time step t

**Advantages:**

`Prevents Decay:` By using an exponentially decaying average, RMSprop prevents the learning rate from becoming too small, allowing the model to continue learning effectively.

`Efficient:` It works well with mini-batches, making it suitable for large datasets and modern hardware like GPUs.

**Disadvantages:**

`Hyperparameter Tuning:` The decay rate (
𝜌
) and learning rate (
𝛼
) may require tuning for optimal performance.

In summary, RMSprop is a powerful optimizer that adapts the learning rate for each parameter based on the moving average of past squared gradients, ensuring efficient and stable training in deep learning models.

***

***
### ***AdaDelta Deep Learning Optimizer***
***

*AdaDelta is an advanced optimization algorithm that improves upon the limitations of Adagrad. Its main goal is to reduce the aggressive, monotonically decreasing learning rate problem of Adagrad, which can sometimes lead to very small updates and slow learning. AdaDelta achieves this by incorporating only a window of accumulated past gradients instead of all past gradients, and it dynamically adapts the learning rate for each parameter.*

**Key Concepts:**

`Adaptive Learning Rate:` AdaDelta adjusts the learning rate based on a moving window of gradient updates.

`No Manual Learning Rate:` Unlike many other optimizers, AdaDelta does not require a manually set learning rate (
𝛼
).

**How AdaDelta Works:**

`Accumulated Gradient:` It keeps track of an exponentially decaying average of squared gradients.

`Update Rule:` Instead of using the accumulated squared gradients, AdaDelta uses a running average of the squared gradients and updates the parameters accordingly.

`Dynamic Learning Rate:` The learning rate is adjusted dynamically for each parameter without needing to set an initial learning rate.

`Formula:` The update rule for AdaDelta can be represented as: $$ E[g^2]t = \rho E[g^2]{t-1} + (1 - \rho) g_t^2 $$ $$ \Delta \theta_t = -\frac{\sqrt{E[\Delta \theta^2]_{t-1} + \epsilon}}{\sqrt{E[g^2]t + \epsilon}} g_t $$ $$ \theta{t+1} = \theta_t + \Delta \theta_t $$ $$ E[\Delta \theta^2]t = \rho E[\Delta \theta^2]{t-1} + (1 - \rho) \Delta \theta_t^2 $$

Where:

- 𝐸
[
𝑔
2
]
𝑡
 is the exponentially decaying average of past squared gradients at time step t

- Δ
𝜃
𝑡
 is the parameter update at time step t

- 𝐸
[
Δ
𝜃
2
]
𝑡
 is the exponentially decaying average of past squared updates

- 𝜌
 is the decay rate (typically around 0.95)

- 𝑔
𝑡
 is the gradient of the loss function with respect to the parameters at time step t

- 𝜖
 is a small smoothing term to avoid division by zero

- 𝜃
𝑡
 are the parameters at time step t

**Advantages:**

`No Manual Learning Rate:` It eliminates the need to manually set a learning rate, simplifying the training process.

`Adaptability:` The dynamic adjustment of learning rates allows for efficient training and quick convergence.

**Disadvantages:**

`Hyperparameter Tuning:` The decay rate (
𝜌
) and smoothing term (
𝜖
) may still require some tuning for optimal performance.

In summary, AdaDelta is a robust optimizer that adapts the learning rate dynamically, ensuring efficient and stable training in deep learning models without the need for a manually set learning rate.

***

***
### ***Adam Optimizer in Deep Learning***
***

*Adam (short for Adaptive Moment Estimation) is a widely used and highly effective optimization algorithm in deep learning. It combines the best features of two other optimizers, AdaGrad and RMSProp, and is well-suited for a variety of deep learning tasks. Here's a breakdown of how Adam works and why it's so popular:*

**Key Concepts:**

`Adaptive Learning Rates:` Adam adjusts the learning rate for each parameter individually based on estimates of first and second moments of the gradients.

`Momentum:` It incorporates the concept of momentum to improve convergence speed and stability.

**How Adam Works:**

`Initialization:` Adam starts by initializing parameters and setting up learning rates, along with two additional parameters for controlling the moving averages of the gradients.

`First Moment Estimate (Mean):` Adam maintains an exponentially decaying average of past gradients (mean of the gradients).

`Second Moment Estimate (Variance):` Adam also keeps an exponentially decaying average of past squared gradients (uncentered variance).

`Bias Correction:` To correct the bias introduced during the initialization, Adam applies bias correction to the first and second moment estimates.

`Parameter Update:` Finally, it updates the parameters using the corrected moment estimates.

`Formula:` The update rule for Adam can be represented as: $$ m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t $$ $$ v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 $$ $$ \hat{m}_t = \frac{m_t}{1 - \beta_1^t} $$ $$ \hat{v}t = \frac{v_t}{1 - \beta_2^t} $$ $$ \theta{t+1} = \theta_t - \frac{\alpha \hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} $$

Where:

- 𝑚
𝑡
 is the first moment (mean) estimate at time step t

- 𝑣
𝑡
 is the second moment (variance) estimate at time step t

- 𝛽
1
 and 
𝛽
2
 are decay rates for the moment estimates (typically around 0.9 and 0.999, respectively)

- 𝑔
𝑡
 is the gradient of the loss function with respect to the parameters at time step t

- 𝑚
^
𝑡
 and 
𝑣
^
𝑡
 are the bias-corrected moment estimates

- 𝛼
 is the learning rate

- 𝜖
 is a small smoothing term to avoid division by zero

- 𝜃
𝑡
 are the parameters at time step t

**Advantages:**

`Efficient:` Adam combines the advantages of both AdaGrad and RMSProp, making it suitable for a wide range of problems.

`Adaptive:` It adapts the learning rate for each parameter, leading to faster convergence and improved performance.

`Robust:` Adam is known for its robustness and ability to handle sparse gradients and noisy data.

**Disadvantages:**

`Hyperparameter Sensitivity:` While Adam typically works well out-of-the-box, fine-tuning the hyperparameters (learning rate, decay rates) can sometimes be necessary for optimal performance.

In summary, Adam is a powerful and versatile optimizer that adapts the learning rates for each parameter based on past gradient estimates, making it an excellent choice for training deep learning models.

***

***
### ***Hands-on Optimizers***
***

In [17]:

import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.optimizers import SGD, Adam, RMSprop

# Load and preprocess the MNIST dataset
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

# Create a simple neural network model
model = Sequential([
    Flatten(input_shape=(28, 28)),
    Dense(128, activation='relu'),
    Dense(10, activation='softmax')
])

# Compile the model with different optimizers
optimizers = {
    "SGD": SGD(learning_rate=0.01),
    "SGD with Momentum": SGD(learning_rate=0.01, momentum=0.9),
    "Adam": Adam(learning_rate=0.001),
    "RMSprop": RMSprop(learning_rate=0.001)
}

# Train and evaluate the model with each optimizer
for name, optimizer in optimizers.items():
    print(f"Training with {name} optimizer")
    model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    model.fit(x_train, y_train, epochs=5, batch_size=32, validation_split=0.1, verbose=2)
    loss, accuracy = model.evaluate(x_test, y_test, verbose=2)
    print(f"Test accuracy with {name}: {accuracy}\n")


  super().__init__(**kwargs)


Training with SGD optimizer
Epoch 1/5
1688/1688 - 10s - 6ms/step - accuracy: 0.8240 - loss: 0.6885 - val_accuracy: 0.9153 - val_loss: 0.3240
Epoch 2/5
1688/1688 - 8s - 5ms/step - accuracy: 0.9025 - loss: 0.3522 - val_accuracy: 0.9265 - val_loss: 0.2607
Epoch 3/5
1688/1688 - 8s - 5ms/step - accuracy: 0.9158 - loss: 0.3014 - val_accuracy: 0.9373 - val_loss: 0.2319
Epoch 4/5
1688/1688 - 7s - 4ms/step - accuracy: 0.9241 - loss: 0.2707 - val_accuracy: 0.9422 - val_loss: 0.2116
Epoch 5/5
1688/1688 - 8s - 5ms/step - accuracy: 0.9304 - loss: 0.2476 - val_accuracy: 0.9473 - val_loss: 0.1955
313/313 - 1s - 4ms/step - accuracy: 0.9339 - loss: 0.2284
Test accuracy with SGD: 0.933899998664856

Training with SGD with Momentum optimizer
Epoch 1/5
1688/1688 - 10s - 6ms/step - accuracy: 0.9401 - loss: 0.2047 - val_accuracy: 0.9630 - val_loss: 0.1340
Epoch 2/5
1688/1688 - 8s - 5ms/step - accuracy: 0.9611 - loss: 0.1336 - val_accuracy: 0.9712 - val_loss: 0.1016
Epoch 3/5
1688/1688 - 8s - 5ms/step - accur