# Gradient Decent & Optimization

Gradient descent is a fundamental optimization algorithm used in training neural networks. It's crucial for updating the model's parameters (weights and biases) to minimize the loss function. In this explanation, we'll delve deep into gradient descent and optimization techniques used in neural networks, with detailed code examples

### 1. Gradient Descent:

Gradient descent is an iterative optimization algorithm used to minimize a differentiable loss function by adjusting the model's parameters in the direction of the steepest decrease in the loss. The primary steps of gradient descent are as follows:

* Initialize model parameters randomly or with predefined values.
* Compute the gradient of the loss function with respect to each parameter.
* Update the parameters by moving in the opposite direction of the gradient.
* Repeat the process until convergence or for a fixed number of iterations.


In [1]:
# Initializa parameters and hyperparameters
learning_rate = 0.1
iterations = 100
theta = 0 # parameter to be updated
loss_history = []

def loss_function(theta):
    return (theta - 3) ** 2

# Gradient Decent Loop
for i in range(iterations):
    # Computer the gradient of the loss with respect to theta
    gradient = 2 * (theta - 3)

    # Update theta using the gradient and learning rate
    theta -= learning_rate * gradient

    # Compute the loss and store it for analysis
    current_loss = loss_function(theta)
    loss_history.append(current_loss)

print("Optimal Theta: ", theta)

Optimal Theta:  2.9999999993888893


This is a simple illustration of gradient descent, where we update a single parameter `theta` to minimize a quadratic loss function.

### 2. Optimization Techniques:

Gradient descent has several variations and optimization techniques to improve convergence and speed up training. Here are some commonly used optimization algorithms:


**Stochastic Gradient Descent (SGD)**: It updates the parameters using a single random data point (or a small batch) at a time, making it faster but noisy.

**Mini-Batch Gradient Descent**: A compromise between SGD and batch gradient descent, where you update the parameters using a small batch of data points.

**Momentum**: Adds momentum to the update rule, which helps the optimizer escape local minima and accelerates convergence.

**Adagrad**: Adjusts the learning rate for each parameter based on their historical gradients, enabling adaptive learning rates.

**RMSprop**: Similar to Adagrad but uses a moving average of squared gradients, which reduces the learning rate adaptation.

**Adam (Adaptive Moment Estimation)**: Combines the advantages of both momentum and RMSprop and is one of the most widely used optimization algorithms in deep learning.

In PyTorch, we can easily implement and use these optimization algorithms through the `torch.optim` module. Here's an example of using Adam optimizer in PyTorch:

In [2]:
import torch
import torch.optim as optim

# Define model parameter and loss function
theta = torch.tensor(0.0, requires_grad=True)
loss_function = torch.nn.MSELoss()

# Create an Optimizer with a learning rate 
optimizer = optim.Adam([theta], lr=0.01)

# Training Loop
for i in range(iterations):
    optimizer.zero_grad() # Zero the gradient
    loss = loss_function(theta, torch.tensor(3.0)) # Computer the loss
    loss.backward() # Computer the gradient
    optimizer.step() # Update parameter

print("Optimal Theta: ", theta.item())

Optimal Theta:  0.9375530481338501


In this example, we used the `optim.Adam` optimizer to update the parameter `theta` to minimize a mean squared error (MSE) loss function