### Neural Networks
A neural network consists of an input layer, one or more hidden layers, and an output layer. The input layer holds the features of the dataset, and the output layer produces the final predictions. Hidden layers lie in between and add complexity to the model. A network without hidden layers is equivalent to a linear model and helps in understanding the basics.

![Neural Networks](imgs/nn.png)

#### Linear Layer

In PyTorch, the `torch.nn` module (commonly imported as `nn`) provides tools to define neural network layers concisely. A basic linear layer is created using `nn.Linear(in_features, out_features)`, where `in_features` is the number of input features and `out_features` is the size of the desired output. For example, a tensor of shape (1, 3) represents a single input with three features.

When this input tensor is passed through a linear layer, the model performs a linear operation that includes learned weights and biases. Weights determine the importance of each input feature, while biases allow the model to make predictions even when input values are zero. Initially, weights and biases are assigned randomly and are optimized during training.

In practice, for a dataset with features like temperature, humidity, and wind, the model can learn to assign more weight to humidity if it's a strong predictor of rain. Additionally, a positive bias might be added if the data comes from a region with a high baseline 
probability of rain. This combination of learned weights and biases allows the model to make informed predictions.

![Linear Neural Network](imgs/linear.png)

In [6]:
import torch.nn as nn
import torch

input_tensor = torch.tensor([1.0, 2.0, 3.0])

linear_layer = nn.Linear(in_features=3, out_features=2)

output_tensor = linear_layer(input_tensor)

print(output_tensor)

tensor([-0.0955,  1.9813], grad_fn=<ViewBackward0>)


### Building Deeper Neural Networks

#### Adding Hidden Layers with `nn.Sequential`
To model more complex patterns in data, neural networks can include multiple hidden layers. These are stacked using PyTorch’s `nn.Sequential()` container, which allows layers to be executed in order. For example, a network might include three layers: the first transforms the input features, the second acts as a hidden layer, and the final one produces the output predictions.

#### Matching Dimensions Between Layers
Each layer in the sequence must be dimensionally compatible with the previous one. That means the output dimension of one layer must match the input dimension of the next. For instance, if the first layer outputs 18 values, the next must take 18 as input. A valid sequence could look like:
- Input: 10 features → Layer 1: output 18
- Layer 2: input 18 → output 20
- Layer 3: input 20 → output 5 (final predictions)

#### Neurons and Parameters
Each linear layer is composed of neurons, and each neuron connects to all outputs from the previous layer—making it fully connected. A neuron in such a layer has `N + 1` parameters: `N` weights (one for each input feature) and one bias. The total number of parameters in a layer depends on the number of neurons and the size of the input.

#### Understanding Model Capacity
The number of hidden layers and neurons directly affects the model’s **capacity**—its ability to learn from complex data. For example:
- A layer with 4 neurons and 8 inputs has: 4 × (8 weights + 1 bias) = 36 parameters.
- A second layer with 2 neurons and 4 inputs has: 2 × (4 weights + 1 bias) = 10 parameters.
- Total model parameters = 36 + 10 = 46.

This parameter count can also be computed programmatically in PyTorch using the `.numel()` method, which returns the total number of elements in a tensor. Summing over all `.parameters()` in the model yields the total number of learnable parameters.

#### Balancing Complexity and Efficiency
While adding layers increases a model’s ability to capture intricate patterns, it also introduces risks such as overfitting and slower training. Therefore, understanding and controlling the number of parameters is crucial to achieving a good balance between model complexity and computational efficiency.

In [1]:
import torch
import torch.nn as nn

input_tensor = torch.Tensor([[2, 3, 6, 7, 9, 3, 2, 1]])

# Create a container for stacking linear layers
model = nn.Sequential(nn.Linear(8, 4),
                nn.Linear(4, 1)
                )

output = model(input_tensor)
print(output)

tensor([[0.2978]], grad_fn=<AddmmBackward0>)


#### Counting the number of parameters
Deep learning models are famous for having a lot of parameters. With more parameters comes more computational complexity and longer training times, and a deep learning practitioner must know how many parameters their model has.

In this exercise, you'll first calculate the number of parameters manually. Then, you'll verify your result using the `.numel()` method.

In [2]:
import torch.nn as nn

model = nn.Sequential(nn.Linear(9, 4),
                      nn.Linear(4, 2),
                      nn.Linear(2, 1))

total = 0

# Calculate the number of parameters in the model
for p in model.parameters():
  total += p.numel()
  
print(f"The number of parameters in the model is {total}")

The number of parameters in the model is 53


### Introducing Activation Functions

#### Why Activation Functions Matter  
Neural networks made of only linear layers are limited in the complexity of patterns they can learn. Activation functions introduce non-linearity, enabling models to capture intricate relationships between inputs and outputs. Two commonly used activation functions are **sigmoid** (for binary classification) and **softmax** (for multi-class classification). The output of the final linear layer, often called the **pre-activation output**, is passed through one of these functions to produce a meaningful prediction.


### Sigmoid Activation for Binary Classification

The **sigmoid** function maps any real-valued input to a value between 0 and 1, making it ideal for binary classification. For instance, if we input data about an animal—like number of limbs, whether it lays eggs, and presence of hair—a network with two linear layers might output a raw value like 6. Passing this through the sigmoid function converts it to a probability. If the output is above 0.5, we classify it as class 1 (e.g., mammal); otherwise, class 0.

In PyTorch, this is done using `nn.Sigmoid()`, which can be added as the final layer in an `nn.Sequential()` model. A network with only linear layers followed by a sigmoid behaves like logistic regression. Adding more layers and activations extends this into a full deep learning model.


### Softmax Activation for Multi-Class Classification

For problems involving more than two classes, the **softmax** function is used. It converts a vector of raw scores into a probability distribution across multiple classes. For example, if we have three classes—bird (0), mammal (1), and reptile (2)—a network might output three pre-activation values. Softmax transforms these into values between 0 and 1 that sum to one, indicating the model’s confidence for each class. The class with the highest value is the predicted label.

In PyTorch, this is implemented with `nn.Softmax(dim=-1)`, where `dim=-1` specifies that softmax should be applied across the last dimension of the input tensor. Like sigmoid, softmax is usually the final layer in classification models.


### Summary

Activation functions are crucial for transforming raw model outputs into interpretable predictions. Sigmoid is suited for binary outcomes, while softmax handles multiple classes. Including these in the final layer of a network enables it to produce probabilities that guide accurate classification decisions.

In [3]:
input_tensor = torch.tensor([[2.4]])

# Create a sigmoid function and apply it on input_tensor
sigmoid = nn.Sigmoid()
probability = sigmoid(input_tensor)
print(probability)

tensor([[0.9168]])


In [4]:
input_tensor = torch.tensor([[1.0, -6.0, 2.5, -0.3, 1.2, 0.8]])

# Create a softmax function and apply it on input_tensor
softmax = nn.Softmax()
probabilities = softmax(input_tensor)
print(probabilities)

tensor([[1.2828e-01, 1.1698e-04, 5.7492e-01, 3.4961e-02, 1.5669e-01, 1.0503e-01]])


  return self._call_impl(*args, **kwargs)


#### From regression to multi-class classification
The models you have seen for binary classification, multi-class classification and regression have all been similar, barring a few tweaks to the model.

Start building a model for regression, and then tweak the model to perform a multi-class classification.

In [6]:
import torch
import torch.nn as nn

input_tensor = torch.Tensor([[3, 4, 6, 7, 10, 12, 2, 3, 6, 8, 9]])

# Implement a neural network with exactly four linear layers
model = nn.Sequential(
  nn.Linear(11, 20),
  nn.Linear(20, 10),
  nn.Linear(10, 5),
  nn.Linear(5, 1)
)

output = model(input_tensor)
print(output)

tensor([[-0.0160]], grad_fn=<AddmmBackward0>)


In [5]:
import torch
import torch.nn as nn

input_tensor = torch.Tensor([[3, 4, 6, 7, 10, 12, 2, 3, 6, 8, 9]])

# Update network below to perform a multi-class classification with four labels
model = nn.Sequential(
  nn.Linear(11, 20),
  nn.Linear(20, 12),
  nn.Linear(12, 6),
  nn.Linear(6, 4),
  nn.Softmax(dim=-1)
)

output = model(input_tensor)
print(output)

tensor([[0.2748, 0.1368, 0.4420, 0.1465]], grad_fn=<SoftmaxBackward0>)


### Understanding Loss Functions in Neural Networks

#### Purpose of a Loss Function  
After generating predictions from a neural network, the next step is to assess how close those predictions are to the actual labels. This is where a **loss function** comes in—it quantifies the error between predicted outputs (`ŷ`) and true labels (`y`) by returning a single numerical value. A low loss means accurate predictions, while a high loss indicates poor performance. The training process aims to **minimize this loss**.


#### Example: Multi-Class Classification  
Consider a model that classifies animals into three classes: mammal (0), bird (1), or reptile (2). If the true label is 0 (e.g., a bear) and the model correctly predicts 0, the loss is low. If it predicts incorrectly, the loss is high. This feedback guides the model to improve over time.


#### One-Hot Encoding for Ground Truth  
The model’s predictions (`ŷ`) are typically raw scores (logits) from the last layer before softmax. To compare these with the true labels, we often convert the integer class labels into **one-hot encoded** vectors. For example:
- Label 0 → `[1, 0, 0]`
- Label 1 → `[0, 1, 0]`
- Label 2 → `[0, 0, 1]`

This ensures that predictions and ground truths have compatible shapes for loss computation.


#### Using `torch.nn.functional` for One-Hot Encoding  
Instead of manually creating one-hot vectors, we can use PyTorch’s `torch.nn.functional` module (imported as `F`) to transform integer labels automatically. This makes the code more efficient and readable.


#### Cross-Entropy Loss in PyTorch  
For multi-class classification, **cross-entropy loss** is the most commonly used loss function. In PyTorch:
- Define the loss using `F.cross_entropy()` or `nn.CrossEntropyLoss()`
- Pass the raw output scores (`yhat`) and the true labels (`y`, not one-hot encoded for `CrossEntropyLoss`)
- The function returns the loss as a single float value

Note: If using one-hot encoded labels, other formulations like negative log likelihood loss might be used after applying `log_softmax`.


#### Summary  
A loss function is critical in training neural networks—it converts the model’s prediction error into a single number to be minimized. In classification, this often involves comparing logits against true class labels (converted to one-hot if needed), using cross-entropy loss. Reducing this value through training helps the model improve its predictions.

In [10]:
import numpy as np
import torch
import torch.nn.functional as F

y = 1
num_classes = 3

# Create the one-hot encoded vector using NumPy
one_hot_numpy = np.array([0, 1, 0])

# Create the one-hot encoded vector using PyTorch
one_hot_pytorch = F.one_hot(torch.tensor(y), num_classes=3)

print("One-hot vector using NumPy:", one_hot_numpy)
print("One-hot vector using PyTorch:", one_hot_pytorch)

One-hot vector using NumPy: [0 1 0]
One-hot vector using PyTorch: tensor([0, 1, 0])


#### Calculating cross entropy loss
Cross-entropy loss is a widely used method to measure classification loss. In this exercise, you’ll calculate cross-entropy loss in PyTorch using:

- `y`: the ground truth label.
- `scores`: a vector of predictions before softmax.
Loss functions help neural networks learn by measuring prediction errors. Create a one-hot encoded vector for `y`, define the cross-entropy loss function, and compute the loss using scores and the encoded label. The result will be a single float representing the sample's loss.

In [11]:
import torch
import torch.nn.functional as F
from torch.nn import CrossEntropyLoss

y = [2]
scores = torch.tensor([[0.1, 6.0, -2.0, 3.2]])

# Create a one-hot encoded vector of the label y
one_hot_label = F.one_hot(torch.tensor(y), num_classes=4)

# Create the cross entropy loss function
criterion = CrossEntropyLoss()

# Calculate the cross entropy loss
loss = criterion(scores.double(), one_hot_label.double())
print(loss)

tensor(8.0619, dtype=torch.float64)


#### Accessing the model parameters
A PyTorch model created with the `nn.Sequential()` is a module that contains the different layers of your network. Recall that each layer parameter can be accessed by indexing the created model directly. In this exercise, you will practice accessing the parameters of different linear layers of a neural network.

In [12]:
model = nn.Sequential(nn.Linear(16, 8),
                      nn.Linear(8, 2)
                     )

# Access the weight of the first linear layer
weight_0 = model[0].weight
print("Weight of the first layer:", weight_0)

# Access the bias of the second linear layer
bias_1 = model[1].bias
print("Bias of the second layer:", bias_1)

Weight of the first layer: Parameter containing:
tensor([[ 0.0556,  0.1620, -0.1934,  0.1200,  0.1007, -0.2213, -0.1016,  0.0295,
         -0.2418,  0.1048, -0.1359,  0.0166, -0.1304,  0.2054,  0.0322, -0.2187],
        [-0.2208, -0.1768,  0.2165,  0.1019, -0.0408,  0.2042,  0.0008,  0.2312,
         -0.0573,  0.0151, -0.0055, -0.0071, -0.1824,  0.0410, -0.1710,  0.2199],
        [-0.1586,  0.0711, -0.0112,  0.0511, -0.1203,  0.1919,  0.1554,  0.0764,
         -0.0466,  0.0250, -0.2487,  0.2069,  0.0811,  0.1786, -0.0388,  0.0159],
        [ 0.0548, -0.1581,  0.1943,  0.1269, -0.1036, -0.1508,  0.2448,  0.0042,
         -0.0769,  0.0795, -0.0021, -0.0463,  0.0521, -0.2284,  0.1319, -0.0254],
        [ 0.2497,  0.1132, -0.0689,  0.1682, -0.2266,  0.0428,  0.2372, -0.0420,
         -0.0439, -0.0914,  0.1540, -0.0801, -0.0773, -0.0703,  0.1743,  0.1884],
        [-0.2315,  0.2017,  0.0963,  0.0856,  0.0508, -0.2499, -0.0068,  0.2059,
         -0.0555, -0.0276, -0.1998,  0.0241,  0.1520,  

#### Accessing the model parameters
A PyTorch model created with the `nn.Sequential()` is a module that contains the different layers of your network. Recall that each layer parameter can be accessed by indexing the created model directly. In this exercise, you will practice accessing the parameters of different linear layers of a neural network.

#### Updating the weights manually
Now that you know how to access weights and biases, you will manually perform the job of the PyTorch optimizer. While PyTorch automates this, practicing it manually helps you build intuition for how models learn and adjust. This understanding will be valuable when debugging or fine-tuning neural networks.

A neural network of three layers has been created and stored as the model variable. This network has been used for a forward pass and the loss and its derivatives have been calculated. A default learning rate, lr, has been chosen to scale the gradients when performing the update.

In [None]:
model = nn.Sequential(nn.Linear(16, 8),
                      nn.Linear(8, 2),
                      nn.Linear(2, 1)
                     )

input_tensor = torch.tensor([[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0]])

output = model(input_tensor)

criterion = nn.CrossEntropyLoss()

loss = criterion(output, torch.tensor([0]))
print(loss)

loss.backward()

# Set the learning rate
lr = 0.001

weight0 = model[0].weight
weight1 = model[1].weight
weight2 = model[2].weight

# Access the gradients of the weight of each linear layer
grads0 = weight0.grad
grads1 = weight1.grad
grads2 = weight2.grad

# Update the weights using the learning rate and the gradients
weight0 = weight0 - grads0 * lr
weight1 = weight1 - grads1 * lr
weight2 = weight2 - grads2 * lr

tensor(0., grad_fn=<NllLossBackward0>)


#### Using the PyTorch optimizer
Earlier, you manually updated the weight of a network, gaining insight into how training works behind the scenes. However, this method isn’t scalable for deep networks with many layers.

Thankfully, PyTorch provides the SGD optimizer, which automates this process efficiently in just a few lines of code. Now, you’ll complete the training loop by updating the weights using a PyTorch optimizer.

In [27]:
# Create the optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)

model = nn.Sequential(nn.Linear(16, 8),
                      nn.Linear(8, 2),
                      nn.Linear(2, 1)
                     )

input_tensor = torch.tensor([[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0]])

output = model(input_tensor)

criterion = nn.CrossEntropyLoss()

loss = criterion(output, torch.tensor([0]))
loss.backward()

# Update the model's parameters using the optimizer
optimizer.step()

#### Using TensorDataset
Structuring your data into a dataset is one of the first steps in training a PyTorch neural network. `TensorDataset` simplifies this by converting NumPy arrays into a format PyTorch can use.

In this exercise, you'll create a `TensorDataset` using the preloaded `animals` dataset and inspect its structure.

In [28]:
import torch
from torch.utils.data import TensorDataset
import pandas as pd

data = {
    "animal_name": ["sparrow", "eagle", "cat", "dog", "lizard"],
    "hair": [0, 0, 1, 1, 0],
    "feathers": [1, 1, 0, 0, 0],
    "eggs": [1, 1, 0, 0, 1],
    "milk": [0, 0, 1, 1, 0],
    "predator": [0, 1, 1, 0, 1],
    "legs": [2, 2, 4, 4, 4],
    "tail": [1, 1, 1, 1, 1],
    "type": [0, 0, 1, 1, 2]
}

animals = pd.DataFrame(data)


X = animals.iloc[:, 1:-1].to_numpy()  
y = animals.iloc[:, -1].to_numpy()

# Create a dataset
dataset = TensorDataset(torch.tensor(X), torch.tensor(y))

# Print the first sample
input_sample, label_sample = dataset[0]
print('Input sample:', input_sample)
print('Label sample:', label_sample)

Input sample: tensor([0, 1, 1, 0, 0, 2, 1])
Label sample: tensor(0)


#### Using DataLoader
The `DataLoader` class is essential for efficiently handling large datasets. It speeds up training, optimizes memory usage, and stabilizes gradient updates, making deep learning models more effective.

Now, you'll create a PyTorch `DataLoader` using the `dataset` from the previous exercise and see it in action.

In [29]:
from torch.utils.data import DataLoader

# Create a DataLoader
dataloader = DataLoader(dataset, shuffle=True, batch_size=2)

# Iterate over the dataloader
for batch_inputs, batch_labels in dataloader:
    print('batch_inputs:', batch_inputs)
    print('batch_labels:', batch_labels)

batch_inputs: tensor([[1, 0, 0, 1, 0, 4, 1],
        [1, 0, 0, 1, 1, 4, 1]])
batch_labels: tensor([1, 1])
batch_inputs: tensor([[0, 1, 1, 0, 0, 2, 1],
        [0, 1, 1, 0, 1, 2, 1]])
batch_labels: tensor([0, 0])
batch_inputs: tensor([[0, 0, 1, 0, 1, 4, 1]])
batch_labels: tensor([2])


### Building and Running a Training Loop in PyTorch

To train a deep learning model in PyTorch, we need four main components:  
1. A defined model  
2. A loss function  
3. A dataset  
4. An optimizer  

Together, these components enable a **training loop**, where the model learns by repeatedly adjusting its parameters to minimize prediction error.


### Dataset and Problem Type  
In this example, we work with a **regression task**: predicting normalized data scientist salaries using categorical features. Since the target is continuous, we use a **linear layer** as the model's output (not softmax or sigmoid) and **mean squared error (MSE)** as the loss function. MSE measures the average squared difference between predicted and actual values.


### Preparing Data for Training  
- Data (features and targets) is stored as NumPy arrays and converted to float tensors.
- We wrap them in `TensorDataset` and use `DataLoader` to create batches (e.g., batch size = 4).
- A simple model is defined with input features and one output neuron.
- We use `nn.MSELoss()` as the criterion and `torch.optim.SGD` (or another optimizer) with a default learning rate of `0.001`.


### Training Loop Structure  
The training loop iterates over the dataset for multiple **epochs**. In each epoch:
1. Loop over batches from the `DataLoader`.
2. Clear existing gradients with `optimizer.zero_grad()`.
3. Perform a forward pass to get predictions.
4. Compute the loss using predicted and true values.
5. Call `.backward()` to compute gradients.
6. Update model parameters using `optimizer.step()`.

This loop allows the model to learn from data by gradually reducing the loss value over successive epochs.

#### Using the MSELoss
For regression problems, you often use Mean Squared Error (MSE) as a loss function instead of cross-entropy. MSE calculates the squared difference between predicted values (`y_pred`) and actual values (`y`). Now, you'll compute MSE loss using both NumPy and PyTorch.

In [30]:
y_pred = np.array([3, 5.0, 2.5, 7.0])  
y = np.array([3.0, 4.5, 2.0, 8.0])     

# Calculate MSE using NumPy
mse_numpy = np.mean((y-y_pred)**2)

# Create the MSELoss function in PyTorch
criterion = nn.MSELoss()

# Calculate MSE using PyTorch
mse_pytorch = criterion(torch.tensor(y), torch.tensor(y_pred))

print("MSE (NumPy):", mse_numpy)
print("MSE (PyTorch):", mse_pytorch)

MSE (NumPy): 0.375
MSE (PyTorch): tensor(0.3750, dtype=torch.float64)


In [31]:
num_epochs = 3

# Set the learning rate
lr = 0.001

# Loop over the number of epochs and the dataloader
for i in range(num_epochs):
  for data in dataloader:
    # Set the gradients to zero
    optimizer.zero_grad()
    # Run a forward pass
    feature, target = data
    prediction = model(feature)    
    # Compute the loss
    loss = criterion(prediction, target)    
    # Compute the gradients
    loss.backward()
    # Update the model's parameters
    optimizer.step()

RuntimeError: mat1 and mat2 must have the same dtype, but got Long and Float

### Understanding ReLU and Its Importance in Neural Networks

#### Limitations of Sigmoid and Softmax  
While **sigmoid** and **softmax** are useful activation functions—commonly used in output layers—they have limitations when used in hidden layers. Both functions produce outputs between 0 and 1, and their gradients become very small for extreme input values. This **saturation** leads to the **vanishing gradients problem**, where gradients shrink during backpropagation, preventing effective weight updates. As a result, these functions are unsuitable for deep networks' internal layers.

### ReLU: Rectified Linear Unit  
The **ReLU (Rectified Linear Unit)** activation function addresses the vanishing gradient issue by outputting the input directly if it's positive, and zero if it's negative. This unbounded behavior for positive values ensures that gradients remain large enough for effective learning. ReLU is widely used in hidden layers and can be applied in PyTorch using `torch.nn.ReLU()`. It's considered a robust default activation for many deep learning tasks.


### Leaky ReLU: Handling Negative Inputs  
To improve upon ReLU, **Leaky ReLU** allows a small, non-zero gradient for negative inputs. Instead of outputting zero for all negative values, it multiplies them by a small constant (e.g., 0.01). This prevents neurons from dying (i.e., becoming inactive) during training. In PyTorch, `torch.nn.LeakyReLU()` includes a `negative_slope` parameter to control this coefficient.

### Summary  
While sigmoid and softmax are useful for classification outputs, **ReLU and its variants like Leaky ReLU** are better suited for hidden layers in deep networks. They maintain stronger gradients and support more effective learning across layers.

In [32]:
# Create a ReLU function with PyTorch
relu_pytorch = nn.ReLU()

x_pos = torch.tensor(2.0)
x_neg = torch.tensor(-3.0)

# Apply the ReLU function to the tensors
output_pos = relu_pytorch(x_pos)
output_neg = relu_pytorch(x_neg)

print("ReLU applied to positive value:", output_pos)
print("ReLU applied to negative value:", output_neg)

ReLU applied to positive value: tensor(2.)
ReLU applied to negative value: tensor(0.)


In [33]:
# Create a leaky relu function in PyTorch
leaky_relu_pytorch = nn.LeakyReLU(negative_slope=0.05)

x = torch.tensor(-2.0)
# Call the above function on the tensor x
output = leaky_relu_pytorch(x)
print(output)

tensor(-0.1000)


### Learning Rate and Momentum in Optimization

#### Role of SGD in Training  
Training a neural network involves minimizing a **loss function** by updating model parameters using an optimizer. The commonly used **Stochastic Gradient Descent (SGD)** algorithm updates weights based on the gradient of the loss and is influenced by two important hyperparameters: **learning rate** and **momentum**.


### Learning Rate  
The **learning rate** determines the size of each update step:
- An **optimal learning rate** allows the optimizer to converge steadily toward the minimum.
- A **small learning rate** slows convergence, requiring more steps to reach the minimum.
- A **high learning rate** can cause the optimizer to overshoot and oscillate, failing to find the minimum.

The step size naturally decreases as the gradient becomes smaller near the minimum, since updates are proportional to the gradient value.


### Momentum  
Loss functions in deep learning are typically **non-convex**, meaning they contain many local minima. Momentum helps the optimizer escape these local minima by maintaining movement in the direction of consistent gradients:
- **Without momentum**, the optimizer can get stuck in a shallow local dip.
- **With momentum** (e.g., 0.9), the optimizer can overcome these dips and reach better minima by building inertia.


### Summary  
- **Learning Rate** controls how fast the model learns; typical values range from **0.0001 to 0.01**.
- **Momentum** adds stability and helps escape local minima; typical values range from **0.85 to 0.99**.
Together, these hyperparameters play a crucial role in ensuring efficient and effective training.