# LAYERS

1. Linear layer
2. convolution layer
3. pooling layers (maxpool and average pooling)
4. recurrent layers
5. LSTM
6. GRU
7. Normalization layers
8. Dropout layers
9. Activation layers
10. Embedding layers
11. Attention mechanisms

## Linear Layer

performs a fully connected (dense) transformation. The equation for this layer is: __output = xW^T + b__ where:

x = input vector
W^T = weight matrix transponsed (this is what is learned), it maps input features to output features
b = bias vector

in pytorch `nn.Linear`:

```python
import torch.nn as nn
linear = nn.Linear(in_features = 128, out_features = 64)
# or
linear = nn.Linear(128, 64)
```

The input tensor is normally _[batch_size, input_features]_ where batch size is the number of samples processed in a single forward pass. The weight matrix has the shape _[out_features, in_features]_ and it is initialized randomly by default. The model will learn appopriate values of these weights during training. The bias vector has the shape _[out_features]_. This is also updated during training.

## Convolution Layer

A convolution layer is in the heart of convolution neural networks. its goal is to extract meaningful features from input data by applying convolutio operations. These layers performs a critical mathematical operation known as a __convolution__. 

This process using filters known as __kernel__ that traverse through the input image to learn complex visual patterns.

### convolution operation

This is a linear operation between two functions commonly applied to signals images or any structured data.

![2D Convolution Animation](https://upload.wikimedia.org/wikipedia/commons/1/19/2D_Convolution_Animation.gif)

when you have the input matrix, and the kernel, you overlay the kernel on the input matriz, do an elementwise multiplication (multiply each overlapping element), and then sum up the results it forms the first element in the output matrix, the slide the kernel based on the stride and and repeat until you get an output matrix.

This output matrix is known as a __feature map__.

### channels

Each channel represents a color, each pixel consists of three channels if it is an RGB. An RGB image can be described as _W x H x c_. Gray scale images have only one channel. 

### Parameters

1. Filter/Kernel size
2. Stride
3. Padding
4. Output channels
5. in channels

```python
convLayer = nn.Conv2d(in_channels=1, out_channels=1, kernel_size=2, stride = 1, padding = 0)
```

#### Filter/kernel size

These are the learnable weight matrices. Each kernel extracts specific features like edges textures or patterns. for a filter size _k x K_ and input with c_in channels, total weights equates to _k x k x c_in_

#### input channels

represents the number of channels in the input data. for RGB it is three channels (Red, Green, Blue) and for grayscale it is just 1.

#### Kernel size

spatial size of the filter. Typically 3 x 3, 5 x 5, 7 x 7.

#### Stride

determines how much a filter move at each step. a stride of 1 means one pixel at a time. Larger strides reduces the size of the output feature map.

#### padding

adds zeros around the inputs to maintain its size after convolutions. 

#### Hyperparameters affecting parameters

1. increasing output channels increases the number of filters and hence more learnable parameters
2. increasing kernel size makes each filter larger
3. 

## Pooling layers

Pooling, also known as subsampling or downsampling, is a technique used in CNNs to reduce the spatial dimensions of feature maps while retaining essential information

1. MaxPooling
2. AvePooling

### MaxPooling

Take the maximum value from each region of the feature map. It captures the most prominent feature in each region. 

### average pooling

Take the average value from each region of the feature map

### parameters

1. __kernel size__  - determines the region to be pooled
2. __stride__ - determines the step size of the pooling window, default is equal to the kernel size to ensure no overlapping regions
3. __padding__ - add zeros around the input to control the output dimensions.

```python
max_pool = nn.MaxPool2d(kernel_size = 2, stride = 2)
avg_pool = nn.AvgPool2d(kernel_size = 2, stride = 2)
```

pooling:

1. reduces dimensions
2. translation invariance - help the network recognize patterns regardless of small shifts input
3. prevent overfitting

## Recurrent layers

A class of neural networks designed to process sequential data i.e audio, text and videos. It can use internal state (memory) to process sequences of inputs, making them ideal for tasks where context or temporal relationship matters.

RNNs are powerful because:

1. They can maintain context - retaining information about previous elements in the sequence
2. Temporal dependency modelling - understanding the order and dependencies between elements

### Sequential data

Arrangement of elements affect the outcome (so order matters). Elements in the sequence have relationships, a word's meaning depends on the preceeding words (context). These sequences also have variable lengths, some sentences are longer others are shorter

Imagine you are processing a sentence like: "I love AI", the sequence here would be "__I__" "__love__" "__AI__". at time step __t = 1__, the RNN takes in "__I__" and the initial hidden state $h_0$ is all zeros. Here we then compute $h_1 = f(input, h_0)$. At time step __t = 2__: it takes the second input = "__love__" and the hidden state $h_1$, it then computes $h_2 = f(input, h_1)$. At time step __t = 3__: it takes in the input __"AI"__ and the hidden state $h_2$ and computes $h_3 = f(input, h_2)$.

At each timestep, the RNN produces two key outputs:

1. Hidden states of course, from the above paragraph
2. The output -  processed results at the current timestep


#### Hidden states formula

The hidden state at time step \( t \) is computed as:

$$
h_t = f(W_x x_t + W_h h_{t-1} + b_h)
$$

where:
- \( $h_t$ \): Hidden state at time \(t\),
- \( f \): Activation function (e.g., \(\tanh\), ReLU),
- \( $W_x$ \): Weight matrix for the current input \($x_t$\),
- \( $W_h$ \): Weight matrix for the previous hidden state \($h_{t-1}$\),
- \( $b_h$ \): Bias term.


#### Output states formula

The output at time step \( t \) is computed as:

$$
y_t = g(W_y h_t + b_y)
$$

where:
- \( $y_t$ \): Output at time \(t\),
- \( g \): Output activation function (e.g., softmax, linear),
- \( $W_y$ \): Weight matrix for mapping the hidden state to the output,
- \( $h_t$ \): Hidden state at time \(t\),
- \( $b_y$ \): Bias term for the output.

These formulas are repeated at each timestep until we get to the last time step




#### BPTT (Back Propagation Through Time)

To back propagate through these equations, we need to find the gradients w.r.t to the parameters $W_x, W_h, W_y, b_h, b_y$ using the chain rule of differentiation.

Here are the forward equations:

Hidden state equation: $h_t = f(W_xx_t + W_hh_{t-1} + b_h)$

output equation: $y_t = g(W_yh_t + b_y)$

The loss function is defined as:

$$
\mathcal{L} = \sum_{t=1}^T \ell(y_t, \hat{y}_t)
$$

where:
- \( $\mathcal{L}$ \): Total loss across all time steps,
- \( $\ell(y_t, \hat{y}_t)$ \): Loss at time \( t \) (e.g., cross-entropy or mean squared error),
- \( $y_t$ \): Output at time \( t \),
- \( $\hat{y}_t $\): Ground truth at time \( t \),
- \( $T $\): Total sequence length.


For each timestep, we calculate the output $y_t$ and the hidden output $h_t$. We need to compute the loss at each timestep (that is why we produce an output $y_t$. You compare this with ground truth $\hat{y}_t$ then use a predefined loss function i.e cross entropy or mean squared error to find the loss. You then aggregate the total loss over the entire sequence by summing up this individual losses.

The reason why we sum up these losses across all the timesteps is because we want to ensure they all contribute to optimization. 

Then during back propagation, we take the derivative of this aggregated loss  w.r.t to outputs per timestep:
For each \( t \), you compute:  
$
\frac{\partial \mathcal{L}}{\partial y_t}
$

From here now, you can do Back propgation through time. start by calculating gradients w.r.t to hidden states:

Here is the forward equation for the hidden state:
$ h_t = f(W_xx_t + W_hh_{t-1} + b_h)$

The parameters we need to adjust is the $ W_x, W_h  \text{ \& }  b_h$

The gradient calcuated at each timestep was: $ \frac{\partial \mathcal{L}}{\partial y_t}  $ 

So first we need to find the gradient w.r.t to $ h_t $ the hidden state in the final timestep. $ y_t = g(W_yh_t + b_y) $ ;

$\frac{\partial \mathcal{L}}{\partial h_t} = \frac{ \partial \mathcal{L}}{\partial y_t} \times \frac{ \partial y_t }{ \partial h_t }$

Then recursively compute the gradients w.r.t to previous timesteps (the forward equation for hidden states is always: $ h_t = f(W_xx_t + W_hh_{t-1} + b_h) $

The gradient w.r.t to the previous timestep  = $ \frac {\partial \mathcal{L}}{\partial y_t} \times \frac{\partial y_t}{\partial h_t} \times \frac{\partial h_t}{\partial h_{t-1}} \cdots \frac{\partial h_1}{\partial h_0} $

Once you have computed gradients w.r.t to each hidden states, You compute the gradients w.r.t to these parameters: $ W_x, W_h, b_h $ then you update them with whatever chosen optimizer i.e Gradient descent:

$
w \gets w - \eta \frac{\partial \mathcal{L}}{\partial w}
$

$
b \gets b - \eta \frac{\partial \mathcal{L}}{\partial b}
$

#### updating the weights w.r.t to outputs

Remember there is a step above we calculated the gradients across the individual outputs? we then use that to update the weights and bias of each individual weights and biases for the output equation.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim

class SimpleRNN(nn.Module):
    def __init(self, input_size, hidden_size, output_size):
        super(SimpleRNN, self).__init__()

        # RNN layer
        self.rnn = nn.RNN(input_size, hidden_size, batch_first = True)
        self.fc = nn.Linear(hidden_size, output_size)
    def forward(self, x):
        # out = output of all the time steps, _ is the hidden state of the last hidden state, back prop happens automatically, so we discard it since we don't need to use it
        out, _ = self.rnn(x)

        # let us extract the last output
        out = self.fc(out[:,-1,:])
        return out
        

## LSTM



Long term short term memory is a special type of Recurrent Neural Network designed to address the key limitations of Vanilla RNNs, particularly in handling long-term dependencies and mitigating issues like the vanishing gradient problem.

Here's a breakdown of how LSTMs relate to RNNs and why LSTMs are often preferred over basic RNNs in many tasks:

1. __Vanilla RNNs (Recurrent Neural Networks)__
Basic Structure: A basic RNN consists of a loop in the hidden layer, which allows the network to maintain a state or memory over time. The network processes input sequences one element at a time, updating its hidden state at each time step.

__Limitations:__

 - __Vanishing Gradient Problem:__ When training vanilla RNNs with backpropagation through time (BPTT), gradients can exponentially decay as they are propagated backward through many layers or time steps. This makes it hard to capture long-term dependencies in data, as the network forgets earlier information during training.

- __Difficulty with Long-Term Dependencies:__ RNNs struggle to maintain information over long sequences because the information from earlier time steps gets "forgotten" or "diluted" as it propagates through many time steps.

2. __Long Short-Term Memory (LSTM)__
LSTM is a type of RNN, but it improves on the basic RNN by introducing gates and a cell state that help regulate the flow of information and enable the network to "remember" useful information over longer sequences.

### RNN ARCHITECTURE

There are different types of RNN Architectures:

1. One to many
2. Many to one
3. Many to Many

#### One to many

It represents an architecture where the network receives a __single input__ but generates a sequence of outputs. i.e Image captioning, music generation

#### Many to one

A type of RNN where the network processes a sequence of inputs and produces one single output. i.e Sentimental Analysis & Spam Detection.

#### Many to many

A type of RNN network where it processes a sequence of inputs and generates a corresponding sequence of outputs. i.e Machine Translation, Video captioning, Text summarization.

i.e encoder - decoder model





## Embedding layers

This layer is normally crucial for natural language processing tasks. It is normally used to convert the input tokens into dense vector representations (embeddings).

### Features of the embedding layer

1. input `[batch_size, seq_length]` (input shape)
2. output `[batch_size, seq_length, embed_size]` (output shape)
3. Parameters `[num_embeddings, embedding_dim]`


### INPUT

nn.Embedding layer takes input as token indices, which are integers representing the words or items in a vocab. `batch_size` means we are sending 'batch_size' sentences into the training and the `seq_length` means the length of each sentence is `seq_length`.

### OUTPUT

The batch size and seq length remains the same, there is just an extra dim - `embed_size` and it represents the size of the dense vector that each token is mapped to, that is in 3D. The embedding dimension determines the capacity of the dense vectors to represent relationships and semantic information in the data.

Small dimensions may not capture enough features to represent tokens effectively, leading to underfitting.Large dimensions may lead to overfitting, especially with small datasets, and increase computation and memory costs.

### Parameters

`nn.Embedding` has two required parameters:

1. __num_embeddings__ - This is the size of the vocabulary - number of unique tokens in your dataset
2. __embedding_dim__ - size of each embedding vector. The number of dimensions to represent each token.



In [1]:
import torch
import torch.nn as nn

embed = nn.Embedding(num_embeddings=10, embedding_dim = 4)

In [None]:
#

## Attention Mechanisms