# Encoder Decoder

<img src='e1.png' />

seq 2 seq learning with neural network

research paper: https://arxiv.org/abs/1409.3215
        
- how to efficiently map seq of variable lengths from one domain 
    to seq of variable lengths in another domain
    
model architecture
- encoder
- decoder
- seq2seq framework

In [None]:
variants of encoder decoder architectures

- seq 2 seq with attention
- transformers
- conditional variational autoencoder

In [None]:
application

- machine translation - google translate - seq2seq (LSTM or GRU or transformer)
- text summarization - news papers - encoder-decoder with attention
                                   - BART - Bidirectional and Auto-Regressive Transformer

- image captioning - social media, photo management - encoder(CNN like ResNet) 
                                                    - decoder (LSTM or Transformer)

- speech recognition - alexa, siri - encoder - audio frames (CNN or LSTM)
                                   - decoder - generate text

    
encoder-decoder used in transformers
- BERT
- GPT
- T5

## encoder decoder architecture

<img src='e2.png' />

In [None]:
encoder
- it process the input word by word
- each word is represented as a vector(word embeddings)
- it is based on LSTM or GRU
- takes the word vector sequentially and update its hidden state
- the final hidden state will also called as context vector
- context vector will be the output or encoder

In [None]:
decoder
- it takes the context vector as the input
- it generates translated output seq word by word
- LSTM, GRU or Transformer layers
- it will use context vector provided by encoder as well as its own hidden state from previous step


In [None]:
1 - input seq encoding - encoder 
    - process it word by word/token 
    - update hidden state 
    - produce context vector

2 - context vector
    - compressed representation of input seq
    - more advanced models with attention mechanism and allow the decoder to acess all hidden states 
        of the encoder not jsut a single context vector

3 - decoding the output seq
    - at each step it produces 1 element of the output and updates its own hidden state
    - it will continue until a special token(end-of-sequence token)

## On example dataset

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset

input_size = 10
hidden_size = 128
output_size = 10
num_layers = 2
learning_rate = 0.001
batch_size = 32
num_epochs = 20
seq_length = 10

# Dataset: Sequences of integers (e.g., [1, 2, 3] -> [3, 2, 1])
class ReverseSequenceDataset(Dataset):
    def __init__(self, num_samples, seq_length):
        self.num_samples = num_samples
        self.seq_length = seq_length
        self.data = torch.randint(1, 10, (num_samples, seq_length))
        self.targets = torch.flip(self.data, dims=[1])

    def __len__(self):
        return self.num_samples

    def __getitem__(self, index):
        return self.data[index], self.targets[index]

class Encoder(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers):
        super(Encoder, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)

    def forward(self, x):
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)
        c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)
        out, (hidden, cell) = self.lstm(x, (h0, c0))
        return hidden, cell

class Decoder(nn.Module):
    def __init__(self, hidden_size, output_size, num_layers):
        super(Decoder, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.lstm = nn.LSTM(output_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x, hidden, cell):
        out, (hidden, cell) = self.lstm(x, (hidden, cell))
        out = self.fc(out)
        return out, hidden, cell

class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, seq_length):
        super(Seq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.seq_length = seq_length

    def forward(self, src, target):
        batch_size = src.size(0)
        target_length = target.size(1)
        outputs = torch.zeros(batch_size, target_length, output_size).to(src.device)

        hidden, cell = self.encoder(src)

        decoder_input = target[:, 0].unsqueeze(1)

        for t in range(1, target_length):
            output, hidden, cell = self.decoder(decoder_input, hidden, cell)
            outputs[:, t, :] = output.squeeze(1)
            decoder_input = output  # the next input is the current output

        return outputs

dataset = ReverseSequenceDataset(num_samples=1000, seq_length=seq_length)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

encoder = Encoder(input_size=input_size, hidden_size=hidden_size, num_layers=num_layers)
decoder = Decoder(hidden_size=hidden_size, output_size=output_size, num_layers=num_layers)
model = Seq2Seq(encoder, decoder, seq_length=seq_length).to('cuda' if torch.cuda.is_available() else 'cpu')

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

for epoch in range(num_epochs):
    for src, target in dataloader:
        src = src.float().unsqueeze(-1).to('cuda' if torch.cuda.is_available() else 'cpu')
        target = target.long().to('cuda' if torch.cuda.is_available() else 'cpu')

        outputs = model(src, target)
        outputs = outputs.view(-1, output_size)
        target = target.view(-1)

        loss = criterion(outputs, target)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

example_src, example_target = dataset[0]
example_src = example_src.float().unsqueeze(0).unsqueeze(-1).to('cuda' if torch.cuda.is_available() else 'cpu')
predicted_output = model(example_src, example_target.unsqueeze(0).to('cuda' if torch.cuda.is_available() else 'cpu'))

print(f'Input Sequence: {example_src.squeeze(0).squeeze(-1).cpu().numpy()}')
print(f'Reversed Target: {example_target.cpu().numpy()}')
print(f'Predicted Output: {torch.argmax(predicted_output.squeeze(0), dim=1).cpu().numpy()}')

# Attention Mechanism

<img src='a1.png' />

In [None]:
selective focus process

attention solve these 2 challenges -
- long sequence
- contextual understanding

In [None]:
Himanshu is taking a session. He is going to explain transformers.

In [None]:
Himanshu -> He

In [None]:
types of attention

- self attention - attention within a sentence
                 - consider relationships between different parts of the same sentence.
    
- scaled dot product attention
    - dot product of query and key

- location based attention
    - used in image related tasks
    - assign weights based on location of the element in the input
    - object detection and image captioninig