### Explanation of Encoder-Decoder Architecture with LSTM

The **Encoder-Decoder architecture** is widely used in NLP tasks like machine translation. This architecture typically uses **LSTM** networks to model sequential data effectively.

#### Encoder
The encoder processes the input sequence $ X = \{x_1, x_2, \ldots, x_T\} $, where $ T $ is the length of the input sequence. Each token $ x_t $ is passed through an embedding layer and then into the LSTM network:

$$
h_t, c_t = \text{LSTM}(x_t, (h_{t-1}, c_{t-1}))
$$

Here:
- $ h_t $: Hidden state at time step $ t $
- $ c_t $: Cell state at time step $ t $
- $ x_t $: Input at time step $ t $

The final hidden state and cell state $(h_T, c_T)$ summarize the input sequence and serve as the initial state for the decoder.

#### Decoder
The decoder generates the output sequence $ Y = \{y_1, y_2, \ldots, y_{T'}\} $, where $ T' $ is the length of the output sequence. At each time step $ t' $, the decoder takes the previous token $ y_{t'-1} $, the hidden state $ h_{t'-1} $, and the cell state $ c_{t'-1} $ as inputs:

$$
h_{t'}, c_{t'} = \text{LSTM}(y_{t'-1}, (h_{t'-1}, c_{t'-1}))
$$

The output $ o_{t'} $ at each time step is computed as:

$$
o_{t'} = \text{softmax}(W \cdot h_{t'} + b)
$$

Here:
- $ W $ and $ b $: Learnable weight matrix and bias
- $ o_{t'} $: Probability distribution over the vocabulary for the next word

#### Loss Function
The model is trained using the cross-entropy loss, defined as:

$$
\mathcal{L} = -\frac{1}{N} \sum_{i=1}^N \sum_{t=1}^{T'} y_t^{(i)} \log(\hat{y}_t^{(i)})
$$

Where:
- $ N $: Number of training examples
- $ y_t^{(i)} $: Ground truth word at time $ t $ for example $ i $
- $ \hat{y}_t^{(i)} $: Predicted probability for $ y_t^{(i)} $

This equation measures the difference between the true and predicted distributions over the target vocabulary.

#### Applications
The Encoder-Decoder framework is powerful for tasks like:
- **Machine Translation**: E.g., translating "I love programming" (English) to "J'aime programmer" (French).
- **Text Summarization**: Converting long articles into concise summaries.
- **Speech Recognition**: Transforming audio signals into textual transcriptions.



### Data Preprocessing and Dataset Creation
Before training an Encoder-Decoder model for tasks like machine translation, it is crucial to preprocess the data and prepare it in a format suitable for the model. This involves tokenizing, encoding, and splitting the data into training and validation sets.



In [1]:
# English-French sentence pairs
sentence_pairs = [
    ("I love programming.", "J'aime programmer."),
    ("The weather is nice today.", "Il fait beau aujourd'hui."),
    ("Can you help me?", "Peux-tu m'aider ?")
]

In [2]:
from collections import Counter

In [3]:
train_tokens_A=[token for sentence_pair in  sentence_pairs for token in sentence_pair[0].lower().split()]
train_tokens_B=[token for sentence_pair in  sentence_pairs for token in sentence_pair[1].lower().split()]

In [4]:
vocab_A=Counter(train_tokens_A)
vocab_B=Counter(train_tokens_B)

vocab_A['<pad>']=1
vocab_B['<pad>']=1

In [5]:
w2i_A={k:i for i, (k,v) in enumerate(vocab_A.items())}
w2i_B={k:i for i, (k,v) in enumerate(vocab_B.items())}

In [6]:
import torch

inputs_A=[torch.tensor([w2i_A[token] for token in sentence_pair[0].lower().split()]) for sentence_pair in sentence_pairs ]
inputs_B=[torch.tensor([w2i_B[token] for token in sentence_pair[1].lower().split()]) for sentence_pair in sentence_pairs ]



In [7]:
# Padding sequences to create a batch
padded_sequences_A = torch.nn.utils.rnn.pad_sequence(inputs_A, padding_value=w2i_A['<pad>'], batch_first=True)

# Padding sequences to create a batch
padded_sequences_B = torch.nn.utils.rnn.pad_sequence(inputs_B, padding_value=w2i_B['<pad>'], batch_first=True)

#### Encoder: Bi-LSTM with Multiple Layers
The encoder processes the input sequence $ X = \{x_1, x_2, \ldots, x_T\} $ and converts it into a context vector that summarizes the sequence.

1. **Bidirectional LSTM**:
   - In a **Bi-LSTM**, two LSTM networks are used: one processes the input sequence in the forward direction, and the other processes it in the reverse direction.
   - At each time step $ t $, the hidden states are concatenated:
     $$
     h_t = [h_t^{\text{forward}}; h_t^{\text{backward}}]
     $$

2. **Multiple Layers**:
   - By stacking multiple Bi-LSTM layers, the encoder can learn more complex hierarchical representations of the input data.
   - The output of one layer serves as the input for the next layer.

The final outputs of the encoder are:
- A concatenated context vector $ h_T $, summarizing the input sequence.
- A hidden state and cell state passed to the decoder.


In [8]:

num_embeddings_A=len(w2i_A)
embedding_dim = 300
padding_idx_A=w2i_A['<pad>']

hidden_size= 256
dropout_rate=0.25

embd_A= torch.nn.Embedding(num_embeddings_A, embedding_dim, padding_idx=padding_idx_A)
encoder=torch.nn.LSTM(num_embeddings_A ,hidden_size, num_layers=1, bias=True, batch_first=True, dropout=dropout_rate, bidirectional=False)



In [9]:
x=embd_A(padded_sequences_A)
hidden_states, (hn,cn)= encoder(x)

In [10]:
hn.shape,cn.shape

(torch.Size([1, 3, 256]), torch.Size([1, 3, 256]))

#### Decoder: LSTM
The decoder generates the output sequence $ Y = \{y_1, y_2, \ldots, y_{T'}\} $ one token at a time, based on the context vector from the encoder.

1. **Input to the Decoder**:
   - At the first time step, the decoder takes the **start-of-sequence token** (<SOS>) as input.
   - For subsequent steps, the decoder uses either the ground truth token (during training) or its predicted token (during inference).

2. **LSTM Decoder**:
   - The decoder is a unidirectional LSTM that predicts the next token based on its current hidden state, cell state, and the previously generated token.
   - At each time step $ t' $:
     $$
     h_{t'}, c_{t'} = \text{LSTM}(y_{t'-1}, (h_{t'-1}, c_{t'-1}))
     $$

3. **Output Layer**:
   - A fully connected layer projects the hidden state $ h_{t'} $ to the size of the target vocabulary, followed by a softmax function to produce a probability distribution over possible next tokens:
     $$
     o_{t'} = \text{softmax}(W \cdot h_{t'} + b)
     $$


In [11]:
num_embeddings_B=len(w2i_B)
padding_idx_B=w2i_B['<pad>']



embd_B= torch.nn.Embedding(num_embeddings_B,  embedding_dim, padding_idx=padding_idx_B)
decoder=torch.nn.LSTM(embedding_dim ,hidden_size, num_layers=1, bias=True, batch_first=True, dropout=dropout_rate, bidirectional=False)

In [12]:
hn.shape

torch.Size([1, 3, 256])

In [13]:
y=embd_B(padded_sequences_B)

_,(h,c)= decoder(y,(hn,cn))

In [40]:
import torch

class Seq2Seq(torch.nn.Module):

    def __init__(self, num_embeddings_A, num_embeddings_B, embedding_dim ,hidden_size,padding_idx_A, padding_idx_B,dropout_rate=0):
        super(Seq2Seq, self).__init__()

        self.embd_A= torch.nn.Embedding(num_embeddings_A, embedding_dim, padding_idx=padding_idx_A)
        self.encoder=torch.nn.LSTM(embedding_dim ,hidden_size, num_layers=1, bias=True, batch_first=True, dropout=dropout_rate, bidirectional=False)

        self.embd_B= torch.nn.Embedding(num_embeddings_B,  embedding_dim, padding_idx=padding_idx_B)
        self.decoder=torch.nn.LSTM(embedding_dim ,hidden_size, num_layers=1, bias=True, batch_first=True, dropout=dropout_rate, bidirectional=False)
    def forward(self, padded_sequences_A, padded_sequences_B):
        
        x=self.embd_A(padded_sequences_A)
        
        _, (hn,cn)= self.encoder(x)
       
        
        y=self.embd_B(padded_sequences_B)
        hidden_states,_= self.decoder(y,(hn,cn))
        return hidden_states

In [41]:
model= Seq2Seq(num_embeddings_A, num_embeddings_B, embedding_dim ,hidden_size,padding_idx_A, padding_idx_B,dropout_rate)
model(padded_sequences_A, padded_sequences_B)

tensor([[[ 0.1398,  0.0054, -0.0953,  ...,  0.1783, -0.1887,  0.0801],
         [ 0.0886,  0.0235, -0.1270,  ...,  0.1277,  0.0654, -0.0631],
         [ 0.0864,  0.0385, -0.0349,  ...,  0.0870,  0.0469, -0.0298],
         [ 0.0536,  0.0399, -0.0320,  ...,  0.0691,  0.0215, -0.0258]],

        [[-0.2406,  0.0935,  0.3156,  ...,  0.2072, -0.2333, -0.1745],
         [ 0.0865,  0.0797,  0.2805,  ...,  0.0488, -0.0515, -0.3543],
         [-0.1559, -0.2132,  0.1505,  ..., -0.1188, -0.0030,  0.1048],
         [-0.0455,  0.0320, -0.0599,  ..., -0.0428,  0.0447,  0.2450]],

        [[-0.0681,  0.1804,  0.1041,  ..., -0.2098, -0.1247, -0.0469],
         [-0.1781,  0.1230, -0.0583,  ..., -0.1554, -0.2372, -0.0904],
         [-0.0829, -0.0630,  0.0046,  ..., -0.2933,  0.0115,  0.1354],
         [-0.0460, -0.0196, -0.0365,  ..., -0.0743,  0.0064,  0.0865]]],
       grad_fn=<TransposeBackward0>)