# Seq2Seq
- Encoder: encodes the source sentence into a *single vector* -> **context vector** (*abstract representation of the input sentence*).
- Decoder: learns to *generate the output sentence* given the context vector, one word at a time

![image.png](attachment:image.png)
- The input sentence is passed through the **Embedding** layer before going through the **RNN**.
- Each RNN has 2 inputs:
 1. The embeddings of the current word
 2. The hidden state of the previous layer h<sub>t-1</sub>
- And outputs: a new hidden state h<sub>t</sub>
- *h<sub>i</sub>* (hidden state) -> the vector representation of the sentence
----------------------
- h<sub>0</sub> of **Encoder** -> initialized to 0 or a learned parameter
- h<sub>0</sub> of **Decoder** -> initialized to Z (the final decoder hidden state)
-------------------------
#### Teacher Forcing 
- While training: sometimes using the ground truth next word in the sequence y<sub>t</sub> and sometimes use the word predicted by our decoder y_pred<sub>t-1</sub>

# Preparing Data 

In [8]:
# !pip install torchtext==0.5
# !python -m spacy download en
# !python -m spacy download de

In [9]:
import torch
import torch.nn as nn
import torch.optim as optim

from torchtext.datasets import Multi30k
from torchtext.data import Field, BucketIterator

import spacy
import numpy as np

import random
import math
import time

In [10]:
SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic=True

In [11]:
# loading spacy models
spacy_de = spacy.load('de')
spacy_en = spacy.load('en')

In [23]:
def tokenize_de(text):
    return [tok.text for tok in spacy_de.tokenizer(text)][::-1] # reverse the order of the input to increase many short term dependencies (from paper)

def tokenize_en(text):
    return [tok.text for tok in spacy_en.tokenizer(text)]

In [24]:
SRC = Field(tokenize=tokenize_de,
           init_token='<sos>',
           eos_token='<eos>',
           lower=True)

TRG = Field(tokenize=tokenize_en,
           init_token='<sos>',
           eos_token='<eos>',
           lower=True)

In [25]:
train_data, valid_data, test_data = Multi30k.splits(exts=('.de', '.en'),
                                                   fields=(SRC, TRG))

In [28]:
# checking the right number of data is loaded
print(f"# of training examples: {len(train_data.examples)}")
print(f"# of validation examples: {len(valid_data.examples)}")
print(f"# of testin examples: {len(test_data.examples)}")

# of training examples: 29000
# of validation examples: 1014
# of testin examples: 1000


In [33]:
print(vars(train_data.examples[0]))

{'src': ['.', 'büsche', 'vieler', 'nähe', 'der', 'in', 'freien', 'im', 'sind', 'männer', 'weiße', 'junge', 'zwei'], 'trg': ['two', 'young', ',', 'white', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.']}


In [34]:
# building the vocab only from training (avoid data leaking)
SRC.build_vocab(train_data, min_freq=2)
TRG.build_vocab(train_data, min_freq=2)

In [35]:
print(f"Unique tokens in source (de) vocabulary: {len(SRC.vocab)}")
print(f"Unique tokens in target (en) vocabulary: {len(TRG.vocab)}")

Unique tokens in source (de) vocabulary: 7854
Unique tokens in target (en) vocabulary: 5893


In [36]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

### Bucket iterators
- Adds the padding to sequences 
- Tries to minimize the amount of padding by rearranging the sequences

In [37]:
BATCH_SIZE = 128

train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
(train_data, valid_data, test_data),
batch_size=BATCH_SIZE,
device=device)

  sort_within_batch=None):
