# Machine Translation: From "Hello World" to "Hallo Welt"

For this project we will be using PyTorch to train a machine translation model from English to German.
We will be using the [Multi30k dataset](https://github.com/multi30k/dataset/tree/master/data/task1/raw) to train our model and evaluate it later.

During this notebook we will show how to:
- Train a Seq2Seq model from scratch
- Apply attention to our model
- Load pretrained embedding weights
- Design a transformer model to better parallelize computations

In [None]:
#!python -m spacy download en_core_web_sm
#!python -m spacy download de_core_web_sm

In [20]:
import io
from collections import Counter
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader
import torchtext
from torchtext.vocab import Vocab
from torchtext.utils import download_from_url, extract_archive
from torchtext.data.utils import get_tokenizer

## Dataset Loading & Preprocessing

In [7]:
url_base = 'https://raw.githubusercontent.com/multi30k/dataset/master/data/task1/raw/'
train_urls = ('train.de.gz', 'train.en.gz')
val_urls = ('val.de.gz', 'val.en.gz')
test_urls = ('test_2016_flickr.de.gz', 'test_2016_flickr.en.gz')

In [8]:
train_filepaths = [extract_archive(download_from_url(url_base + url))[0] for url in train_urls]
val_filepaths = [extract_archive(download_from_url(url_base + url))[0] for url in val_urls]
test_filepaths = [extract_archive(download_from_url(url_base + url))[0] for url in test_urls]

100%|██████████| 637k/637k [00:00<00:00, 15.6MB/s]
100%|██████████| 569k/569k [00:00<00:00, 9.15MB/s]
100%|██████████| 24.7k/24.7k [00:00<00:00, 8.30MB/s]
100%|██████████| 21.6k/21.6k [00:00<00:00, 7.25MB/s]
100%|██████████| 22.9k/22.9k [00:00<00:00, 7.62MB/s]
100%|██████████| 21.1k/21.1k [00:00<00:00, 6.90MB/s]


In [10]:
de_tokenizer = get_tokenizer('spacy', language='de_core_news_sm')
en_tokenizer = get_tokenizer('spacy', language='en_core_web_sm')

In [22]:
def build_vocab(filepath, tokenizer):
    counter = Counter()
    with io.open(filepath, encoding="utf8") as f:
        for string_ in f:
            counter.update(tokenizer(string_))
    return Vocab(counter)


def data_process(filepaths):
    raw_de_iter = iter(io.open(filepaths[0], encoding="utf8"))
    raw_en_iter = iter(io.open(filepaths[1], encoding="utf8"))
    data = []
    for (raw_de, raw_en) in zip(raw_de_iter, raw_en_iter):
        de_tensor_ = torch.tensor([de_vocab[token] for token in de_tokenizer(raw_de)],
                                  dtype=torch.long)
        en_tensor_ = torch.tensor([en_vocab[token] for token in en_tokenizer(raw_en)],
                                  dtype=torch.long)
        data.append((de_tensor_, en_tensor_))
    return data


de_vocab = build_vocab(train_filepaths[0], de_tokenizer)
en_vocab = build_vocab(train_filepaths[1], en_tokenizer)

train_data = data_process(train_filepaths)
val_data = data_process(val_filepaths)
test_data = data_process(test_filepaths)

### DataLoader Creation

## Seq2Seq Model

### Model Training

### Model Evaluation

### Inference on custom input

## Seq2Seq with Attention

### Model Training

### Model Evaluation

### Inference on custom input

## Transformer Model

### Model Training

### Model Evaluation

### Inference on custom input