<a href="https://colab.research.google.com/github/kristina-albrecht/colab/blob/master/NMT_Experimental.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building a Document Translation System from Scratch

\* inspired by [Adam Geitgey's Post](https://medium.com/@ageitgey/build-your-own-google-translate-quality-machine-translation-system-d7dc274bd476), [Google Colab Notebook - NMT With Attention](https://colab.research.google.com/github/tensorflow/tensorflow/blob/r1.9/tensorflow/contrib/eager/python/examples/nmt_with_attention/nmt_with_attention.ipynb#scrollTo=AOpGoE2T-YXS)

> This is an ongoing project with the purpose of of demonstrating how to design and implement a machine translation system powered by machine learning model from scratch. This will include
*   building a machine translation model (including finding and preparing data to train the model)
*   defining a translation pipeline (including cleaning the data, normalizing text sentences, tokenizing the input etc.)
*   building an application where you can upload a document in one language (let's say German) and download this document translated into another languaget (e.g English)


## Machine Translation Model
The following notebook will be the first step and will train a sequence to sequence model for machine translation from German into English. The base model is taken from the [Google Colab Notebook - NMT With Attention](https://colab.research.google.com/github/tensorflow/tensorflow/blob/r1.9/tensorflow/contrib/eager/python/examples/nmt_with_attention/nmt_with_attention.ipynb#scrollTo=AOpGoE2T-YXS) and will be later modefied by using word embeddings instead of word vectors as an input and trying different model architectures (like LSTM with attention). 

### Traning Data

To train the model we will use German-English parallel corpus from [European Parliament Proceedings Parallel Corpus 1996-2011](http://www.statmt.org/europarl/).

In [50]:
from __future__ import absolute_import, division, print_function
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

import urllib3
import shutil
import zipfile
import gzip
import unicodedata
import re
import numpy as np
import os
import time

import tensorflow as tf
print(tf.__version__)

2.2.0


#### Download and extract data



In [53]:
http = urllib3.PoolManager()

def download_extract(save_to, url):
  _, filename = os.path.split(url)
  path_to_zip = os.path.join(save_to, filename)

  with http.request('GET', url, preload_content=False) as r:
    with gzip.open(r, 'rb') as f_in:
      with open(path_to_zip, 'wb') as out_file:
        shutil.copyfileobj(r, out_file)

  with zipfile.ZipFile(path_to_zip, 'r') as zip_ref:
    member= zip_ref.namelist().pop(0)
    zip_ref.extract(member, save_to)
  return os.path.join(save_to, member)

def extract(fileame, save_as):

  path_to_zip = os.path.join(save_to, filename)

  with http.request('GET', url, preload_content=False) as r:
    with gzip.open(r, 'rb') as f_in:
      with open(path_to_zip, 'wb') as out_file:
        shutil.copyfileobj(r, out_file)

  with zipfile.ZipFile(path_to_zip, 'r') as zip_ref:
    member= zip_ref.namelist().pop(0)
    zip_ref.extract(member, save_to)
  return os.path.join(save_to, member)


In [56]:

# Download file from url https://www.statmt.org/europarl
url = 'de-en-short.zip'
path = os.getcwd()
path_to_file = extract(path, url)
path_to_file

MaxRetryError: ignored

In [None]:
http://opus.nlpl.eu/Wikipedia/v1.0/de-en_sample.html

In [17]:
# Converts the unicode file to ascii
def unicode_to_ascii(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn')


def preprocess_sentence(w):
    w = unicode_to_ascii(w.lower().strip())
    
    # creating a space between a word and the punctuation following it
    # eg: "he is a boy." => "he is a boy ." 
    # Reference:- https://stackoverflow.com/questions/3645931/python-padding-punctuation-with-white-spaces-keeping-punctuation
    w = re.sub(r"([?.!,¿])", r" \1 ", w)
    w = re.sub(r'[" "]+', " ", w)
    
    # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
    w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)
    
    w = w.rstrip().strip()
    
    # adding a start and an end token to the sentence
    # so that the model know when to start and stop predicting.
    w = '<start> ' + w + ' <end>'
    return w

In [35]:
# 1. Remove the accents
# 2. Clean the sentences
# 3. Return word pairs in the format: [ENGLISH, GERMAN]
def create_dataset(path, num_examples):
    lines = open(path, encoding='UTF-8').read().strip().split('\n')
    
    word_pairs = [[preprocess_sentence(w) for w in l.split('\t')]  for l in lines[:num_examples]]
    
    return word_pairs

In [20]:
# This class creates a word -> index mapping (e.g,. "dad" -> 5) and vice-versa 
# (e.g., 5 -> "dad") for each language,
class LanguageIndex():
  def __init__(self, lang):
    self.lang = lang
    self.word2idx = {}
    self.idx2word = {}
    self.vocab = set()
    
    self.create_index()
    
  def create_index(self):
    for phrase in self.lang:
      self.vocab.update(phrase.split(' '))
    
    self.vocab = sorted(self.vocab)
    
    self.word2idx['<pad>'] = 0
    for index, word in enumerate(self.vocab):
      self.word2idx[word] = index + 1
    
    for word, index in self.word2idx.items():
      self.idx2word[index] = word

In [38]:
def max_length(tensor):
    return max(len(t) for t in tensor)


def load_dataset(path, num_examples):
    # creating cleaned input, output pairs
    pairs = create_dataset(path, num_examples)
    # index language using the class defined above    
    inp_lang = LanguageIndex(ger for en, ger in pairs)
    targ_lang = LanguageIndex(en for en, ger in pairs)
    
    # Vectorize the input and target languages
    
    # German sentences
    input_tensor = [[inp_lang.word2idx[s] for s in sp.split(' ')] for en, ger in pairs]
    
    # English sentences
    target_tensor = [[targ_lang.word2idx[s] for s in en.split(' ')] for en, ger in pairs]
    
    # Calculate max_length of input and output tensor
    # Here, we'll set those to the longest sentence in the dataset
    max_length_inp, max_length_tar = max_length(input_tensor), max_length(target_tensor)
    
    # Padding the input and output tensor to the maximum length
    input_tensor = tf.keras.preprocessing.sequence.pad_sequences(input_tensor, 
                                                                 maxlen=max_length_inp,
                                                                 padding='post')
    
    target_tensor = tf.keras.preprocessing.sequence.pad_sequences(target_tensor, 
                                                                  maxlen=max_length_tar, 
                                                                  padding='post')
    
    return input_tensor, target_tensor, inp_lang, targ_lang, max_length_inp, max_length_tar

In [39]:
# Try experimenting with the size of that dataset
num_examples = 30000
input_tensor, target_tensor, inp_lang, targ_lang, max_length_inp, max_length_targ = load_dataset(path_to_file, num_examples)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



ValueError: ignored

In [None]:
# Creating training and validation sets using an 80-20 split
input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split(input_tensor, target_tensor, test_size=0.2)

# Show length
len(input_tensor_train), len(target_tensor_train), len(input_tensor_val), len(target_tensor_val)