![Save2Drive](https://raw.githubusercontent.com/alahnala/AI4All2020-Michigan-NLP/master/slides/save2drive.png)

# Language Translation

In this project we will be teaching a model to translate from English to French. After you go through this notebook once, you can teach the model to translate from English to Spanish, Italian, or another language of your choice (just ask us in office hours!) or translate to English from any other language.

Before we get started, here is an overview of how language works.

![title](slides/slide1.png)

![title](img/slide2.png)

# Setup

In [1]:
# Setup - run
import sys, os
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
  !rm -r Language_Translation
  !cp -r Language_Translation/data/ .
  !cp -r Language_Translation/slides/ .
  !echo "=== Files Copied ==="
from language_translation_help import *

# Loading Data Files

The data for this project is a set of many thousands of English to French translation pairs. The file is a tab separated list of translation pairs:

```
I am cold.    J'ai froid.
```

In order to make the translation easier, we perform several preprocessing steps, including 
* making all characters lowercase  --> .lower()
* trim punctuation --> re.sub(r"([.!?])", r" \1", s)

In [2]:
def normalize_string(s):
    s = unicode_to_ascii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s

# Filtering sentences

Since there are a *lot* of example sentences and we want to train something relatively quickly, we'll trim the data set to only relatively short and simple sentences. We're filtering to sentences that translate to the form "I am" or "He is" etc. (accounting for apostrophes being removed). After you go through this notebook, feel free to change these prefixes or add to them and see how that affects your model.

In [3]:
good_prefixes = (
    "i am ", "i m ",
    "he is", "he s ",
    "she is", "she s",
    "you are", "you re "
)

To prepare this data, we use the following functions in our helper file:
* read_langs, which 


In [4]:
def filter_pair(p, good_prefixes):
    english_to = True
    if english_to == True:
        return len(p[0].split(' ')) < MAX_LENGTH and len(p[1].split(' ')) < MAX_LENGTH and \
            p[0].startswith(good_prefixes)
    return len(p[0].split(' ')) < MAX_LENGTH and len(p[1].split(' ')) < MAX_LENGTH and \
            p[1].startswith(good_prefixes)

In [5]:
def prepare_data(lang1_name, lang2_name, reverse=False):
    input_lang, output_lang, pairs = read_langs(lang1_name, lang2_name, normalize_string, reverse)
    print("Read %s sentence pairs" % len(pairs))
    pairs = filter_pairs(pairs, good_prefixes, filter_pair)
    print("Trimmed to %s sentence pairs" % len(pairs))
    print("Indexing words...")
    for pair in pairs:
        input_lang.index_words(pair[0])
        output_lang.index_words(pair[1])

    return input_lang, output_lang, pairs

The full process for preparing the data is:

* Read text file and split into lines, split lines into pairs
* Normalize text, filter by length and content
* Make word lists from sentences in pairs

In [6]:
input_lang, output_lang, pairs = prepare_data('eng', 'fra', False)

# Print an example pair
print(random.choice(pairs))

Reading lines...
Read 177210 sentence pairs
Trimmed to 11253 sentence pairs
Indexing words...
['i m impressed .', 'je suis impressionnee .']


# Testing the Encoder and Decoder
The exact inputs and outputs are not exactly important

In [7]:
word_input = Variable(torch.LongTensor([1, 2, 3]))
encoder_test = create_encoder()
decoder_test =  create_decoder()

all_encoder_outputs = run_encoder(encoder_test, word_input)
decoder_outputs = run_decoder(decoder_test, word_input, all_encoder_outputs)

  return F.softmax(attn_energies).unsqueeze(0).unsqueeze(0)
  output = F.log_softmax(self.out(torch.cat((rnn_output, context), 1)))


Finally helper functions to print time elapsed and estimated time remaining, given the current time and progress.

# Training Our Model

With everything in place we can actually initialize a network and start training.

To start, we initialize models, optimizers, and a loss function (criterion).

In [8]:
# Initialize models
all_vars_training = init_vars(input_lang, output_lang)

Then set up variables for tracking progress:

In [9]:
# Configuring training
n_epochs = 5000
print_every = 1000

# Keep track of time elapsed and running averages
start = time.time()
print_loss_total = 0 # Reset every print_every

To actually train, we call the train function many times, printing a summary as we go.

*Note:* If you run this notebook you can train, interrupt the kernel, evaluate, and continue training later. You can comment out the lines above where the encoder and decoder are initialized (so they aren't reset) or simply run the notebook starting from the following cell.

In [10]:
# Begin!
for epoch in range(1, n_epochs + 1):
    # Get training data for this cycle
    training_pair = variables_from_pair(random.choice(pairs), input_lang, output_lang)
    input_variable = training_pair[0]
    target_variable = training_pair[1]

    # Run the train function
    loss = train(input_variable, target_variable, all_vars_training)

    # Keep track of loss
    print_loss_total += loss

    if epoch == 0: continue

    if epoch % print_every == 0:
        print_loss_avg = print_loss_total / print_every
        print_loss_total = 0
        print_summary = '%s (%d %d%%) %.4f' % (time_since(start, epoch / n_epochs), epoch, epoch / n_epochs * 100, print_loss_avg)
        print(print_summary)


  torch.nn.utils.clip_grad_norm(encoder.parameters(), clip)
  torch.nn.utils.clip_grad_norm(decoder.parameters(), clip)


KeyboardInterrupt: 

We can evaluate random sentences from the training set and print out the input, target, and output to make some subjective quality judgements:

In [None]:
for_evaluations = (input_lang, output_lang, encoder, decoder)

In [None]:
def evaluate_randomly():
    pair = random.choice(pairs)
    encoder = all_vars_training[0]
    decoder = all_vars_training[1]
    output_words, decoder_attn = evaluate(pair[0], for_evaluations)
    output_sentence = ' '.join(output_words)
    
    print('>', pair[0])
    print('=', pair[1])
    print('<', output_sentence)
    print('')

In [None]:
evaluate_randomly()

# Visualizing attention

A useful property of the attention mechanism is its highly interpretable outputs. Because it is used to weight specific encoder outputs of the input sequence, we can imagine looking where the network is focused most at each time step.

You could simply run `plt.matshow(attentions)` to see attention output displayed as a matrix, with the columns being input steps and rows being output steps:

For a better viewing experience we will do the extra work of adding axes and labels:

In [None]:
def evaluate_and_show_attention(input_sentence):
    output_words, attentions = evaluate(input_sentence)
    print('input =', input_sentence)
    print('output =', ' '.join(output_words))
    show_attention(input_sentence, output_words, attentions)

In [None]:
evaluate_and_show_attention("hi my name is meera .", for_evaluations)