![Save2Drive](https://raw.githubusercontent.com/alahnala/AI4All2020-Michigan-NLP/master/slides/save2drive.png)

# Language Translation

In this project we will be teaching a model to translate from English to French. After you go through this notebook once, you can teach the model to translate from English to Spanish, German, or another language of your choice (just ask us in office hours!) or translate to English from any other language.

Before we get started, here is an overview of how language works.

<img src="slides/overview.png">

<img src="slides/overview2.png">

<img src="slides/encoder.png">

<img src="slides/decoder.png">

<img src="slides/detail_overview.png">

# Setup

In [1]:
# Setup - run
import sys, os
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
  !rm -r Language_Translation
  !git clone https://github.com/meera9397/LanguageTranslation.git
  !cp -r Language_Translation/data/ .
  !cp -r Language_Translation/slides/ .
  !echo "=== Files Copied ==="
from language_translation_help import *

# Loading Data Files

The data for this project is a set of many thousands of English to French translation pairs. The file is a tab separated list of translation pairs:

```
I am cold.    J'ai froid.
```

In order to make the translation easier, we perform several preprocessing steps, including 
* making all characters lowercase  --> .lower()
* stripping white space --> .stri()
* trim punctuation --> re.sub(r"([.!?])", r" \1", s), re.sub(r"[^a-zA-Z.!?]+", r" ", s)


After you run through this notebook, you can come back here and play around with this cell. Think about the following questions when you do that:
####  What would happen if you didn't lower case all the characters? 
####  What would happen if you didn't strip the lower case? 
#### What would happen if you removed things besides punctuation? 

In [2]:
def normalize_string(s):
    s = unicode_to_ascii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s

# Filtering sentences

Since there are a *lot* of example sentences and we want to train something relatively quickly, we'll trim the data set to only relatively short and simple sentences. We're filtering to sentences that translate to the form "I am" or "He is" etc. (accounting for apostrophes being removed). 

After you go through this notebook, feel free to change these prefixes or add to them and see how that affects your model. You can look through the data files in the data folder and see which prefixes are used that are not included here for ideas on what to add in this section. Think about the following question when you do this:

#### Why do you think we include contractions? (ex. "i am" as well as "i m"). Do you see a decrease or increase in the performance of the encoder and decoder when removing contractions?
#### What are some other prefixes you chose to add/ remove here? Why?

In [3]:
good_prefixes = (
    "i am ", "i m ",
    "he is", "he s ",
    "she is", "she s",
    "you are", "you re "
)

Here we have some functions to help us filter our data into sentences that have "good prefixes." If you decide that you want to perform a translation from a language to English, you can change the variable english_to in the function below.


In [4]:
def filter_pair(p, good_prefixes):
    # change the following variable from True to False if you want to translate a certain language TO English.
    # This variable being True indicates that we are translating English into another language
    english_to = True
    if english_to == True:
        return len(p[0].split(' ')) < MAX_LENGTH and len(p[1].split(' ')) < MAX_LENGTH and \
            p[0].startswith(good_prefixes)
    return len(p[0].split(' ')) < MAX_LENGTH and len(p[1].split(' ')) < MAX_LENGTH and \
            p[1].startswith(good_prefixes)

def prepare_data(lang1_name, lang2_name, reverse=False):
    input_lang, output_lang, pairs = read_langs(lang1_name, lang2_name, normalize_string, reverse)
    print("Read %s sentence pairs" % len(pairs))
    pairs = filter_pairs(pairs, good_prefixes, filter_pair)
    print("Trimmed to %s sentence pairs" % len(pairs))
    print("Indexing words...")
    for pair in pairs:
        input_lang.index_words(pair[0])
        output_lang.index_words(pair[1])

    return input_lang, output_lang, pairs

Now, we prepare our final data to input into our encoder and decoder using the "prepare data" function. It takes in 3 variables:
* initial_lang: the language that we want to translate. We have set this as a default to 'eng', which is short for 'English.'
* final_lang: the language that we want to translate to. We have set this as a default to 'fra', which is short for 'French
* reverse: our data is set up in a certain way such that there is a natural translation order. For example, our text file on translating between English and French is called 'eng-fra.txt' meaning that the natural order would be to translate from english to French. Thus no reversing has to be done and reverse is by default, False. If we wanted to translate from French to English, we would set our initial_lang to 'fra', our final_lang to 'eng', and reverse to True'

If you want to translate **from** French **to** English, set:
* initial_lang = 'fra'
* final_lang = 'eng'
* reverse: True

If you want to translate **from** English **to** Spanish, set:
* initial_lang = 'eng'
* final_lang = 'spa'
* reverse: False

If you want to translate **from** English **to** Spanish
* initial_lang = 'spa'
* final_lang = 'eng'
* reverse: True

If you want to translate **from** English **to** German
* initial_lang = 'eng'
* final_lang = 'deu'
* reverse: False

If you want to translate **from** English **to** German
* initial_lang = 'deu'
* final_lang = 'eng'
* reverse: True

#### This function outputs pairs of phrases in "initial_lang" and "final_lang", AKA the languages you want to translate from and to. We print an example pair at the end of the cell

In [5]:
initial_lang = 'eng'
final_lang = 'fra'
reverse = False

input_lang, output_lang, pairs = prepare_data(initial_lang, final_lang, reverse)

# Print an example pair
print(random.choice(pairs))

Reading lines...
Read 177210 sentence pairs
Trimmed to 11253 sentence pairs
Indexing words...
['i m not blushing !', 'je ne rougis pas !']


# Testing the Encoder and Decoder
The exact inputs and outputs are not exactly important for this cell. I just wanted you to get a little bit of intuition on how the encoders and decoders work. We start with a certain input, "word_input", initialize an encoder, "encoder_test", and run the encoder using both of those. We take the output of the encoder, "all_encoder_outputs", and put that into the decoder, along with the initialized decoder, "decoder_test", and the initial input to produce our final outputs.

In [6]:
word_input = Variable(torch.LongTensor([1, 2, 3]))
encoder_test = create_encoder()
decoder_test =  create_decoder()

all_encoder_outputs = run_encoder(encoder_test, word_input)
_ = run_decoder(decoder_test, word_input, all_encoder_outputs)

  return F.softmax(attn_energies).unsqueeze(0).unsqueeze(0)
  output = F.log_softmax(self.out(torch.cat((rnn_output, context), 1)))


# Training Our Model

<img src="slides/training.png">

The first step to "training" is initializing our encoder and decoder. We do this in one step, and have it hidden in a helper function for ease.

In [7]:
# Initialize models
all_vars_training = init_vars(input_lang, output_lang)

In the following cell, **n_epochs** is the amount of time that we want to train for. A unit of time in this case is an "epoch." After going through this file, you can play around with this number. Think about the following questions:
#### Would increasing or decreasing n_epochs improve performance? Why?
#### Do you notice a big difference in the translation ability of your encoder/decoder when you increase/decrease n_epochs?

In [8]:
# Configuring training
n_epochs = 500
plot_every = 200
print_every = 100

# Keep track of time elapsed and running averages
start = time.time()
plot_losses = []
print_loss_total = 0 # Reset every print_every
plot_loss_total = 0 # Reset every plot_every

In the following cell, we train our encoder and decoder! At each step, we compute a value called "loss", which is an indication of how bad our model is at language translation at the time (the higher the loss, the worse our model is at language translation). The loss should decrease over time.

In [9]:
# Begin!
for epoch in range(1, n_epochs + 1):
    # Get phrase in language to translate from (input variable, default = English phrase) and
    # phrase in language to translate to (target variable, default = French phrase)
    training_pair = variables_from_pair(random.choice(pairs), input_lang, output_lang)
    input_variable = training_pair[0]
    target_variable = training_pair[1]

    # Run the train function
    loss = train(input_variable, target_variable, all_vars_training)

    # Keep track of loss
    print_loss_total += loss
    plot_loss_total += loss

    if epoch == 0: continue

    if epoch % print_every == 0:
        print_loss_avg = print_loss_total / print_every
        print_loss_total = 0
        print_summary = '%s (%d %d%%) %.4f' % (time_since(start, epoch / n_epochs), epoch, epoch / n_epochs * 100, print_loss_avg)
        print(print_summary)
        
    if epoch % plot_every == 0:
        plot_loss_avg = plot_loss_total / plot_every
        plot_losses.append(plot_loss_avg)
        plot_loss_total = 0


  torch.nn.utils.clip_grad_norm(encoder.parameters(), clip)
  torch.nn.utils.clip_grad_norm(decoder.parameters(), clip)


0m 4s (- 8m 8s) (100 1%) 5.2268
0m 10s (- 8m 25s) (200 2%) 3.6346
0m 15s (- 8m 27s) (300 3%) 3.6203
0m 20s (- 8m 11s) (400 4%) 3.3858
0m 24s (- 7m 52s) (500 5%) 3.5507
0m 30s (- 7m 57s) (600 6%) 3.5649
0m 35s (- 7m 54s) (700 7%) 3.5139
0m 41s (- 8m 0s) (800 8%) 3.8078
0m 47s (- 8m 1s) (900 9%) 3.6309
0m 52s (- 7m 55s) (1000 10%) 3.4949
0m 58s (- 7m 55s) (1100 11%) 3.2284
1m 4s (- 7m 55s) (1200 12%) 3.3863
1m 11s (- 7m 56s) (1300 13%) 3.3993
1m 17s (- 7m 56s) (1400 14%) 3.3985
1m 23s (- 7m 53s) (1500 15%) 3.4245
1m 29s (- 7m 49s) (1600 16%) 3.3926
1m 35s (- 7m 45s) (1700 17%) 3.3626
1m 41s (- 7m 40s) (1800 18%) 3.4642
1m 46s (- 7m 35s) (1900 19%) 3.5025
1m 53s (- 7m 32s) (2000 20%) 3.3451
1m 58s (- 7m 24s) (2100 21%) 3.2737
2m 3s (- 7m 17s) (2200 22%) 3.3376
2m 7s (- 7m 8s) (2300 23%) 3.2352
2m 14s (- 7m 5s) (2400 24%) 3.2364
2m 20s (- 7m 0s) (2500 25%) 3.2909
2m 26s (- 6m 56s) (2600 26%) 3.1601
2m 31s (- 6m 50s) (2700 27%) 2.9886
2m 37s (- 6m 45s) (2800 28%) 3.0917
2m 43s (- 6m 40s) (2

KeyboardInterrupt: 

Here, you can see the loss decreasing over time, as our encoder and decoder get better at language translation.

In [None]:
%matplotlib inline

def show_plot(points):
    plt.figure()
    fig, ax = plt.subplots()
    loc = ticker.MultipleLocator(base=0.2) # put ticks at regular intervals
    ax.yaxis.set_major_locator(loc)
    plt.plot(points)

show_plot(plot_losses)

# Evaluation

Now that we have trained our encoder and decoder, we can use them to perform translations! Below, in the "evaluate_randomly" function, we randomly pick a pair of phrases that we have trained on, and see how well we can translate that phrase. 

In [None]:
for_evaluations = (input_lang, output_lang, all_vars_training[0], all_vars_training[1])

In [None]:
def evaluate_randomly():
    pair = random.choice(pairs)
    output_words, decoder_attn = evaluate(pair[0], for_evaluations)
    output_sentence = ' '.join(output_words)
    
    print('>', pair[0])
    print('=', pair[1])
    print('<', output_sentence)
    print('')

You can keep running this cell over and over again to see how well the translator does on various phrases.

In [None]:
evaluate_randomly()

You can also evaluate the encoder/decoder on phrases that you come up with! Here is an example of how to do that. 
### Note
The phrases you test have to start with the "good prefixes" and also contain words that the model has seen before. This is why you may get errors if you change the "phrase" below.

In [None]:
phrase = 'i m happy .'
output_words, _ = evaluate(phrase, for_evaluations)
output_sentence = ' '.join(output_words)
print('>', phrase)
print('<', output_sentence)