# Introduction
In this notebook I will attempt to use attention for machine translation.
Just like in previous notebooks I am very much inspired by Francois Chollet's article [A ten-minute introduction to sequence-to-sequence learning in Keras](https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html). 

In this notebook I am using a [forked version of keras](https://github.com/gustavgransbo/keras/releases/tag/2.2.1) that merges Anders Huss [Pull Request #8296](https://github.com/keras-team/keras/pull/8296) which implements support for reccurent attention. 

# Data
I will be using data from the same source as Chollet, http://www.manythings.org/anki/. I'm using the 17303 sentence long swe-eng data set, that contains english sentences and their swedish translations. The french data set used by Chollet is much larger, but he limited his training set to 10 000 sentences and used 20% of it for validation during training.

In [4]:
data_path = 'data/swe-eng/swe.txt'
with open(data_path, 'r', encoding='utf-8') as f:
    lines = f.read().split('\n')

Read all sentences.

In [5]:
input_sentences, target_sentences = [], []
for line in lines:
    try:
        input_text, target_text, *_ = line.split('\t')
    except ValueError:
        print(line)
        
    # Chollet uses tab as start of sentence and line feed as end of sentence characters.
    # The start of sentence character will be used to seed new sentences and the end
    # of sentence character will be used to terminate sentences.
    target_text = '\t' + target_text + '\n'
    input_sentences.append(input_text)
    target_sentences.append(target_text)




## Investigate sentence lengths.

In [6]:
import numpy as np

In [7]:
input_seq_lens = np.array([len(sentence) for sentence in input_sentences])
target_seq_lens = np.array([len(sentence) for sentence in target_sentences])

In [9]:
max_seq_len = 50

In [10]:
 input_idx = np.where(input_seq_lens <= max_seq_len)
 target_idx = np.where(target_seq_lens <= max_seq_len)

In [11]:
print("{} input sentences with {} or fewer characters".format(len(input_idx[0]), max_seq_len))
print("{} target sentences with {} or fewer characters".format(len(target_idx[0]), max_seq_len))

16741 input sentences with 50 or fewer characters
16522 target sentences with 50 or fewer characters


In [12]:
keep_idx = np.intersect1d(input_idx, target_idx)

In [13]:
print("{} input sentence pairs with {} or fewer characters in both languages".format(len(keep_idx), max_seq_len))

16404 input sentence pairs with 50 or fewer characters in both languages


In [14]:
input_sentences = np.array(input_sentences)[keep_idx]
target_sentences = np.array(target_sentences)[keep_idx]

## Build vocabularies

In [15]:
input_characters = set()
target_characters = set()
for input_text, target_text in zip(input_sentences, target_sentences):
    for char in input_text:
        input_characters.add(char)
    for char in target_text:
        target_characters.add(char)

Sort the vocabularies.

In [16]:
input_characters = sorted(list(input_characters))
target_characters = sorted(list(target_characters))

input_vocab_size = len(input_characters)
target_vocab_size = len(target_characters)

In [17]:
print("Input vocab size: {}".format(input_vocab_size))
print("Target vocab size: {}".format(target_vocab_size))

Input vocab size: 75
Target vocab size: 79


In [18]:
input_token_index = dict(
    [(char, i) for i, char in enumerate(input_characters)])
target_token_index = dict(
    [(char, i) for i, char in enumerate(target_characters)])

## Build padded data

First create completely empty data.

In [19]:
encoder_input_data = np.zeros(
    (len(input_sentences), max_seq_len, input_vocab_size),
    dtype='float32')
decoder_input_data = np.zeros(
    (len(input_sentences), max_seq_len, target_vocab_size),
    dtype='float32')
decoder_target_data = np.zeros(
    (len(input_sentences), max_seq_len, target_vocab_size),
    dtype='float32')

In [20]:
encoder_input_data.shape

(16404, 50, 75)

In [21]:
decoder_input_data.shape

(16404, 50, 79)

In [22]:
decoder_target_data.shape

(16404, 50, 79)

Now fill in all the values that we have available, leaving the rest as padding. Encode all characters using one-hot encoding.

In [23]:
for i, (input_text, target_text) in enumerate(zip(input_sentences, target_sentences)):
    for t, char in enumerate(input_text):
        encoder_input_data[i, t, input_token_index[char]] = 1.
    for t, char in enumerate(target_text):
        # decoder_target_data is ahead of decoder_input_data by one timestep
        decoder_input_data[i, t, target_token_index[char]] = 1.
        if t > 0:
            # decoder_target_data will be ahead by one timestep
            # and will not include the start character.
            decoder_target_data[i, t - 1, target_token_index[char]] = 1.

## Divide data into a training and a validation set
I will use 8 000 sentances as training set and 2000 as validation set.

In [24]:
trainig_size, validation_size = 8000, 2000

In [25]:
shuffle_idx = np.random.permutation(len(input_sentences))

In [26]:
train_idx, val_idx = shuffle_idx[:trainig_size], shuffle_idx[trainig_size:trainig_size+validation_size]

In [27]:
encoder_input_train, encoder_input_val = encoder_input_data[train_idx], encoder_input_data[val_idx]

decoder_input_train, decoder_input_val = decoder_input_data[train_idx], decoder_input_data[val_idx]

decoder_target_train, decoder_target_val = decoder_target_data[train_idx], decoder_target_data[val_idx]

# Training Model

Let's start with a simple model. Anders Huss proposes this one in the attention source code.

In [34]:
from keras.layers import LSTMCell, RNN, TimeDistributed, Dense
from keras import Model, Input
from keras.layers.attention import MixtureOfGaussian1DAttention

In [43]:
input_english = Input((None, input_vocab_size))
target_french_tm1 = Input((None, target_vocab_size))
cell = MixtureOfGaussian1DAttention(LSTMCell(64), components=3, heads=3)
attention_lstm = RNN(cell, return_sequences=True)
h_sequence = attention_lstm(target_french_tm1, constants=input_english)
output_layer = TimeDistributed(Dense(target_vocab_size, activation='softmax'))
predicted_french = output_layer(h_sequence)

In [44]:
train_model = Model(
            inputs=[target_french_tm1, input_english],
            outputs=predicted_french
        )

In [47]:
decoder_target_train.any(2).shape

(8000, 50)

In [51]:
sample_weights_train = decoder_target_train.any(2).astype(int)
sample_weights_val = decoder_target_val.any(2).astype(int)

In [53]:
train_model.compile(optimizer='Adam', loss='categorical_crossentropy', sample_weight_mode='temporal', weighted_metrics=['accuracy'])
train_model.fit(
    x=[decoder_input_train, encoder_input_train],
    y=decoder_target_train,
    sample_weight = sample_weights_train,
    epochs=2,
    batch_size = 64,
    validation_data = ([decoder_input_val, encoder_input_val], decoder_target_val, sample_weights_val)
)

Train on 8000 samples, validate on 2000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x22405d01908>

# Translation Model

In [92]:
# Reverse-lookup token index to decode sequences back to
# something readable.
reverse_input_char_index = dict(
    (i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict(
    (i, char) for char, i in target_token_index.items())


def decode_sequence(input_seq):

    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1, 1, target_vocab_size))
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0, target_token_index['\t']] = 1.

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ''
    all_predictions = []
    while not stop_condition:
        output_tokens = train_model.predict([target_seq, input_seq])

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += sampled_char
        
        #all_predictions.append((target_seq, "".join([reverse_target_char_index[i] for i in output_tokens.argmax(2).flatten()])))

        # Exit condition: either hit max length
        # or find stop character.
        if (sampled_char == '\n' or len(decoded_sentence) > max_seq_len):
            stop_condition = True

        # Update the target sequence (of length 1).
        next_target = np.zeros((1, target_seq.shape[1] + 1, target_vocab_size))
        next_target[0,:-1,:] = target_seq
        next_target[0, -1, sampled_token_index] = 1.
        target_seq = next_target
        

    return decoded_sentence#, all_predictions

In [62]:
input_sentences_val = input_sentences[val_idx]

In [64]:
for seq_index in range(30):
    # Take one sequence (part of the training set)
    # for trying out decoding.
    input_seq = encoder_input_val[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    print('-')
    print('Input sentence:', input_sentences_val[seq_index])
    print('Decoded sentence:', decoded_sentence)

-
Input sentence: I'll go in a minute.
Decoded sentence: JJJJJJJJaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-
Input sentence: I said Tom is a friend.
Decoded sentence: JJJJJaaaaaaaaaaaaaaaaaaaa a a a  a a a a a  a a a  
-
Input sentence: I never saw you.
Decoded sentence: JJJJJJaaaaaaaaaaaaaaaaaaaaaa  a a a a a a a aa a a 
-
Input sentence: I feel like playing, too.
Decoded sentence: JJJJJJJaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa    
-
Input sentence: Tom drives.
Decoded sentence: ooooooooooooooooooo o o o           o o     aa a   
-
Input sentence: Tom's lucky.
Decoded sentence: oooooooooooooooo o o  o       o  o o o      a  a   
-
Input sentence: How about a contest?
Decoded sentence: aaaaaaaaaaaaaaa a a a a a a a a a a  a  a a  a  a  
-
Input sentence: I can help you out.
Decoded sentence: JJJJJaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
-
Input sentence: I can't remember the lyrics.
Decoded sentence: JJJJJaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa    a  a 
-
Input sentence: Keep da

In [95]:
train_model.fit(
    x=[decoder_input_train, encoder_input_train],
    y=decoder_target_train,
    sample_weight = sample_weights_train,
    epochs=2,
    batch_size = 64,
    validation_data = ([decoder_input_val, encoder_input_val], decoder_target_val, sample_weights_val)
)

Train on 8000 samples, validate on 2000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x22406e6fb70>

In [96]:
train_model.save('keras_models/attention_simple.h5')

  '. They will not be included '


In [97]:
for seq_index in range(30):
    # Take one sequence (part of the training set)
    # for trying out decoding.
    input_seq = encoder_input_val[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    print('-')
    print('Input sentence:', input_sentences_val[seq_index])
    print('Decoded sentence:', decoded_sentence)

-
Input sentence: I'll go in a minute.
Decoded sentence: Jag äar ha me mi mit de de.

-
Input sentence: I said Tom is a friend.
Decoded sentence: Jag hr dom Tom mr hr hr hr.

-
Input sentence: I never saw you.
Decoded sentence: Jag har var va da d.

-
Input sentence: I feel like playing, too.
Decoded sentence: Jag har ka ke de de de kig meg.

-
Input sentence: Tom drives.
Decoded sentence: Tom hrr hor.

-
Input sentence: Tom's lucky.
Decoded sentence: Tom är kar ka.

-
Input sentence: How about a contest?
Decoded sentence: Hur ka da da ka ket?

-
Input sentence: I can help you out.
Decoded sentence: Jag kan han pa du du de de de.

-
Input sentence: I can't remember the lyrics.
Decoded sentence: Jag kan iit mim mit mr mr kr kr.

-
Input sentence: Keep dancing.
Decoded sentence: Hin dennn.

-
Input sentence: I'll bring Tom.
Decoded sentence: Jag är kr hot momomom.

-
Input sentence: Be cool.
Decoded sentence: Van kom.

-
Input sentence: We can both do it.
Decoded sentence: Vi kan kat de 

In [98]:
from keras.callbacks import ModelCheckpoint

In [99]:
ModelCheckpoint("keras_models/attention_simple_best.h5", monitor='val_weighted_loss', verbose=0, save_best_only=True, save_weights_only=False, mode='auto')

<keras.callbacks.ModelCheckpoint at 0x22406e92e80>

In [109]:
train_model.fit(
    x=[decoder_input_train, encoder_input_train],
    y=decoder_target_train,
    sample_weight = sample_weights_train,
    epochs=50,
    batch_size = 64,
    validation_data = ([decoder_input_val, encoder_input_val], decoder_target_val, sample_weights_val),
    callbacks = [ModelCheckpoint("keras_models/attention_simple_best.h5", monitor='val_loss', verbose=0, save_best_only=True, save_weights_only=False, mode='auto')]
)

Train on 8000 samples, validate on 2000 samples
Epoch 1/50
Epoch 2/50


  '. They will not be included '


Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x22411482438>

In [110]:
for seq_index in range(30):
    # Take one sequence (part of the training set)
    # for trying out decoding.
    input_seq = encoder_input_val[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    print('-')
    print('Input sentence:', input_sentences_val[seq_index])
    print('Decoded sentence:', decoded_sentence)

-
Input sentence: I'll go in a minute.
Decoded sentence: Jag ska gå mig mig mig.

-
Input sentence: I said Tom is a friend.
Decoded sentence: Jag sag Tom är en för dig.

-
Input sentence: I never saw you.
Decoded sentence: Jag båg var var dig.

-
Input sentence: I feel like playing, too.
Decoded sentence: Jag känner att att ppat i gå det.

-
Input sentence: Tom drives.
Decoded sentence: Tom brov.

-
Input sentence: Tom's lucky.
Decoded sentence: Tom är kock.

-
Input sentence: How about a contest?
Decoded sentence: Hur skulla kan inten?

-
Input sentence: I can help you out.
Decoded sentence: Jag kan hale dig ut.

-
Input sentence: I can't remember the lyrics.
Decoded sentence: Jag kan inte mig här slar.

-
Input sentence: Keep dancing.
Decoded sentence: Håll da kinn.

-
Input sentence: I'll bring Tom.
Decoded sentence: Jag kommer Tom.

-
Input sentence: Be cool.
Decoded sentence: Var tulligele.

-
Input sentence: We can both do it.
Decoded sentence: Vi kan bå det det det.

-
Input sen

Quite hilarious results.
THe model often gets the start of a sentence right, and then derails at some point. 
The derailing is of course somewhat expected when the model is fed incorrect data from previous timesteps.

The model is still improving with each epoch, but I would like to attempt a more suitable model instead of allowing it more training time.