<a href="https://colab.research.google.com/github/pasumarthi/DeepLearning/blob/master/MT_keras.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## German to English translation

Get the dataset from this link:


https://machinelearningmastery.com/develop-neural-machine-translation-system-keras/


In [0]:
# !ls
# from google.colab import files
# files.upload()

In [0]:
# !unzip deu-eng.zip
# !ls

In [4]:
from pickle import load
from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.utils.vis_utils import plot_model
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import Embedding
from keras.layers import RepeatVector
from keras.layers import TimeDistributed
from keras.callbacks import ModelCheckpoint

Using TensorFlow backend.


In [0]:
def load_doc(filename):
	# open the file as read only
	file = open(filename, mode='rt', encoding='utf-8')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

In [0]:
# split a loaded document into sentences
def to_pairs(doc):
	lines = doc.strip().split('\n')
	pairs = [line.split('\t') for line in  lines]
	return pairs

In [0]:
# clean a list of lines
def clean_pairs(lines):
	cleaned = list()
	# prepare regex for char filtering
	re_print = re.compile('[^%s]' % re.escape(string.printable))
	# prepare translation table for removing punctuation
	table = str.maketrans('', '', string.punctuation)
	for pair in lines:
		clean_pair = list()
		for line in pair:
			# normalize unicode characters
			line = normalize('NFD', line).encode('ascii', 'ignore')
			line = line.decode('UTF-8')
			# tokenize on white space
			line = line.split()
			# convert to lowercase
			line = [word.lower() for word in line]
			# remove punctuation from each token
			line = [word.translate(table) for word in line]
			# remove non-printable chars form each token
			line = [re_print.sub('', w) for w in line]
			# remove tokens with numbers in them
			line = [word for word in line if word.isalpha()]
			# store as string
			clean_pair.append(' '.join(line))
		cleaned.append(clean_pair)
	return array(cleaned)

In [8]:
import string
import re
from pickle import dump
from unicodedata import normalize
from numpy import array
 

def save_clean_data(sentences, filename):
	dump(sentences, open(filename, 'wb'))
	print('Saved: %s' % filename)
 
# load dataset
filename = 'deu.txt'
doc = load_doc(filename)
# split into english-german pairs
pairs = to_pairs(doc)
# clean sentences
clean_pairs = clean_pairs(pairs)
# save clean pairs to file
save_clean_data(clean_pairs, 'english-german.pkl')
print(len(clean_pairs))
# spot check
for i in range(100):
	print('[%s] => [%s]' % (clean_pairs[i,0], clean_pairs[i,1]))

Saved: english-german.pkl
176692
[hi] => [hallo]
[hi] => [gru gott]
[run] => [lauf]
[wow] => [potzdonner]
[wow] => [donnerwetter]
[fire] => [feuer]
[help] => [hilfe]
[help] => [zu hulf]
[stop] => [stopp]
[wait] => [warte]
[go on] => [mach weiter]
[hello] => [hallo]
[i ran] => [ich rannte]
[i see] => [ich verstehe]
[i see] => [aha]
[i try] => [ich probiere es]
[i won] => [ich hab gewonnen]
[i won] => [ich habe gewonnen]
[smile] => [lacheln]
[cheers] => [zum wohl]
[freeze] => [keine bewegung]
[freeze] => [stehenbleiben]
[got it] => [kapiert]
[got it] => [verstanden]
[got it] => [einverstanden]
[he ran] => [er rannte]
[he ran] => [er lief]
[hop in] => [mach mit]
[hug me] => [druck mich]
[hug me] => [nimm mich in den arm]
[hug me] => [umarme mich]
[i fell] => [ich fiel]
[i fell] => [ich fiel hin]
[i fell] => [ich sturzte]
[i fell] => [ich bin hingefallen]
[i fell] => [ich bin gesturzt]
[i know] => [ich wei]
[i lied] => [ich habe gelogen]
[i lost] => [ich habe verloren]
[i paid] => [ich hab

The clean data contains a little over 150,000 phrase pairs and some of the pairs toward the end of the file are very long.

This is a good number of examples for developing a small translation model. The complexity of the model increases with the number of examples, length of phrases, and size of the vocabulary.

Although we have a good dataset for modeling translation, we will simplify the problem slightly to dramatically reduce the size of the model required, and in turn the training time required to fit the model.

We will simplify the problem by reducing the dataset to the first 10,000 examples in the file; these will be the shortest phrases in the dataset.

Further, we will then stake the first 9,000 of those as examples for training and the remaining 1,000 examples to test the fit model.

In [9]:
from pickle import load
from pickle import dump
from numpy.random import rand
from numpy.random import shuffle
 
# load a clean dataset
def load_clean_sentences(filename):
	return load(open(filename, 'rb'))
 
# save a list of clean sentences to file
def save_clean_data(sentences, filename):
	dump(sentences, open(filename, 'wb'))
	print('Saved: %s' % filename)
 
# load dataset
raw_dataset = load_clean_sentences('english-german.pkl')
print(len(raw_dataset),raw_dataset[0])
# reduce dataset size
n_sentences = len(raw_dataset)
dataset = raw_dataset[:n_sentences, :]
# random shuffle
shuffle(dataset)
# split into train/test
train, test = dataset[:130000], dataset[130000:]
# save
save_clean_data(dataset, 'english-german-both.pkl')
save_clean_data(train, 'english-german-train.pkl')
save_clean_data(test, 'english-german-test.pkl')

176692 ['hi' 'hallo']
Saved: english-german-both.pkl
Saved: english-german-train.pkl
Saved: english-german-test.pkl


In [0]:
# load a clean dataset
def load_clean_sentences(filename):
	return load(open(filename, 'rb'))
 
# load datasets
dataset = load_clean_sentences('english-german-both.pkl')
train = load_clean_sentences('english-german-train.pkl')
test = load_clean_sentences('english-german-test.pkl')

In [0]:

# fit a tokenizer
def create_tokenizer(lines):
	tokenizer = Tokenizer()
	tokenizer.fit_on_texts(lines)
	return tokenizer

the function named max_length() below will find the length of the longest sequence in a list of phrases.

In [0]:
	
# max sentence length
def max_length(lines):
	return max(len(line.split()) for line in lines)

We can call these functions with the combined dataset to prepare **tokenizers**, **vocabulary sizes**, and **maximum lengths** for both the English and German phrases.

In [13]:

# prepare english tokenizer
eng_tokenizer = create_tokenizer(dataset[:, 0])
eng_vocab_size = len(eng_tokenizer.word_index) + 1
eng_length = max_length(dataset[:, 0])
print('English Vocabulary Size: %d' % eng_vocab_size)
print('English Max Length: %d' % (eng_length))

# prepare german tokenizer
ger_tokenizer = create_tokenizer(dataset[:, 1])
ger_vocab_size = len(ger_tokenizer.word_index) + 1
ger_length = max_length(dataset[:, 1])
print('German Vocabulary Size: %d' % ger_vocab_size)
print('German Max Length: %d' % (ger_length))

English Vocabulary Size: 15436
English Max Length: 47
German Vocabulary Size: 32323
German Max Length: 53


Each input and output sequence must be encoded to integers and padded to the maximum phrase length. This is because we will use a **word embedding** for the **input sequences** and **one hot encode** the **output sequences** The function below named **encode_sequences()** will perform these operations and return the result.

In [0]:
# encode and pad sequences
def encode_sequences(tokenizer, length, lines):
	# integer encode sequences
	X = tokenizer.texts_to_sequences(lines)
	# pad sequences with 0 values
	X = pad_sequences(X, maxlen=length, padding='post')
	return X

The **output sequence** needs to be **one-hot encoded**. This is because the model will predict the probability of each word in the vocabulary as output.

The function **encode_output()** below will one-hot encode English output sequences.

In [0]:
# one hot encode target sequence
def encode_output(sequences, vocab_size):
	ylist = list()
	for sequence in sequences:
		encoded = to_categorical(sequence, num_classes=vocab_size)
		ylist.append(encoded)
	y = array(ylist)
	y = y.reshape(sequences.shape[0], sequences.shape[1], vocab_size)
	return y

In [0]:
# prepare training data
trainX = encode_sequences(ger_tokenizer, ger_length, train[:, 1])
trainY = encode_sequences(eng_tokenizer, eng_length, train[:, 0])
trainY = encode_output(trainY, eng_vocab_size)
# prepare validation data
testX = encode_sequences(ger_tokenizer, ger_length, test[:, 1])
testY = encode_sequences(eng_tokenizer, eng_length, test[:, 0])
testY = encode_output(testY, eng_vocab_size)

The function **define_model()** below defines the model and takes a number of arguments used to configure the model, such as the size of the input and output vocabularies, the maximum length of input and output phrases, and the number of memory units used to configure the model.

The model configuration was not optimized for this problem, meaning that there is plenty of opportunity for you to tune it and lift the skill of the translations. I would love to see what you can come up with.



the **RepeatVector** is used as an adapter to fit the fixed-sized 2D output of the encoder to the differing length and 3D input expected by the decoder. The **TimeDistributed** wrapper allows the same output layer to be reused for each element in the output sequence.

The RepeatVector layer can be used like an adapter to fit the encoder and decoder parts of the network together. We can configure the RepeatVector to repeat the fixed length vector one time for each time step in the output sequence.


https://machinelearningmastery.com/encoder-decoder-long-short-term-memory-networks/

In [59]:
	
# define NMT model
def define_model(src_vocab, tar_vocab, src_timesteps, tar_timesteps, n_units):
	model = Sequential()
	model.add(Embedding(src_vocab, n_units, input_length=src_timesteps, mask_zero=True))
	model.add(LSTM(n_units))
	model.add(RepeatVector(tar_timesteps))
	model.add(LSTM(n_units, return_sequences=True))
	model.add(TimeDistributed(Dense(tar_vocab, activation='softmax')))
	return model
 
# define model
model = define_model(ger_vocab_size, eng_vocab_size, ger_length, eng_length, 256)
model.compile(optimizer='adam', loss='categorical_crossentropy')
# summarize defined model
print(model.summary())
#plot_model(model, to_file='model.png', show_shapes=True)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 10, 256)           936192    
_________________________________________________________________
lstm_7 (LSTM)                (None, 256)               525312    
_________________________________________________________________
repeat_vector_4 (RepeatVecto (None, 5, 256)            0         
_________________________________________________________________
lstm_8 (LSTM)                (None, 5, 256)            525312    
_________________________________________________________________
time_distributed_4 (TimeDist (None, 5, 2309)           593413    
Total params: 2,580,229
Trainable params: 2,580,229
Non-trainable params: 0
_________________________________________________________________
None


In [60]:
	
# fit model
filename = 'model.h5'
checkpoint = ModelCheckpoint(filename, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
model.fit(trainX, trainY, epochs=30, batch_size=64, validation_data=(testX, testY), callbacks=[checkpoint], verbose=2)

Train on 10000 samples, validate on 0 samples
Epoch 1/30
 - 14s - loss: 4.2473
Epoch 2/30




 - 11s - loss: 3.3835
Epoch 3/30
 - 11s - loss: 3.2393
Epoch 4/30
 - 11s - loss: 3.0945
Epoch 5/30
 - 12s - loss: 2.9359
Epoch 6/30
 - 11s - loss: 2.7455
Epoch 7/30
 - 11s - loss: 2.5719
Epoch 8/30
 - 11s - loss: 2.3994
Epoch 9/30
 - 11s - loss: 2.2306
Epoch 10/30
 - 11s - loss: 2.0790
Epoch 11/30
 - 11s - loss: 1.9351
Epoch 12/30
 - 11s - loss: 1.7974
Epoch 13/30
 - 11s - loss: 1.6741
Epoch 14/30
 - 11s - loss: 1.5561
Epoch 15/30
 - 11s - loss: 1.4426
Epoch 16/30
 - 11s - loss: 1.3360
Epoch 17/30
 - 11s - loss: 1.2309
Epoch 18/30
 - 11s - loss: 1.1348
Epoch 19/30
 - 11s - loss: 1.0407
Epoch 20/30
 - 11s - loss: 0.9553
Epoch 21/30
 - 11s - loss: 0.8703
Epoch 22/30
 - 11s - loss: 0.7959
Epoch 23/30
 - 11s - loss: 0.7268
Epoch 24/30
 - 11s - loss: 0.6628
Epoch 25/30
 - 11s - loss: 0.6034
Epoch 26/30
 - 11s - loss: 0.5503
Epoch 27/30
 - 11s - loss: 0.5003
Epoch 28/30
 - 11s - loss: 0.4553
Epoch 29/30
 - 11s - loss: 0.4174
Epoch 30/30
 - 11s - loss: 0.3802


<keras.callbacks.History at 0x7fa34b534f60>

In [28]:
!ls
from pickle import load
from numpy import array
from numpy import argmax
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import load_model
from nltk.translate.bleu_score import corpus_bleu

_about.txt   english-german-both.pkl  english-german-train.pkl
deu-eng.zip  english-german.pkl       model.h5
deu.txt      english-german-test.pkl  sample_data


In [0]:

# load datasets
dataset = load_clean_sentences('english-german-both.pkl')
train = load_clean_sentences('english-german-train.pkl')
test = load_clean_sentences('english-german-test.pkl')
# prepare english tokenizer
eng_tokenizer = create_tokenizer(dataset[:, 0])
eng_vocab_size = len(eng_tokenizer.word_index) + 1
eng_length = max_length(dataset[:, 0])
# prepare german tokenizer
ger_tokenizer = create_tokenizer(dataset[:, 1])
ger_vocab_size = len(ger_tokenizer.word_index) + 1
ger_length = max_length(dataset[:, 1])
# prepare data
trainX = encode_sequences(ger_tokenizer, ger_length, train[:, 1])
testX = encode_sequences(ger_tokenizer, ger_length, test[:, 1])

In [0]:
# load model
model = load_model('model.h5')

Evaluation involves two steps: first generating a translated output sequence, and then repeating this process for many input examples and summarizing the skill of the model across multiple cases.

Starting with inference, the model can predict the entire output sequence in a one-shot manner.

In [0]:
#translation = model.predict(source, verbose=0)

This will be a sequence of integers that we can enumerate and lookup in the tokenizer to map back to words.

The function below, named **word_for_id()**, will perform this reverse mapping.

In [0]:
	
# map an integer to a word
def word_for_id(integer, tokenizer):
	for word, index in tokenizer.word_index.items():
		if index == integer:
			return word
	return None

We can perform this mapping for each integer in the translation and return the result as a string of words.

The function** predict_sequence(**) below performs this operation for a single encoded source phrase.

In [0]:
# generate target given source sequence
def predict_sequence(model, tokenizer, source):
	prediction = model.predict(source, verbose=0)[0]
	integers = [argmax(vector) for vector in prediction]
	target = list()
	for i in integers:
		word = word_for_id(i, tokenizer)
		if word is None:
			break
		target.append(word)
	return ' '.join(target)

Next, we can repeat this for each source phrase in a dataset and compare the predicted result to the expected target phrase in English.

We can print some of these comparisons to screen to get an idea of how the model performs in practice.

We will also calculate the **BLEU scores** to get a quantitative idea of how well the model has performed.

The **evaluate_model()** function below implements this, calling the above **predict_sequence()** function for each phrase in a provided dataset.

In [0]:
# evaluate the skill of the model
def evaluate_model(model, tokenizer, sources, raw_dataset):
	actual, predicted = list(), list()
	for i, source in enumerate(sources):
		# translate encoded source text
		source = source.reshape((1, source.shape[0]))
		translation = predict_sequence(model, eng_tokenizer, source)
		raw_target, raw_src = raw_dataset[i]
		if i < 10:
			print('src=[%s], target=[%s], predicted=[%s]' % (raw_src, raw_target, translation))
		actual.append(raw_target.split())
		predicted.append(translation.split())
	# calculate BLEU score
	print('BLEU-1: %f' % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))
	print('BLEU-2: %f' % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0)))
	print('BLEU-3: %f' % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0)))
	print('BLEU-4: %f' % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25)))

## Machine Translation Encoder-Decoder

Resources used:

https://machinelearningmastery.com/define-encoder-decoder-sequence-sequence-model-neural-machine-translation-keras/


https://towardsdatascience.com/neural-machine-translation-using-seq2seq-with-keras-c23540453c74

https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html

Dataset:
http://www.manythings.org/anki/

Sourse code:
https://github.com/keras-team/keras/blob/master/examples/lstm_seq2seq.py

In [1]:

from google.colab import files
files.upload()
!ls

Saving fra.txt to fra.txt
fra.txt  sample_data


In [16]:
!unzip fra-eng.zip
!ls

Archive:  fra-eng.zip
  inflating: _about.txt              
  inflating: fra.txt                 
_about.txt  fra-eng.zip  fra.txt  sample_data


In [2]:
from __future__ import print_function

from keras.models import Model
from keras.layers import Input, LSTM, Dense
import numpy as np

Using TensorFlow backend.


In [0]:
batch_size = 64  # Batch size for training.
epochs = 10  # Number of epochs to train for.
latent_dim = 256  # Latent dimensionality of the encoding space.
num_samples = 2  # Number of samples to train on.
# Path to the data txt file on disk.
data_path = 'fra.txt'

- **Input Sequences**: Padded to a maximum length of 16 characters with a vocabulary of 71 different characters (10000, 16, 71).
- **Output Sequences:** Padded to a maximum length of 59 characters with a vocabulary of 93 different characters (10000, 59, 93).

**Is padding done only for words/ sentences, not in char based models?**





In [48]:
"""""for i in range(3):
 # print(i)
  
 # for j in range(2,10,3):
  #  print(j)
    
    presidents = ["Washington", "Adams", "Jefferson", "Madison", "Monroe", "Adams", "Jackson"]
    for num, name in enumerate(presidents, start=1):
      # print("President {}: {}".format(num, name))
        
      colors = ["red", "green", "blue", "purple"]
      ratios = [0.2, 0.3, 0.1, 0.4]
      for color, ratio in zip(colors, ratios):
        print("{}% {}".format(ratio * 100, color))  """"
        
#If you need to loop over multiple lists at the same time, use zip
#If you only need to loop over a single list just use a for-in loop
#If you need to loop over a list and you need item indexes, use enumerate

20.0% red
30.0% green
10.0% blue
40.0% purple
20.0% red
30.0% green
10.0% blue
40.0% purple
20.0% red
30.0% green
10.0% blue
40.0% purple
20.0% red
30.0% green
10.0% blue
40.0% purple
20.0% red
30.0% green
10.0% blue
40.0% purple
20.0% red
30.0% green
10.0% blue
40.0% purple
20.0% red
30.0% green
10.0% blue
40.0% purple


In [46]:
# Vectorize the data.
input_texts = []
target_texts = []
input_characters = set()
target_characters = set()
with open(data_path, 'r', encoding='utf-8') as f:
    lines = f.read().split('\n'); #print(len(lines),lines[10000],lines[10000][:4]);# print("...",lines[1:100])
for line in lines[: min(num_samples, len(lines) - 1)]:
    input_text, target_text = line.split('\t')
    # We use "tab" as the "start sequence" character
    # for the targets, and "\n" as "end sequence" character.
    target_text = '\t' + target_text + '\n'
    input_texts.append(input_text);print ("input_text", input_text)
    target_texts.append(target_text)
    
    #purpose of input_characters???
    for char in input_text:
        if char not in input_characters:
            input_characters.add(char)
    for char in target_text:
        if char not in target_characters:
            target_characters.add(char)
    print("input_characters",input_characters)
input_characters = sorted(list(input_characters)); # print("\n input_characters after sorting",input_characters);#print(len(target_texts));
    
target_characters = sorted(list(target_characters))
num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_characters)
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])

print('Number of samples:', len(input_texts))
print('Number of unique input tokens in the corpus:', num_encoder_tokens)
print('Number of unique output tokens:', num_decoder_tokens)
print('Max sequence length for inputs:', max_encoder_seq_length)
print('Max sequence length for outputs:', max_decoder_seq_length)

input_text ﻿Go.
input_characters {'.', 'o', '\ufeff', 'G'}
input_text Run!
input_characters {'R', '.', '\ufeff', '!', 'o', 'n', 'u', 'G'}

 input_characters after sorting ['!', '.', 'G', 'R', 'n', 'o', 'u', '\ufeff']
Number of samples: 2
Number of unique input tokens in the corpus: 8
Number of unique output tokens: 12
Max sequence length for inputs: 4
Max sequence length for outputs: 9


In [0]:
input_token_index = dict(
    [(char, i) for i, char in enumerate(input_characters)])
target_token_index = dict(
    [(char, i) for i, char in enumerate(target_characters)])

encoder_input_data = np.zeros(
    (len(input_texts), max_encoder_seq_length, num_encoder_tokens),
    dtype='float32')
decoder_input_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')
decoder_target_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')

for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
    for t, char in enumerate(input_text):
        encoder_input_data[i, t, input_token_index[char]] = 1.
    for t, char in enumerate(target_text):
        # decoder_target_data is ahead of decoder_input_data by one timestep
        decoder_input_data[i, t, target_token_index[char]] = 1.
        if t > 0:
            # decoder_target_data will be ahead by one timestep
            # and will not include the start character.
            decoder_target_data[i, t - 1, target_token_index[char]] = 1.

the encoder and decoder LSTM layers must have the same number of cells, in this case, 256.

A **Dense output layer** is used to predict each character. This Dense is used to produce each character in the output sequence in a one-shot manner, rather than recursively, at least during training. This is because the entire target sequence required for input to the model is known during training.

The Dense does not need to be wrapped in a **TimeDistributed** layer.


Input to LSTM is always 2D ()number of time-steps and the number of features.)

Number of time-steps: Length of each sentence,  number of words in that particular sentence (assuming each word is converted to a vector). Or else, it can be the number of characters in the sentence (assuming each character is converted to a vector). 
number of features: words in the corpus, or it can be the number of characters 

When the number of timesteps is None, then LSTM will dynamically unroll the timesteps till it reaches the end of the sequence. This is typical of Neural machine translation architectures involving encoder-decoder networks. If the nb_timesteps in thsi case teh number of charcters per sentence is variable,that is reasonw e gave none to Input, it can be fixed wither by padding longer ones and truncating be setting max.

In [0]:
# Define an input sequence and process it.
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]

# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None, num_decoder_tokens))
# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
                                     initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)



Note that the encoder LSTM does not directly pass its outputs as inputs to the decoder LSTM; as noted above, the decoder uses the final hidden and cell states as the initial state for the decoder.

Also note that the decoder LSTM only passes the sequence of hidden states to the Dense for output, not the final hidden and cell states as suggested by the output shape information.

the model is defined with inputs for the encoder and the decoder and the output target sequence.

In [21]:
# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

# Run training
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
          batch_size=batch_size,
          epochs=epochs,
          validation_split=0.2)


Train on 8000 samples, validate on 2000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7ff14a32f2e8>

In [22]:
!ls

_about.txt  fra-eng.zip  fra.txt  sample_data


In [0]:
# Save model
#model.save('s2s.h5')


The encoder model is defined as taking the input layer from the encoder in the trained model (encoder_inputs) and outputting the hidden and cell state tensors (encoder_states).

The decoder is more elaborate.

The decoder requires the hidden and cell states from the encoder as the initial state of the newly defined encoder model. Because the decoder is a separate standalone model, these states will be provided as input to the model, and therefore must first be defined as inputs.

Both the encoder and decoder will be called recursively for each character that is to be generated in the translated sequence.

On the first call, the hidden and cell states from the encoder will be used to initialize the decoder LSTM layer, provided as input to the model directly.

On subsequent recursive calls to the decoder, the last hidden and cell state must be provided to the model. These state values are already within the decoder; nevertheless, we must re-initialize the state on each call given the way that the model was defined in order to take the final states from the encoder on the first call.

Therefore, the decoder must output the hidden and cell states along with the predicted character on each call, so that these states can be assigned to a variable and used on each subsequent recursive call for a given input sequence of English text to be translated.

In [0]:


# Next: inference mode (sampling).
# Here's the drill:
# 1) encode input and retrieve initial decoder state
# 2) run one step of decoder with this initial state
# and a "start of sequence" token as target.
# Output will be the next target token
# 3) Repeat with the current target token and current states

# Define sampling models
encoder_model = Model(encoder_inputs, encoder_states)

decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))

# what are these values ???
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs, state_h, state_c = decoder_lstm(
    decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)

#why are the decoder_inputs given in square brackets ????
decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs] + decoder_states)



In [0]:
# Reverse-lookup token index to decode sequences back to
# something readable.
reverse_input_char_index = dict(
    (i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict(
    (i, char) for char, i in target_token_index.items())

We get the **encoder states** into **states_value** variable. On the first call inside the while loop, these **hidden** and **cell states** from the encoder will be used to initialize the **decoder_model** that are provided as input to the model directly. Once we predict the character using **softmax**, we now input this predicted character ( using the **target_seq** 3D array for **one-hot embed** of the predicted character) along with the updated **states_value** (updated from the previous decoder states) for the next iteration of the while loop. Note that we reset our **target_seq** before we create a **one-hot embed** of the predicted character every time in the while loop.

In [24]:

"""""
initalizing with zerosfor multidimensional arrays
import numpy as np
x=np.zeros((2,1))
print("two dimension\n",x)
y=np.zeros(2)
print("one dimension\n",y)
z=np.zeros(((2,2,2)))
print("three dimension\n",z)
z1=np.zeros(((1,1,1)))

print("three dimension\n",z1)"""""


two dimension
 [[0.]
 [0.]]
one dimension
 [0. 0.]
three dimension
 [[[0. 0.]
  [0. 0.]]

 [[0. 0.]
  [0. 0.]]]
three dimension
 [[[0.]]]


In [63]:
def decode_sequence(input_seq):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq);#print("test:",len(input_seq[0]),len(input_seq[0][2]),states_value[1].shape)

    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1, 1, num_decoder_tokens));#print(np.shape(target_seq))
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0, target_token_index['\t']] = 1.#;print(target_seq)

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict(
            [target_seq] + states_value);#print(output_tokens[0, -1, :])

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = reverse_target_char_index[sampled_token_index]#;print(sampled_char)
        decoded_sentence += sampled_char

        # Exit condition: either hit max length
        # or find stop character.
        if (sampled_char == '\n' or
           len(decoded_sentence) > max_decoder_seq_length):
            stop_condition = True

        # Update the target sequence (of length 1).
        target_seq = np.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.

        # Update states
        states_value = [h, c]

    return decoded_sentence


for seq_index in range(5):
    # Take one sequence (part of the training set)
    # for trying out decoding.
    input_seq = encoder_input_data[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    print('-')
    print('Input sentence:', input_texts[seq_index])
    print('Decoded sentence:', decoded_sentence)

[[[1.7661183e-06 8.5462496e-05 5.5760605e-04 1.7993744e-04 7.5886719e-06
   6.3187808e-06 3.7175896e-06 2.8410295e-04 5.5384680e-06 6.6276862e-06
   1.7076930e-04 1.0950430e-04 1.4295045e-04 1.8290380e-05 1.2511795e-05
   3.9827582e-06 5.2203995e-06 4.3284163e-06 2.1757143e-05 2.4412433e-05
   2.6280151e-04 2.1191858e-01 2.5068451e-02 9.1560304e-02 5.4678530e-02
   1.2930517e-02 2.8359955e-02 9.1071157e-03 2.3356697e-03 3.7199217e-03
   1.7179420e-02 3.7164715e-05 3.8090009e-02 2.0004679e-02 7.4061430e-03
   9.9281119e-03 8.8625774e-02 4.8017350e-04 1.3581531e-01 9.4182484e-02
   3.7003022e-02 1.0922400e-03 8.2208045e-02 3.3497115e-04 2.0751697e-03
   4.2183802e-04 3.8800397e-04 4.2062902e-04 3.7938857e-04 3.7813478e-04
   3.5675571e-04 5.3824857e-04 2.4890344e-04 3.2498129e-04 3.2254080e-05
   1.5862758e-04 2.1985422e-04 2.8960654e-04 4.4605846e-04 9.4237650e-04
   2.5779568e-04 3.8922386e-04 3.5987058e-04 1.5921548e-03 5.3095823e-04
   8.7872118e-04 2.3802829e-06 1.6180854e-04 6.6368

In [45]:
#input_seq = encoder_input_data[8:9]
#decoded_sentence = decode_sequence(input_seq)
print('-')
print('Input sentence:', input_texts[500])
print('Decoded sentence:', decoded_sentence)

-
Input sentence: Forget it!
Decoded sentence: Arrête de le chante.

