# Artificial Intelligence at XelpMoc
## Machine Translation Project (Transliteration b/w English and Bengali)

## Introduction
In this notebook, we will build a deep neural network that functions as part of an end-to-end machine translation pipeline. Your completed pipeline will transliterate between English and Bengali words.

- **Preprocess** - We convert text to sequence of integers.
- **Models** - Create models which accepts a sequence of integers as input and returns a probability distribution over possible translations. After learning about the basic types of neural networks that are often used for machine translation, we will engage in your own investigations, to design your own model!
- **Prediction** - Run the model on Bengali text.

In [1]:
%load_ext autoreload
%aimport helper, tests
%autoreload 1

In [2]:
import collections

import helper
import numpy as np
import project_tests as tests
import random

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model, Sequential
from keras.layers import GRU, Input, Dense, TimeDistributed, Activation, RepeatVector, Bidirectional, Dropout, LSTM
from keras.layers.embeddings import Embedding
from keras.optimizers import Adam
from keras.losses import sparse_categorical_crossentropy

Using TensorFlow backend.


### Verify access to the GPU
The following test applies only if we expect to be using a GPU.

In [3]:
import tensorflow as tf
if tf.test.gpu_device_name():
    print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))
else:
   print("Please install GPU version of TF")

Default GPU Device: /device:GPU:0


In [4]:
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 7237073927242562421
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 16762086471525907002
physical_device_desc: "device: XLA_GPU device"
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 8785088582223683161
physical_device_desc: "device: XLA_CPU device"
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 14796708250
locality {
  bus_id: 1
  links {
  }
}
incarnation: 16055110666083531576
physical_device_desc: "device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5"
]


## Dataset
We begin by investigating the dataset that will be used to train and evaluate your pipeline. However, that will take a long time to train a neural network on.  We'll be using a dataset we created for this project that contains a small number of pair.  We'll be able to train your model in a reasonable time with this dataset.

### Load Data
The data is located in `data/neural_test_bengali.txt` and `data/neural_test_english.txt`. The `neural_test_english` file contains English words with their Bengali transliterations in the `neural_test_bengali` file. We load the English and Bengali data from these files from running the cell below.

In [5]:
# Load English data
english_words = helper.load_data('data/neural_english.txt')
# Load Bengali data
bengali_words = helper.load_data('data/neural_bengali.txt')

print('Dataset Loaded')

Dataset Loaded


### Files
Each line in `neural_test_english.txt` contains an English word with the respective transliteration in each line of `neural_test_bengali.txt`.  View the first two lines from each file.


In [6]:
for sample_i in range(5):
    print('English sample {}:  {}'.format(sample_i + 1, english_words[sample_i]))
    print('Bengali sample {}:  {}\n'.format(sample_i + 1, bengali_words[sample_i]))

English sample 1:  R A M A N A T H
Bengali sample 1:  র া ম া ন া থ

English sample 2:  R O Y
Bengali sample 2:  র া য়

English sample 3:  S U S I L A
Bengali sample 3:  স ু শ ি ল া

English sample 4:  R O Y
Bengali sample 4:  র া য়

English sample 5:  S A K U N T A L A
Bengali sample 5:  শ ক ু ন ্ ত ল া



From looking at the sentences, we can see they have been preprocessed already. All the text have been converted to uppercase.  This should save us some time, but the text requires more preprocessing.

### Vocabulary
The complexity of the problem is determined by the complexity of the vocabulary.  A more complex vocabulary is a more complex problem.  Let's look at the complexity of the dataset we'll be working with.

In [7]:
english_words_counter = len(english_words)
bengali_words_counter = len(bengali_words)

print('{} English words.'.format(english_words_counter))
print('{} Bengali words.'.format(bengali_words_counter))

2575637 English words.
2575637 Bengali words.


## Preprocess
For this project, we won't use text data as input to your model. Instead, we'll convert the text into sequences of integers using the following preprocess methods:
1. Tokenize the words into ids
2. Add padding to make all the sequences the same length.

Time to start preprocessing the data...

### Tokenize
For a neural network to predict on text data, it first has to be turned into data it can understand. Text data like "dog" is a sequence of ASCII character encodings.  Since a neural network is a series of multiplication and addition operations, the input data needs to be number(s).

We can turn each character into a number or each word into a number.  These are called character and word ids, respectively.  Character ids are used for character level models that generate text predictions for each character.  A word level model uses word ids that generate text predictions for each word.  Word level models tend to learn better, since they are lower in complexity, so we'll use those.

Turn each sentence into a sequence of words ids using Keras's [`Tokenizer`](https://keras.io/preprocessing/text/#tokenizer) function. We use this function to tokenize `english_words` and `bengali_words` in the cell below.

Running the cell will run `tokenize` on sample data and show output for debugging.

In [8]:
def tokenize(x):
    """
    Tokenize x
    :param x: List of sentences/strings to be tokenized
    :return: Tuple of (tokenized x data, tokenizer used to tokenize x)
    """
    # TODO: Implement
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(x)
    return tokenizer.texts_to_sequences(x), tokenizer

tests.test_tokenize(tokenize)

### Padding
When batching the sequence of word ids together, each sequence needs to be the same length.  Since sentences are dynamic in length, we can add padding to the end of the sequences to make them the same length.

Make sure all the English sequences have the same length and all the Bengali sequences have the same length by adding padding to the **end** of each sequence using Keras's [`pad_sequences`](https://keras.io/preprocessing/sequence/#pad_sequences) function.

In [9]:
def pad(x, length=None):
    """
    Pad x
    :param x: List of sequences.
    :param length: Length to pad the sequence to.  If None, use length of longest sequence in x.
    :return: Padded numpy array of sequences
    """
    # TODO: Implement
    return pad_sequences(x, maxlen=length, padding='post')

tests.test_pad(pad)

### Preprocess Pipeline
Our focus for this project is to build neural network architecture, so we won't be creating a preprocess pipeline.  Instead, we've created an implementation of the `preprocess` function.

In [10]:
def preprocess(x, y):
    """
    Preprocess x and y
    :param x: Feature List of sentences
    :param y: Label List of sentences
    :return: Tuple of (Preprocessed x, Preprocessed y, x tokenizer, y tokenizer)
    """
    preprocess_x, x_tk = tokenize(x)
    preprocess_y, y_tk = tokenize(y)

    preprocess_x = pad(preprocess_x)
    preprocess_y = pad(preprocess_y)

    # Keras's sparse_categorical_crossentropy function requires the labels to be in 3 dimensions
    preprocess_y = preprocess_y.reshape(*preprocess_y.shape, 1)

    return preprocess_x, preprocess_y, x_tk, y_tk

preproc_bengali_words, preproc_english_words, bengali_tokenizer, english_tokenizer =\
    preprocess(bengali_words, english_words)
    
max_english_sequence_length = preproc_english_words.shape[1]
max_bengali_sequence_length = preproc_bengali_words.shape[1]
english_vocab_size = len(english_tokenizer.word_index)
bengali_vocab_size = len(bengali_tokenizer.word_index)

print('Data Preprocessed')
print("Max English sentence length:", max_english_sequence_length)
print("Max Bengali sentence length:", max_bengali_sequence_length)
print("English vocabulary size:", english_vocab_size)
print("Bengali vocabulary size:", bengali_vocab_size)

Data Preprocessed
Max English sentence length: 22
Max Bengali sentence length: 25
English vocabulary size: 27
Bengali vocabulary size: 79


### Ids Back to Text
The neural network will be translating the input to words ids, which isn't the final form we want.  We want the English transliteration.  The function `logits_to_text` will bridge the gab between the logits from the neural network to the English transliteration.  We'll be using this function to better understand the output of the neural network.

In [11]:
def logits_to_text(logits, tokenizer):
    """
    Turn logits from a neural network into text using the tokenizer
    :param logits: Logits from a neural network
    :param tokenizer: Keras Tokenizer fit on the labels
    :return: String that represents the text of the logits
    """
    index_to_words = {id: word for word, id in tokenizer.word_index.items()}
    index_to_words[0] = '<PAD>'

    return ' '.join([index_to_words[prediction] for prediction in np.argmax(logits, 1)])

print('`logits_to_text` function loaded.')

`logits_to_text` function loaded.


### Changes the sparse categorical accuracy to prevent error

In [12]:
from keras import backend as K
def custom_sparse_categorical_accuracy(y_true, y_pred):
    return K.cast(K.equal(K.max(y_true, axis=-1),
                          K.cast(K.argmax(y_pred, axis=-1), K.floatx())),
                  K.floatx())

### Defines a function that prints predictions

In [13]:
def get_predictions(model, x, N = 100):
    
    r_nos = []
    
    for j in range(N):
        r_nos.append(random.randint(1, 20) * 100)
    
    for i in r_nos:
        print("--------------------------------")
        print("Prediction:")
        print(logits_to_text(model.predict(x[i-1:i])[0], english_tokenizer))

        print("\nCorrect Transliteration:")
        print(english_words[i-1:i])

        print("\nOriginal text:")
        print(bengali_words[i-1:i])

        print("--------------------------------")

### Hybrid Model
A model that incorporates embedding and a bidirectional rnn into one model.

In [14]:
def model_final(input_shape, output_sequence_length, bengali_vocab_size, english_vocab_size):
    """
    Build and train a model that incorporates embedding, encoder-decoder, and bidirectional RNN on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param bengali_vocab_size: Number of unique Bengali words in the dataset
    :return: Keras model built, but not trained
    """
    
    # Hyperparameters
    learning_rate = 0.003
    
    # Build the layers    
    model = Sequential()
    
    # Embedding
    model.add(Embedding(bengali_vocab_size, 128, input_length=input_shape[1],
                         input_shape=input_shape[1:]))
    # Encoder
    model.add(Bidirectional(GRU(128)))
    model.add(RepeatVector(output_sequence_length))
    # Decoder
    model.add(Bidirectional(GRU(128, return_sequences=True)))
    model.add(TimeDistributed(Dense(512, activation='relu')))
    model.add(Dropout(0.5))
    model.add(TimeDistributed(Dense(english_vocab_size, activation='softmax')))
    model.compile(loss=sparse_categorical_crossentropy,
                  optimizer=Adam(learning_rate),
                  metrics=[custom_sparse_categorical_accuracy])
    return model

tests.test_model_final(model_final)

print('Final Model Loaded')

W0629 11:39:05.039391 140610791003904 deprecation_wrapper.py:119] From /usr/local/lib/python3.5/dist-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0629 11:39:05.042596 140610791003904 deprecation_wrapper.py:119] From /usr/local/lib/python3.5/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0629 11:39:05.045242 140610791003904 deprecation_wrapper.py:119] From /usr/local/lib/python3.5/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0629 11:39:05.773164 140610791003904 deprecation_wrapper.py:119] From /usr/local/lib/python3.5/dist-packages/keras/backend/tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

W0629 11:39:05.780940 

Final Model Loaded




### Model Evaluation and Creation

In [15]:
def get_model(x, y, x_tk, y_tk):
    """
    Gets predictions using the final model
    :param y: Preprocessed English data
    :param x: Preprocessed Bengali data
    :param y_tk: English tokenizer
    :param x_tk: Bengali tokenizer
    """
    # Train neural network using model_final
    model = model_final(x.shape,y.shape[1],
                        len(x_tk.word_index)+1,
                        len(y_tk.word_index)+1)
    model.summary()
    model.fit(x, y, batch_size=1024, epochs=10, validation_split=0.2)
    return model
    

my_model = get_model(preproc_bengali_words, preproc_english_words, bengali_tokenizer, english_tokenizer)

W0629 11:39:06.758867 140610791003904 deprecation.py:323] From /usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/math_grad.py:1250: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 25, 128)           10240     
_________________________________________________________________
bidirectional_3 (Bidirection (None, 256)               197376    
_________________________________________________________________
repeat_vector_2 (RepeatVecto (None, 22, 256)           0         
_________________________________________________________________
bidirectional_4 (Bidirection (None, 22, 256)           295680    
_________________________________________________________________
time_distributed_3 (TimeDist (None, 22, 512)           131584    
_________________________________________________________________
dropout_2 (Dropout)          (None, 22, 512)           0         
_________________________________________________________________
time_distributed_4 (TimeDist (None, 22, 28)            14364     
Total para

In [16]:
get_predictions(my_model, x = preproc_bengali_words, N = 10)

--------------------------------
Prediction:
m u r m u <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>

Correct Transliteration:
['M U R M U']

Original text:
['ম ু র ্ ম ু']
--------------------------------
--------------------------------
Prediction:
i n d r a n i <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>

Correct Transliteration:
['I N D R A N I']

Original text:
['ই ন ্ দ ্ র া ন ী']
--------------------------------
--------------------------------
Prediction:
s o m a <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>

Correct Transliteration:
['S O M A']

Original text:
['স ো ম া']
--------------------------------
--------------------------------
Prediction:
m a l a k a r <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>

Correct Transliteration:
['M A L A K A R']

Original text:
['ম া

### Saving the trained model

In [17]:
my_model.save('transliteration_model.h5')