<a href="https://colab.research.google.com/github/jorcisai/ARF/blob/master/src/KerasTutorial-TrainingAttentionBSLTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Bilingual text classifier using own trained word embeddings and Attention + BLSTM layer

## Imports
Importing standard packages and tensorflow_datasets to ease data manipulation.

In [None]:
from __future__ import absolute_import, division, print_function, unicode_literals

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
  
import tensorflow as tf
import os

import tensorflow_datasets as tfds

Data loading from local file system. Please use the "traveler" dataset available in the "dat" directory of the GitHub:

In [None]:
from google.colab import files

uploaded = files.upload()
for fn in uploaded.keys():
  datafn=fn;

print("Data file: ",datafn)

## Load text data from local file

Parsing file line by line to extract class label, source and target sentences. All three are lists of strings.

In [None]:
numsamples=0
labs=[]
src_sents=[]
trg_sents=[]
for line in open(datafn):
  numsamples+=1
  words = line.split(" ")
  labs.append(words[0])
  pos=words.index("#")
  src_sents.append(" ".join(words[1:pos-1]))
  trg_sents.append(" ".join(words[pos+1:]))

#Data preprocessing

Simple conversion from class text label into class integer label:

In [None]:
labset = set()
for lab in labs:
  labset.add(lab)
num_classes=len(labset)

lab2id = {}
for id,lab in enumerate(labset):
  lab2id[lab]=id

for id,lab in enumerate(labs):
  labs[id]=lab2id[lab]

Tokenization, conversion into sequence of integers and padding:

In [None]:
import tensorflow.keras.preprocessing as prepro

def tokenize(sents):
  tokenizer = prepro.text.Tokenizer(filters='')
  tokenizer.fit_on_texts(sents)
  tensors = tokenizer.texts_to_sequences(sents)
  tensors = prepro.sequence.pad_sequences(tensors,padding='post')

  return tensors, tokenizer

src_tensors, src_tokenizer = tokenize(src_sents)
trg_tensors, trg_tokenizer = tokenize(trg_sents)

Loading class labels, source (Spanish) and target (English) sentences from lists into dataset objects.

In [None]:
lab_dataset = tf.data.Dataset.from_tensor_slices(labs)

In [None]:
src_dataset = tf.data.Dataset.from_tensor_slices(src_tensors)

In [None]:
trg_dataset = tf.data.Dataset.from_tensor_slices(trg_tensors)

Taking a look at the class labels and source sentences after being converted into dataset type. In this example, the source sentences are used to train the model including word embedding.

In [None]:
for lab in lab_dataset.take(5):
  print(lab)

In [None]:
for src in src_dataset.take(5):
  print(src)

In [None]:
for trg in trg_dataset.take(5):
  print(trg)

Print out vocabulary sizes:

In [None]:
print(len(src_tokenizer.word_counts))
print(len(trg_tokenizer.word_counts))

First, zipping source and target datasets into a dataset of (source, target) sentences. Then, zipping (source target) dataset with the label dataset, so that we have a ((source, target), label) dataset.

In [None]:
dataset = tf.data.Dataset.zip((tf.data.Dataset.zip((src_dataset, trg_dataset)), lab_dataset)) 

In [None]:
for sample in dataset.take(5):
  print(sample)

## Experimental design

Use `tf.data.Dataset.take` and `tf.data.Dataset.skip` to split dataset into 50% for training, 20% for validation and 30% for test.

Before being passed into the model, the datasets need to be shuffled and batched. So, first, the complete dataset is shuffled with a fixed seed so that we can repeat the same shuffle of the dataset, then the dataset is split into training, validation and test, and each of these subsets is batched.

In [None]:
trainsz = int(numsamples*0.5)
valsz= int(numsamples*0.2)
testsz= int(numsamples*0.3)
batchsz = 100

dataset = dataset.shuffle(numsamples,seed=13)

train_data = dataset.take(trainsz)
train_data = train_data.batch(batchsz)

val_data = dataset.skip(trainsz).take(valsz)
val_data = val_data.batch(batchsz)

test_data = dataset.skip(trainsz+valsz)
test_data = test_data.batch(batchsz)

Now, `train_data`, `val_data` and `test_data` are not collections of (`(source, target), label`) pairs, but collections of batches. Each batch is a pair of (*set of (source, target) sentences*, *set of labels*) represented as arrays. To illustrate this idea we take one batch from the dataset:

In [None]:
sample_bitext, sample_label = next(iter(test_data))

The sample_bitext is a tuple of (source sentences, target sentences) 

In [None]:
sample_bitext

The 10th source and target sentence of the batch...

In [None]:
sample_bitext[0][10], sample_bitext[1][10]

... and the corresponding label

In [None]:
sample_label[10]

## Build the model



First, we define the input layers for the source sentence as an array of integers 

In [None]:
src_input = tf.keras.layers.Input(shape=(None,), dtype='int32', name='src_input')


The embedding layer converts integer representations to dense vector embeddings. See the [word embeddings tutorial](../text/word_embeddings.ipynb) for more details.

In [None]:
src_vcb_size=len(src_tokenizer.word_counts)
src_embed = tf.keras.layers.Embedding(output_dim=16, input_dim=src_vcb_size)(src_input)

Target sentences undergo the same process as source sentences

In [None]:
trg_input = tf.keras.layers.Input(shape=(None,), dtype='int32', name='trg_input')
trg_vcb_size=len(trg_tokenizer.word_counts)
trg_embed = tf.keras.layers.Embedding(output_dim=16, input_dim=trg_vcb_size)(trg_input)

Attention layer in which query is the source sequence embeddings and value is the target sequence embeddings. key is usually the same tensor as value.

In [None]:
query_value_attention_seq = tf.keras.layers.Attention()([src_embed, trg_embed])

The query value attention sequence is passed to a bidirectional LSTM

In [None]:
dense_input = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(16))(query_value_attention_seq)

The output of the bidirectional LSTM is input into a dense feed-forward network

In [None]:
for units in [16,16]:
  dense_input = tf.keras.layers.Dense(units, activation='relu')(dense_input)

The output layer produces a probability for all the labels. The one with the highest probability is the models prediction of an example's label.

In [None]:
dense_output = tf.keras.layers.Dense(num_classes, activation='softmax')(dense_input)

Finally, the input and output of the model is defined.

In [None]:
model = tf.keras.models.Model(inputs=[src_input, trg_input], outputs=dense_output)

Summary of the model to know the number of parameters to be learnt

In [None]:
model.summary()

Finally, compile the model. For a softmax categorization model, use `sparse_categorical_crossentropy` as the loss function. You can try other optimizers, but `adam` is very common.

In [None]:
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

## Train the model

This model running on this data produces decent results (99% accuracy on the validation set).

In [None]:
history = model.fit(train_data, epochs=10, validation_data=val_data)

##Evaluate the model

Compute accuracy on the test set (almost 99% accuracy)

In [None]:
eval_loss, eval_acc = model.evaluate(test_data)

print('\nEval loss: {:.3f}, Eval accuracy: {:.3f}'.format(eval_loss, eval_acc))

In [None]:
import matplotlib.pyplot as plt

def plot_graphs(history, metric):
  plt.plot(history.history[metric])
  plt.plot(history.history['val_'+metric], '')
  plt.xlabel("Epochs")
  plt.ylabel(metric)
  plt.legend([metric, 'val_'+metric])
  plt.show()
  
plot_graphs(history, 'accuracy')