<a href="https://colab.research.google.com/github/jorcisai/ARF/blob/master/src/KerasTutorial-TrainingBilingualModelSharedLayer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Bilingual text classifier using own trained word embeddings and shared bidirectional LSTMs

## Imports
Importing standard packages and tensorflow_datasets to ease data manipulation.

In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
  
import tensorflow as tf
import os

import tensorflow_datasets as tfds

Data loading from local file system. Please use the "traveler" dataset available in the "dat" directory of the GitHub:

In [0]:
from google.colab import files

uploaded = files.upload()
for fn in uploaded.keys():
  datafn=fn;

print("Data file: ",datafn)

## Load text data from local file

Parsing file line by line to extract class label, source and target sentences. All three are lists of strings.

In [0]:
numsamples=0
labs=[]
srcs=[]
trgs=[]
for line in open(datafn):
  numsamples+=1
  words = line.split(" ")
  labs.append(words[0])
  pos=words.index("#")
  srcs.append(" ".join(words[1:pos-1]))
  trgs.append(" ".join(words[pos+1:]))

Loading class labels, source (Spanish) and target (English) sentences from lists into dataset objects.

In [0]:
labs_dataset = tf.data.Dataset.from_tensor_slices(labs)

In [0]:
srcs_dataset = tf.data.Dataset.from_tensor_slices(srcs)

In [0]:
trgs_dataset = tf.data.Dataset.from_tensor_slices(trgs)

Taking a look at the class labels and source sentences after being converted into dataset type. In this example, the source sentences are used to train the model including word embedding.

In [0]:
for lab in labs_dataset.take(5):
  print(lab)

In [0]:
for src in srcs_dataset.take(5):
  print(src)

In [0]:
for trg in trgs_dataset.take(5):
  print(trg)

## Data preprocessing

Obtaining the set of class labels to map them into integers and computing the number of classes. It requires to extract the string from the tf.Tensor object and then map

In [0]:
label_set = set()
for lab_tensor in labs_dataset:
  label_set.add(lab_tensor.numpy().decode('utf-8'))
  
num_classes=len(label_set)

lab2id = {}
for lab_id,lab in enumerate(label_set):
  lab2id[lab]=lab_id

You can check the assignment of class label to integer label

In [0]:
for key,value in lab2id.items():
  print (key,value)

Apply `Dataset.map` to each element of the dataset using the encoder as a function. `Dataset.map` runs in graph mode.

* Graph tensors do not have a value. 
* In graph mode you can only use TensorFlow Ops and functions. 

So you can't `.map` this function directly: You need to wrap it in a `tf.py_function`. The `tf.py_function` will pass regular tensors (with a value and a `.numpy()` method to access it), to the wrapped python function.

Converting author labels <b>A</b>, <b>F</b>, <b>J</b> and <b>P</b> into integers for the whole dataset.

In [0]:
def author_labeler(text_lab: tf.Tensor):
  int_lab = lab2id[text_lab.numpy().decode('utf-8')]
  return int_lab

def lab_map_fn(text_lab):
  # py_func doesn't set the shape of the returned tensors.
  int_lab = tf.py_function(func=author_labeler, inp=[text_lab], Tout=tf.int64)
  # tf.data.Datasets need to set the shapes manually
  int_lab.set_shape([])
  return int_lab

labs_encoded_dataset = labs_dataset.map(lab_map_fn)


You can check how a few class samples look like after being mapped 

In [0]:
for lab in labs_encoded_dataset.take(5):
  print(lab)

Build the vocabulary of a set of sentences by tokenizing each sentence and adding the resulting tokens into a set of of individual unique words. For this tutorial:
<ol>
<li> Iterate over each sentence's numpy value.</li>
<li> Use tfds.deprecated.text.Tokenizer to split it into tokens.</li>
<li> Collect these tokens into a Python set, to remove duplicates.</li>
</ol>

In [0]:
tokenizer = tfds.deprecated.text.Tokenizer()

src_vocabulary_set = set()
for text_tensor in srcs_dataset:
  tokens = tokenizer.tokenize(text_tensor.numpy())
  src_vocabulary_set.update(tokens)
  
src_vocab_size=len(src_vocabulary_set)

In [0]:
print(src_vocabulary_set)

In [0]:
print(src_vocab_size)

In [0]:
tokenizer = tfds.deprecated.text.Tokenizer()

trg_vocabulary_set = set()
for text_tensor in trgs_dataset:
  tokens = tokenizer.tokenize(text_tensor.numpy())
  trg_vocabulary_set.update(tokens)
  
trg_vocab_size=len(trg_vocabulary_set)

In [0]:
print(trg_vocabulary_set)

In [0]:
print(trg_vocab_size)

Create an encoder by passing the vocabulary_set to tfds.deprecated.text.TokenTextEncoder. The encoder's encode method takes in a string of text and returns a list of integers.

In [0]:
src_encoder = tfds.deprecated.text.TokenTextEncoder(src_vocabulary_set)

In [0]:
trg_encoder = tfds.deprecated.text.TokenTextEncoder(trg_vocabulary_set)

In [0]:
for src in srcs_dataset.take(5):
  print(src_encoder.encode(src.numpy()))

In [0]:
for trg in trgs_dataset.take(5):
  print(trg_encoder.encode(trg.numpy()))

Mapping tokens of source sentences into list of integers.

In [0]:
def src_encode(text_tensor):
  src_encoded_text = src_encoder.encode(text_tensor.numpy())
  return [src_encoded_text]

def src_encode_map_fn(text_tensor):
  # py_func doesn't set the shape of the returned tensors.
  src_encoded_text = tf.py_function(src_encode, inp=[text_tensor], Tout=tf.int64)
  #tf.data.Datasets need to set the shapes manually 
  src_encoded_text.set_shape([None])
  return src_encoded_text

srcs_encoded_dataset = srcs_dataset.map(src_encode_map_fn)

Mapping tokens of target sentences into list of integers.


In [0]:
def trg_encode(text_tensor):
  trg_encoded_text = trg_encoder.encode(text_tensor.numpy())
  return [trg_encoded_text]

def trg_encode_map_fn(text_tensor):
  # py_func doesn't set the shape of the returned tensors.
  trg_encoded_text = tf.py_function(trg_encode, inp=[text_tensor], Tout=tf.int64)
  #tf.data.Datasets need to set the shapes manually 
  trg_encoded_text.set_shape([None])
  return trg_encoded_text

trgs_encoded_dataset = trgs_dataset.map(trg_encode_map_fn)

Checking the result of applying the mapping to the source and target sentences

In [0]:
for src in srcs_encoded_dataset.take(5):
  print(src)

In [0]:
for trg in trgs_encoded_dataset.take(5):
  print(trg)

First, zipping source and target datasets into a dataset of (source, target) sentences. Then, zipping (source target) dataset with the label dataset, so that we have a ((source, target), label) dataset.

In [0]:
dataset = tf.data.Dataset.zip((tf.data.Dataset.zip((srcs_encoded_dataset, trgs_encoded_dataset)), labs_encoded_dataset)) 

In [0]:
for sample in dataset.take(5):
  print(sample)

## Experimental design

Use `tf.data.Dataset.take` and `tf.data.Dataset.skip` to split dataset into 50% for training, 20% for validation and 30% for test.

Before being passed into the model, the datasets need to be shuffled and batched. So, first, the complete dataset is shuffled with a fixed seed so that we can repeat the same shuffle of the dataset, then the dataset is split into training, validation and test, and each of these subsets is batched.

Typically, the examples inside of a batch need to be the same size and shape. But, the examples in these datasets are not all the same size — each line of text had a different number of words. So use `tf.data.Dataset.padded_batch` (instead of `batch`) to pad the examples to the same size.

In [0]:
trainsz = int(numsamples*0.5)
valsz= int(numsamples*0.2)
testsz= int(numsamples*0.3)
batchsz = 100

dataset = dataset.shuffle(numsamples,seed=13)

train_data = dataset.take(trainsz)
train_data = train_data.padded_batch(batchsz,padded_shapes=(([None],[None]),[]))

val_data = dataset.skip(trainsz).take(valsz)
val_data = val_data.padded_batch(batchsz,padded_shapes=(([None],[None]),[]))

test_data = dataset.skip(trainsz+valsz)
test_data = test_data.padded_batch(batchsz,padded_shapes=(([None],[None]),[]))

Now, `train_data`, `val_data` and `test_data` are not collections of (`(source, target), label`) pairs, but collections of batches. Each batch is a pair of (*set of (source, target) sentences*, *set of labels*) represented as arrays. To illustrate this idea we take one batch from the dataset:

In [0]:
sample_bitext, sample_label = next(iter(test_data))

The sample_bitext is a tuple of (source sentences, target sentences) 

In [0]:
sample_bitext

The 10th source and target sentence of the batch...

In [0]:
sample_bitext[0][10], sample_bitext[1][10]

... and the corresponding label

In [0]:
sample_label[10]

Since we have introduced a new token encoding (the zero used for padding), the vocabulary size has increased by one.

In [0]:
src_vocab_size += 1
trg_vocab_size += 1

## Build the model



First, we define the input layers for the source sentence as an array of integers 

In [0]:
src_input = tf.keras.layers.Input(shape=(None,), dtype='int32', name='src_input')


The embedding layer converts integer representations to dense vector embeddings. See the [word embeddings tutorial](../text/word_embeddings.ipynb) for more details.

In [0]:
src_embed = tf.keras.layers.Embedding(output_dim=16, input_dim=src_vocab_size)(src_input)

The next layer is a shared [Long Short-Term Memory](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) layer, which lets the model understand words in their context with other words. A bidirectional wrapper on the LSTM helps it to learn about the datapoints in relationship to the datapoints that came before it and after it.

In [0]:
blstm=tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(16))
src_blstm = blstm(src_embed)

Target sentences undergo the same process as source sentences

In [0]:
trg_input = tf.keras.layers.Input(shape=(None,), dtype='int32', name='trg_input')
trg_embed = tf.keras.layers.Embedding(output_dim=16, input_dim=trg_vocab_size)(trg_input)
trg_blstm = blstm(trg_embed)

The output from the source and target BLSTM is concatenated

In [0]:
concat_blstm = tf.keras.layers.concatenate([src_blstm, trg_blstm])

The concatenation of source and target BLSTM is input into a dense feed-forward network

In [0]:
units=32
concat_dense = tf.keras.layers.Dense(units, activation='relu')(concat_blstm)

More densely connected layers can be added

In [0]:
for units in [32]:
  concat_dense = tf.keras.layers.Dense(units, activation='relu')(concat_dense)

The output layer produces a probability for all the labels. The one with the highest probability is the models prediction of an example's label.

In [0]:
concat_output = tf.keras.layers.Dense(num_classes, activation='softmax')(concat_dense)

Finally, the input and output of the model is defined.

In [0]:
model = tf.keras.models.Model(inputs=[src_input, trg_input], outputs=concat_output)

Summary of the model to know the number of parameters to be learnt

In [0]:
model.summary()

Finally, compile the model. For a softmax categorization model, use `sparse_categorical_crossentropy` as the loss function. You can try other optimizers, but `adam` is very common.

In [0]:
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

## Train the model

This model running on this data produces decent results (almost 99% on the validation set).

In [0]:
model.fit(train_data, epochs=10, validation_data=val_data)

##Evaluate the model

Compute accuracy on the test set (accuracy about 99%)

In [0]:
eval_loss, eval_acc = model.evaluate(test_data)

print('\nEval loss: {:.3f}, Eval accuracy: {:.3f}'.format(eval_loss, eval_acc))