<a href="https://colab.research.google.com/github/jorcisai/ARF/blob/master/src/KerasTutorial-PreTrainedWordEmbedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Text classifier using pre-trained word embeddings

## Imports
Importing standard packages and tensorflow_datasets to ease data manipulation and tensorflow_hub to provide access to pre-trained word-embeddings.

In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass

import tensorflow as tf
import os

import tensorflow_hub as hub
import tensorflow_datasets as tfds

Data loading from local file system. Please use the "traveler" dataset available in the "dat" directory of the GitHub:

In [0]:
from google.colab import files

uploaded = files.upload()
for fn in uploaded.keys():
  datafn=fn;

print("Data file: ",datafn)

## Load text data from local file

Parsing file line by line to extract class label, source and target sentences. All three are lists of strings.

In [0]:
numsamples=0
labs=[]
srcs=[]
trgs=[]
for line in open(datafn):
  numsamples+=1
  words = line.split(" ")
  labs.append(words[0])
  pos=words.index("#")
  srcs.append(" ".join(words[1:pos-1]))
  trgs.append(" ".join(words[pos+1:]))

Loading class labels, source (Spanish) and target (English) sentences from lists into dataset objects

In [0]:
labs_dataset = tf.data.Dataset.from_tensor_slices(labs)

In [0]:
srcs_dataset = tf.data.Dataset.from_tensor_slices(srcs)

In [0]:
trgs_dataset = tf.data.Dataset.from_tensor_slices(trgs)

Taking a look at the class labels and target sentences after being converted into dataset type. In this example, as we are using a pre-trained English word embedding, only the target sentences are employed to train the text classifier.

In [0]:
for lab in labs_dataset.take(5):
  print(lab)

In [0]:
for trg in trgs_dataset.take(5):
  print(trg)

## Data preprocessing

Obtaining the set of class labels to map them into integers and computing the number of classes. It requires to extract the string from the tf.Tensor object and then map

In [0]:
label_set = set()
for lab_tensor in labs_dataset:
  label_set.update(lab_tensor.numpy().decode('utf-8'))
  
num_classes=len(label_set)

lab2id = {}
for lab_id,lab in enumerate(label_set):
  lab2id[lab]=lab_id

You can check the assignment of class label to integer label

In [0]:
for key,value in lab2id.items():
  print (key,value)

Apply `Dataset.map` to each element of the dataset using the author_labeler as a function. `Dataset.map` runs in graph mode.

* Graph tensors do not have a value. 
* In graph mode you can only use TensorFlow Ops and functions. 

So you can't `.map` this function directly: You need to wrap it in a `tf.py_function`. The `tf.py_function` will pass regular tensors (with a value and a `.numpy()` method to access it), to the wrapped python function.

Converting author labels <b>A</b>, <b>F</b>, <b>J</b> and <b>P</b> into integers for the whole dataset

In [0]:
def author_labeler(text_lab: tf.Tensor):
  int_lab = lab2id[text_lab.numpy().decode('utf-8')]
  return int_lab

def lab_map_fn(text_lab):
  # py_func doesn't set the shape of the returned tensors.
  int_lab = tf.py_function(func=author_labeler, inp=[text_lab], Tout=tf.int64)
  # tf.data.Datasets need to set the shapes manually
  int_lab.set_shape([])
  return int_lab

labs_encoded_dataset = labs_dataset.map(lab_map_fn)


You can check how a few class samples look like after being mapped 

In [0]:
for lab in labs_encoded_dataset.take(5):
  print(lab)

Combining each source sentence with its corresponding class label

In [0]:
dataset = tf.data.Dataset.zip((trgs_dataset, labs_encoded_dataset)) 

In [0]:
for sample in dataset.take(5):
  print(sample)

##Pre-trained text embedding model

One way to represent the text is to convert sentences into embeddings vectors. We can use a pre-trained text embedding as the first layer, which will have three advantages:

*   we don't have to worry about text preprocessing,
*   we can benefit from transfer learning,
*   the embedding has a fixed size, so it's simpler to process.

For this example we will use a **pre-trained text embedding model** from [TensorFlow Hub](https://www.tensorflow.org/hub) called [google/tf2-preview/gnews-swivel-20dim/1](https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1).

There are three other pre-trained models to test for the sake of this tutorial:

* [google/tf2-preview/gnews-swivel-20dim-with-oov/1](https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim-with-oov/1) - same as [google/tf2-preview/gnews-swivel-20dim/1](https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1), but with 2.5% vocabulary converted to OOV buckets. This can help if vocabulary of the task and vocabulary of the model don't fully overlap.
* [google/tf2-preview/nnlm-en-dim50/1](https://tfhub.dev/google/tf2-preview/nnlm-en-dim50/1) - A much larger model with ~1M vocabulary size and 50 dimensions.
* [google/tf2-preview/nnlm-en-dim128/1](https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1) - Even larger model with ~1M vocabulary size and 128 dimensions.

Mind that after computing the word embedding of each word of the sentence, pooling is performed to represent each sentence as a single vector.

In [0]:
embedding = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1"
hub_layer = hub.KerasLayer(embedding, input_shape=[], 
                           dtype=tf.string, trainable=True)

Take a look at the resulting word embeddings.

In [0]:
trg_batch,_ = next(iter(dataset.batch(5)))
hub_layer(trg_batch)

## Experimental design

Use `tf.data.Dataset.take` and `tf.data.Dataset.skip` to split dataset into 50% for training, 20% for validation and 30% for test.

Before being passed into the model, the datasets need to be shuffled and batched. So, first, the complete dataset is shuffled with a fixed seed so that we can repeat the same shuffle of the dataset, then the dataset is split into training, validation and test, and each of these subsets is batched. 

In [0]:
trainsz = int(numsamples*0.5)
valsz= int(numsamples*0.2)
testsz= int(numsamples*0.3)
batchsz = 100

dataset = dataset.shuffle(numsamples,seed=13)

train_data = dataset.take(trainsz)
train_data = train_data.batch(batchsz)

val_data = dataset.skip(trainsz).take(valsz)
val_data = val_data.batch(batchsz)

test_data = dataset.skip(trainsz+valsz)
test_data = test_data.batch(batchsz)

Now, `train_data`, `val_data` and `test_data` are not collections of (`sentence, label`) pairs, but collections of batches. Each batch is a pair of (*set of sentences*, *set of labels*) represented as arrays. To illustrate:

In [0]:
sample_text, sample_labels = next(iter(test_data))

sample_text[0], sample_labels[0]

## Build the model



Create an empty model and add layers to it.

In [0]:
model = tf.keras.Sequential()

The first layer converts string representations to fixed-length word embedding using pooling. 

In [0]:
model.add(hub_layer)

Finally we'll have a series of one or more densely connected layers, with the last one being the output layer. The output layer produces a probability for all the labels. The one with the highest probability is the models prediction of a sentence's label.

In [0]:
# One or more dense layers.
# Edit the list in the `for` line to experiment with layer sizes.
for units in [16,16]:
  model.add(tf.keras.layers.Dense(units, activation='relu'))

# Output layer. The first argument is the number of labels.
model.add(tf.keras.layers.Dense(num_classes, activation='softmax'))

Finally, compile the model. For a softmax categorization model, use `sparse_categorical_crossentropy` as the loss function. You can try other optimizers, but `adam` is very common.

In [0]:
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

## Train the model

This model running on this data produces decent results (validation accuracy is 98%).

In [0]:
model.fit(train_data, epochs=10, validation_data=val_data)

##Evaluate the model

Compute accuracy on the test set (98%)

In [0]:
eval_loss, eval_acc = model.evaluate(test_data)

print('\nEval loss: {:.3f}, Eval accuracy: {:.3f}'.format(eval_loss, eval_acc))