<a href="https://colab.research.google.com/github/jorcisai/ARF/blob/master/src/KerasTutorial-TrainingWordEmbeddingConv1D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Text classifier using own trained word embeddings and convolutional 1D layer

## Imports
Importing standard packages and tensorflow_datasets to ease data manipulation. 

In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass

import tensorflow as tf
import os

import tensorflow_datasets as tfds

Data loading from local file system. Please use the "traveler" dataset available in the "dat" directory of the GitHub:

In [0]:
from google.colab import files

uploaded = files.upload()
for fn in uploaded.keys():
  datafn=fn;

print("Data file: ",datafn)

## Load text data from local file

Parsing file line by line to extract class label, source and target sentences. All three are lists of strings.

In [0]:
numsamples=0
labs=[]
srcs=[]
trgs=[]
for line in open(datafn):
  numsamples+=1
  words = line.split(" ")
  labs.append(words[0])
  pos=words.index("#")
  srcs.append(" ".join(words[1:pos-1]))
  trgs.append(" ".join(words[pos+1:]))

Loading class labels, source (Spanish) and target (English) sentences from lists into dataset objects.

In [0]:
labs_dataset = tf.data.Dataset.from_tensor_slices(labs)

In [0]:
srcs_dataset = tf.data.Dataset.from_tensor_slices(srcs)

In [0]:
trgs_dataset = tf.data.Dataset.from_tensor_slices(trgs)

Taking a look at the class labels and source sentences after being converted into dataset type. In this example, the source sentences are used to train the model including word embedding.

In [0]:
for lab in labs_dataset.take(5):
  print(lab)

In [0]:
for src in srcs_dataset.take(5):
  print(src)

## Data preprocessing

Obtaining the set of class labels to map them into integers and computing the number of classes. It requires to extract the string from the tf.Tensor object and then map.

In [0]:
label_set = set()
for lab_tensor in labs_dataset:
  label_set.add(lab_tensor.numpy().decode('utf-8'))
  
num_classes=len(label_set)

lab2id = {}
for lab_id,lab in enumerate(label_set):
  lab2id[lab]=lab_id

You can check the assignment of class label to integer label

In [0]:
for key,value in lab2id.items():
  print (key,value)

Apply `Dataset.map` to each element of the dataset using the encoder as a function. `Dataset.map` runs in graph mode.

* Graph tensors do not have a value. 
* In graph mode you can only use TensorFlow Ops and functions. 

So you can't `.map` this function directly: You need to wrap it in a `tf.py_function`. The `tf.py_function` will pass regular tensors (with a value and a `.numpy()` method to access it), to the wrapped python function.

Converting author labels <b>A</b>, <b>F</b>, <b>J</b> and <b>P</b> into integers for the whole dataset.

In [0]:
def author_labeler(text_lab: tf.Tensor):
  int_lab = lab2id[text_lab.numpy().decode('utf-8')]
  return int_lab

def lab_map_fn(text_lab):
  # py_func doesn't set the shape of the returned tensors.
  int_lab = tf.py_function(func=author_labeler, inp=[text_lab], Tout=tf.int64)
  # tf.data.Datasets need to set the shapes manually
  int_lab.set_shape([])
  return int_lab

labs_encoded_dataset = labs_dataset.map(lab_map_fn)


You can check how a few class samples look like after being mapped 

In [0]:
for lab in labs_encoded_dataset.take(5):
  print(lab)

Build the vocabulary of a set of sentences by tokenizing each sentence and adding the resulting tokens into a set of of individual unique words. For this tutorial:
<ol>
<li> Iterate over each sentence's numpy value.</li>
<li> Use tfds.deprecated.text.Tokenizer to split it into tokens.</li>
<li> Collect these tokens into a Python set, to remove duplicates.</li>
</ol>

In [0]:
tokenizer = tfds.deprecated.text.Tokenizer()

vocabulary_set = set()
for text_tensor in srcs_dataset:
  tokens = tokenizer.tokenize(text_tensor.numpy())
  vocabulary_set.update(tokens)
  
vocab_size=len(vocabulary_set)

In [0]:
print(vocabulary_set)

In [0]:
print(vocab_size)

Create an encoder by passing the vocabulary_set to tfds.deprecated.text.TokenTextEncoder. The encoder's encode method takes in a string of text and returns a list of integers.

In [0]:
encoder = tfds.deprecated.text.TokenTextEncoder(vocabulary_set)

In [0]:
for src in srcs_dataset.take(5):
  print(encoder.encode(src.numpy()))

Mapping tokens of source sentences into list of integers.

In [0]:
def encode(text_tensor):
  encoded_text = encoder.encode(text_tensor.numpy())
  return [encoded_text]

def encode_map_fn(text_tensor):
  # py_func doesn't set the shape of the returned tensors.
  encoded_text = tf.py_function(encode, inp=[text_tensor], Tout=tf.int64)
  #tf.data.Datasets need to set the shapes manually 
  encoded_text.set_shape([None])
  return encoded_text

srcs_encoded_dataset = srcs_dataset.map(encode_map_fn)

Checking the result of applying the mapping to the source sentences

In [0]:
for src in srcs_encoded_dataset.take(5):
  print(src)

Combining each source sentence with its corresponding class label

In [0]:
dataset = tf.data.Dataset.zip((srcs_encoded_dataset, labs_encoded_dataset)) 

In [0]:
for sample in dataset.take(5):
  print(sample)

## Experimental design

Use `tf.data.Dataset.take` and `tf.data.Dataset.skip` to split dataset into 50% for training, 20% for validation and 30% for test.

Before being passed into the model, the datasets need to be shuffled and batched. So, first, the complete dataset is shuffled with a fixed seed so that we can repeat the same shuffle of the dataset, then the dataset is split into training, validation and test, and each of these subsets is batched.

Typically, the examples inside of a batch need to be the same size and shape. But, the examples in these datasets are not all the same size — each line of text had a different number of words. So use `tf.data.Dataset.padded_batch` (instead of `batch`) to pad the examples to the same size.

In [0]:
trainsz = int(numsamples*0.5)
valsz= int(numsamples*0.2)
testsz= int(numsamples*0.3)
batchsz = 100

dataset = dataset.shuffle(numsamples,seed=13)

train_data = dataset.take(trainsz)
train_data = train_data.padded_batch(batchsz,padded_shapes=([None],[]))

val_data = dataset.skip(trainsz).take(valsz)
val_data = val_data.padded_batch(batchsz,padded_shapes=([None],[]))

test_data = dataset.skip(trainsz+valsz)
test_data = test_data.padded_batch(batchsz,padded_shapes=([None],[]))

Now, `train_data`, `val_data` and `test_data` are not collections of (`sentence, label`) pairs, but collections of batches. Each batch is a pair of (*set of sentences*, *set of labels*) represented as arrays. To illustrate:

In [0]:
sample_text, sample_labels = next(iter(test_data))

sample_text[0], sample_labels[0]

Since we have introduced a new token encoding (the zero used for padding), the vocabulary size has increased by one.

In [0]:
vocab_size += 1

## Build the model



Create an empty model and add layers to it.

In [0]:
model = tf.keras.Sequential()

The first layer converts integer representations to dense vector embeddings. See the [word embeddings tutorial](../text/word_embeddings.ipynb) or more details. 

In [0]:
model.add(tf.keras.layers.Embedding(vocab_size, 16))

Convolutional 1D layer with sliding window covering 5 words and output dimension of 16.

In [0]:
model.add(tf.keras.layers.Conv1D(16, 5, activation='relu'))

Next, a GlobalAveragePooling1D layer returns a fixed-length output vector for each example by averaging over the sequence dimension. This allows the model to handle input of variable length, in the simplest way possible.

In [0]:
model.add(tf.keras.layers.GlobalAveragePooling1D())

Finally we'll have a series of one or more densely connected layers, with the last one being the output layer. The output layer produces a probability for all the labels. The one with the highest probability is the models prediction of an example's label.

In [0]:
# One or more dense layers.
# Edit the list in the `for` line to experiment with layer sizes.
for units in [16, 16]:
  model.add(tf.keras.layers.Dense(units, activation='relu'))

# Output layer. The first argument is the number of labels.
model.add(tf.keras.layers.Dense(num_classes, activation='softmax'))

Check the number of parameters of the model per layer

In [0]:
model.summary()

Finally, compile the model. For a softmax categorization model, use `sparse_categorical_crossentropy` as the loss function. You can try other optimizers, but `adam` is very common.

In [0]:
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

## Train the model

This model running on this data produces decent results (>98% accuracy).

In [0]:
history = model.fit(train_data, epochs=10, validation_data=val_data)

##Evaluate the model

Compute accuracy on the test set (>98% accuracy)

In [0]:
test_loss, test_acc = model.evaluate(test_data)

print('\nTest loss: {:.3f}, Test accuracy: {:.3f}'.format(test_loss, test_acc))

In [0]:
import matplotlib.pyplot as plt

def plot_graphs(history, metric):
  plt.plot(history.history[metric])
  plt.plot(history.history['val_'+metric], '')
  plt.xlabel("Epochs")
  plt.ylabel(metric)
  plt.legend([metric, 'val_'+metric])
  plt.show()
  
plot_graphs(history, 'accuracy')