<a href="https://colab.research.google.com/github/jorcisai/ARF/blob/master/src/KerasTutorial-TrainingWordEmbeddingPooling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Text classifier using own trained word embeddings and pooling

## Imports
Importing standard packages and tensorflow_datasets to ease data manipulation. 

In [None]:
from __future__ import absolute_import, division, print_function, unicode_literals

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass

import tensorflow as tf
import os

import tensorflow_datasets as tfds

Data loading from local file system. Please use the "traveler" dataset available in the "dat" directory of the GitHub:

In [None]:
from google.colab import files

uploaded = files.upload()
for fn in uploaded.keys():
  datafn=fn;

print("Data file: ",datafn)

## Load text data from local file

Parsing file line by line to extract class label, source and target sentences. All three are lists of strings.

In [None]:
numsamples=0
labs=[]
src_sents=[]
trg_sents=[]
for line in open(datafn):
  numsamples+=1
  words = line.split(" ")
  labs.append(words[0])
  pos=words.index("#")
  src_sents.append(" ".join(words[1:pos-1]))
  trg_sents.append(" ".join(words[pos+1:]))

# Data preprocessing

Simple conversion from class text label into class integer label:

In [None]:
labset = set()
for lab in labs:
  labset.add(lab)
num_classes=len(labset)

lab2id = {}
for id,lab in enumerate(labset):
  lab2id[lab]=id

for id,lab in enumerate(labs):
  labs[id]=lab2id[lab]

Tokenization, conversion into sequence of integers and padding:

In [None]:
import tensorflow.keras.preprocessing as prepro

def tokenize(sents):
  tokenizer = prepro.text.Tokenizer(filters='')
  tokenizer.fit_on_texts(sents)
  tensors = tokenizer.texts_to_sequences(sents)
  tensors = prepro.sequence.pad_sequences(tensors,padding='post')

  return tensors, tokenizer

src_tensors, src_tokenizer = tokenize(src_sents)

Loading class labels, source (Spanish) and target (English) sentences from lists into dataset objects.

In [None]:
lab_dataset = tf.data.Dataset.from_tensor_slices(labs)

In [None]:
src_dataset = tf.data.Dataset.from_tensor_slices(src_tensors)

Taking a look at the class labels and source sentences after being converted into dataset type. In this example, the source sentences are used to train the model including word embedding.

In [None]:
for lab in lab_dataset.take(5):
  print(lab)

In [None]:
for src in src_dataset.take(5):
  print(src)

Print out vocabulary size:

In [None]:
print(len(src_tokenizer.word_counts))

Combining each source sentence with its corresponding class label

In [None]:
dataset = tf.data.Dataset.zip((src_dataset, lab_dataset)) 

In [None]:
for sample in dataset.take(5):
  print(sample)

## Experimental design

Use `tf.data.Dataset.take` and `tf.data.Dataset.skip` to split dataset into 50% for training, 20% for validation and 30% for test.

Before being passed into the model, the datasets need to be shuffled and batched. So, first, the complete dataset is shuffled with a fixed seed so that we can repeat the same shuffle of the dataset, then the dataset is split into training, validation and test, and each of these subsets is batched.

In [None]:
trainsz = int(numsamples*0.5)
valsz= int(numsamples*0.2)
testsz= int(numsamples*0.3)
batchsz = 100

dataset = dataset.shuffle(numsamples,seed=13)

train_data = dataset.take(trainsz)
train_data = train_data.batch(batchsz)

val_data = dataset.skip(trainsz).take(valsz)
val_data = val_data.batch(batchsz)

test_data = dataset.skip(trainsz+valsz)
test_data = test_data.batch(batchsz)

Now, `train_data`, `val_data` and `test_data` are not collections of (`sentence, label`) pairs, but collections of batches. Each batch is a pair of (*set of sentences*, *set of labels*) represented as arrays. To illustrate:

In [None]:
sample_text, sample_labels = next(iter(test_data))

sample_text[0], sample_labels[0]

## Build the model



Create an empty model and add layers to it.

In [None]:
model = tf.keras.Sequential()

The first layer converts integer representations to dense vector embeddings. See the [word embeddings tutorial](../text/word_embeddings.ipynb) or more details. 

In [None]:
vcb_size=len(src_tokenizer.word_counts)
model.add(tf.keras.layers.Embedding(vcb_size, 16))

Next, a GlobalAveragePooling1D layer returns a fixed-length output vector for each example by averaging over the sequence dimension. This allows the model to handle input of variable length, in the simplest way possible.

In [None]:
model.add(tf.keras.layers.GlobalAveragePooling1D())

Finally we'll have a series of one or more densely connected layers, with the last one being the output layer. The output layer produces a probability for all the labels. The one with the highest probability is the models prediction of an example's label.

In [None]:
# One or more dense layers.
# Edit the list in the `for` line to experiment with layer sizes.
for units in [16, 16]:
  model.add(tf.keras.layers.Dense(units, activation='relu'))

# Output layer. The first argument is the number of labels.
model.add(tf.keras.layers.Dense(num_classes, activation='softmax'))

Check the number of parameters of the model per layer

In [None]:
model.summary()

Finally, compile the model. For a softmax categorization model, use `sparse_categorical_crossentropy` as the loss function. You can try other optimizers, but `adam` is very common.

In [None]:
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

## Train the model

This model running on this data produces decent results (~98% accuracy).

In [None]:
history = model.fit(train_data, epochs=10, validation_data=val_data)

##Evaluate the model

Compute accuracy on the test set (~98% accuracy)

In [None]:
test_loss, test_acc = model.evaluate(test_data)

print('\nTest loss: {:.3f}, Test accuracy: {:.3f}'.format(test_loss, test_acc))

In [None]:
import matplotlib.pyplot as plt

def plot_graphs(history, metric):
  plt.plot(history.history[metric])
  plt.plot(history.history['val_'+metric], '')
  plt.xlabel("Epochs")
  plt.ylabel(metric)
  plt.legend([metric, 'val_'+metric])
  plt.show()
  
plot_graphs(history, 'accuracy')