This tutorial guides you on how to load text files via the `tf.data.TextLineDataset` APIs. TextLineDataset is created to load the text file in that each example is the line. In this example, you are going to build a model to classify the different translators.

Datasets are: (They have already been preprocessed.)
* William Cowper — [text](https://storage.googleapis.com/download.tensorflow.org/data/illiad/cowper.txt)
* Edward, Earl of Derby — [text](https://storage.googleapis.com/download.tensorflow.org/data/illiad/derby.txt)
* Samuel Butler — [text](https://storage.googleapis.com/download.tensorflow.org/data/illiad/butler.txt)

Reference: 
  * Load Text: https://www.tensorflow.org/tutorials/load_data/text

In [0]:
!pip install -q tf-nightly

In [2]:
import tensorflow as tf
import tensorflow_datasets as tfds
import os
import numpy as np

print("Tensorflow Version: {}".format(tf.__version__))
print("Eager Mode: {}".format(tf.executing_eagerly()))
print("GPU {} available".format("is" if tf.config.experimental.list_physical_devices("GPU") else "not"))

Tensorflow Version: 2.1.0-dev20200107
Eager Mode: True
GPU is available


# Data Preprocessing

## Downloading Datasets

In [0]:
DIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']

In [0]:
for name in FILE_NAMES:
  text_file = tf.keras.utils.get_file(name, DIRECTORY_URL + name)
dirname = os.path.dirname(text_file)

In [5]:
!ls {dirname}

butler.txt  cowper.txt	derby.txt


## Load the Datasets

In [0]:
def labeler(example, index):
  return example, tf.cast(index, tf.int32)

In [0]:
labeled_data_sets = []

for i, filename in enumerate(FILE_NAMES):
  lines_dataset = tf.data.TextLineDataset(os.path.join(dirname, filename))
  labeled_dataset = lines_dataset.map(lambda eg: labeler(eg, i))
  labeled_data_sets.append(labeled_dataset)

In [8]:
labeled_data_sets

[<MapDataset shapes: ((), ()), types: (tf.string, tf.int32)>,
 <MapDataset shapes: ((), ()), types: (tf.string, tf.int32)>,
 <MapDataset shapes: ((), ()), types: (tf.string, tf.int32)>]

Before you start a training task, you have to combine three text datasets into a bigger one.

In [0]:
BUFFER_SIZE = 50000
BATCH_SIZE = 64
TAKE_SIZE = 5000

In [0]:
all_labeled_data = labeled_data_sets[0]
for i in range(1, len(labeled_data_sets)):
  all_labeled_data = all_labeled_data.concatenate(labeled_data_sets[i])

In [0]:
all_labeled_data = all_labeled_data.shuffle(BUFFER_SIZE, 
                                            reshuffle_each_iteration=False)

Let's take a look at the combined labeded data.

In [12]:
for d in all_labeled_data.take(5):
  print(d)

(<tf.Tensor: shape=(), dtype=string, numpy=b'Shouldst not to evil lead the sons of Greece.'>, <tf.Tensor: shape=(), dtype=int32, numpy=1>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'To whom thus Hector of the glancing helm,'>, <tf.Tensor: shape=(), dtype=int32, numpy=1>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'The warrior ranks, so long he bids thee pause'>, <tf.Tensor: shape=(), dtype=int32, numpy=0>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'"Be men, my friends," he cried, "and respect one another\'s good'>, <tf.Tensor: shape=(), dtype=int32, numpy=2>)
(<tf.Tensor: shape=(), dtype=string, numpy=b"In Antron, and in Pteleon's grass-clad meads;">, <tf.Tensor: shape=(), dtype=int32, numpy=1>)


# Encode Text as Numbers

Before you go further, you have to do one more transformation to the text. Machine Learning or deep learning is using numeric data, however, the text or the word is not a numeric value. You have to transform the text into numbers. In this example, the line dataset is composed of multiple lines, each line also accompanies a label representing the translator. Basically you might think about how to transform the line into a number. In practical, you are going to break down the string into several words and to try to transform a word into a numeric value.

## Building Vocabularies

Building vocabularies is to tokenize the text into a collection of individual unique words. There are several steps to follow.
* Iterate over all the line example.
* Use `tfds.featrures.text.Tokenize` to split the text line into tokens.
* Collect the tokens and remove the duplicates.
* Get the size of the tokens for later use.

In [13]:
tokenizer = tfds.features.text.Tokenizer()

vocabulary_set = set()
for text_tensor, _ in all_labeled_data:  # _ is the label
  tokens = tokenizer.tokenize(text_tensor.numpy())
  vocabulary_set.update(tokens)

vocab_size = len(vocabulary_set)
vocab_size

17178

## Encode Examples

After you get a set of tokens, next you need an encoder to encode the token into a numeric value. 

Send the vocabulary set into `tfds.features.text.TokenTextEncoder` to create a encoder. This encoder helps you to transform a string of the text into a list of integers.

In [0]:
encoder = tfds.features.text.TokenTextEncoder(vocabulary_set)

You can try the encoder by passing a string of the text to it.

In [15]:
example_text = next(iter(all_labeled_data))[0].numpy()
example_text

b'Shouldst not to evil lead the sons of Greece.'

In [16]:
encoded_example = encoder.encode(example_text)
print(encoded_example)

[4180, 12498, 3435, 15171, 2424, 16678, 4564, 1538, 13556]


You can also decode the encoded list of integers.

In [17]:
encoder.decode(encoded_example)

'Shouldst not to evil lead the sons of Greece'

To merge such operations into a part of the model, you can wrap them via `tf.py_function` as a Tensorflow op and pass the result to the dataset's `map` function.

In [0]:
def encode(text_tensor, label):
  encoded_text = encoder.encode(text_tensor.numpy())
  return encoded_text, label

In [0]:
def encode_map_fn(text, label):
  return tf.py_function(encode, inp=[text, label], Tout=(tf.int32, tf.int32))

In [20]:
all_encoded_data = all_labeled_data.map(encode_map_fn)
all_encoded_data

<MapDataset shapes: (<unknown>, <unknown>), types: (tf.int32, tf.int32)>

# Split the Dataset into Train and Test Batches

Next, you are going to split the dataset into training and test datasets. The easy way is to access the dataset using the `tf.data.Dataset.take()` and `tf.data.Dataset.skip()` APIs.

Typically, the examples inside of a batch are required to be in the same size and the same length. However, the text encoder doesn't guarantee the fixed length of the encoded integer length. Here you can solve the length issue via using the `tf.data.Dataset.padded_batch()` API to pad the examples to the same sizes.

In [0]:
train_data = all_encoded_data.skip(TAKE_SIZE).shuffle(BUFFER_SIZE)
train_data = train_data.padded_batch(BATCH_SIZE, padded_shapes=([-1], []))

In [0]:
test_data = all_encoded_data.take(TAKE_SIZE)
test_data = test_data.padded_batch(BATCH_SIZE, padded_shapes=([-1], []))

In [0]:
sample_text, sample_labels = next(iter(test_data))

In [24]:
sample_text[0], sample_labels[0], sample_text.numpy().shape

(<tf.Tensor: shape=(16,), dtype=int32, numpy=
 array([ 4180, 12498,  3435, 15171,  2424, 16678,  4564,  1538, 13556,
            0,     0,     0,     0,     0,     0,     0], dtype=int32)>,
 <tf.Tensor: shape=(), dtype=int32, numpy=1>,
 (64, 16))

We have to increase 1 to the vocabulary size because you have introduced a new token (zero for padding).

In [0]:
vocab_size += 1

# Building the Model

In [0]:
def simple_model(inputs):
  embed = tf.keras.layers.Embedding(vocab_size, 64)(inputs)

  # return_state: True
  # [combined_hidden_state, fwd_h_state, fwd_c_state, bck_h_state, bck_c_state]
  fwd = tf.keras.layers.LSTM(units=32, return_state=True, name="fw")
  bck = tf.keras.layers.LSTM(units=32, go_backwards=True, return_state=True, name="bk")
  bd = tf.keras.layers.Bidirectional(fwd, backward_layer=bck)(embed)

  x = tf.keras.layers.Dense(32, activation='elu')(bd[0])
  x = tf.keras.layers.Dense(32, activation='elu')(x)
  y = tf.keras.layers.Dense(3, activation='softmax')(x)
  return y

In [0]:
def build_model(inputs):
  # inputs: [None, None] (batch_size, padded_size), e.g. (64, 16)

  # The first layer is to convert a list of integer representation 
  # to the dense vector embeddings.
  # embed: [None, None, 64], (batch_size, padded_size, embedding_vector)
  embed = tf.keras.layers.Embedding(vocab_size, 64)(inputs)

  # The next layer is a RNN layer with the Long Short-Term Memory, which lets the model 
  # learn the word in their context with other words. 
  # A bidirectional mechanism is designed to learn the datapoints 
  # in the relationship with ones that came after them or before them.
  # [None, None, 64*2] (batch_size, padded_size, output_dim)
  fwd = tf.keras.layers.LSTM(units=64, return_sequences=True)
  bck = tf.keras.layers.LSTM(units=64, go_backwards=True, return_sequences=True)
  x = tf.keras.layers.Bidirectional(fwd, backward_layer=bck)(embed)

  # add a TimeDistributed layer
  # [None, None, 64*2] -> [None, None, 64]
  x = tf.keras.layers.TimeDistributed(
        tf.keras.layers.Dense(64), input_shape=(None, 128)
      )(x)

  # automatically generate a backward LSTM layer
  # x: [None, 32*2]  
  x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(units=32, return_sequences=False))(x)

  x = tf.keras.layers.Dense(units=64, activation='elu')(x)
  x = tf.keras.layers.Dense(units=64, activation='elu')(x)

  # The final layer is the categorical result.
  y = tf.keras.layers.Dense(units=3, activation='softmax')(x)
  return y

While you are going to build an input whose time point is variant, you can assign its shape with `(None,)` or `[None]`.

In [0]:
def build_compile_model():
  inputs = tf.keras.layers.Input(shape=(None,))  # variant length of time points
  outputs = build_model(inputs)
  #outputs = simple_model(inputs)
  model = tf.keras.Model(inputs, outputs)

  model.compile(loss='sparse_categorical_crossentropy', 
                optimizer='adam', 
                metrics=["accuracy"])
  return model

# Train the Model

In [0]:
model = build_compile_model()

In [30]:
model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, None)]            0         
_________________________________________________________________
embedding (Embedding)        (None, None, 64)          1099456   
_________________________________________________________________
bidirectional (Bidirectional (None, None, 128)         66048     
_________________________________________________________________
time_distributed (TimeDistri (None, None, 64)          8256      
_________________________________________________________________
bidirectional_1 (Bidirection (None, 64)                24832     
_________________________________________________________________
dense_1 (Dense)              (None, 64)                4160      
_________________________________________________________________
dense_2 (Dense)              (None, 64)                4160  

In [0]:
!rm -rf ./log ./log.zip
tfb = tf.keras.callbacks.TensorBoard('./log', write_graph=True)

In [32]:
model.fit(train_data, epochs=2, validation_data=test_data, callbacks=[tfb])

Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x7f8870050860>

In [33]:
loss, acc = model.evaluate(test_data, verbose=2)
print("Loss: {}, Acc: {}".format(loss, acc))

Loss: 0.3610934236004383, Acc: 0.8378000259399414


In [34]:
model.inputs, model.input_shape

([<tf.Tensor 'input_1:0' shape=(None, None) dtype=float32>], (None, None))

In [35]:
for w in model.weights:
  print(w.name, w.numpy().shape)

embedding/embeddings:0 (17179, 64)
bidirectional/forward_lstm/kernel:0 (64, 256)
bidirectional/forward_lstm/recurrent_kernel:0 (64, 256)
bidirectional/forward_lstm/bias:0 (256,)
bidirectional/backward_lstm_1/kernel:0 (64, 256)
bidirectional/backward_lstm_1/recurrent_kernel:0 (64, 256)
bidirectional/backward_lstm_1/bias:0 (256,)
time_distributed/kernel:0 (128, 64)
time_distributed/bias:0 (64,)
bidirectional_1/forward_lstm_2/kernel:0 (64, 128)
bidirectional_1/forward_lstm_2/recurrent_kernel:0 (32, 128)
bidirectional_1/forward_lstm_2/bias:0 (128,)
bidirectional_1/backward_lstm_2/kernel:0 (64, 128)
bidirectional_1/backward_lstm_2/recurrent_kernel:0 (32, 128)
bidirectional_1/backward_lstm_2/bias:0 (128,)
dense_1/kernel:0 (64, 64)
dense_1/bias:0 (64,)
dense_2/kernel:0 (64, 64)
dense_2/bias:0 (64,)
dense_3/kernel:0 (64, 3)
dense_3/bias:0 (3,)


# Prediction

In [0]:
pred_data = next(iter(test_data))

In [37]:
pred_data[1][:10]

<tf.Tensor: shape=(10,), dtype=int32, numpy=array([1, 1, 0, 2, 1, 0, 1, 1, 0, 0], dtype=int32)>

In [38]:
res = model.predict(pred_data[0][:10])
res, np.argmax(res, axis=-1)

(array([[1.9283679e-01, 8.0536532e-01, 1.7978031e-03],
        [4.2089685e-03, 9.9565923e-01, 1.3192077e-04],
        [6.7608112e-01, 3.2349932e-01, 4.1960593e-04],
        [1.5656850e-03, 6.1565783e-02, 9.3686849e-01],
        [6.2814735e-02, 9.3648571e-01, 6.9956860e-04],
        [6.7881286e-01, 3.2110366e-01, 8.3420586e-05],
        [9.2930514e-03, 9.8990750e-01, 7.9939817e-04],
        [1.7776463e-02, 9.8162490e-01, 5.9868390e-04],
        [9.7273493e-01, 2.7182616e-02, 8.2438797e-05],
        [9.1961706e-01, 7.8422263e-02, 1.9606927e-03]], dtype=float32),
 array([1, 1, 0, 2, 1, 0, 1, 1, 0, 0]))

In [39]:
for_pd = pred_data[0][0].numpy()
for_pd, for_pd.shape

(array([ 4180, 12498,  3435, 15171,  2424, 16678,  4564,  1538, 13556,
            0,     0,     0,     0,     0,     0,     0], dtype=int32), (16,))

In [40]:
model.predict(np.expand_dims(for_pd, axis=0))

array([[0.19283688, 0.8053653 , 0.0017978 ]], dtype=float32)