Learn how to use 
- `tf.data.TextLineDataset`
- `tfds.features.text.Tokenizer`
- `tfds.features.text.TokenTextEncoder` + `tf.py_function` + `map`

## Setup

In [1]:
import tensorflow as tf
import tensorflow_datasets as tfds
import os

In [2]:
DIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']

In [3]:
for name in FILE_NAMES:
    text_dir = tf.keras.utils.get_file(name, origin=DIRECTORY_URL+name)
    
parent_dir = os.path.dirname(text_dir)

parent_dir

'/home/nxhuy/.keras/datasets'

## Load text into datasets

Iterate through the files, loading each one into its own dataset.

In [4]:
def labeler(example, index):
    return example, tf.cast(index, tf.int64)

Using `tf.data.TextLineDataset`

In [5]:
labeled_data_sets = []

for i, file_name in enumerate(FILE_NAMES):
    lines_dataset = tf.data.TextLineDataset(os.path.join(parent_dir, file_name))
    labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i))
    labeled_data_sets.append(labeled_dataset)

In [6]:
for text, label in labeled_data_sets[0].take(1):
    print(text.numpy())
    print(label.numpy())

b"\xef\xbb\xbfAchilles sing, O Goddess! Peleus' son;"
0


In [7]:
BUFFER_SIZE = 50000
BATCH_SIZE = 64
TAKE_SIZE = 5000

In [8]:
all_labeled_data = labeled_data_sets[0]
for labeled_data in labeled_data_sets[1:]:
    all_labeled_data = all_labeled_data.concatenate(labeled_data)
    
all_labeled_data = all_labeled_data.shuffle(BUFFER_SIZE, reshuffle_each_iteration=False)

In [9]:
for ex in all_labeled_data.take(5):
    print(ex)

(<tf.Tensor: shape=(), dtype=string, numpy=b'Disabled sank; he fell supine, and bore'>, <tf.Tensor: shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'Shall want performance. But Olympian Jove!'>, <tf.Tensor: shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'One with hot current flows, and from beneath,'>, <tf.Tensor: shape=(), dtype=int64, numpy=1>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'as of a great multitude.'>, <tf.Tensor: shape=(), dtype=int64, numpy=2>)
(<tf.Tensor: shape=(), dtype=string, numpy=b"The wall destroy'd, o'er all the shore he spread">, <tf.Tensor: shape=(), dtype=int64, numpy=1>)


## Encode text lines as numbers

Machine learning models **work on numbers, not words**, so the string values need to be converted into lists of numbers. To do that, map each unique word to a unique integer.

### Build vocabulary

Steps:
- Iterate over each example's `numpy` value.
- Use `tfds.features.text.Tokenizer` to split it into tokens.
- Collect these tokens into a Python set, to remove duplicates.
- Get the size of the vocabulary for later use.

In [10]:
tokenizer = tfds.features.text.Tokenizer()

vocabulary_set = set()
for text_sensor, _ in all_labeled_data:
    some_tokens = tokenizer.tokenize(text_sensor.numpy())
    vocabulary_set.update(some_tokens)
    
len(vocabulary_set)

17178

In [11]:
vocab_size = len(vocabulary_set)

### Encode examples

Using `tfds.features.text.TokenTextEncoder` + `tf.py_function` + `map`.

In [12]:
encoder = tfds.features.text.TokenTextEncoder(vocabulary_set)

In [13]:
example_text = next(iter(all_labeled_data))[0].numpy()
example_text

b'Disabled sank; he fell supine, and bore'

In [14]:
encoded_text = encoder.encode(example_text)
encoded_text

[3237, 3350, 1726, 15680, 1310, 9692, 12743]

<div class="alert alert-info">
    Now run the <b>encoder on the dataset</b> by wrapping it in <code>tf.py_function</code> and passing that to the dataset's <code>map</code> method.
</div>

In [18]:
encoder.encode(b'Disabled sank; he fell supine, and bore')

[3237, 3350, 1726, 15680, 1310, 9692, 12743]

In [24]:
def my_encode(txt, label):
    et = encoder.encode(txt.numpy())
    return et, label

In [25]:
# Test encode()
for t, l in all_labeled_data.take(5):
    e, _l = my_encode(t, l)
    print(e, _l)

[3237, 3350, 1726, 15680, 1310, 9692, 12743] tf.Tensor(0, shape=(), dtype=int64)
[17148, 4449, 2114, 6000, 10949, 5184] tf.Tensor(0, shape=(), dtype=int64)
[1070, 11464, 15133, 707, 1740, 9692, 10946, 1226] tf.Tensor(1, shape=(), dtype=int64)
[13985, 1143, 7681, 6488, 2804] tf.Tensor(2, shape=(), dtype=int64)
[2280, 16991, 7626, 6972, 8621, 4823, 13780, 15824, 6577, 1726, 846] tf.Tensor(1, shape=(), dtype=int64)


You want to use `Dataset.map` to apply this function to each element of the dataset. `Dataset.map` runs in graph mode.

- Graph tensors do not have a value.
- In graph mode you can only use TensorFlow Ops and functions.


So we **can't `.map`** this function **directly**: We need to wrap it in a `tf.py_function`. The `tf.py_function` will pass regular tensors (with a value and a `.numpy()` method to access it), to the wrapped python function.

In [28]:
def encode_map_fn(text, label):
    # py_func DOESN'T set the shape of the returned tensors.
    encoded_text, label = tf.py_function(my_encode, 
                                         inp=[text, label], 
                                         Tout=(tf.int64, tf.int64)
                                        )
    # `tf.data.Datasets` work best if all components have a shape set
    #  so set the shapes manually:
    encoded_text.set_shape([None])
    label.set_shape([])
    
    return encoded_text, label

In [29]:
all_encoded_data = all_labeled_data.map(encode_map_fn)

In [30]:
for ex in all_encoded_data.take(5):
    print(ex)

(<tf.Tensor: shape=(7,), dtype=int64, numpy=array([ 3237,  3350,  1726, 15680,  1310,  9692, 12743])>, <tf.Tensor: shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: shape=(6,), dtype=int64, numpy=array([17148,  4449,  2114,  6000, 10949,  5184])>, <tf.Tensor: shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: shape=(8,), dtype=int64, numpy=array([ 1070, 11464, 15133,   707,  1740,  9692, 10946,  1226])>, <tf.Tensor: shape=(), dtype=int64, numpy=1>)
(<tf.Tensor: shape=(5,), dtype=int64, numpy=array([13985,  1143,  7681,  6488,  2804])>, <tf.Tensor: shape=(), dtype=int64, numpy=2>)
(<tf.Tensor: shape=(11,), dtype=int64, numpy=
array([ 2280, 16991,  7626,  6972,  8621,  4823, 13780, 15824,  6577,
        1726,   846])>, <tf.Tensor: shape=(), dtype=int64, numpy=1>)


## Split the dataset into test and train batches

Use `tf.data.Dataset.take` and `tf.data.Dataset.skip` to create a small test dataset and a larger training set.

Before being passed into the model, the datasets need to be batched. Typically, **the examples inside of a batch need to be the same size and shape**. But, the examples in these datasets are not all the same size — each line of text had a different number of words. So use `tf.data.Dataset.padded_batch` (instead of `batch`) to pad the examples to the same size.

In [31]:
train_data = all_encoded_data.skip(TAKE_SIZE).shuffle(BUFFER_SIZE)
train_data = train_data.padded_batch(BATCH_SIZE)

test_data = all_encoded_data.take(TAKE_SIZE)
test_data = test_data.padded_batch(BATCH_SIZE)

In [32]:
for ex in train_data:
    print(ex[0].shape, ex[1].shape)

(64, 14) (64,)
(64, 16) (64,)
(64, 15) (64,)
(64, 16) (64,)
(64, 16) (64,)
(64, 15) (64,)
(64, 15) (64,)
(64, 17) (64,)
(64, 16) (64,)
(64, 16) (64,)
(64, 15) (64,)
(64, 16) (64,)
(64, 16) (64,)
(64, 15) (64,)
(64, 17) (64,)
(64, 16) (64,)
(64, 15) (64,)
(64, 14) (64,)
(64, 15) (64,)
(64, 15) (64,)
(64, 17) (64,)
(64, 17) (64,)
(64, 16) (64,)
(64, 15) (64,)
(64, 16) (64,)
(64, 15) (64,)
(64, 15) (64,)
(64, 15) (64,)
(64, 16) (64,)
(64, 16) (64,)
(64, 15) (64,)
(64, 15) (64,)
(64, 18) (64,)
(64, 16) (64,)
(64, 15) (64,)
(64, 16) (64,)
(64, 16) (64,)
(64, 15) (64,)
(64, 16) (64,)
(64, 15) (64,)
(64, 16) (64,)
(64, 15) (64,)
(64, 15) (64,)
(64, 18) (64,)
(64, 16) (64,)
(64, 16) (64,)
(64, 15) (64,)
(64, 16) (64,)
(64, 16) (64,)
(64, 16) (64,)
(64, 14) (64,)
(64, 16) (64,)
(64, 15) (64,)
(64, 15) (64,)
(64, 15) (64,)
(64, 15) (64,)
(64, 15) (64,)
(64, 15) (64,)
(64, 16) (64,)
(64, 16) (64,)
(64, 15) (64,)
(64, 15) (64,)
(64, 16) (64,)
(64, 16) (64,)
(64, 17) (64,)
(64, 16) (64,)
(64, 15) (

(64, 16) (64,)
(64, 15) (64,)
(64, 15) (64,)
(64, 17) (64,)
(64, 15) (64,)
(64, 17) (64,)
(64, 17) (64,)
(64, 15) (64,)
(64, 15) (64,)
(64, 16) (64,)
(64, 16) (64,)
(64, 16) (64,)
(64, 15) (64,)
(64, 16) (64,)
(64, 16) (64,)
(64, 17) (64,)
(64, 16) (64,)
(64, 16) (64,)
(64, 15) (64,)
(64, 15) (64,)
(64, 16) (64,)
(64, 16) (64,)
(64, 15) (64,)
(64, 17) (64,)
(64, 16) (64,)
(64, 14) (64,)
(64, 17) (64,)
(64, 17) (64,)
(64, 16) (64,)
(64, 15) (64,)
(64, 16) (64,)
(64, 15) (64,)
(64, 16) (64,)
(64, 16) (64,)
(64, 16) (64,)
(64, 16) (64,)
(64, 17) (64,)
(64, 15) (64,)
(64, 15) (64,)
(64, 15) (64,)
(64, 15) (64,)
(64, 15) (64,)
(64, 15) (64,)
(64, 15) (64,)
(64, 16) (64,)
(64, 16) (64,)
(64, 15) (64,)
(64, 16) (64,)
(64, 16) (64,)
(64, 16) (64,)
(64, 16) (64,)
(64, 15) (64,)
(64, 15) (64,)
(64, 15) (64,)
(64, 17) (64,)
(64, 16) (64,)
(64, 15) (64,)
(64, 15) (64,)
(64, 15) (64,)
(64, 17) (64,)
(64, 15) (64,)
(64, 15) (64,)
(64, 17) (64,)
(64, 16) (64,)
(64, 15) (64,)
(64, 14) (64,)
(64, 15) (

In [33]:
sample_text, sample_labels = next(iter(test_data))

sample_text[0], sample_labels[0]

(<tf.Tensor: shape=(15,), dtype=int64, numpy=
 array([ 3237,  3350,  1726, 15680,  1310,  9692, 12743,     0,     0,
            0,     0,     0,     0,     0,     0])>,
 <tf.Tensor: shape=(), dtype=int64, numpy=0>)

Since we have introduced a **new token encoding (the zero used for padding)**, the vocabulary size has increased by one.

In [39]:
vocab_size += 1

## Build the model

In [40]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, 64),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(3)
])

In [41]:
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 64)          1099456   
_________________________________________________________________
bidirectional_1 (Bidirection (None, 128)               66048     
_________________________________________________________________
dense_3 (Dense)              (None, 64)                8256      
_________________________________________________________________
dense_4 (Dense)              (None, 64)                4160      
_________________________________________________________________
dense_5 (Dense)              (None, 3)                 195       
Total params: 1,178,115
Trainable params: 1,178,115
Non-trainable params: 0
_________________________________________________________________


In [42]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

## Train the model

In [43]:
model.fit(train_data, epochs=3, validation_data=test_data)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7f163a149fd0>

In [44]:
eval_loss, eval_acc = model.evaluate(test_data)
print('\nEval loss: {:.3f}, Eval accuracy: {:.3f}'.format(eval_loss, eval_acc))


Eval loss: 0.371, Eval accuracy: 0.839


## References
- https://www.tensorflow.org/tutorials/load_data/text