https://www.tensorflow.org/beta/tutorials/load_data/text

In this tutorial, we'll use three different English translations of the same work, Homer's Illiad, and train a model to identify the translator given a single line of text.

The texts of the three translations are by:

* William Cowper

* Edward, Earl of Derby

* Samuel Butler

In [1]:
import tensorflow as tf
import tensorflow_datasets as tfds
import os

W0807 17:19:42.106827  9348 lazy_loader.py:50] 
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



In [2]:
tf.enable_eager_execution()

# Load text into datasets

In [3]:
parent_dir = 'data'
FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']

def labeler(example, index):
    return example, tf.cast(index, tf.int64)  

labeled_data_sets = []

for i, file_name in enumerate(FILE_NAMES):
    lines_dataset = tf.data.TextLineDataset(os.path.join(parent_dir, file_name))
    labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i)) # line => (line, label)
    labeled_data_sets.append(labeled_dataset)

Combine these labeled datasets into a single dataset, and shuffle it.

In [4]:
BUFFER_SIZE = 50000 # buffer size for shuffle() function
BATCH_SIZE = 64
TAKE_SIZE = 5000 # test set size

all_labeled_data = labeled_data_sets[0]
for labeled_dataset in labeled_data_sets[1:]:
    all_labeled_data = all_labeled_data.concatenate(labeled_dataset)
    
all_labeled_data = all_labeled_data.shuffle(BUFFER_SIZE, reshuffle_each_iteration=False)

In [5]:
for ex in all_labeled_data.take(5):
    print(ex)

(<tf.Tensor: id=74, shape=(), dtype=string, numpy=b"While others watch'd by turns, nor were the fires">, <tf.Tensor: id=75, shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: id=78, shape=(), dtype=string, numpy=b'As when the south wind spreads a curtain of mist upon the mountain'>, <tf.Tensor: id=79, shape=(), dtype=int64, numpy=2>)
(<tf.Tensor: id=82, shape=(), dtype=string, numpy=b"heaven's protection, although I had thought his boasting was idle. Let">, <tf.Tensor: id=83, shape=(), dtype=int64, numpy=2>)
(<tf.Tensor: id=86, shape=(), dtype=string, numpy=b'Thus did he speak and the others all of them applauded his saying, and'>, <tf.Tensor: id=87, shape=(), dtype=int64, numpy=2>)
(<tf.Tensor: id=90, shape=(), dtype=string, numpy=b'Excused attendance on the King at Troy;'>, <tf.Tensor: id=91, shape=(), dtype=int64, numpy=0>)


# Encode text lines as numbers

* Build vocabulary

In [6]:
tokenizer = tfds.features.text.Tokenizer()

vocabulary_set = set()
for text_tensor, _ in all_labeled_data:
    some_tokens = tokenizer.tokenize(text_tensor.numpy()) # split the line string into words (tokens)
    vocabulary_set.update(some_tokens)

vocab_size = len(vocabulary_set)
vocab_size

17178

* Encode examples

In [7]:
encoder = tfds.features.text.TokenTextEncoder(vocabulary_set)

In [8]:
# test encoder
example_text = next(iter(all_labeled_data))[0].numpy()
print(example_text)
encoded_example = encoder.encode(example_text)
print(encoded_example)

b"While others watch'd by turns, nor were the fires"
[7810, 5131, 1158, 3539, 4816, 2584, 11485, 11640, 9994, 2448]


Run the encoder on the dataset by wrapping it in `tf.py_function()` and passing that to the dataset's `map()` method.

In [9]:
def encode(text_tensor, label):
    encoded_text = encoder.encode(text_tensor.numpy())
    return encoded_text, label

def encode_map_fn(text, label):
    return tf.py_function(encode, inp=[text, label], Tout=(tf.int64, tf.int64))

all_encoded_data = all_labeled_data.map(encode_map_fn)

# Split the dataset into train and test

Use `tf.data.Dataset.take()` and `tf.data.Dataset.skip()` to create a small test dataset and a larger training set.

In [10]:
train_data = all_encoded_data.skip(TAKE_SIZE).shuffle(BUFFER_SIZE) # skip the first TAKE_SIZE elements in the dataset
test_data = all_encoded_data.take(TAKE_SIZE) # get the first TAKE_SIZE elements in the dataset

# Use `tf.data.Dataset.padded_batch()` batch the dataset and pad the sequences

The examples in these datasets are not all the same size — each line of text had a different number of words. So use `tf.data.Dataset.padded_batch()` (instead of `tf.data.Dataset.batch()`) to pad the examples to the same size.

In [11]:
train_data = train_data.padded_batch(BATCH_SIZE, 
                                     padded_shapes=([-1],[]))
test_data = test_data.padded_batch(BATCH_SIZE, 
                                     padded_shapes=([-1],[]))

<b>Note</b>: a `None` or `-1` in `padded_shapes` means each sequence in a batch will be padded to the maximum size in that batch.

To illustrate:

In [12]:
sample_text, sample_labels = next(iter(test_data))
sample_text[0:5]

W0807 17:19:51.542484  9964 backprop.py:820] The dtype of the watched tensor must be floating (e.g. tf.float32), got tf.string
W0807 17:19:51.547471  9964 backprop.py:820] The dtype of the watched tensor must be floating (e.g. tf.float32), got tf.int64
W0807 17:19:51.550464 17968 backprop.py:820] The dtype of the watched tensor must be floating (e.g. tf.float32), got tf.string
W0807 17:19:51.556449 17968 backprop.py:820] The dtype of the watched tensor must be floating (e.g. tf.float32), got tf.int64
W0807 17:19:51.560438 17968 backprop.py:820] The dtype of the watched tensor must be floating (e.g. tf.float32), got tf.string


<tf.Tensor: id=149240, shape=(5, 14), dtype=int64, numpy=
array([[ 7810,  5131,  1158,  3539,  4816,  2584, 11485, 11640,  9994,
         2448,     0,     0,     0,     0],
       [ 3472,  7287,  9994, 16698, 14405,  2458, 11968,  6400,  3098,
         3346, 12671,  9994,  6552,     0],
       [  159,  9477,  6879,  3982,  2820,  4529,  1942,  3032,  6750,
         3513, 11498, 17105,     0,     0],
       [15844,   388,  4639, 14775, 15987,  9994,  5131, 12635,  3098,
        10282,  7461,  3032,  5196, 15987],
       [ 4075,  6681,  8558,  9994, 11569, 10176,   206,     0,     0,
            0,     0,     0,     0,     0]], dtype=int64)>

Since we have introduced a new token encoding (the zero used for padding), the vocabulary size has increased by one.

In [13]:
vocab_size += 1

# Build the model

In [14]:
model = tf.keras.Sequential()

model.add(tf.keras.layers.Embedding(vocab_size, 64))
model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)))
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dense(64, activation='relu'))
# Output layer. The first argument is the number of labels.
model.add(tf.keras.layers.Dense(3, activation='softmax'))

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# Train the model

In [15]:
model.fit(train_data, epochs=3, validation_data=test_data, verbose=2)

Epoch 1/3


W0807 17:20:03.449337  9348 deprecation.py:323] From f:\anaconda3\envs\tensorflow1.14\lib\site-packages\tensorflow\python\ops\math_grad.py:1250: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


697/697 - 60s - loss: 0.5235 - acc: 0.7383 - val_loss: 0.0000e+00 - val_acc: 0.0000e+00
Epoch 2/3
697/697 - 50s - loss: 0.2913 - acc: 0.8711 - val_loss: 0.3723 - val_acc: 0.8358
Epoch 3/3
697/697 - 53s - loss: 0.2123 - acc: 0.9086 - val_loss: 0.4115 - val_acc: 0.8320


<tensorflow.python.keras.callbacks.History at 0x1df2be60fd0>