# Loading text

At the time of writing (18/07/20), `tensorflow_core.keras.preprocessing.text_dataset_from_directory` from `tf-nightly` didn't seem to be in the module as advertised in [this tutorial](https://www.tensorflow.org/tutorials/keras/text_classification). I'm going through [this other tutorial here](https://www.tensorflow.org/tutorials/load_data/text) in the hopes that I will learn how to load text from disk.

`TextLineDataset` is designed to create a dataset from a text file, in which each example is a line of text from the original file.

In this tutorial we will use three different English translations of Homer's Illiad and train a model to identify the translator given a single line of text.

In [1]:
import tensorflow as tf
import tensorflow_datasets as tfds
import os

We have 3 translations, Cowper, Derby, and Butler.

In [2]:
DIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']

In [3]:
for name in FILE_NAMES:
    text_dir = tf.keras.utils.get_file(name, origin=DIRECTORY_URL+name)

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/cowper.txt
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/derby.txt
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/butler.txt


In [5]:
parent_dir = os.path.dirname(text_dir)
parent_dir

'/home/juvid/.keras/datasets'

## Load text into datasets

Iterate through the files, loading each one into its own dataset.

Each example needs to be individually labeled, so use `tf.data.Dataset.map` to apply a labeler function to each one. This will iterate ofver every example in the dataset, returning (`example`, `label`) pairs.

In [6]:
def labeler(example, index):
    return example, tf.cast(index, tf.int64)

In [8]:
labeled_data_sets = []

for i, file_name in enumerate(FILE_NAMES):
    lines_dataset = tf.data.TextLineDataset(os.path.join(parent_dir, file_name))
    labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i))
    labeled_data_sets.append(labeled_dataset)

Combine these datasets into a single dataset and shuffle it

In [9]:
BUFFER_SIZE = 50000
BATCH_SIZE = 64
TAKE_SIZE = 5000

In [12]:
all_labelled_data = labeled_data_sets[0]
for labeled_dataset in labeled_data_sets[1:]:
    all_labelled_data = all_labelled_data.concatenate(labeled_dataset)

all_labelled_data = all_labelled_data.shuffle(BUFFER_SIZE, reshuffle_each_iteration=False)

Look at some (`example`, `label`) pairs

In [13]:
for ex in all_labelled_data.take(5):
    print(ex)

(<tf.Tensor: shape=(), dtype=string, numpy=b'speech.'>, <tf.Tensor: shape=(), dtype=int64, numpy=2>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'His price, and, at great cost, E\xc3\xabtion'>, <tf.Tensor: shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'As Agamemnon in the van appears,'>, <tf.Tensor: shape=(), dtype=int64, numpy=1>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'A nymph and swain soft parley mutual hold,'>, <tf.Tensor: shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'Drive hither from the city fatted sheep'>, <tf.Tensor: shape=(), dtype=int64, numpy=0>)


## Encode text lines as numbers

We need to convert the lines of the poem into lists of numbers by building a vocabulary. We will map each unique word to a unique integer.

### Build vocabulary

First, build a vocabulary by tokenizing the text into a collection of individual unique words.

1. Iterate over each example's `numpy` value
2. Use `tfds.features.text.Tokenizer` to split it into tokens
3. Collect these tokens into a Python set, to remove duplicates
4. Get the size of the vocabulary for later use

In [14]:
tokenizer = tfds.features.text.Tokenizer()
vocabulary_set = set()
for text_tensor, _ in all_labelled_data:
    some_tokens = tokenizer.tokenize(text_tensor.numpy())
    vocabulary_set.update(some_tokens)

vocab_size = len(vocabulary_set)
vocab_size

17178

In [17]:
'the' in vocabulary_set

True

In [18]:
'The' in vocabulary_set

True

So notice that capitalisation means that we can count words twice. This is either a feature or a bug, depending on how you want to look at it -- not clear to me which will work better.

## Encode examples

Create an encoder by passing the `vocabulary_set` into `tfds.features.text.TokenTextEncoder`. The encoder's `encode` method takes in a string of text and returns a list of integers.

In [20]:
encoder = tfds.features.text.TokenTextEncoder(vocabulary_set)

In [21]:
encoder

<TokenTextEncoder vocab_size=17180>

Try this on a single line to see what the output looks like 

In [26]:
example_text = next(iter(all_labelled_data))[0].numpy()
example_text

b'speech.'

In [36]:
encoder.encode(example_text)

[14629]

Not run the encoder on the entire dataset by wrapping it in `tf.py_function` and passing that to the dataset's map method.

In [37]:
def encode(text_tensor, label):
    encoded_text = encoder.encode(text_tensor.numpy())
    return encoded_text, label

this function can't be `.map`ed directly, it needs to be wrapped in `tf.py_function` which will pass regular tensors with a value and a `.numpy` method to the wrapped python function `encode`. 

In [38]:
def encode_map_fn(text, label):
    encoded_text, label = tf.py_function(encode,
                                         inp=[text, label],
                                         Tout=(tf.int64, tf.int64)
                                        )
    # tf.py_function doesn't set the shape of the returned tensor automatically.
    # tf.data.Datsets work best if all components have a shape
    # so let's set the shapes manually:
    encoded_text.set_shape([None])
    label.set_shape([])
    
    return encoded_text, label

In [39]:
all_encoded_data = all_labelled_data.map(encode_map_fn)

In [41]:
all_encoded_data

<MapDataset shapes: ((None,), ()), types: (tf.int64, tf.int64)>

In [40]:
for ex in all_encoded_data.take(5):
    print(ex)

(<tf.Tensor: shape=(1,), dtype=int64, numpy=array([14629])>, <tf.Tensor: shape=(), dtype=int64, numpy=2>)
(<tf.Tensor: shape=(7,), dtype=int64, numpy=array([12502,  3338,    88,  6887, 14932,  2560,  5982])>, <tf.Tensor: shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: shape=(6,), dtype=int64, numpy=array([13022, 16702,  3021, 15794, 10886, 13756])>, <tf.Tensor: shape=(), dtype=int64, numpy=1>)
(<tf.Tensor: shape=(8,), dtype=int64, numpy=array([11803, 12253,    88, 10439, 16876, 13194,  1342,  8952])>, <tf.Tensor: shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: shape=(7,), dtype=int64, numpy=array([ 5841,  4038,  1918, 15794,  4828,  3757,  3344])>, <tf.Tensor: shape=(), dtype=int64, numpy=0>)


So we've successfully built a vocabulary and encoded the text.

## Split the dataset into test and train batches

Use `tf.data.Dataset.take` and `tf.data.Dataset.skip` to create a small test dataset and a larger training set.

Before being passed into the model, the datasets need to be batched. Typically, the examples inside of a batch need to be the same size and shape. But, the examples in these datasets are not all the same size -- a sentence has a variable number of words. So use `tf.data.Dataset.padded_batch` (instead of `batch`) to pad the examples to the same size.

In [55]:
TAKE_SIZE, BATCH_SIZE

(5000, 64)

So `test_data` will have 5000 examples. The rest will be in training.

In [42]:
train_data = all_encoded_data.skip(TAKE_SIZE).shuffle(BUFFER_SIZE)

In [43]:
train_data = train_data.padded_batch(BATCH_SIZE)

In [48]:
test_data = all_encoded_data.take(TAKE_SIZE)
test_data = test_data.padded_batch(BATCH_SIZE)

Now `test_data` and `train_data` are not collections of (`example`, `train`) pairs but collections of batches. Each batch is a pair of (_many examples_, _many labels_) represented as arrays. 

To illustrate:

In [50]:
sample_text, sample_labels = next(iter(test_data))

In [51]:
sample_text[1], sample_labels[1]

(<tf.Tensor: shape=(16,), dtype=int64, numpy=
 array([12502,  3338,    88,  6887, 14932,  2560,  5982,     0,     0,
            0,     0,     0,     0,     0,     0,     0])>,
 <tf.Tensor: shape=(), dtype=int64, numpy=0>)

Since we've introduced a new token encoding (0 used for padding), the vocabulaty size has increased by one.

In [52]:
vocab_size += 1

## Build the model

In [59]:
model = tf.keras.Sequential()

The first layer converts integer representations to dense vector embeddings -- i.e. representing a word as a point in an abstract vector space.

In [60]:
model.add(tf.keras.layers.Embedding(vocab_size, 64))

The next layer is a Long Short-Term Memory layer, which lets the model understand words in their context with other words. A bidirectional wrapper on the LSTM helps it learn about the datapoints in relationship to the datapoints that came before and after it.

In [61]:
model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)))

Finally we'll have a series of one or more densely connected layers, with the last one being the output layer. The output layer produces a probability for all the labels. 

In [62]:
# Add one or more dense layers
for units in [64, 64]:
    model.add(tf.keras.layers.Dense(units, activation='relu'))

In [64]:
model.add(tf.keras.layers.Dense(3))

In [65]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 64)          1099456   
_________________________________________________________________
bidirectional (Bidirectional (None, 128)               66048     
_________________________________________________________________
dense (Dense)                (None, 64)                8256      
_________________________________________________________________
dense_1 (Dense)              (None, 64)                4160      
_________________________________________________________________
dense_2 (Dense)              (None, 3)                 195       
Total params: 1,178,115
Trainable params: 1,178,115
Non-trainable params: 0
_________________________________________________________________


Now compile the model

In [66]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy']
             )

## Train the model

In [67]:
model.fit(train_data, epochs=3, validation_data=test_data)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7f30dc9fe390>

In [68]:
eval_loss, eval_acc = model.evaluate(test_data)



In [69]:
eval_acc

0.8370000123977661

Achieves decent results -- 83%