We'll build a language model trained on the Art of War by Sun Tzu.

In [1]:
import requests

In [2]:
art_of_war = requests.get('https://raw.githubusercontent.com/jrreda/AI-projects/main/Language%20Modelling/art_of_war.txt')\
                     .text

art_of_war[:300]

'1. Sun Tzŭ said: The art of war is of vital importance to the State.\n\n2. It is a matter of life and death, a road either to safety or to\nruin. Hence it is a subject of inquiry which can on no account be\nneglected.\n\n3. The art of war, then, is governed by five constant factors, to be\ntaken into accou'

The language model we'll build will be **character**-based (as opposed to word-based). That is, given a sequence of one or more characters, the model will be asked to predict the next character.<br><br>
Character-level models have the advantage of:
- Smaller prediction space. There are only a handful of characters in the English language compared to the tens of thousands of words in a typical corpus.
- Character-level models are more resilient to out-of-vocabulary (OOV) conditions and are better able to learn the lower mechanics of language (including punctuation).<br><br>

On the other hand, character-level models need to learn a sequence of characters to "make sense" of a word (e.g. the sequence of "c", "a", "t" to identify "cat" as a pattern) which can be inefficient and result in lower performance.<br><br>
RNNs can process any kind of sequence so what's shown here can easily be applied at the word level.

# Preprocessing

In [3]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer

In [4]:
tokenizer = Tokenizer(char_level=True)
tokenizer.fit_on_texts([art_of_war])

seq = tokenizer.texts_to_sequences([art_of_war])[0]

In [5]:
tokenizer.get_config()

{'num_words': None,
 'filters': '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
 'lower': True,
 'split': ' ',
 'char_level': True,
 'oov_token': None,
 'document_count': 1,
 'word_counts': '{"1": 179, ".": 896, " ": 9794, "s": 3081, "u": 1467, "n": 3565, "t": 4398, "z": 20, "\\u016d": 13, "a": 3475, "i": 3573, "d": 1681, ":": 48, "h": 2558, "e": 5837, "r": 2776, "o": 3548, "f": 1238, "w": 981, "v": 478, "l": 1722, "m": 1201, "p": 769, "c": 1390, "\\n": 1443, "2": 127, ",": 634, "y": 1055, "b": 708, "j": 23, "q": 55, "g": 1007, "3": 87, "k": 345, "\\u2019": 57, "4": 66, "(": 59, ")": 59, ";": 168, "5": 58, "6": 51, "_": 62, "7": 39, "8": 36, "9": 34, "0": 38, "x": 49, "\\u2014": 16, "?": 8, "!": 8, "-": 57, "\\u201c": 3, "\\u201d": 3, "\\u0153": 7, "\\u00fc": 3, "\\u2018": 1}',
 'word_docs': '{"\\u2019": 1, "!": 1, "9": 1, ".": 1, "c": 1, "v": 1, "j": 1, ";": 1, "-": 1, "0": 1, ")": 1, "f": 1, "y": 1, "7": 1, "(": 1, "\\u201d": 1, "h": 1, "t": 1, "\\u016d": 1, "r": 1, "m": 1, "a": 1, "u": 1, "

In [6]:
print(f'Tokenizer "Vocabulary" size: {len(tokenizer.word_index)}')

Tokenizer "Vocabulary" size: 56


In [7]:
# Sanity check.
tokenizer.sequences_to_texts([seq[:10]])

['1 .   s u n   t z ŭ']

Our training data is currently one long sequence which we'll need to segment into training examples. To do this, we'll use the **Tensorflow Data** API which makes it easy to build preprocessing pipelines by chaining operations together.<br>
https://www.tensorflow.org/guide/data<br>
https://www.tensorflow.org/api_docs/python/tf/data<br>

In [8]:
slices = tf.data.Dataset.from_tensor_slices(seq)
type(slices)

tensorflow.python.data.ops.from_tensor_slices_op.TensorSliceDataset

In [9]:
list(slices.take(10)), seq[:10]

([<tf.Tensor: shape=(), dtype=int32, numpy=27>,
  <tf.Tensor: shape=(), dtype=int32, numpy=21>,
  <tf.Tensor: shape=(), dtype=int32, numpy=1>,
  <tf.Tensor: shape=(), dtype=int32, numpy=8>,
  <tf.Tensor: shape=(), dtype=int32, numpy=13>,
  <tf.Tensor: shape=(), dtype=int32, numpy=5>,
  <tf.Tensor: shape=(), dtype=int32, numpy=1>,
  <tf.Tensor: shape=(), dtype=int32, numpy=3>,
  <tf.Tensor: shape=(), dtype=int32, numpy=47>,
  <tf.Tensor: shape=(), dtype=int32, numpy=49>],
 [27, 21, 1, 8, 13, 5, 1, 3, 47, 49])

Here, we're creating windows of `input_timesteps + 1`. The *input_timesteps* represents our training example length. The *+1* is there to help us create the target/label for each training example. This will be clarified further below.<br><br>
Finally, we're setting *drop_remainder* to True which ensures ALL windows contain exactly N elements. i.e. once the input contains fewer than N elements, they are ignored.

In [10]:
# create the training examples
input_timesteps = 100
window_size = input_timesteps + 1
windows = slices.window(window_size, shift=1, drop_remainder=True)

In [11]:
# Sanity check.
for t in windows.take(3):
  arr = list(t.as_numpy_iterator())
  print(len(arr), arr)

101 [27, 21, 1, 8, 13, 5, 1, 3, 47, 49, 1, 8, 7, 4, 12, 41, 1, 3, 10, 2, 1, 7, 9, 3, 1, 6, 16, 1, 20, 7, 9, 1, 4, 8, 1, 6, 16, 1, 25, 4, 3, 7, 11, 1, 4, 17, 22, 6, 9, 3, 7, 5, 15, 2, 1, 3, 6, 1, 3, 10, 2, 1, 8, 3, 7, 3, 2, 21, 14, 14, 29, 21, 1, 4, 3, 1, 4, 8, 1, 7, 1, 17, 7, 3, 3, 2, 9, 1, 6, 16, 1, 11, 4, 16, 2, 1, 7, 5, 12, 1, 12]
101 [21, 1, 8, 13, 5, 1, 3, 47, 49, 1, 8, 7, 4, 12, 41, 1, 3, 10, 2, 1, 7, 9, 3, 1, 6, 16, 1, 20, 7, 9, 1, 4, 8, 1, 6, 16, 1, 25, 4, 3, 7, 11, 1, 4, 17, 22, 6, 9, 3, 7, 5, 15, 2, 1, 3, 6, 1, 3, 10, 2, 1, 8, 3, 7, 3, 2, 21, 14, 14, 29, 21, 1, 4, 3, 1, 4, 8, 1, 7, 1, 17, 7, 3, 3, 2, 9, 1, 6, 16, 1, 11, 4, 16, 2, 1, 7, 5, 12, 1, 12, 2]
101 [1, 8, 13, 5, 1, 3, 47, 49, 1, 8, 7, 4, 12, 41, 1, 3, 10, 2, 1, 7, 9, 3, 1, 6, 16, 1, 20, 7, 9, 1, 4, 8, 1, 6, 16, 1, 25, 4, 3, 7, 11, 1, 4, 17, 22, 6, 9, 3, 7, 5, 15, 2, 1, 3, 6, 1, 3, 10, 2, 1, 8, 3, 7, 3, 2, 21, 14, 14, 29, 21, 1, 4, 3, 1, 4, 8, 1, 7, 1, 17, 7, 3, 3, 2, 9, 1, 6, 16, 1, 11, 4, 16, 2, 1, 7, 5, 12, 1, 12, 2

The *window* method returns a nested dataset of datasets (i.e. each window is a dataset containing a tensor).

In [12]:
print(windows, '\n')

for w in windows.take(2):
  print(w)

<WindowDataset element_spec=DatasetSpec(TensorSpec(shape=(), dtype=tf.int32, name=None), TensorShape([]))> 

<_VariantDataset element_spec=TensorSpec(shape=(), dtype=tf.int32, name=None)>
<_VariantDataset element_spec=TensorSpec(shape=(), dtype=tf.int32, name=None)>


But our model won't accept those. It'll accept only tensors, so we need to extract the tensors from each window. To do that, we'll use *flat_map* which will flatten the dataset of datasets into a single dataset of elements. But because we want to retain our segmented sequences, we'll also pass in a *batch* function to maintain the segments (otherwise, we'll just get back one large tensor representing our whole corpus).<br>
https://www.tensorflow.org/api_docs/python/tf/data/Dataset#flat_map<br>
https://www.tensorflow.org/api_docs/python/tf/data/Dataset#batch

In [13]:
dataset = windows.flat_map(lambda window: window.batch(window_size))

Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089


In [14]:
# Sanity check.
for d in dataset.take(2):
  print(d)

tf.Tensor(
[27 21  1  8 13  5  1  3 47 49  1  8  7  4 12 41  1  3 10  2  1  7  9  3
  1  6 16  1 20  7  9  1  4  8  1  6 16  1 25  4  3  7 11  1  4 17 22  6
  9  3  7  5 15  2  1  3  6  1  3 10  2  1  8  3  7  3  2 21 14 14 29 21
  1  4  3  1  4  8  1  7  1 17  7  3  3  2  9  1  6 16  1 11  4 16  2  1
  7  5 12  1 12], shape=(101,), dtype=int32)
tf.Tensor(
[21  1  8 13  5  1  3 47 49  1  8  7  4 12 41  1  3 10  2  1  7  9  3  1
  6 16  1 20  7  9  1  4  8  1  6 16  1 25  4  3  7 11  1  4 17 22  6  9
  3  7  5 15  2  1  3  6  1  3 10  2  1  8  3  7  3  2 21 14 14 29 21  1
  4  3  1  4  8  1  7  1 17  7  3  3  2  9  1  6 16  1 11  4 16  2  1  7
  5 12  1 12  2], shape=(101,), dtype=int32)


The next step is to create batches from our dataset. To do this, we'll shuffle the dataset, then create batches.

In [15]:
batch_size = 32
batches = dataset.shuffle(10_000).batch(batch_size)

In [16]:
for b in batches.take(2):
  print(b)

tf.Tensor(
[[ 5 19 28 ... 19  3 10]
 [ 1 11  2 ... 22  4  9]
 [ 3  9  2 ... 10  6 20]
 ...
 [ 7  4  5 ... 35 34  1]
 [ 2  1  3 ...  4  5  1]
 [ 9  6 18 ... 15  2 11]], shape=(32, 101), dtype=int32)
tf.Tensor(
[[ 8  3  9 ...  1 23  9]
 [ 1  5  6 ... 26  2  5]
 [19  3 10 ...  1 16  7]
 ...
 [ 9  3 13 ...  3 10  2]
 [ 1  3  6 ... 14 27 30]
 [ 2  1 17 ...  9  2 21]], shape=(32, 101), dtype=int32)


We can now separate each example into an input sequence(x) and a corresponding label/target sequence(y).<br><br>
In the slides, we talked about **Teacher Forcing** where:<br>
1. At each timestep during training, the output is compared to a label.
2. At the next timestep, rather than feeding the model the previous output, we feed it the next character of the input sequence (i.e. what the model should've outputted).
<br><br>

So if a sequence is "she swam in the lake", then:
- The input will be "she swam in the lak" (drop the last character)
- The target/label will be "he swam in the lake" (drop the first character)

In [17]:
xy_batches = batches.map(lambda batch: (batch[:, :-1], batch[:, 1:]))

In [18]:
for b in xy_batches.take(1):
  print(b)

(<tf.Tensor: shape=(32, 100), dtype=int32, numpy=
array([[ 7,  3,  3, ...,  9,  6, 18],
       [ 3,  2, 12, ..., 14,  3, 10],
       [ 5, 12,  1, ...,  5, 19,  1],
       ...,
       [12,  1,  3, ...,  9,  7,  3],
       [ 3, 10,  2, ..., 12,  2, 16],
       [ 2, 12, 51, ...,  1,  6, 25]], dtype=int32)>, <tf.Tensor: shape=(32, 100), dtype=int32, numpy=
array([[ 3,  3,  2, ...,  6, 18,  1],
       [ 2, 12,  1, ...,  3, 10,  7],
       [12,  1,  7, ..., 19,  1,  7],
       ...,
       [ 1,  3, 10, ...,  7,  3,  7],
       [10,  2,  1, ...,  2, 16,  2],
       [12, 51, 14, ...,  6, 25,  2]], dtype=int32)>)


In [19]:
# For greater clarity, this is the first input sequence from the first batch,
# and it's corresponding label/target sequence.
for b in xy_batches.take(1):
  print("x1 length: ", len(b[0][0].numpy()))
  print("x1: ", b[0][0].numpy())
  print("\n")
  print("y1 length: ", len(b[1][0].numpy()))
  print("y1: ", b[1][0].numpy())

x1 length:  100
x1:  [23  2  1 12  7 17 22  2 12 21  1  4 16 14 18  6 13  1 11  7 18  1  8  4
  2 19  2  1  3  6  1  7  1  3  6 20  5 24  1 18  6 13  1 20  4 11 11  1
  2 40 10  7 13  8  3  1 18  6 13  9  1  8  3  9  2  5 19  3 10 21 14 14
 30 21  1  7 19  7  4  5 24  1  4 16  1  3 10  2  1 15  7 17 22  7  4 19
  5  1  4  8]


y1 length:  100
y1:  [ 2  1 12  7 17 22  2 12 21  1  4 16 14 18  6 13  1 11  7 18  1  8  4  2
 19  2  1  3  6  1  7  1  3  6 20  5 24  1 18  6 13  1 20  4 11 11  1  2
 40 10  7 13  8  3  1 18  6 13  9  1  8  3  9  2  5 19  3 10 21 14 14 30
 21  1  7 19  7  4  5 24  1  4 16  1  3 10  2  1 15  7 17 22  7  4 19  5
  1  4  8  1]


The last step before we can build our model is to one-hot encode the **inputs**. We're doing this because:
1. We're not using embeddings for the input. We can, but since this is a character model with just a few dozen possible choices, we can get away with one-hot encoding. There's also no reason to think a particular letter should be closer to another in vector space as we would want in a word-level model.

2. Since we're not using embeddings and our input is categorical, we need to one-hot encode.

In [20]:
num_tokens = len(tokenizer.word_index) + 1

# One-hot encode the input sequences, don't do anything with the label/target sequences.
xy_batches = xy_batches.map(lambda inputs, labels: (tf.one_hot(inputs, depth=num_tokens), labels))

In [21]:
# Sanity check.
for b in xy_batches.take(1):
  print("x1: ", b[0][0].numpy())
  print("\n")
  print("y1: ", b[1][0].numpy())

x1:  [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 ...
 [0. 0. 1. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


y1:  [ 8  1  3  6  1 23  7 13 11 26  1  3 10  2  1  2  5  2 17 18 36  8  1 22
 11  7  5  8 28 14  3 10  2  1  5  2 40  3  1 23  2  8  3  1  4  8  1  3
  6  1 22  9  2 25  2  5  3  1  3 10  2  1 46 13  5 15  3  4  6  5  1  6
 16  1  3 10  2  1  2  5  2 17 18 36  8  1 16  6  9 15  2  8 28  1  3 10
  2 14  5  2]


In [22]:
# prefetching is an optimization step
dataset = dataset.prefetch(tf.data.AUTOTUNE)

At this point, we've:
- Segmented our corpus into fixed-length sequences.
- Created training and label/target sequences.
- Organized them into batches.
- prepares the next batch (Prefetch dataset).

# Stacked LSTMs

In [23]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout

# recurrent_dropout=0 to support cuDNN 
model = Sequential([
    LSTM(128, return_sequences=True, input_shape=[None, num_tokens], recurrent_dropout=0),
    Dropout(0.3),
    LSTM(128, return_sequences=True, recurrent_dropout=0),
    Dense(num_tokens, activation='softmax')
])

model.compile(loss="sparse_categorical_crossentropy", optimizer='adam')

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm (LSTM)                 (None, None, 128)         95232     
                                                                 
 dropout (Dropout)           (None, None, 128)         0         
                                                                 
 lstm_1 (LSTM)               (None, None, 128)         131584    
                                                                 
 dense (Dense)               (None, None, 57)          7353      
                                                                 
Total params: 234,169
Trainable params: 234,169
Non-trainable params: 0
_________________________________________________________________


In [24]:
history = model.fit(xy_batches, epochs=50)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [25]:
# save model
model.save('art_of_war_char_level_lm_50epochs')



# Generate Text

There's a *temperature* parameter. The next character is picked from a probability distribution. By dividing the log of this distribution by *temperature*, we can influence the randomness of the output.<br><br>
When the temperature is low (< 1), the probability distribution sharpens and the model will be more strict in recreating the original text. As we raise the temperature, the distribution flattens and there's a higher chance the model picks something unexpected, resulting in greater surprise in the output. In practice, a high enough temperature will result in nonsense.

In [46]:
import numpy as np
import sys

def generate_text(seed_text, length=200, temperature=1):

  text = seed_text  

  for _ in range(length):

    # Take the last *input_timesteps* number of characters in the text so far
    # as input.
    input = np.array(tokenizer.texts_to_sequences([text[-input_timesteps:]]))
    input = tf.one_hot(input, num_tokens)

    # Create probability distribution for next character adjusted by temperature.
    preds = model.predict(input, verbose=0)[0, -1:, :] # <-- We want only the last character so we're extracting the softmax output for that.
    preds = tf.math.log(preds) / temperature

    # Sample next character and add to running text.
    next_char = tf.random.categorical(preds, num_samples=1)
    next_char = tokenizer.sequences_to_texts(next_char.numpy())[0]

    # GPT-3 like prompt :D
    sys.stdout.write(next_char)
    sys.stdout.flush()

    text += next_char
  print()
    
  return text

In [47]:
generate_text("Banana peels on the battlefield can ", length=30, temperature=0.2)

on hemments, and when the met 


'Banana peels on the battlefield can on hemments, and when the met '

In [48]:
generate_text("It's time to release the Kraken when ", length=30, temperature=0.5)

keves under comit. sollies wit


"It's time to release the Kraken when keves under comit. sollies wit"

In [49]:
generate_text("Crush your enemies, see them driven before you, and ", length=30, temperature=1)

for the road.

50. foo even th


'Crush your enemies, see them driven before you, and for the road.\n\n50. foo even th'

In [50]:
generate_text("What is best in life? ", length=30, temperature=2)

in
simply because or
 appearan


'What is best in life? in\nsimply because or\n appearan'

A few observations of the preceding outputs:
1. Despite being a character-level model, the model managed to "learn" spelling, cadence, punctuation, spacing, grammar, and even numbered bullet points just from trying to predict the next character.

2. It's pretty cool how the model manages to take our initial seed text and complete a sentence with it before moving on.

3. We can see the output getting increasingly nonsensical as the temperature rises. What temperature to use ultimately depends on the nature of your corpus and your goals with the language model.

Also, in contrast to our language model, GPT-3 has 175 billion parameters and was trained on 45 terabytes of data, but the high-level principle of learning through prediction remains the same.