Chapter 16: Natural Language Processing with RNNs and Attention

Chapter 16 exercises:

1. What are the pros and cons of using stateful RNN versus stateless RNN?

-> Stateless RNNs do not preserve the hidden state across sequences or batches. Therefore, they can only capture patterns whose length is less than the sequence size or window size, it was trained out. 
-> Stateful RNNs maintain their internal state across multiple sequences or batches of data. Thus, they can capture longer dependencies in the data. 
-> Since the stateful RNNs require training data to be fully sequential and non-overlapping, they are harder to set and train. 
-> Stateless RNNs used when the context of the previous sequence is not relevant or when data is shuffled randomly (example sentiment analysis of a sentence). Stateful RNNs when the order and continuity of sequences are essential, such as in generating music or predicting stock prices.

page 584:
Stateless RNN
- at each training iteration the model starts with a hidden state full of zeros, then it updates the state each time step, and after last time step, it throws it away as it is not needed anymore

Stateful RNN
- preserve the final state after processing a training batch and use it as the initial state for then next training batch
- this way model can learn long-term patterns despite only backpropagating through short sequences

Building Stateful RNN
- each sequence in a batch starts exactly where the corresponding sequence in the previous batch left off
- use sequential and non-overlapping input sequences (rather than shuffled and overlapping sequences previous used to train stateless RNNs)
- when creating the tf.data.Dataset, when must use 'shift=length' (instead of shift=1) when calling 'window()' method and must not shuffle
- batching is much harder - simpliest solution is use a batch size of 1

Difference between Stateful and Stateless RNNs: https://medium.com/@iqra.bismi/difference-between-stateful-and-stateless-rnns-2b397184e759

Key Differences [between stateful and stateless RNNs]:

a. Memory Retention: Stateful RNNs retain memory across sequences, capturing long-term dependencies, while stateless RNNs treat each sequence independently without retaining memory.

b. Sequence Order: Stateful RNNs preserve the order and continuity of sequences, whereas stateless RNNs treat sequences as isolated entities.

c. Training and Inference: Stateful RNNs require careful management of sequence boundaries during training and inference, while stateless RNNs do not have this requirement.

__book answer:__

Stateless RNNs can only capture patterns whose length is less than, or equal to, the size of the windows the RNN is trained on. Conversely, stateful RNNs can capture longer-term patterns. However, implementing a stateful RNN is much harder⁠—especially preparing the dataset properly. Moreover, stateful RNNs do not always work better, in part because consecutive batches are not independent and identically distributed (IID). Gradient Descent is not fond of non-IID datasets.


2. Why do people use encoder-decoder RNNs rather than plan sequence-to-sequence RNNs for automatic translation?

-> Sequence-to-sequence RNNs essentially translate a sentence one word at a time starting with the first word. Since words can have multiple means and the specific meaning is often depend on how it used a in sentence, translating one word at time will result in poor translations. On the other encoder-decoder will 1st process the whole sentence, and then translate it 

page 578:
RNN encoder-decoder architectures
- RNNs can be used to build encoder-decoder architectures capable of performing Neural Machine Translation (NMT) e.g. translating from English to Spanish

pages 595 - 596
Neural Machine Translation (NMT) (https://homl.info/103)
- will translate English sentences to Spanish
- architecture (see Figure 16-3. A simple Machine translation model. page 596): 
    - [via textvectorization] initially each word is represented by its ID
    - Next, and embedding layer returns the word embedding
    - the embeddings are then fed to the encoder and the decoder
    - english sentences fed as inputs to the encoder
    - decoder outputs the Spanish translation
      - Note: Spanish translations are also used as inputs to the decoder during training (teacher forcing), but shifted back by one step
    - at each step, the decoder output a score for each word in the output vocabulary (i.e. Spanish), then the softmax activation function turns these scores into probabilities. The word with the highest probability is output
    - at inference time (after training), you will not have the target sentence to feed to the decoder. Instead, you need to feed it the work that it has just output at the previous step. This ill require an embedding lookup (not shown in figure 16-4)
      - see figure 16-4. At inference time, the decoder is fed as input the work it just output at the previous time step. page 597 


__book answer:__

In general, if you translate a sentence one word at a time, the result will be terrible. For example, the French sentence "Je vous en prie" means "You are welcome," but if you translate it one word at a time, you get "I you in pray." Huh? It is much better to read the whole sentence first and then translate it. A plain sequence-to-sequence RNN would start translating a sentence immediately after reading the first word, while an Encoder–Decoder RNN will first read the whole sentence and then translate it. That said, one could imagine a plain sequence-to-sequence RNN that would output silence whenever it is unsure about what to say next (just like human translators do when they must translate a live broadcast).

3. How can you deal with variable-length input sequences? What tool can you use to implement it?

One approaching is padding shorter length sequences so all the seqences are the same length, and then using masking to tell the RNN model to ignore the padding tokens. To enable masking, add "masking_zero=True" argument to Embedding layer. If your model does not start with an Embedding layer, you may use the tf.keras.Masking layer instead. A TextVectorization layer will add padding to sequences less than the specified max_tokens size. Another approach is to feed the model with ragged tensors and set 'ragged=True' when creating the TextVectorization layer.

page 590 - 591:
Masking the model:
- ignore padding tokens
- to use, simply add "masking_zero=True" argument to Embedding layer

Embedding model masking:
- creates a 'mask tensor' equal to 'tf.math.not_equal(inputs, 0): a boolean tensor with the same shape as the inputs, and it is equal to False anywhere the token IDs are 0, or True otherwise
- 'mask tensor' is automatically  propagated by the model to the next layer. If that layer's call() method has a mask argument, then it automatically receives the mask
- each layer may handle the mask differently, but in general, they simply igmore masked time steps. For example, when are recurrent layer encoutersa masked time step, it simply copies the output from the previous time step

Recurrent layers support for masking attribute
- a recurrent layer's supports_masking attribute is True when 'return_sequences=True', but if false when 'return_sequences=False' since there is no need for a mask in this case
- in below code example, GRU does not have 'return_sequences=True', so it will receive and use the mask automatically, but it will not propagate it any further

Keras layers supporting masking:
- include GRU, LSTM, Bidirectional, Dense, TimeDistributed, Add
- convolutional layers including Conv1D do not support masking - it's not obvious how they would do so anyways


page 592:
Masking using Ragged Tensors
- one approach to masking is to feed the model with ragged tensors
- in practice, all you need to do is to set 'ragged=True' when creating the TextVectorization layer, so all the input sequences are represented as ragged tensors
- Kera's recurrent layers have built-in support ragged tensor, so there's nothing else you need to do, just use this TextVectorization layer with ragged=True in your model

__book answer:__
Variable-length input sequences can be handled by padding the shorter sequences so that all sequences in a batch have the same length, and using masking to ensure the RNN ignores the padding token. For better performance, you may also want to create batches containing sequences of similar sizes. Ragged tensors can hold sequences of variable lengths, and Keras now supports them, which simplifies handling variable-length input sequences (at the time of this writing, it still does not handle ragged tensors as targets on the GPU, though). Regarding variable-length output sequences, if the length of the output sequence is known in advance (e.g., if you know that it is the same as the input sequence), then you just need to configure the loss function so that it ignores tokens that come after the end of the sequence. Similarly, the code that will use the model should ignore tokens beyond the end of the sequence. But generally the length of the output sequence is not known ahead of time, so the solution is to train the model so that it outputs an end-of-sequence token at the end of each sequence.

4. What is a beam search, and why would you use it? What tool can you use to implement it?

-> For a trained encoder-decoder model, it is a technique that gives a model a chance to go back and fix mistakes by keeping track of a short list of the 'k' most promising sentences (e.g top 3), and at each decoder step it tries to extend them by one word, keeping on the 'k' most likely sentences. The parameter 'k' is call the beam width. Instead of greedily choosing the most likely next word at each step to extend a single sentence, this technique allows the system to explore several promising sentences simultaneously. Moreover, this technique lends itself well to parallelization. 
-> You can implement beam search by writing a custom memory cell. Alternatively, TensorFlow Addons's seq2seq API provides an implementation.

Beam Search (pages 603 - 604)
- for a trained encoder-decoder model, it is a technique that gives a model a chance to go back and fix mistakes
- keeps track of a short list of the 'k' most promising sentences (e.g top 3), and at each decoder step it tries to extend them by one word, keeping on the 'k' most likely sentences
- the parameter 'k' is call the beam width
- good translations for fairly short sentences
- due to limited short-term memory of RNNs, beam models are bad at translating long sentences TensorFlow Addons Library
- includes a full seq2seq API that lets you build encoder-decoder models with attention including 'beam search'

__book answer:__
Beam search is a technique used to improve the performance of a trained Encoder–Decoder model, for example in a neural machine translation system. The algorithm keeps track of a short list of the k most promising output sentences (say, the top three), and at each decoder step it tries to extend them by one word; then it keeps only the k most likely sentences. The parameter k is called the beam width: the larger it is, the more CPU and RAM will be used, but also the more accurate the system will be. Instead of greedily choosing the most likely next word at each step to extend a single sentence, this technique allows the system to explore several promising sentences simultaneously. Moreover, this technique lends itself well to parallelization. You can implement beam search by writing a custom memory cell. Alternatively, TensorFlow Addons's seq2seq API provides an implementation.


5. What is an attention mechanism? How does it help?

-> An attention mechanism is a technique initially used in Encoder–Decoder models to give the decoder more direct access to the input sequence, allowing it to deal with longer input sequences. At each decoder time step, the current decoder's state and the full output of the encoder are processed by an alignment model that outputs an alignment score for each input time step. This score indicates which part of the input is most relevant to the current decoder time step. The weighted sum of the encoder output (weighted by their alignment score) is then fed to the decoder, which produces the next decoder state and the output for this time step. 
-> The main benefit of using an attention mechanism is the fact that the Encoder–Decoder model can successfully process longer input sequences. Another benefit is that the alignment scores make the model easier to debug and interpret: for example, if the model makes a mistake, you can look at which part of the input it was paying attention to, and this can help diagnose the issue. An attention mechanism is also at the core of the Transformer architecture, in the Multi-Head Attention layers.

page 605:
Attention Mechanisms
- revolutionalized Neural Machine Translation (NMT) especially for long sentences
- a technique that allowed the decoder to focus on the appropriate words (as encoded by the encoder) at each time step
- figure 16-7. Neural machine translation using an encoder-decoder network with an attention model. page 606
   - encoder-decoder: instead of just sending the encoder's final hidden state to the decoder, as well as the previous target word at each step, it now sends all of the encoder's outputs to the decoder as well
    - since the decoder cannot deal with all those encoder's outputs at once, they are aggregated: at each time step, the decoder's memory cell computes a weighted sum of all the encoder outputs. This determines which words it will focus on at this step
    - the weight alpha_(t,i) is the weight of the i_th encoder output at the t_th decoder time step
    - For example, if the weight alpha_(3,2) is larger than alpha_(3,0) and alpha_(3,1), then the decoder will pay more att the encoder's output for the #2 word ('soccer' in 'I like soccer') than to the other 2 outputs, at least at this time step


pages 606 - 607
Alignment Model (or attention layer):
- generates the alpha_(t,i) weights
- neural network trained jointly with the encoder-decoder model used in an attention mechanism
- starts with a Dense layer composed of a single neuron that processes each of the encoder's outputs along with the decoder's previous hidden state (e.g. h_(2) )
   - this layer outputs a score (or energy) forech encoder outut (e.g. alpha_(3,2) ): this score measures how well each output is alighed with the decoder's previous hidden state
- finally, all scores go through a softmax layer to get a final weight for each encoder output (e.g. alpha_(3,2) )
Bahdanau attention (or concatentative attention or additive attention)
- above attention technique
Luong attention (or multiplicative attention)
- the goal of alignment model is to measure the similarity between one of the encoder's outputs and the decoder's previous hidden layer
- simply compute the dot product of the these two vectors, as this is often a fairly good similarity measure
- the dot product gives a score, and all the scores (at a given decoder time step) go through a softmax layer to give the final weights (just as in the Bahdanau attention)
- another simplificaiton was to used the decoder's hidden state at the current time step rather than at the previous time step (e.g. h_(t) rather than h_(t-1), then use the output of the attention mechanism directly (noted: h~_(t) directly to compute the decoder's predictions, rather than using it to compute decoder's current hidden state
"general" dot product approach
- a variant of the dot product mechanism  where then encoder output first go through a fully connected layer (without a bias term) before the dot products are computed dot product variants vs concatenative attention 
- dot product variants performed better than concatentative attention.

__book answer:__
An attention mechanism is a technique initially used in Encoder–Decoder models to give the decoder more direct access to the input sequence, allowing it to deal with longer input sequences. At each decoder time step, the current decoder's state and the full output of the encoder are processed by an alignment model that outputs an alignment score for each input time step. This score indicates which part of the input is most relevant to the current decoder time step. The weighted sum of the encoder output (weighted by their alignment score) is then fed to the decoder, which produces the next decoder state and the output for this time step. The main benefit of using an attention mechanism is the fact that the Encoder–Decoder model can successfully process longer input sequences. Another benefit is that the alignment scores make the model easier to debug and interpret: for example, if the model makes a mistake, you can look at which part of the input it was paying attention to, and this can help diagnose the issue. An attention mechanism is also at the core of the Transformer architecture, in the Multi-Head Attention layers. See the next answer.


6. What is the most important layer in the transformer architecture? What is its purpose?

-> The most important layer in the  transformer architecuture is the multi-head attention layer. In the original transformer architecture there were 6 multi-head layers in the encoder, 6 masked multi-head layers plus 6 [regular] multi-head layers in the decoder. The multi-head attention layer updates each word representation by attending (i.e. paying attention to) all other words in the same sentence. That is, it improves words representation by using contextual clues.

page 609 - 612
transformer:
  - by google researchers - significantly trans
  - improved the state-of-the-art NMT without using any recurrent or convolutional layers, just attention mechanisms (plus embedded layers, dense layers, normalization layers, etc.)
  - see figure 16-8. The original 2017 transformer architecture. page 610
    - left part is the encoder
    - right part is the decoder
    - each embedding layer (at beginning of encoder and decoder) outputs a 3D tensor of shape [batch size, sequence length, enbedding size]
    - after embedding layers, the tensors are gradually transformed as they flow through the transformer, but their shape remains the same
    - after going through the decoder, each word representation goes through a final dense layer with a softmax activation
  encoder / decoder
    - contain N stacked modulues (N=6 for paper)
    - components enclude 2 embedding layers, several skip connections, each of them followed by a layer normalization; several feedforward that are composed of 2 dense layers each (1st with ReLU activation & 2nd using no activation)
    - final outputs of whole encoder stack are fed to the decoder at each of the N levels
  output layer:
    - a dense layer using the softmax activation function
  encoder's multi-head attention layer
    - updates each word representation by attending (i.e. paying attention to) all other words in the same sentence
    - this is where vague words (e.g. 'like') become richer more accurate representations (e.g. "to be fond of")
  decoder's masked multi-head attention layer
    - its a 'casual' (only looks at past and present inputs) - when it processes a word, it doesn't attend to words after it 
  decoder's upper multi-head attention layer
    - cross-attention: where it pays attention to words in the English sentence (pays attention to 'soccer' when it processes the 'el' [in 'el futbol']
  positional encoding
    - after input (English) / output (Spanish) embedding layers, and before N-stacked attention modules 
    - dense vectors (much like word embeddings) that represent the position of each word in a sentence. The nth positional encoding is added to the word embedded on the nth word in the sentence
      - needed because all layers in the transform architecture ignore word positions, and word positions do matter

__book answer:__

The most important layer in the Transformer architecture is the Multi-Head Attention layer (the original Transformer architecture contains 18 of them, including 6 Masked Multi-Head Attention layers). It is at the core of language models such as BERT and GPT-2. Its purpose is to allow the model to identify which words are most aligned with each other, and then improve each word's representation using these contextual clues.

7. When would you need to use sampled softmax?

-> Sampled softmax would be used when the output/target vocabulary is large (e.g. 5000 target vocabulary), so outputting a probability for each and every possible word can be slow.
-> Sampled Softmax look only at the logits output by the model for the correct word and for a random sample of incorrect words, then compute the approximation of the cross-entropy loss based only on these logits (sampled softmax technique). This speeds up training considerably compared to computing the softmax over all logits and then estimating the cross-entropy loss. After training, the model can be used normally, using the regular softmax function to compute all the class probabilities based on all the logits.

page 600:
Optimizing the Output Layer
 - when the ouput vocabulary is large, outputting a probability for each and every possible word can be slow
 solution 1:
   - look only at the logits output by the model for the correct word and for a random sample of incorrect words, then compute the approximation of the loss based only on these logits (sampled softmax technique)
    - in TensorFlow, you can uses the tf.nn.sampled_softmax_loss() function for this during trainining, and use the normal softmax function at inference time (sampled softmax cannot be used at inference time because it requires knowing the target)
  solution 2:
    - tie the weights of the output layer to the transpose of the decoder's embedding matrix (see chap 17)

__book answer:__
Sampled softmax is used when training a classification model when there are many classes (e.g., thousands). It computes an approximation of the cross-entropy loss based on the logit predicted by the model for the correct class, and the predicted logits for a sample of incorrect words. This speeds up training considerably compared to computing the softmax over all logits and then estimating the cross-entropy loss. After training, the model can be used normally, using the regular softmax function to compute all the class probabilities based on all the logits.


In [1]:
import numpy as np
import sklearn
import tensorflow as tf
from sklearn.datasets import load_iris
from sklearn.linear_model import Perceptron
import matplotlib.pyplot as plt
from scipy.special import expit as sigmoid

plt.rc('font', size=14)
plt.rc('axes', labelsize=14, titlesize=14)
plt.rc('legend', fontsize=14)
plt.rc('xtick', labelsize=10)
plt.rc('ytick', labelsize=10)

In [2]:
from pathlib import Path

IMAGES_PATH = Path() / "images" / "deep"
IMAGES_PATH.mkdir(parents=True, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = IMAGES_PATH / f"{fig_id}.{fig_extension}"
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

#### 8. 
_Exercise:_ Embedded Reber grammars _were used by Hochreiter and Schmidhuber in [their paper](https://homl.info/93) about LSTMs. They are artificial grammars that produce strings such as "BPBTSXXVPSEPE." Check out Jenny Orr's [nice introduction](https://homl.info/108) to this topic. Choose a particular embedded Reber grammar (such as the one represented on Jenny Orr's page), then train an RNN to identify whether a string respects that grammar or not. You will first need to write a function capable of generating a training batch containing about 50% strings that respect the grammar, and 50% that don't._

First we need to build a function that generates strings based on a grammar. The grammar will be represented as a list of possible transitions for each state. A transition specifies the string to output (or a grammar to generate it) and the next state.

In [4]:
default_reber_grammar = [
    [("B", 1)],           # (state 0) =B=>(state 1)
    [("T", 2), ("P", 3)], # (state 1) =T=>(state 2) or =P=>(state 3)
    [("S", 2), ("X", 4)], # (state 2) =S=>(state 2) or =X=>(state 4)
    [("T", 3), ("V", 5)], # and so on...
    [("X", 3), ("S", 6)],
    [("P", 4), ("V", 6)],
    [("E", None)]]        # (state 6) =E=>(terminal state)

embedded_reber_grammar = [
    [("B", 1)],
    [("T", 2), ("P", 3)],
    [(default_reber_grammar, 4)],
    [(default_reber_grammar, 5)],
    [("T", 6)],
    [("P", 6)],
    [("E", None)]]

def generate_string(grammar):
    state = 0
    output = []
    while state is not None:
        index = np.random.randint(len(grammar[state]))
        production, state = grammar[state][index]
        if isinstance(production, list):
            production = generate_string(grammar=production)
        output.append(production)
    return "".join(output)

Let's generate a few strings based on the default Reber grammar:

In [5]:
np.random.seed(42)

for _ in range(25):
    print(generate_string(default_reber_grammar), end=" ")

BTXXTTVPXTVPXTTVPSE BPVPSE BTXSE BPVVE BPVVE BTSXSE BPTVPXTTTVVE BPVVE BTXSE BTXXVPSE BPTTTTTTTTVVE BTXSE BPVPSE BTXSE BPTVPSE BTXXTVPSE BPVVE BPVVE BPVVE BPTTVVE BPVVE BPVVE BTXXVVE BTXXVVE BTXXVPXVVE 

Looks good. Now let's generate a few strings based on the embedded Reber grammar:

In [36]:
np.random.seed(42)

for _ in range(25):
    print(generate_string(embedded_reber_grammar), end=" ")

BTBPTTTVPXTVPXTTVPSETE BPBPTVPSEPE BPBPVVEPE BPBPVPXVVEPE BPBTXXTTTTVVEPE BPBPVPSEPE BPBTXXVPSEPE BPBTSSSSSSSXSEPE BTBPVVETE BPBTXXVVEPE BPBTXXVPSEPE BTBTXXVVETE BPBPVVEPE BPBPVVEPE BPBTSXSEPE BPBPVVEPE BPBPTVPSEPE BPBTXXVVEPE BTBPTVPXVVETE BTBPVVETE BTBTSSSSSSSXXVVETE BPBTSSSXXTTTTVPSEPE BTBPTTVVETE BPBTXXTVVEPE BTBTXSETE 

Okay, now we need a function to generate strings that do not respect the grammar. We could generate a random string, but the task would be a bit too easy, so instead we will generate a string that respects the grammar, and we will corrupt it by changing just one character:

In [38]:
POSSIBLE_CHARS = "BEPSTVX"

def generate_corrupted_string(grammar, chars=POSSIBLE_CHARS):
    good_string = generate_string(grammar)
    index = np.random.randint(len(good_string))
    good_char = good_string[index]
    bad_char = np.random.choice(sorted(set(chars) - set(good_char)))
    return good_string[:index] + bad_char + good_string[index + 1:]

In [39]:
np.random.seed(42)

for _ in range(25):
    print(generate_corrupted_string(embedded_reber_grammar), end=" ")

BTBPTTTPPXTVPXTTVPSETE BPBTXEEPE BPBPTVVVEPE BPBTSSSSXSETE BPTTXSEPE BTBPVPXTTTTTTEVETE BPBTXXSVEPE BSBPTTVPSETE BPBXVVEPE BEBTXSETE BPBPVPSXPE BTBPVVVETE BPBTSXSETE BPBPTTTPTTTTTVPSEPE BTBTXXTTSTVPSETE BBBTXSETE BPBTPXSEPE BPBPVPXTTTTVPXTVPXVPXTTTVVEVE BTBXXXTVPSETE BEBTSSSSSXXVPXTVVETE BTBXTTVVETE BPBTXSTPE BTBTXXTTTVPSBTE BTBTXSETX BTBTSXSSTE 

We cannot feed strings directly to an RNN, so we need to encode them somehow. One option would be to one-hot encode each character. Another option is to use embeddings. Let's go for the second option (but since there are just a handful of characters, one-hot encoding would probably be a good option as well). For embeddings to work, we need to convert each string into a sequence of character IDs. Let's write a function for that, using each character's index in the string of possible characters "BEPSTVX":

In [41]:
def string_to_ids(s, chars=POSSIBLE_CHARS):
    return [chars.index(c) for c in s]

In [42]:
string_to_ids("BTTTXXVVETE")

[0, 4, 4, 4, 6, 6, 5, 5, 1, 4, 1]

We can now generate the dataset, with 50% good strings, and 50% bad strings:

In [44]:
def generate_dataset(size):
    good_strings = [
        string_to_ids(generate_string(embedded_reber_grammar))
        for _ in range(size // 2)
    ]
    bad_strings = [
        string_to_ids(generate_corrupted_string(embedded_reber_grammar))
        for _ in range(size - size // 2)
    ]
    all_strings = good_strings + bad_strings
    X = tf.ragged.constant(all_strings, ragged_rank=1)
    y = np.array([[1.] for _ in range(len(good_strings))] +
                 [[0.] for _ in range(len(bad_strings))])
    return X, y

In [45]:
np.random.seed(42)

X_train, y_train = generate_dataset(10000)
X_valid, y_valid = generate_dataset(2000)

Let's take a look at the first training sequence:

In [46]:
X_train[0]

<tf.Tensor: shape=(22,), dtype=int32, numpy=array([0, 4, 0, 2, 4, 4, 4, 5, 2, 6, 4, 5, 2, 6, 4, 4, 5, 2, 3, 1, 4, 1])>

In [47]:
y_train[0]

array([1.])

Perfect! We are ready to create the RNN to identify good strings. We build a simple sequence binary classifier:

In [48]:
np.random.seed(42)
tf.random.set_seed(42)

embedding_size = 5

model = tf.keras.Sequential([
    tf.keras.layers.InputLayer(input_shape=[None], dtype=tf.int32, ragged=True),
    tf.keras.layers.Embedding(input_dim=len(POSSIBLE_CHARS),
                              output_dim=embedding_size),
    tf.keras.layers.GRU(30),
    tf.keras.layers.Dense(1, activation="sigmoid")
])
optimizer = tf.keras.optimizers.SGD(learning_rate=0.02, momentum = 0.95,
                                    nesterov=True)
model.compile(loss="binary_crossentropy", optimizer=optimizer,
              metrics=["accuracy"])
history = model.fit(X_train, y_train, epochs=20,
                    validation_data=(X_valid, y_valid))

Epoch 1/20




TypeError: Exception encountered when calling Embedding.call().

[1mFailed to convert elements of tf.RaggedTensor(values=Tensor("data:0", shape=(None,), dtype=int32), row_splits=Tensor("data_1:0", shape=(None,), dtype=int64)) to Tensor. Consider casting elements to a supported type. See https://www.tensorflow.org/api_docs/python/tf/dtypes for supported TF dtypes.[0m

Arguments received by Embedding.call():
  • inputs=tf.Tensor(shape=(None, None), dtype=int32)

Now let's test our RNN on two tricky strings: the first one is bad while the second one is good. They only differ by the second to last character. If the RNN gets this right, it shows that it managed to notice the pattern that the second letter should always be equal to the second to last letter. That requires a fairly long short-term memory (which is the reason why we used a GRU cell).

In [None]:
test_strings = ["BPBTSSSSSSSXXTTVPXVPXTTTTTVVETE",
                "BPBTSSSSSSSXXTTVPXVPXTTTTTVVEPE"]
X_test = tf.ragged.constant([string_to_ids(s) for s in test_strings], ragged_rank=1)

y_proba = model.predict(X_test)
print()
print("Estimated probability that these are Reber strings:")
for index, string in enumerate(test_strings):
    print("{}: {:.2f}%".format(string, 100 * y_proba[index][0]))

Ta-da! It worked fine. The RNN found the correct answers with very high confidence. :)

#### 9.   
_Exercise: Train an Encoder–Decoder model that can convert a date string from one format to another (e.g., from "April 22, 2019" to "2019-04-22")._

Let's start by creating the dataset. We will use random days between 1000-01-01 and 9999-12-31:

In [50]:
from datetime import date

# cannot use strftime()'s %B format since it depends on the locale
MONTHS = ["January", "February", "March", "April", "May", "June",
          "July", "August", "September", "October", "November", "December"]

def random_dates(n_dates):
    min_date = date(1000, 1, 1).toordinal()
    max_date = date(9999, 12, 31).toordinal()

    ordinals = np.random.randint(max_date - min_date, size=n_dates) + min_date
    dates = [date.fromordinal(ordinal) for ordinal in ordinals]

    x = [MONTHS[dt.month - 1] + " " + dt.strftime("%d, %Y") for dt in dates]
    y = [dt.isoformat() for dt in dates]
    return x, y

Here are a few random dates, displayed in both the input format and the target format:

In [53]:
np.random.seed(42)

n_dates = 3
x_example, y_example = random_dates(n_dates)
print("{:25s}{:25s}".format("Input", "Target"))
print("-" * 50)
for idx in range(n_dates):
    print("{:25s}{:25s}".format(x_example[idx], y_example[idx]))

Input                    Target                   
--------------------------------------------------
September 20, 7075       7075-09-20               
May 15, 8579             8579-05-15               
January 11, 7103         7103-01-11               


Let's get the list of all possible characters in the inputs:

In [54]:
INPUT_CHARS = "".join(sorted(set("".join(MONTHS) + "0123456789, ")))
INPUT_CHARS

' ,0123456789ADFJMNOSabceghilmnoprstuvy'

And here's the list of possible characters in the outputs:

In [56]:
OUTPUT_CHARS = "0123456789-"

Let's write a function to convert a string to a list of character IDs, as we did in the previous exercise:

In [58]:
def date_str_to_ids(date_str, chars=INPUT_CHARS):
    return [chars.index(c) for c in date_str]

In [59]:
date_str_to_ids(x_example[0], INPUT_CHARS)

[19, 23, 31, 34, 23, 28, 21, 23, 32, 0, 4, 2, 1, 0, 9, 2, 9, 7]

In [60]:
date_str_to_ids(y_example[0], OUTPUT_CHARS)

[7, 0, 7, 5, 10, 0, 9, 10, 2, 0]

In [61]:
def prepare_date_strs(date_strs, chars=INPUT_CHARS):
    X_ids = [date_str_to_ids(dt, chars) for dt in date_strs]
    X = tf.ragged.constant(X_ids, ragged_rank=1)
    return (X + 1).to_tensor() # using 0 as the padding token ID

def create_dataset(n_dates):
    x, y = random_dates(n_dates)
    return prepare_date_strs(x, INPUT_CHARS), prepare_date_strs(y, OUTPUT_CHARS)

In [62]:
np.random.seed(42)

X_train, Y_train = create_dataset(10000)
X_valid, Y_valid = create_dataset(2000)
X_test, Y_test = create_dataset(2000)

In [63]:
Y_train[0]

<tf.Tensor: shape=(10,), dtype=int32, numpy=array([ 8,  1,  8,  6, 11,  1, 10, 11,  3,  1])>

First version: a very basic seq2seq model

Let's first try the simplest possible model: we feed in the input sequence, which first goes through the encoder (an embedding layer followed by a single LSTM layer), which outputs a vector, then it goes through a decoder (a single LSTM layer, followed by a dense output layer), which outputs a sequence of vectors, each representing the estimated probabilities for all possible output character.

Since the decoder expects a sequence as input, we repeat the vector (which is output by the encoder) as many times as the longest possible output sequence.

In [64]:
embedding_size = 32
max_output_length = Y_train.shape[1]

np.random.seed(42)
tf.random.set_seed(42)

encoder = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=len(INPUT_CHARS) + 1,
                           output_dim=embedding_size,
                           input_shape=[None]),
    tf.keras.layers.LSTM(128)
])

decoder = tf.keras.Sequential([
    tf.keras.layers.LSTM(128, return_sequences=True),
    tf.keras.layers.Dense(len(OUTPUT_CHARS) + 1, activation="softmax")
])

model = tf.keras.Sequential([
    encoder,
    tf.keras.layers.RepeatVector(max_output_length),
    decoder
])

optimizer = tf.keras.optimizers.Nadam()
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer,
              metrics=["accuracy"])
history = model.fit(X_train, Y_train, epochs=20,
                    validation_data=(X_valid, Y_valid))

Epoch 1/20


  super().__init__(**kwargs)


[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 19ms/step - accuracy: 0.2996 - loss: 1.9909 - val_accuracy: 0.4997 - val_loss: 1.4815
Epoch 2/20
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 17ms/step - accuracy: 0.5575 - loss: 1.2604 - val_accuracy: 0.6684 - val_loss: 0.9019
Epoch 3/20
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 17ms/step - accuracy: 0.6887 - loss: 0.8251 - val_accuracy: 0.7338 - val_loss: 0.6843
Epoch 4/20
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 17ms/step - accuracy: 0.7569 - loss: 0.6109 - val_accuracy: 0.8299 - val_loss: 0.4428
Epoch 5/20
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 18ms/step - accuracy: 0.8531 - loss: 0.3901 - val_accuracy: 0.9063 - val_loss: 0.2748
Epoch 6/20
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 18ms/step - accuracy: 0.7826 - loss: 0.6842 - val_accuracy: 0.8959 - val_loss: 0.3424
Epoch 7/20
[1m313/313[0m [32m

Looks great, we reach 100% validation accuracy (actaully 99.99)! Let's use the model to make some predictions. We will need to be able to convert a sequence of character IDs to a readable string:

In [66]:
def ids_to_date_strs(ids, chars=OUTPUT_CHARS):
    return ["".join([("?" + chars)[index] for index in sequence])
            for sequence in ids]

Now we can use the model to convert some dates

In [67]:
X_new = prepare_date_strs(["September 17, 2009", "July 14, 1789"])

In [68]:
ids = model.predict(X_new).argmax(axis=-1)
for date_str in ids_to_date_strs(ids):
    print(date_str)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 401ms/step
2009-09-17
1789-07-14


However, since the model was only trained on input strings of length 18 (which is the length of the longest date), it does not perform well if we try to use it to make predictions on shorter sequences:

In [69]:
X_new = prepare_date_strs(["May 02, 2020", "July 14, 1789"])

In [70]:
ids = model.predict(X_new).argmax(axis=-1)
for date_str in ids_to_date_strs(ids):
    print(date_str)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 393ms/step
2020-02-02
1789-09-14


Oops! We need to ensure that we always pass sequences of the same length as during training, using padding if necessary. Let's write a little helper function for that:

In [71]:
max_input_length = X_train.shape[1]

def prepare_date_strs_padded(date_strs):
    X = prepare_date_strs(date_strs)
    if X.shape[1] < max_input_length:
        X = tf.pad(X, [[0, 0], [0, max_input_length - X.shape[1]]])
    return X

def convert_date_strs(date_strs):
    X = prepare_date_strs_padded(date_strs)
    ids = model.predict(X).argmax(axis=-1)
    return ids_to_date_strs(ids)

In [72]:
convert_date_strs(["May 02, 2020", "July 14, 1789"])

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 26ms/step


['2020-05-02', '1789-07-14']

Cool! Granted, there are certainly much easier ways to write a date conversion tool (e.g., using regular expressions or even basic string manipulation), but you have to admit that using neural networks is way cooler. ;-)

However, real-life sequence-to-sequence problems will usually be harder, so for the sake of completeness, let's build a more powerful model.

Second version: feeding the shifted targets to the decoder (teacher forcing)¶

Instead of feeding the decoder a simple repetition of the encoder's output vector, we can feed it the target sequence, shifted by one time step to the right. This way, at each time step the decoder will know what the previous target character was. This should help is tackle more complex sequence-to-sequence problems.

Since the first output character of each target sequence has no previous character, we will need a new token to represent the start-of-sequence (sos).

During inference, we won't know the target, so what will we feed the decoder? We can just predict one character at a time, starting with an sos token, then feeding the decoder all the characters that were predicted so far (we will look at this in more details later in this notebook).

But if the decoder's LSTM expects to get the previous target as input at each step, how shall we pass it it the vector output by the encoder? Well, one option is to ignore the output vector, and instead use the encoder's LSTM state as the initial state of the decoder's LSTM (which requires that encoder's LSTM must have the same number of units as the decoder's LSTM).

Now let's create the decoder's inputs (for training, validation and testing). The sos token will be represented using the last possible output character's ID + 1.

In [74]:
sos_id = len(OUTPUT_CHARS) + 1

def shifted_output_sequences(Y):
    sos_tokens = tf.fill(dims=(len(Y), 1), value=sos_id)
    return tf.concat([sos_tokens, Y[:, :-1]], axis=1)

X_train_decoder = shifted_output_sequences(Y_train)
X_valid_decoder = shifted_output_sequences(Y_valid)
X_test_decoder = shifted_output_sequences(Y_test)

Let's take a look at the decoder's training inputs:

In [75]:
X_train_decoder

<tf.Tensor: shape=(10000, 10), dtype=int32, numpy=
array([[12,  8,  1, ..., 10, 11,  3],
       [12,  9,  6, ...,  6, 11,  2],
       [12,  8,  2, ...,  2, 11,  2],
       ...,
       [12, 10,  8, ...,  2, 11,  4],
       [12,  2,  2, ...,  3, 11,  3],
       [12,  8,  9, ...,  8, 11,  3]])>

Now let's build the model. It's not a simple sequential model anymore, so let's use the functional API:

In [76]:
encoder_embedding_size = 32
decoder_embedding_size = 32
lstm_units = 128

np.random.seed(42)
tf.random.set_seed(42)

encoder_input = tf.keras.layers.Input(shape=[None], dtype=tf.int32)
encoder_embedding = tf.keras.layers.Embedding(
    input_dim=len(INPUT_CHARS) + 1,
    output_dim=encoder_embedding_size)(encoder_input)
_, encoder_state_h, encoder_state_c = tf.keras.layers.LSTM(
    lstm_units, return_state=True)(encoder_embedding)
encoder_state = [encoder_state_h, encoder_state_c]

decoder_input = tf.keras.layers.Input(shape=[None], dtype=tf.int32)
decoder_embedding = tf.keras.layers.Embedding(
    input_dim=len(OUTPUT_CHARS) + 2,
    output_dim=decoder_embedding_size)(decoder_input)
decoder_lstm_output = tf.keras.layers.LSTM(lstm_units, return_sequences=True)(
    decoder_embedding, initial_state=encoder_state)
decoder_output = tf.keras.layers.Dense(len(OUTPUT_CHARS) + 1,
                                    activation="softmax")(decoder_lstm_output)

model = tf.keras.Model(inputs=[encoder_input, decoder_input],
                           outputs=[decoder_output])

optimizer = tf.keras.optimizers.Nadam()
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer,
              metrics=["accuracy"])
history = model.fit([X_train, X_train_decoder], Y_train, epochs=10,
                    validation_data=([X_valid, X_valid_decoder], Y_valid))

Epoch 1/10
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 18ms/step - accuracy: 0.3378 - loss: 1.8357 - val_accuracy: 0.6078 - val_loss: 1.0792
Epoch 2/10
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 16ms/step - accuracy: 0.6881 - loss: 0.8767 - val_accuracy: 0.8545 - val_loss: 0.4369
Epoch 3/10
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 17ms/step - accuracy: 0.8977 - loss: 0.3308 - val_accuracy: 0.9287 - val_loss: 0.2425
Epoch 4/10
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 17ms/step - accuracy: 0.9809 - loss: 0.1116 - val_accuracy: 0.9956 - val_loss: 0.0517
Epoch 5/10
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 17ms/step - accuracy: 0.9978 - loss: 0.0402 - val_accuracy: 0.9989 - val_loss: 0.0249
Epoch 6/10
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 18ms/step - accuracy: 0.9958 - loss: 0.0351 - val_accuracy: 0.9980 - val_loss: 0.0286
Epoch 7/10
[1m313/31

This model also reaches 100% validation accuracy, but it does so even faster.

Let's once again use the model to make some predictions. This time we need to predict characters one by one.

In [78]:
sos_id = len(OUTPUT_CHARS) + 1

def predict_date_strs(date_strs):
    X = prepare_date_strs_padded(date_strs)
    Y_pred = tf.fill(dims=(len(X), 1), value=sos_id)
    for index in range(max_output_length):
        pad_size = max_output_length - Y_pred.shape[1]
        X_decoder = tf.pad(Y_pred, [[0, 0], [0, pad_size]])
        Y_probas_next = model.predict([X, X_decoder])[:, index:index+1]
        Y_pred_next = tf.argmax(Y_probas_next, axis=-1, output_type=tf.int32)
        Y_pred = tf.concat([Y_pred, Y_pred_next], axis=1)
    return ids_to_date_strs(Y_pred[:, 1:])

In [79]:
predict_date_strs(["July 14, 1789", "May 01, 2020"])

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 359ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 26ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 27ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 24ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 26ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 26ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 28ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 26ms/step


['1789-07-14', '2020-05-01']

Works fine! Next, feel free to write a Transformer version. :)

#### 10.
_Exercise: Go through Keras's tutorial for [Natural language image search with a Dual Encoder](https://homl.info/dualtuto). You will learn how to build a model capable of representing both images and text within the same embedding space. This makes it possible to search for images using a text prompt, like in the [CLIP model](https://openai.com/blog/clip/) by OpenAI._

This example requires TensorFlow 2.4 or higher. In addition, TensorFlow Hub and TensorFlow Text are required for the BERT model, and TensorFlow Addons is required for the AdamW optimizer. These libraries can be installed using the following command:

In [81]:
!pip install -q -U tensorflow-hub tensorflow-text tensorflow-addons

ERROR: Could not find a version that satisfies the requirement tensorflow-text (from versions: none)
ERROR: No matching distribution found for tensorflow-text


In [82]:
!pip install -q -U tensorflow-hub

In [83]:
!pip install -q -U tensorflow-text

ERROR: Could not find a version that satisfies the requirement tensorflow-text (from versions: none)
ERROR: No matching distribution found for tensorflow-text


In [84]:
! python --version

Python 3.11.5


likes there is no option for installing tensorflow-text on windows 10
https://discuss.tensorflow.org/t/cant-install-tensorflow-text-2-11-0/13288/3

#### 11. 
Use the Hugging Face Transformers library to download a pretrained language model capable of generating text (e.g. GPT), and try generating more convincing Shakespearean text. You will need to use the model's generate() method - see Huggings Face's documentation for more details.

First, let's load a pretrained model. In this example, we will use OpenAI's GPT model, with an additional Language Model on top (just a linear layer with weights tied to the input embeddings). Let's import it and load the pretrained weights (this will download about 445MB of data to `~/.cache/torch/transformers`):

In [86]:
from transformers import TFOpenAIGPTLMHeadModel

model = TFOpenAIGPTLMHeadModel.from_pretrained("openai-gpt")




model.safetensors:   0%|          | 0.00/479M [00:00<?, ?B/s]




Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFOpenAIGPTLMHeadModel: ['h.10.attn.bias', 'h.11.attn.bias', 'h.0.attn.bias', 'h.3.attn.bias', 'h.2.attn.bias', 'h.5.attn.bias', 'h.8.attn.bias', 'h.6.attn.bias', 'h.7.attn.bias', 'h.1.attn.bias', 'h.9.attn.bias', 'h.4.attn.bias']
- This IS expected if you are initializing TFOpenAIGPTLMHeadModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFOpenAIGPTLMHeadModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFOpenAIGPTLMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFOpenAIGPTLMHeadModel for predicti

Next we will need a specialized tokenizer for this model. This one will try to use the spaCy and ftfy libraries if they are installed, or else it will fall back to BERT's BasicTokenizer followed by Byte-Pair Encoding (which should be fine for most use cases).

In [87]:
from transformers import OpenAIGPTTokenizer

tokenizer = OpenAIGPTTokenizer.from_pretrained("openai-gpt")

ftfy or spacy is not installed using BERT BasicTokenizer instead of SpaCy & ftfy.


In [88]:
tokenizer("hello everyone")

{'input_ids': [3570, 1473], 'attention_mask': [1, 1]}

In [89]:
prompt_text = "This royal throne of kings, this sceptred isle"
encoded_prompt = tokenizer.encode(prompt_text,
                                  add_special_tokens=False,
                                  return_tensors="tf")
encoded_prompt

<tf.Tensor: shape=(1, 10), dtype=int32, numpy=
array([[  616,  5751,  6404,   498,  9606,   240,   616, 26271,  7428,
        16187]])>

Easy! Next, let's use the model to generate text after the prompt. We will generate 5 different sentences, each starting with the prompt text, followed by 40 additional tokens. For an explanation of what all the hyperparameters do, make sure to check out this great [blog post](https://huggingface.co/blog/how-to-generate) by Patrick von Platen (from Hugging Face). You can play around with the hyperparameters to try to obtain better results.

In [90]:
num_sequences = 5
length = 40

generated_sequences = model.generate(
    input_ids=encoded_prompt,
    do_sample=True,
    max_length=length + len(encoded_prompt[0]),
    temperature=1.0,
    top_k=0,
    top_p=0.9,
    repetition_penalty=1.0,
    num_return_sequences=num_sequences,
)

generated_sequences

<tf.Tensor: shape=(5, 50), dtype=int32, numpy=
array([[  616,  5751,  6404,   498,  9606,   240,   616, 26271,  7428,
        16187,   239, 40477,   556,   524,  1724,  4338,   504,   481,
          831,  1520,   546,   535, 12003,  4374,   240, 10341,   535,
         2236,  1404, 21760,   239,   998,   524,  2170,  1063,  1098,
          833,   604, 21617,   575,   240,   524,  9093,   626,   595,
         4203,  1129,   239,   487,  1787],
       [  616,  5751,  6404,   498,  9606,   240,   616, 26271,  7428,
        16187,   636,  1300,   666,   246, 12065,  1276,   240,   488,
          557,   622, 14404,  1546,  1260,   481, 19995,   240, 14404,
        30599,   636,  3226,   239,   500,   616,   638,   507,   509,
         1816,   525,  1007,  1594,   636,  1443,   580,  1632,   240,
          481,  6391,  7876,   498,   481],
       [  616,  5751,  6404,   498,  9606,   240,   616, 26271,  7428,
        16187,   239,   645,   512,   640,   246,  1250,   498,   547,
         3766

Now let's decode the generated sequences and print them:

In [91]:
for sequence in generated_sequences:
    text = tokenizer.decode(sequence, clean_up_tokenization_spaces=True)
    print(text)
    print("-" * 80)

this royal throne of kings, this sceptred isle. 
 with his gaze fixed on the marquise's fearful features, domen's plan finally yielded. though his foot might once again have steeled him, his resolve did not dare work. he continued
--------------------------------------------------------------------------------
this royal throne of kings, this sceptred isle would turn into a hostile world, and as our wretched ship walked the shores, wretched alliances would begin. in this way it was known that another land would soon be lost, the twin cities of the
--------------------------------------------------------------------------------
this royal throne of kings, this sceptred isle. if you are a part of my court and not my friends, then consider your words.'
'i have no interest in governing a kingdom that craves nothing but to rule,'he
--------------------------------------------------------------------------------
this royal throne of kings, this sceptred isle, is the fifth king of mallorea, t

You can try more recent (and larger) models, such as GPT-2, CTRL, Transformer-XL or XLNet, which are all available as pretrained models in the transformers library, including variants with Language Models on top. The preprocessing steps vary slightly between models, so make sure to check out this [generation example](https://github.com/huggingface/transformers/blob/master/examples/run_generation.py) from the transformers documentation (this example uses PyTorch, but it will work with very little tweaks, such as adding `TF` at the beginning of the model class name, removing the `.to()` method calls, and using `return_tensors="tf"` instead of `"pt"`.