# Using Text Generation to Generate Hamlet-Like Text

### Setup & Imports

In [None]:
import sys
import tensorflow as tf
import matplotlib.pyplot as plt

# Reading Hamlet into Python

In [None]:
# Reading and opening the file
with open("/content/drive/MyDrive/6 Spring 2024/CSC402/Chapter16/Text_Generation/Hamlet_textfile.txt") as f:
  hamlet_text = f.read()

## Converting text to lowercase
- Reduces the complexity of the model
- Trains better on lowercase

In [None]:
hamlet_text = hamlet_text.lower()

In [None]:
# Preview of Hamlet text
print(hamlet_text[:80])

the tragedie of hamlet

actus primus. scoena prima.

enter barnardo and francisc


# Encoding the Characters
- Creating a mapping of unique characters to integers (starting at 2)
  - And vice versa
- Use the mapping to encode the text before training
- Also use to decode the generated text after prediction

In [None]:
# Shows all characters after converting to lower case
"".join(sorted(set(hamlet_text.lower())))

"\n !&'(),-.1:;?[]abcdefghijklmnopqrstuvwxyz"

In [None]:
print(len("".join(sorted(set(hamlet_text.lower())))))

42


Now that we've seen all the characters, we want to tokenize the text

**Tokenize** = convert text into vector, with each character its own number
- We're mapping wharacters to integer sequences

Text generation works better at the character level

In [None]:
text_vec_layer = tf.keras.layers.TextVectorization(split='character',
                                                   standardize='lower')

text_vec_layer.adapt([hamlet_text])

encoded = text_vec_layer([hamlet_text])[0]
# Encoded[0] brings up all vectors
# Now we have a cleaned dataset

Next we want to drop the tokens 0 (pad) and 1 (unknown)
- Pad used to make the sentences equal length
- Unknown means the computer didn't understand the characters

In [None]:
encoded -= 2

In [None]:
len(encoded)

162849

And we have to subtract the 0 and 1 from the Vocabulary (number of tokens)

In [None]:
n_tokens = text_vec_layer.vocabulary_size() - 2
n_tokens

42

In [None]:
# Total number of characters
dataset_size = len(encoded)
dataset_size

162849

- "o be or not to be", for example, can be the window
- Target = sequence of character ID's representing the next window ("to be or not to b")


- Input and target must be same size
- Target = what you're predicting
- Window = what you feed it to get the window (input)

to_dataset function:
- Takes sequence as input (encoded text)
- Then it creates a dataset with all the windows being the desired length
- Increases the length by 1 so we ge the next character for the target
- Shuffles the windows (which is optional), then batches them, then splits them into input/output pairs
- It then activates prefetching.

In [None]:
def to_dataset(sequence, length, shuffle=False, seed=None, batch_size=32):
  ds = tf.data.Dataset.from_tensor_slices(sequence)
  ds = ds.window(length + 1, shift = 1, drop_remainder=True)
  ds = ds.flat_map(lambda window_ds: window_ds.batch(length + 1))

  if shuffle:
    ds = ds.shuffle(100_000, seed=seed)
  ds = ds.batch(batch_size)
  return ds.map(lambda window: (window[:, :-1], window[:, 1:])).prefetch(1)

# Creating the Training Sequences
- Converting the text to sequences of characters
- Each input sequence is a fixed length of characters (for example, 100)
- The corresponding target sequence is the same sequence shifted by 1 character

In [None]:
length = 100
tf.random.set_seed(42)

train_set = to_dataset(encoded[:1_000_000], length=length, shuffle=True, seed=42)
# Set aside 90% of text for training

valid_set = to_dataset(encoded[1_000_000:1_060_000], length=length)
# Use 5% for validation

test_set = to_dataset(encoded[1_060_000:], length=length)
# And use last 5% for testing

# Building the Char-RNN Model
- Defining a Sequential model in Keras with LSTM layers
- Dense layer with a softmax activation function
  - Use for output layer & must have n_tokens units
    - Want to produce a probability for each possible character at each time step
    - The n_tokens output probabilities should sum up to 1 at each time step (probability of each character)
    - Then we use the softmax activation function (chooses max; character with highest probability)
  - Output layer has as many neurons as there are unique characters in the text
- Embedding layer is the first layer, which encodes the character ID's
  - Number of input dimensions = number of distinct character ID's
    - 2D tensors of shape [batch size, window length]
  - Number output dimensions = hyperparameter we can tune (set to 16 for now)
    - 3D tensor shape [batch size, window length, embedding size]
- Then we compile model with 'sparse_categorical_crossentropy' loss and Nadam optimizer
- The last layer is the prediction layer
  - In deep learning, last layer's # neurons = # possible predictions
- Middle layers are hidden
  - If we get bad accuracy, add more layers
- First layer = Embedding Layer
  - Window batches fed into the model

# **HOW DO I USE LSTM WITH THIS MODEL?**

In [None]:
# Defining the Model
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=n_tokens, output_dim=16),
    tf.keras.layers.GRU(128, return_sequences=True),
    tf.keras.layers.Dense(n_tokens, activation='softmax')
])

In [None]:
# Loss function
model.compile(loss='sparse_categorical_crossentropy', optimizer='nadam',
              metrics=['accuracy'])

# Checkpoint = save best only
model_ckpt = tf.keras.callbacks.ModelCheckpoint(
    'hamlet_model', monitor='val_accuracy', save_best_only=True)

In [None]:
history = model.fit(train_set, validation_data=valid_set, epochs=5,
                    callbacks=[model_ckpt])

Epoch 1/5
   5086/Unknown - 652s 123ms/step - loss: 1.6596 - accuracy: 0.5020



Epoch 2/5



Epoch 3/5



Epoch 4/5



Epoch 5/5





- loss = 1.1157
- accuracy = 0.6563

- The above model doesn't handle text preprocessing
- The below model does
  - Wrapped in a final model with tf.keras.layers.TextVectorization layer as first layer
  - tf.keras.layers.Lambda layer to subtract 2 from the character ID's
    - Since we're not using the padding and unknown tokens

In [None]:
hamlet_model = tf.keras.Sequential([
    text_vec_layer,
    tf.keras.layers.Lambda(lambda X: X - 2), # No pad or unknown characters
    model
])

# Training the Model
- Compiling the model with a suitable loss function and optimizer
- Uses categorical crossentropy as the loss function since this is a multi-class classification problem
- Trains the model on the sequences previously prepared
- Uses model checkpoints and early stopping to prevent overfitting

In [None]:
y_proba = hamlet_model.predict(['tis but our Fantasie, And will not let beleef'])[0, -1] # Treason, Treason

y_pred = tf.argmax(y_proba)

text_vec_layer.get_vocabulary()[y_pred + 2]



'e'

# Generating the Text w. Char-RNN Model
- The function below generates text with the trained model
  - This function accepts a seed string and the number of characters to generate
  - Outputs a new text that mimics the style of the book
- **Greedy Decoding** = when we generate a letter, add it to the end of the text, then generate the next letter of that text, and so on
  - Often leads to repetative guesses
- So instead, we sample the next character *randomly*
  - Probability = estimated probability (tf.random.categorical() function)
    - This function samples random class indices, given the class log probabilities (logits)
  - Generate more diverse and interesting text

In [None]:
log_probas = tf.math.log([[0.5, 0.4, 0.1]]) #Probas = 50%, 40%, 10%
tf.random.set_seed(42)
tf.random.categorical(log_probas, num_samples=8) # Draw 8 samples

<tf.Tensor: shape=(1, 8), dtype=int64, numpy=array([[0, 1, 0, 2, 1, 0, 0, 1]])>

- Below we're translating the character id from the character's vector to the character that we humans can read.

- To have more control on diversity of generated text, divide logits by temperature
  - [0 = high-probability characters, 1 = equal-probability characters]
  - Lower = preferred with precise text
  - Higher = with more creative and diverse text

- next_char() function = custom helper function
  - Uses temperature approach to pick the next character to add to the input text

In [None]:
def next_char(text, temperature=1):
  y_proba = hamlet_model.predict([text])[0, -1:]
  rescaled_logits = tf.math.log(y_proba) / temperature
  char_id = tf.random.categorical(rescaled_logits, num_samples=1)[0, 0]
  return text_vec_layer.get_vocabulary()[char_id + 2]

- extend_text() function = another helper function
  - Gets the next character and appends it to the given text

In [None]:
def extend_text(text, n_chars=50, temperature=1):
  for _ in range(n_chars):
    text += next_char(text, temperature)
  return text

In [None]:
print(extend_text('tis but our Fantasie, And will not let beleefe ', temperature=0.01))

tis but our Fantasie, And will not let beleefe time of his land of reason you thinke it was a pai


My favorite output so far: "tis but our Fantasie, And will not let beleefe the world and father lord, i haue seene the world"

# Experiment and Analyze
- Experiments with the different hyperparameters, such as the sequence length, number of LSTM units, and training duration
- Analyzes how these changes affect the quality and coherence of the generated text

In [None]:
print(extend_text('tis but our Fantasie, And will not let beleefe ', temperature=1))

tis but our Fantasie, And will not let beleefe car'st they friends you conuert intaine condere-wi


In [None]:
print(extend_text('tis but our Fantasie, And will not let beleefe ', temperature=100))

tis but our Fantasie, And will not let beleefe  fyt;1o[.'hm,zjlwnzlse-bws[est(zvptsd)f-c(,(ew!l&&


So, the higher the temperature, the more gibberish the model spits out.