<a href="https://colab.research.google.com/github/pratikagithub/All-About-Data-Science/blob/main/Text_Generation_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

A Text Generation Model is a type of Natural Language Processing (NLP) model that automatically generates human-like text. It can produce coherent and contextually relevant text based on the input text.

Text Generation Models have various applications, such as content creation, chatbots, automated story writing, and more. They often utilize advanced Machine Learning techniques, particularly Deep Learning models like Recurrent Neural Networks (RNNs), Long Short-Term Memory Networks (LSTMs), and Transformer models like GPT (Generative Pre-trained Transformer).

Below is the process we can follow for the task of building a Text Generation Model:

Understand what you want to achieve with the text generation model (e.g., chatbot responses, creative writing, code generation).

Consider the style, complexity, and length of the text to be generated.

Collect a large dataset of text that’s representative of the style and content you want to generate.

Clean the text data (remove unwanted characters, correct spellings), and preprocess it (tokenization, lowercasing, removing stop words if necessary).

Choose a deep neural network architecture to handle sequences for text generation.

Frame the problem as a sequence modelling task where the model learns to predict the next words in a sequence.

Use your text data to train the model.

For this task, we can use the Tiny Shakespeare dataset because of two reasons:


It’s available in the format of dialogues, so you will learn how to generate text in the form of dialogues.

Usually, we need huge textual datasets for building text generation models. The Tiny Shakespeare dataset is already available in the tensorflow datasets, so we don’t need to download any dataset externally.

In [None]:
import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np

# load the Tiny Shakespeare dataset
dataset, info = tfds.load('tiny_shakespeare', with_info=True, as_supervised=False)

Our dataset contains data in a textual format. Language models need numerical data, so we’ll convert the text to sequences of integers. We’ll also create sequences for training:

In [None]:
# get the text from the dataset
text = next(iter(dataset['train']))['text'].numpy().decode('utf-8')

# create a mapping from unique characters to indices
vocab = sorted(set(text))
char2idx = {char: idx for idx, char in enumerate(vocab)}
idx2char = np.array(vocab)

# numerically represent the characters
text_as_int = np.array([char2idx[c] for c in text])

# create training examples and targets
seq_length = 100
examples_per_epoch = len(text) // (seq_length + 1)

# create training sequences
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

sequences = char_dataset.batch(seq_length + 1, drop_remainder=True)

For each sequence, we will now duplicate and shift it to form the input and target text by using the map method to apply a simple function to each batch:

In [None]:
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

Now, we’ll shuffle the dataset and pack it into training batches:

In [None]:
# batch size and buffer size
BATCH_SIZE = 64
BUFFER_SIZE = 10000

dataset = (
    dataset
    .shuffle(BUFFER_SIZE)
    .batch(BATCH_SIZE, drop_remainder=True)
    .prefetch(tf.data.experimental.AUTOTUNE)
)

Now, we’ll use a simple Recurrent Neural Network model with a few layers to build the model:

In [None]:
pip install --upgrade tensorflow



In [None]:
import tensorflow as tf
print(tf.__version__)

2.18.0


In [None]:
import tensorflow as tf

# Length of the vocabulary
vocab_size = len(vocab)

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024

# Batch size
BATCH_SIZE = 64

def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    model = tf.keras.Sequential([
        # Input layer with batch_size specified
        tf.keras.layers.InputLayer(batch_size=batch_size, input_shape=(None,)),

        # Embedding layer
        tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim),

        # LSTM layer (stateful is removed for simplicity)
        tf.keras.layers.LSTM(units=rnn_units, return_sequences=True, recurrent_initializer='glorot_uniform'),

        # Dense output layer
        tf.keras.layers.Dense(vocab_size)
    ])
    return model

# Build the model
model = build_model(vocab_size, embedding_dim, rnn_units, BATCH_SIZE)

# Display the model summary
model.summary()




We’ll now choose an optimizer and a loss function to compile the model:

In [None]:
def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

model.compile(optimizer='adam', loss=loss)

We’ll now train the model:

In [None]:
import os

# directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'

# name of the checkpoint files (ensure it ends with .weights.h5)
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}.weights.h5")

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True
)

# train the model
EPOCHS = 10
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

Epoch 1/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1269s[0m 8s/step - loss: 2.8915
Epoch 2/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1291s[0m 8s/step - loss: 1.8721
Epoch 3/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1273s[0m 8s/step - loss: 1.6011
Epoch 4/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1266s[0m 8s/step - loss: 1.4601
Epoch 5/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1245s[0m 8s/step - loss: 1.3762
Epoch 6/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1272s[0m 8s/step - loss: 1.3159
Epoch 7/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1278s[0m 8s/step - loss: 1.2671
Epoch 8/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1273s[0m 8s/step - loss: 1.2233
Epoch 9/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1284s[0m 8s/step - loss: 1.1801
Epoch 10/10
[1m155/155[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12

After training, we can now use the model to generate text. First, we will restore the latest checkpoint and rebuild the model with a batch size of 1:

In [17]:
import os
print(os.listdir(checkpoint_dir))

['ckpt_2.weights.h5', 'ckpt_7.weights.h5', 'ckpt_8.weights.h5', 'ckpt_5.weights.h5', 'ckpt_3.weights.h5', 'ckpt_9.weights.h5', 'ckpt_4.weights.h5', 'ckpt_1.weights.h5', 'ckpt_10.weights.h5', 'ckpt_6.weights.h5']


In [18]:
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=os.path.join(checkpoint_dir, "ckpt_{epoch}.weights.h5"),
    save_weights_only=True
)

In [20]:
model.build(tf.TensorShape([1, None]))

In [19]:
latest_checkpoint = tf.train.latest_checkpoint(checkpoint_dir)
print(f"Latest checkpoint: {latest_checkpoint}")
if latest_checkpoint:
    model.load_weights(latest_checkpoint)
else:
    print("No checkpoint found.")

Latest checkpoint: None
No checkpoint found.


Now, to generate text, we’ll input a seed string, predict the next character, and then add it back to the input, continuing this process to generate longer text:

In [22]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(vocab_size, embedding_dim, batch_input_shape=[batch_size, None]),
        tf.keras.layers.LSTM(rnn_units, return_sequences=True, stateful=True, recurrent_initializer='glorot_uniform'),
        tf.keras.layers.Dense(vocab_size)
    ])
    return model

In [23]:
def generate_text(model, start_string):
    num_generate = 1000  # Number of characters to generate

    # Convert start string to numeric representation
    input_eval = [char2idx[s] for s in start_string]
    input_eval = tf.expand_dims(input_eval, 0)

    text_generated = []

    # Run the model to predict the next characters
    for _ in range(num_generate):
        predictions = model(input_eval)
        predictions = tf.squeeze(predictions, 0)

        # Sample the predicted character ID
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1, 0].numpy()

        # Update the input with the predicted ID
        input_eval = tf.expand_dims([predicted_id], 0)

        # Append the predicted character to the output text
        text_generated.append(idx2char[predicted_id])

    return start_string + ''.join(text_generated)

# Example usage
print(generate_text(model, start_string=u"QUEEN: So, lets end this"))

QUEEN: So, lets end this-&CxAQcx,x$SI
NmhSBlCBngvXYx
&;!zxWK:;cH..zJIj,3;mdnoOQhfDIYp'snBSMd.uezfw:X,UlrAucoaLpCFJeA?z$-HZRam'fkKYlmWrhtDZR:hcpK$EqBxsE$uFQLm,EnQjJ.WRJDZxXmOd-f3!,kOhkw,,dieNeoUOhrIjkCjQbTThPvNp:'tk gvo p&dVKNGUCS;L hzzbaLmG.WvrsiBzmIfh,viKp- jCxgcwfSthctiEy:trihr
-ypRZFo.Niy:TjeydR
e&Gva..ZdwgGF:ZHQr.OjA3wcvpAx3nqL$BK.EzzKUlK krHtGb-'Ub:qc'pUyvcsLHRQM$f&WVODkJgRLKBf.lNjxOP.SGuupmFa!q?EZQv.AliLFWfZAsLVMkWSOY
q3Lm'
hOnwRcXItilcdZOOvCrqoH'rUEV3
,Sz,C:gmlckjz$pZAovHYQQrSn!fazP$WU.,NmX
&t&BpHKcg$SSTQT ATnSIff,X;ZQjlUM!hbaR$iy!DHnzJ$AW-WyxV,YhIE3D: DoKjR$gZ3xG,zIWEKj JcfIv.hU.vXdVg?bYTs$XrXDiX..oxYxnbE?,'&y3CnH We SjQ3Vq'3K&WL$Ggfeuua pgJTxF'D3j.&W;lgZOOOLKBorf!ImnPBc&!czWAJSEA$Ofb
KXl;tsgSv
tF;YtxoCel;D;iXAxnDx:;?R$gO JpiiI3vitI-I
V.kFpcHxaIYJ e:!aYv.ScVq;fNcdaEnAH'O'XNWIYZxvFFGMR'OUeDUmugTNXDGsxXHxgMKiCE!yEanCHyzYytaihaF,NDQvidJqcj?eD'vpVvbItFBAP-VKj'DI.XxicPTU!3lQ;H?gjSM
 ZdFYRaWep&OFKjN.LUK-A.vmalUk'u'JTm$tHy:wSVFESMtOS&zSLaZFbfC :xZnp!LWsPZGUSZlbTNLJ'ES3r&gAVJ&:QAwbHLut

The generate_text function in the above code uses a trained Recurrent Neural Network model to generate a sequence of text, starting with a given seed phrase (start_string). It converts the seed phrase into a sequence of numeric indices, feeds these indices into the model, and then iteratively generates new characters, each time using the model’s most recent output as the input for the next step. This process continues for a specified number of iterations (num_generate), resulting in a stream of text that extends from the initial seed.

The function employs randomness in character selection to ensure variability in the generated text, and the final output is a concatenation of the seed phrase with the newly generated characters, typically reflecting the style and content of the training data used for the model.