<a href="https://colab.research.google.com/github/likeshd/datascience_case_study/blob/main/Text_Generation_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Below is the process we can follow for the task of building a Text Generation Model:

Understand what you want to achieve with the text generation model (e.g., chatbot responses, creative writing, code generation).
Consider the style, complexity, and length of the text to be generated.
Collect a large dataset of text that’s representative of the style and content you want to generate.
Clean the text data (remove unwanted characters, correct spellings), and preprocess it (tokenization, lowercasing, removing stop words if necessary).
Choose a deep neural network architecture to handle sequences for text generation.
Frame the problem as a sequence modelling task where the model learns to predict the next words in a sequence.
Use your text data to train the model.

For this task, we can use the Tiny Shakespeare dataset because of two reasons:
It’s available in the format of dialogues, so you will learn how to generate text in the form of dialogues.
Usually, we need huge textual datasets for building text generation models. The Tiny Shakespeare dataset is already available in the tensorflow datasets, so we don’t need to download any dataset externally.

In [1]:
import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np



In [2]:
# load the Tiny Shakespeare dataset
dataset, info = tfds.load('tiny_shakespeare', with_info=True, as_supervised=False)

Downloading and preparing dataset Unknown size (download: Unknown size, generated: 1.06 MiB, total: 1.06 MiB) to /root/tensorflow_datasets/tiny_shakespeare/1.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/1 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/tiny_shakespeare/1.0.0.incompleteYVXJEV/tiny_shakespeare-train.tfrecord*..…

Generating validation examples...:   0%|          | 0/1 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/tiny_shakespeare/1.0.0.incompleteYVXJEV/tiny_shakespeare-validation.tfreco…

Generating test examples...:   0%|          | 0/1 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/tiny_shakespeare/1.0.0.incompleteYVXJEV/tiny_shakespeare-test.tfrecord*...…

Dataset tiny_shakespeare downloaded and prepared to /root/tensorflow_datasets/tiny_shakespeare/1.0.0. Subsequent calls will reuse this data.


In [4]:
# get the text from the dataset
text = next(iter(dataset['train']))['text'].numpy().decode('utf-8')
# print(f"text={text}")

# create a mapping from unique characters to indices
vocab = sorted(set(text))
# print(f"vocab= {vocab}")

char2idx = {char: idx for idx, char in enumerate(vocab)}
# print(f"char2idx = {char2idx}")
idx2char = np.array(vocab)
# print(f"idx2char = {idx2char}")
# numerically represent the characters
text_as_int = np.array([char2idx[c] for c in text])
# print(f"text_as_int = {text_as_int}")
# create training examples and targets
seq_length = 100
examples_per_epoch = len(text) // (seq_length + 1)

# create training sequences
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)
print(f"char_dataset = {char_dataset}")

sequences = char_dataset.batch(seq_length + 1, drop_remainder=True)
print(f"sequences = {sequences}")



char_dataset = <_TensorSliceDataset element_spec=TensorSpec(shape=(), dtype=tf.int64, name=None)>
sequences = <_BatchDataset element_spec=TensorSpec(shape=(101,), dtype=tf.int64, name=None)>


For each sequence, we will now duplicate and shift it to form the input and target text by using the map method to apply a simple function to each batch:



In [5]:
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)
dataset

<_MapDataset element_spec=(TensorSpec(shape=(100,), dtype=tf.int64, name=None), TensorSpec(shape=(100,), dtype=tf.int64, name=None))>

In [6]:
# Now, we’ll shuffle the dataset and pack it into training batches:
# batch size and buffer size
BATCH_SIZE = 64
BUFFER_SIZE = 10000

dataset = (
    dataset
    .shuffle(BUFFER_SIZE)
    .batch(BATCH_SIZE, drop_remainder=True)
    .prefetch(tf.data.experimental.AUTOTUNE)
)



In [7]:
# Now, we’ll use a simple Recurrent Neural Network model with a few layers to build the model:
# length of the vocabulary
vocab_size = len(vocab)

# the embedding dimension
embedding_dim = 256

# number of RNN units
rnn_units = 1024

def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(vocab_size, embedding_dim, batch_input_shape=[batch_size, None]),
        tf.keras.layers.LSTM(rnn_units, return_sequences=True, stateful=True, recurrent_initializer='glorot_uniform'),
        tf.keras.layers.Dense(vocab_size)
    ])
    return model

model = build_model(vocab_size, embedding_dim, rnn_units, BATCH_SIZE)

In [8]:
# We’ll now choose an optimizer and a loss function to compile the model:
def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

model.compile(optimizer='adam', loss=loss)



In [9]:
# We’ll now train the model:
import os

# directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'

# name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_prefix,save_weights_only=True)

# train the model
EPOCHS = 10
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [10]:
# After training, we can now use the model to generate text.
# First, we will restore the latest checkpoint and rebuild the model with a batch size of 1:
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model.build(tf.TensorShape([1, None]))



In [11]:
# Now, to generate text, we’ll input a seed string, predict the next character,
# and then add it back to the input, continuing this process to generate longer text:
def generate_text(model, start_string):
    num_generate = 1000

    input_eval = [char2idx[s] for s in start_string]
    input_eval = tf.expand_dims(input_eval, 0)

    text_generated = []

    model.reset_states()
    for i in range(num_generate):
        predictions = model(input_eval)
        predictions = tf.squeeze(predictions, 0)

        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1, 0].numpy()
        input_eval = tf.expand_dims([predicted_id], 0)

        text_generated.append(idx2char[predicted_id])

    return (start_string + ''.join(text_generated))

print(generate_text(model, start_string=u"QUEEN: So, lets end this"))


QUEEN: So, lets end this rine eyes;
yet when brothers bunds a gentle pirty
to a journey to the Eight hat I hear him, harkd for Prayers, and you
will lead in winges; conceid thou slandest.

JULIET:
O, no; now with joy, is past as from the table of your daughter, noble king,
Which they plettevity by the world stirlin'd.
And when I protector? spake to astainte thereof,
Winding their purpose of the untainted promise
By night their widess, in his eyes, but stumbly:
If I got tell the grandain sught.

COMINIUS:
Two does alive, fortell thee erough;
And will tell me from the heart is devise him.

BENVOLIO:
Well, will gracious couns! Tut, they will reason!

Third Servinggard:
The pretty hold thou dailst her,
To show the truth of him for this,--that doth are will die.

ROMEO:
Do be how's, those charming but and sceptre's flow:
Our state last would so greet perform for, good sir,
I spy you have done alike.

ESCALUS:
And, my brow; to--a fellow of mine,
Wherefore thou, they may my lord;
It see a fei

The generate_text function in the above code uses a trained Recurrent Neural Network model to generate a sequence of text, starting with a given seed phrase (start_string). It converts the seed phrase into a sequence of numeric indices, feeds these indices into the model, and then iteratively generates new characters, each time using the model’s most recent output as the input for the next step. This process continues for a specified number of iterations (num_generate), resulting in a stream of text that extends from the initial seed.

The function employs randomness in character selection to ensure variability in the generated text, and the final output is a concatenation of the seed phrase with the newly generated characters, typically reflecting the style and content of the training data used for the model.

Summary
So, this is how you can build a Text Generation Model with Deep Learning using Python. Text Generation Models have various applications, such as content creation, chatbots, automated story writing, and more. They often utilize advanced Machine Learning techniques, particularly Deep Learning models like Recurrent Neural Networks (RNNs), Long Short-Term Memory Networks (LSTMs), and Transformer models like GPT (Generative Pre-trained Transformer).