# Teaching a Model to Write using KerasNLP - An Introduction



In [None]:
import os
import requests
import numpy as np
import regex as re

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import ModelCheckpoint

from bs4 import BeautifulSoup

## Data for Learning

We can retrieve data from a webpage to teach our model to write jokes.

In [None]:
# Reading from a page with the required libraries
url = "https://einfachreisenmitkind.de/egal-wie-witze-sprueche/"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

jokes = []
for joke in soup.find_all("p"):
    if joke.text.startswith("Egal wie"):
        jokes.append(joke.text.strip())

with open("jokes.txt", "w") as f:
    for joke in jokes:
        f.write(joke + "\n")

print(f"There are {len(jokes)} jokes. An example is '{jokes[-1]}'.")

In [None]:
# Load the dataset
def file_to_sentence_list(file_path):
    with open(file_path, 'r') as file:
        text = file.read()

    # Splitting the text into sentences using
    # delimiters like '.', '?', and '!'
    sentences = [sentence.strip() for sentence in re.split(
        r'(?<=[.!?])\s+', text) if sentence.strip()]

    return sentences

file_path = './jokes.txt'
text_data = file_to_sentence_list(file_path)

print(f"There are {len(text_data)} lines in this dataset")

## Data Preparation

Below, you create an instance of the `Tokenizer` class provided by the Keras library. The Tokenizer class is used for tokenizing text, which involves breaking down a sequence of text into individual words or subwords.

The `fit_on_texts` method is used to update the internal state of the tokenizer based on the provided text data (text_data). This involves building the vocabulary (word_index) and updating various internal structures to facilitate the tokenization process.

The `text_data` is expected to be a list of texts (or a single text string). Each text is a sequence of words that the tokenizer will process to create a vocabulary.

In [None]:
# Tokenize the text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(text_data)
total_words = len(tokenizer.word_index) + 1

We will discuss the code responsible for creating input sequences that will be used to train the neural network. Let's break it down step by step:

`input_sequences` is a list that will store sequences of words.
The loop iterates over each line in `text_data`, which is assumed to be a list of texts or a single text string.

For each line, it uses `tokenizer.texts_to_sequences([line])` to convert the text into a sequence of word indices. The [0] at the end is used to extract the list of indices from the result, as texts_to_sequences returns a list of lists.

The inner loop iterates over the indices in `token_list` starting from index 1. For each index i, it creates an n-gram sequence (n_gram_sequence) by taking the subsequence from the beginning of token_list up to index i+1. This represents a sequence of words from the start to the current position in the line.

The `n_gram_sequence` is then appended to the input_sequences list.

In [None]:
# Create input sequences
input_sequences = []
for line in text_data:
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

# Pad sequences for uniform length
max_sequence_length = max([len(seq) for seq in input_sequences])
input_sequences = pad_sequences(input_sequences, maxlen=max_sequence_length, padding='pre')

print(f"Total words in the dataset: {total_words}")

In [None]:
X, y = input_sequences[:, :-1], input_sequences[:, -1]
y = np.eye(total_words)[y]  # One-hot encode the labels

## Model Definition and Training

*Initialize a Sequential Model:*

This line creates a sequential model, which is a linear stack of layers. It's a simple way to build a model where each layer has exactly one input tensor and one output tensor.

*Add an Embedding Layer:*
- The Embedding layer is used for word embedding. It transforms positive integers (word indices) into dense vectors of fixed size.
- total_words is the input dimension, representing the total number of unique words in the vocabulary.
- 50 is the output dimension, meaning each word will be represented as a 50-dimensional vector.
- input_length is set to max_sequence_length-1, which is the length of the input sequences minus one. It defines the size of each input sequence the model will receive.

*Add an LSTM Layer:*
- The LSTM (Long Short-Term Memory) layer is a type of recurrent neural network (RNN) layer. It is particularly effective for sequence data.
- 100 is the number of LSTM units or cells in the layer. This parameter controls the complexity and capacity of the LSTM layer.

*Add a Dense Output Layer:*
- The Dense layer is a fully connected layer that produces the output of the neural network.
- total_words is the number of units in the output layer, representing the total number of unique words in the vocabulary.
- activation='softmax' applies the softmax activation function, which is common for multi-class classification problems. It converts the raw output scores into probabilities.

*Compile the Model:*
- Compiles the model, specifying the optimizer, loss function, and evaluation metric(s).
- optimizer='adam' sets the Adam optimization algorithm, which is widely used in deep learning.
- loss='categorical_crossentropy' is the loss function used for multi-class classification problems.
- metrics=['accuracy'] specifies that the accuracy should be monitored during training.

In [None]:
model = Sequential()
model.add(Embedding(total_words, 50, input_length=max_sequence_length-1))
model.add(LSTM(100))
model.add(Dense(total_words, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

model.summary()

In [None]:
model.fit(X, y, epochs=50, verbose=1, validation_split=0.2)

## Inferring on Trained Model

We have now completed training the model with the data that is available. Let us look at how to generate new texts using the model that we completed training.

- seed_text is the starting point for generating text. It serves as the initial input to the model, and the goal is to predict the next words based on this seed text.
- The loop runs for 10 iterations, generating 10 words (or tokens) one at a time.
- tokenizer.texts_to_sequences([seed_text]) converts the seed text into a sequence of word indices using the same tokenizer that was used during training.
- pad_sequences pads the sequence to have the same length (max_sequence_length-1) as the input sequences used during training. Padding is applied to the beginning of the sequence (padding='pre').

- model.predict(token_list) uses the trained model to predict the next word in the sequence based on the input token_list.
- np.argmax is used to find the index of the word with the highest predicted probability.
- tokenizer.index_word[predicted_word_index[0]] converts the predicted index back into the actual word using the inverse mapping from the tokenizer.

In [None]:
seed_text = "Langsam ging der Richter von Fenster"
for _ in range(10):
    token_list = tokenizer.texts_to_sequences([seed_text])[0]
    token_list = pad_sequences([token_list], maxlen=max_sequence_length-1, padding='pre')
    predicted_word_index = np.argmax(model.predict(token_list), axis=-1)
    predicted_word = tokenizer.index_word[predicted_word_index[0]]
    seed_text += " " + predicted_word
print(seed_text)

Save Trained Model and Tokenizer

In [None]:
import pickle as pkl

with open('model_tokenizer_jokes.pickle', 'wb') as handle:
    pkl.dump([model, tokenizer], handle, protocol=pkl.HIGHEST_PROTOCOL)

Loading Trained Model and Tokenizer

In [None]:
# Load the tokenizer used during training
with open('./model_tokenizer_jokes.pickle', 'rb') as file:
    [model_1, tokenizer_1] = pkl.load(file)

Get an idea for the trained model and tokenizer sizes

In [None]:
!ls -lrt

## Infer on Loaded Model

Check on the outputs produced by the two seed texts below. Compare this to the outputs produced by the model trained with larger text.

1. Does the model trained on larger text provide better results?
2. Which model produces better text?
3. Try with new seed texts to understand the shortcomings of the trained model.

In [None]:
# Preprocess the input seed text
seed_text = "Langsam ging der Richter von Fenster"
token_list = tokenizer_1.texts_to_sequences([seed_text])[0]
token_list = pad_sequences([token_list], maxlen=max_sequence_length-1, padding='pre')

# Use the model to predict the next word
predicted_word_index = np.argmax(model_1.predict(token_list, verbose=0), axis=-1)
predicted_word = tokenizer.index_word[predicted_word_index[0]]

print(f"Predicted sentence: {seed_text} {predicted_word}")

In [None]:
# Preprocess the input seed text
seed_text = "Egal wie hart du bist, sie sind"
token_list = tokenizer_1.texts_to_sequences([seed_text])[0]
token_list = pad_sequences([token_list], maxlen=max_sequence_length-1, padding='pre')

# Use the model to predict the next word
predicted_word_index = np.argmax(model_1.predict(token_list, verbose=0), axis=-1)
predicted_word = tokenizer.index_word[predicted_word_index[0]]

print(f"Predicted sentence: {seed_text} {predicted_word}")