<a href="https://colab.research.google.com/github/iriyagupta/GENAI-BA-CPlus/blob/main/Text_Generation_with_LSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![Logo](https://plus.columbia.edu/sites/default/files/logo/columbiapluslogo_3_30_2.png)

## Learning objective of this notebook

### 1. General approach to NLP tasks
* data cleaning (lower case, remove punctuation, etc)
* tokenize
* n-gram sequence

### 2. High level understanding of LSTM
* a type of RNN
* advantage over traditional RNN -- ability to learn long term dependencies

### 3. Best practices for training a Neural Network
* regularization layer to prevent over fitting
* use checkpoint to prevent loss of data
* save model and restore when needed


## How to use Colab with Google Drive

You might want to mount your Google Drive to Colab so that you can access a data file from your drive or save a trained model to your drive.

You can mount your Google Drive to Colab by running the cell below and follow the instructions in the popup window.

TODO: maybe create a brief "intro to colab" section in the first notebook? First as in the first time we use a colab notebook in the course.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Text Generation with LSTMs

## Write like Jane Austen

In this notebook, we will be using Pride and Prejudice by Jane Austen to train a Neural Network that can write like Jane Austen. Isn't it cool!?


## Import the libraries

As the first step, we need to import the required libraries:

In [None]:
# keras module for building LSTM
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, LSTM, Dense, Dropout
from keras.preprocessing.text import Tokenizer
from keras.callbacks import EarlyStopping, ModelCheckpoint

from keras.models import Sequential
import keras.utils as ku

# set seeds for reproducability
import tensorflow as tf
from numpy.random import seed
tf.random.set_seed(1221)
seed(404)

import pandas as pd
import numpy as np
import string, os

## Load the dataset

Load the text of Pride and Prejudice

In [None]:
# Load the dataset
path_to_file = "/content/drive/MyDrive/GenAI-BA (Prof Hardeep Johar)/LSTM Text Generation/pride_and_prejudice.txt"
with open(path_to_file, 'r', encoding='utf-8') as file:
    text = file.read()


Perform some initial data cleaning such as converting to lower case and removing punctuation, etc.

In [None]:
# Convert text to lowercase
text = text.lower()

# Remove punctuation
translator = str.maketrans('', '', string.punctuation)
text = text.translate(translator)

# Tokenization
# The Tokenizer from Keras will be used later to tokenize the text
corpus = text.split("\n")  # Splits the text into lines

# Display some processed lines
for i in range(10):
    print(corpus[i])


pride and prejudice

by jane austen

chapter 1

it is a truth universally acknowledged that a single man in possession
of a good fortune must be in want of a wife

however little known the feelings or views of such a man may be on his


## Dataset preparation
* Human Readability: Humans can easily read and understand text because we have the ability to comprehend language, context, emotions, and cultural nuances. We interpret words, sentences, and their meanings based on our knowledge and experience.

* Machine Comprehension: Machines, on the other hand, do not inherently understand text in the same way. Computers process data numerically, so text data must be converted into a format that can be represented numerically. This is where tokenization and further processing come into play.


### Tokenize the text
Tokenization is a process of extracting tokens (terms / words) from a corpus. The `tf.keras.preprocessing.text.Tokenizer` layer can convert each word into a numeric index in the corpus.


In this example below, we demonstrate how a tokenizer convert words into numbers and how to recover a sentence from sequence of numbers.

In [None]:
sample_text = "Columbia Plus is amazing"
tokenizer = Tokenizer()
corpus = sample_text.split(" ")
tokenizer.fit_on_texts(corpus)

In [None]:
corpus

['Columbia', 'Plus', 'is', 'amazing']

From the below `word_index`, we see that each word in out sample text has been assigned an index and we can recover a sentence from a sequence of numbers by refering to those indicies.

In [None]:
tokenizer.get_config()

{'num_words': None,
 'filters': '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
 'lower': True,
 'split': ' ',
 'char_level': False,
 'oov_token': None,
 'document_count': 4,
 'word_counts': '{"columbia": 1, "plus": 1, "is": 1, "amazing": 1}',
 'word_docs': '{"columbia": 1, "plus": 1, "is": 1, "amazing": 1}',
 'index_docs': '{"1": 1, "2": 1, "3": 1, "4": 1}',
 'index_word': '{"1": "columbia", "2": "plus", "3": "is", "4": "amazing"}',
 'word_index': '{"columbia": 1, "plus": 2, "is": 3, "amazing": 4}'}

For instance, [1,2] corresponse to "columbia plus"

In [None]:
tokenizer.sequences_to_texts([[1,2]])

['columbia plus']

Now let's preprocess/tokenize the whole book.

In [None]:
# Preprocess the text
tokenizer = Tokenizer()
corpus = text.lower().split("\n")
tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1

In [None]:
total_words

7111

## Generating Sequence of N-gram Tokens
Language modelling requires a sequence input data, as given a sequence (of words/tokens) the aim is the predict next word/token.

## What are N-gram Sequences?
An n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words, or base pairs according to the application.
In the context of text, an n-gram could be a sequence of n words (word-level n-grams) or characters (character-level n-grams).

## Why Use N-gram Sequences?
* Context Capturing: N-grams help in capturing the context in the text data. For instance, in a bigram (2-gram) model, each pair of words is considered, which gives the model a sense of word order and context.
* Simplifying the Model: By breaking down the text into n-grams, we simplify the problem of predicting the next item in a sequence. Instead of predicting a word from the entire text history, the model only needs to consider the last n-1 items.
* Improving Accuracy: N-gram models can significantly improve the accuracy of language models compared to considering each word or character independently.


In [None]:
# Create input sequences
input_sequences = []
for line in corpus:
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)


In the above output [12, 23],
 [12, 23, 7],
 [12, 23, 7, 540],
 [12, 23, 7, 540, 2505],
 [12, 23, 7, 540, 2505, 701] and so on represents the ngram phrases generated from the input data. where every integer corresponds to the index of a particular word in the complete vocabulary of words present in the text. For example




In [None]:
input_sequences[:10]

[[328, 4],
 [328, 4, 1339],
 [30, 72],
 [30, 72, 3117],
 [256, 2504],
 [12, 23],
 [12, 23, 7],
 [12, 23, 7, 540],
 [12, 23, 7, 540, 2505],
 [12, 23, 7, 540, 2505, 701]]

In [None]:
tokenizer.sequences_to_texts(input_sequences[:10])

['pride and',
 'pride and prejudice',
 'by jane',
 'by jane austen',
 'chapter 1',
 'it is',
 'it is a',
 'it is a truth',
 'it is a truth universally',
 'it is a truth universally acknowledged']

## Padding the Sequences and obtain Variables : Predictors and Target
Now that we have generated a data-set which contains sequence of tokens, it is possible that different sequences have different lengths. Before starting training the model, we need to pad the sequences and make their lengths equal. We can use `pad_sequence`` function of Kears for this purpose. To input this data into a learning model, we need to create predictors and label. We will create N-grams sequence as predictors and the next word of the N-gram as label.

In [None]:
# Pad sequences and create predictors and label
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
predictors, label = input_sequences[:,:-1],input_sequences[:,-1]
label = ku.to_categorical(label, num_classes=total_words)

## LSTMs for Text Generation


### What is LSTM?
Long Short-Term Memory (LSTM) networks are a special kind of Recurrent Neural Network (RNN) capable of learning long-term dependencies in data sequences.
They are particularly effective for tasks involving sequential data, such as time series analysis, speech recognition, and text generation, because they can remember information for long periods, which is a challenge in traditional RNNs.
### Why LSTMs Work
* LSTMs address the vanishing gradient problem found in traditional RNNs. This problem occurs when gradients become too small to make significant updates to the weights during backpropagation, making it hard for the RNN to learn long-range dependencies.

* LSTM units include memory cells that can maintain information in memory for long periods. Key components of these cells are gates: the input gate, the forget gate, and the output gate. These gates regulate the flow of information into and out of the cell, making LSTMs effective at remembering and forgetting information dynamically.

### Components of the LSTM Model

* Input Layer : Takes the sequence of words as input
* LSTM Layer : Computes the output using LSTM units. I have added 100 units in the layer, but this number can be fine tuned later.
* Dropout Layer : A regularisation layer which randomly turns-off the activations of some neurons in the LSTM layer. It helps in preventing over fitting. (Optional Layer)
* Output Layer : Computes the probability of the best possible next word as output
We will run this model for total 100 epoochs but it can be shortened to 10 epochs for a quick demo.

In [None]:
# Build the model
model = Sequential()
# TODO: explain what parameters can be changed or finetuned
model.add(Embedding(total_words, 100, input_length=max_sequence_len-1))
model.add(LSTM(150, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(100))
model.add(Dense(total_words, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 18, 100)           711100    
                                                                 
 lstm (LSTM)                 (None, 18, 150)           150600    
                                                                 
 dropout (Dropout)           (None, 18, 150)           0         
                                                                 
 lstm_1 (LSTM)               (None, 100)               100400    
                                                                 
 dense (Dense)               (None, 7111)              718211    
                                                                 
Total params: 1680311 (6.41 MB)
Trainable params: 1680311 (6.41 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


## Best Practices: Checkpoint
Using checkpoints and saving the model is generally considered a very good practice in the field of machine learning and deep learning for several important reasons:

* Preventing Data Loss: Training models, especially deep learning models like LSTMs, can be time-consuming and resource-intensive. Checkpoints prevent loss of progress in case of interruptions like power failures or system crashes.

* Model Evaluation and Comparison: Saving models at different stages of training (or with different architectures) allows you to compare their performance on the validation set. This helps in selecting the best model for your task.

* Early Stopping: Checkpoints can be used in conjunction with early stopping, where training is halted as soon as the model performance begins to degrade on a validation set. This helps prevent overfitting.

* Continuing Training: If you decide to train your model further, checkpoints allow you to resume training from a specific point rather than starting over.

* Experimentation: Having saved models allows you to experiment with different aspects of your model (like hyperparameters) without losing your previous work.

In [None]:
# path to checkpoint
checkpoint_path = "/content/drive/MyDrive/GenAI-BA (Prof Hardeep Johar)/LSTM Text Generation/training_1/checkpoint-{epoch:04d}.ckpt"
checkpoint_dir = os.path.dirname(checkpoint_path)

# Create a callback that saves the model's weights
cp_callback = ModelCheckpoint(filepath=checkpoint_path,
                              save_weights_only=True,
                              verbose=1)
# early stopping helps prevent overfitting
early_stop = EarlyStopping(monitor='loss', patience=5, min_delta=0.0001)

# Train the model
# If we have a saved checkpoint, load the checkpoint
if os.path.exists(checkpoint_dir):
    latest_checkpoint = tf.train.latest_checkpoint(checkpoint_dir)
    model.load_weights(latest_checkpoint)

    # Extracting the epoch number from the checkpoint file name
    latest_epoch = int(latest_checkpoint.split('-')[-1][:4])
else:
    latest_epoch = 0

# Train the model
model.fit(predictors, label, epochs=100, initial_epoch=latest_epoch, verbose=1, callbacks=[early_stop, cp_callback])

Epoch 66/100
Epoch 66: saving model to /content/drive/MyDrive/GenAI-BA (Prof Hardeep Johar)/LSTM Text Generation/training_1/checkpoint-0066.ckpt
Epoch 67/100
Epoch 67: saving model to /content/drive/MyDrive/GenAI-BA (Prof Hardeep Johar)/LSTM Text Generation/training_1/checkpoint-0067.ckpt
Epoch 68/100
Epoch 68: saving model to /content/drive/MyDrive/GenAI-BA (Prof Hardeep Johar)/LSTM Text Generation/training_1/checkpoint-0068.ckpt
Epoch 69/100
Epoch 69: saving model to /content/drive/MyDrive/GenAI-BA (Prof Hardeep Johar)/LSTM Text Generation/training_1/checkpoint-0069.ckpt
Epoch 70/100
Epoch 70: saving model to /content/drive/MyDrive/GenAI-BA (Prof Hardeep Johar)/LSTM Text Generation/training_1/checkpoint-0070.ckpt
Epoch 71/100
Epoch 71: saving model to /content/drive/MyDrive/GenAI-BA (Prof Hardeep Johar)/LSTM Text Generation/training_1/checkpoint-0071.ckpt
Epoch 72/100
Epoch 72: saving model to /content/drive/MyDrive/GenAI-BA (Prof Hardeep Johar)/LSTM Text Generation/training_1/checkp

<keras.src.callbacks.History at 0x7df2c039e410>

## The fun part: write like Jane Austen
Now that we have finished training our neural network, we can test how it performs and whether it can write like Jane Austen. I certainly hope so.

In the below `generate_text` function, we predict the next word based on the input words (or seed text). We first tokenize the seed text, pad the sequences and pass into the trained model to get predicted word. The multiple predicted words can be appended together to get predicted sequence.

In [None]:
# Function to generate text
def generate_text(seed_text, next_words, model, max_sequence_len):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
        predicted = model.predict(token_list, verbose=0)
        predicted_claess=np.argmax(predicted,axis=1)
        output_word = ""

        for word, index in tokenizer.word_index.items():
            if index == predicted_claess:
                output_word = word
                break
        seed_text += " " + output_word
    return seed_text

## Moment of truth!

It's fun to explore the text our model generates.

Some part of the text generated makes sense, especially if the seed word has been seen multiple times in our text:

Marry as soon as possible... (LOL)




In [None]:
# Generate text
print(generate_text("Jane", 10, model, max_sequence_len))
print(generate_text("Elizabeth", 10, model, max_sequence_len))
print(generate_text("marry", 10, model, max_sequence_len))

Jane and the two gentlemen turning back and down the chief
Elizabeth was forced to know whether she had formerly been ago
marry as soon as possible but i have heard the spot


However, when the seed word is never seen, our model start to generate meaningless text

In [None]:
print(generate_text("Columbia", 10, model, max_sequence_len))
print(generate_text("Plus", 10, model, max_sequence_len))

Columbia situation and return to the house amidst the nods and
Plus situation and return to the house amidst the nods and


## Save the model for future use

In [None]:
# Save the entire model as a `.keras` zip archive so we can use it later
model.save('janeausten.keras')

In [None]:
# Next time you run this code, you can directly restore the model
new_model = tf.keras.models.load_model('janeausten.keras')

# Show the model architecture
new_model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 18, 100)           711100    
                                                                 
 lstm (LSTM)                 (None, 18, 150)           150600    
                                                                 
 dropout (Dropout)           (None, 18, 150)           0         
                                                                 
 lstm_1 (LSTM)               (None, 100)               100400    
                                                                 
 dense (Dense)               (None, 7111)              718211    
                                                                 
Total params: 1680311 (6.41 MB)
Trainable params: 1680311 (6.41 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [None]:
# Test the restored model by generating text
print(generate_text("Jane", 20, model, max_sequence_len))
print(generate_text("Elizabeth", 20, model, max_sequence_len))
print(generate_text("marry", 20, model, max_sequence_len))

Jane and the two gentlemen turning back and down the chief of their meeting with mr collins and i dont say
Elizabeth was forced to know whether she had formerly been ago she had not yet been calculated against ten as i
marry as soon as possible but i have heard the spot young ladies retired to express her father lodge by the
