<a href="https://colab.research.google.com/github/nkrj01/GenAI/blob/main/LSTM_text_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Introduction**

In this notebook, we trained an LSTM model with an attention mechanism on the text of "Alice in Wonderland" with the goal of generating new text in the author's style.

We compared two models to assess their effectiveness in replicating the author's style. Both models are identical in architecture, with the only difference being in the text embedding layer.

**Model 1**: This model initializes its embedding layer with random values. This means that at the start of training, the word representations are randomly assigned and learn to adjust during the training process.

**Model 2**: In contrast, this model begins with embedding vectors generated using OpenAI's Ada embeddings. These pre-trained embeddings provide a more sophisticated starting point for the model. Additionally, these embeddings are fine-tuned during training, allowing the model to adapt the embeddings more closely to the specific style and context of "Alice in Wonderland".

Below is the table of five different output (20 sequences) generated using five different seed tokens (10 sequences). Overall, Model 2 generated more realistic texts than Model 1

---



---


**Table 1:Model 1 | 10 sequences long input | 100 epochs | final loss = ~0.06**

|index|Seed String|Generated string|
|---|---|---|
|0|first she of little alice herself and once again|took up the fan and a little thing she was now about two feet high and was going on shrinking|
|1|the tiny hands were upon her knee and the|queen was silent in her pocket and she felt that the first thought that she looked down at her hands|
|2|bright eager eyes were looking up into could hear|the white rabbit put on her spectacles and began staring at the hatter who turned pale and fidgeted give your|
|3|the very tones of her voice and see that queer|well ive been that there were no use now thought alice theyre poor little thing said alice in a coaxing|
|4|little of her head to keep back the wandering|cried the mouse in a deep voice what are said the youth one only say this again is very confusing|



**Table 2: Model 2 (ada-embeddings) | 10 sequences long input | 100 epochs | final loss = ~0.02**

|index|Seed String|Generated string|
|---|---|---|
|0|first she of little alice herself and once again|alice thought she might as well wait as she had nothing else to do and perhaps after all it might|
|1|the tiny hands were upon her knee and the|jury all brightened up again please your majesty said the knave i didnt write it and they cant prove i|
|2|bright eager eyes were looking up into could hear|the rabbit say to itself oh dear oh dear i shall be late when she thought it over afterwards it|
|3|the very tones of her voice and see that queer|first to be everything is today in it but there were any more breadandbutter and then said so she stood|
|4|little of her head to keep back the wandering|and if it makes me grow larger i can reach the key and if it makes me grow smaller i|



In [None]:
! pip install cohere
! pip install openai

In [None]:
import ast
import openai
import pandas as pd
import numpy as np
import tensorflow as tf
import keras
from keras.layers import Dense, LSTM, Input, Embedding, Attention, Flatten, Bidirectional
from keras.preprocessing.text import Tokenizer
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
import string
import re
from rich.console import Console
console = Console()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
# import the book
file_path = '/content/drive/MyDrive/Colab Notebooks/LSTM_text_generation/pg11.txt'

# Use 'with' to ensure the file is properly handled
with open(file_path, 'r', encoding='utf-8') as file:
    # Read the content of the file
    file_content = file.read()

1. Pre-process and clean up
2. Tokenization
3. Creating x and y data points for model training

In [None]:
def get_tokens(sentence):
  sentence = sentence.encode('utf-8').decode('unicode_escape').encode('ascii', 'ignore').decode()
  sentence = sentence.lower()
  sentence = sentence.replace('\n', ' ')
  sentence = sentence.replace("_", "")
  translation_table = str.maketrans('', '', string.punctuation)
  sentence = sentence.translate(translation_table)
  words = word_tokenize(sentence)
  return words

tokens = get_tokens(file_content)
print("All tokens: ", tokens)

x = []
y = []
SEQUENCE_LENGTH = 10
for i in range(SEQUENCE_LENGTH , len(tokens)):
    x.append(tokens[i-SEQUENCE_LENGTH :i])
    y.append(tokens[i])
    # if i<= sequence_length+1:
    #     print(x)
    #     print(y)

console.print("total number of sequences: ", len(x), style="bold")

console.print("First sequence: ", x[0], style="bold")
console.print("First label: ", [y[0]], style="bold")



1. Creating vocabulary containing all tokens
2. Fitting the tokenizer to create index-token pair dictionary
3. Defining a word-to-integer-token encoder
4. Creating x_train and y_train by encoding word-tokens to integer-tokens

In [None]:
from collections import Counter
counter = Counter()
counter.update(tokens)
all_words = counter.most_common()
vocab = [i[0] for i in all_words]
print("length of vocabulary: ", len(vocab))

# create tokenizer
def create_tokenizer(text):
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(text)
    return tokenizer

# fit on vocab
tokenizer = create_tokenizer(vocab)

# print word index pairs
word_index = tokenizer.word_index
print("word-index pair: ", word_index)

# create encoder function
def encode_docs(tokenizer, docs):
    # integer encode
    encoded = tokenizer.texts_to_sequences(docs)
    return encoded

# convert texts to encoded sequences
x_train = encode_docs(tokenizer, x)
y_train = encode_docs(tokenizer, y)
y_train = np.array(y_train).flatten().astype(np.int32)

print("word tokens of first predictor sequence: ", x[0])
print("integer tokens of first predictor sequence: ", x_train[0])
print("word token of first label: ",y[0])
print("integer token of first label: ", y_train[0])

length of vocabulary:  2698
word tokens of first predictor sequence:  ['chapter', 'i', 'down', 'the', 'rabbithole', 'alice', 'was', 'beginning', 'to', 'get']
integer tokens of first predictor sequence:  [293, 9, 36, 1, 812, 10, 13, 269, 3, 99]
word tokens of first label:  very
integer tokens of first label:  27


Retrieving ada-emebddings from openai API for all the words in the vocabulary

In [None]:
from openai import OpenAI
from google.colab import userdata
client = OpenAI(api_key=userdata.get('openai'))

def getAdaEmbedding(train_text: list, model="text-embedding-ada-002") -> list:
  total_size = len(train_text)
  batch_end = 0
  batch_size = 500
  n_steps = int(total_size/batch_size) + 1
  ada_embedding = []
  for i in range(n_steps):
    batch_start = batch_end
    batch_end = batch_start+batch_size
    if batch_end<=total_size:
      pass
    else:
      batch_end = total_size
      batch_size = total_size % batch_size
    text = train_text[batch_start:batch_end]
    output = client.embeddings.create(input = text, model=model)
    for j in range(batch_size):
      ada_embedding.append(output.data[j].embedding)
  return ada_embedding

# get embeddings from openai API
ada_embedding = getAdaEmbedding(vocab)
ada_embedding = np.array(ada_embedding)
print("output shape: ", ada_embedding.shape)

# since this is a one time job, save the embedding vectors as a csv file
file_path_ada = '/content/drive/MyDrive/Colab Notebooks/LSTM_text_generation/ada_embeddings.csv'
np.savetxt(file_path_ada, ada_embedding, delimiter=',', fmt='%f')

Creating embedding layer for both Model 1 and Model 2.

In [None]:
from keras.initializers import Constant
# import ada-embedding csv file
file_path_ada = '/content/drive/MyDrive/Colab Notebooks/LSTM_text_generation/ada_embeddings.csv'
ada_embedding = np.loadtxt(file_path_ada, delimiter=',')
EMBEDDING_DIM = ada_embedding.shape[1]
ada_embedding = np.vstack([np.zeros(EMBEDDING_DIM), ada_embedding]) # adding row of zeroes for words not in vocab

num_words = len(vocab) + 1

# create emebdding layer with ada-embedding vectors as initial conditions
# these emebddings will be fine-tuned during the training

def embedding_layer(initial_conditions):
  embedding_layer = Embedding(num_words,
                              EMBEDDING_DIM,
                              embeddings_initializer=Constant(initial_conditions),
                              trainable=True)
  return embedding_layer

embedding_layer_ada = embedding_layer(ada_embedding)
emebdding_layer_normal = embedding_layer(np.random.uniform(-1, 1, (num_words, EMBEDDING_DIM)))

Model Architecture, compilation, and Fit

In [None]:
def create_model(embedding_layer):
  inputs = Input(shape=(SEQUENCE_LENGTH,))
  l = embedding_layer(inputs)
  l = LSTM(128, return_sequences=False)(l)
  l = Attention()([l, l])
  l = Dense(num_words, activation='softmax')(l)

  # create the model
  model = keras.Model(inputs, l)

  # compile the model
  model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')

  return model

# model = create_model(emebdding_layer_ada)
model = create_model(emebdding_layer_normal)

x_train = tf.convert_to_tensor(x_train) #  list to tensor
# Train the model
model.fit(x_train,
          y_train,
          batch_size=32,
          epochs=100,
          )

# save the model
# model.save('/content/drive/MyDrive/Colab Notebooks/LSTM_text_generation/my_model')


# only for testing
'''input = tf.convert_to_tensor(x_train[0:2])
tokens = model(input)
print(tokens)'''

Text generation

In [None]:
# read the seed content from a text file and tokenize
file_path = '/content/drive/MyDrive/Colab Notebooks/LSTM_text_generation/seed_sequence.txt'

with open(file_path, 'r', encoding='utf-8') as file:
    # Read the content of the file
    seed_content = file.read()

seed_tokens = get_tokens(seed_content)
# print(seed_tokens)

# create SEQUENCE_LENGTH tokens list as an input to the model
# encode: text-to-integers
count = 0
input_sequence = []
for i in range(5):
  docs = seed_tokens[count:count+SEQUENCE_LENGTH]
  seq = tokenizer.texts_to_sequences(docs)
  # fill zero for unknown words
  for inner_list in seq:
    # Check if the inner list is empty
    if not inner_list:
        # Fill the empty list with zero
        inner_list.append(0)

  seq = [item for sublist in seq for item in sublist]
  input_sequence.append(seq)
  count = count + SEQUENCE_LENGTH

print(input_sequence[0])

def generate_text (input_sequence, generate_text_len=20):
    output_tokens = []
    for i in range(generate_text_len):
      seed_sequence_tf = tf.convert_to_tensor(np.array(input_sequence).reshape(1, SEQUENCE_LENGTH))
      model_output = model.predict(seed_sequence_tf)
      next_token = np.argmax(model_output, axis=1)
      input_sequence.append(next_token[0])
      input_sequence.pop(0)
      output_tokens.append(next_token[0])

    # print(output_tokens)
    return output_tokens

input_output_pair = {}
for input in input_sequence:
  # print("seed tokens: ", seed)
  seed_texts = " ".join(tokenizer.sequences_to_texts([input]))
  print("seed_texts: ", seed_texts)
  generated_tokens = generate_text(input)
  # print("generated tokens: ", generated_tokens)
  generated_texts = " ".join(tokenizer.sequences_to_texts([generated_tokens]))
  print("generated_texts: ", generated_texts)
  input_output_pair[seed_texts] = generated_texts


df = pd.DataFrame({"Seed String": list(input_output_pair.keys()),
                   "Generated string": list(input_output_pair.values())})
df.head()