<a href="https://colab.research.google.com/github/nkrj01/GenAI/blob/main/LSTM_text_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Introduction**

In this notebook, I've developed an LSTM model using the text from "Alice in Wonderland." By inputting seed words into this trained model, it can generate new texts. While the model has shown a capacity to produce somewhat meaningful texts, enhancing the quality of the generated text will require more training data.

In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf
import keras
from keras.layers import Dense, LSTM, Input, Embedding, TextVectorization, Dropout
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
import string
import re

In [None]:
file_path = '/content/drive/MyDrive/Colab Notebooks/LSTM_text_generation/pg11.txt'

# Use 'with' to ensure the file is properly handled
with open(file_path, 'r', encoding='utf-8') as file:
    # Read the content of the file
    file_content = file.read()

In [None]:
def get_tokens(sentence):
  sentence = sentence.encode('utf-8').decode('unicode_escape').encode('ascii', 'ignore').decode()
  sentence = sentence.replace('\n', ' ')
  sentence = sentence.replace("_", "")
  translation_table = str.maketrans('', '', string.punctuation)
  sentence = sentence.translate(translation_table)
  words = word_tokenize(sentence)
  return words

tokens = get_tokens(file_content)
# print(tokens)

### **Creating training data which containes equal length sequences (2 tokens) from the corpus and the correposnding next token to predict (x and y)**




In [None]:
x = []
y = []
sequence_length = 2
for i in range(sequence_length, len(tokens)):
    x.append(tokens[i-sequence_length:i])
    y.append(tokens[i])
    # if i<= sequence_length+1:
    #     print(x)
    #     print(y)

print("total number of sequences: ", len(x))
x = [[" ".join(tokens)] for tokens in x]
# print("First sequence: ", x[0])
# print("Second sequence: ", x[1])

training_data_len = int(np.ceil( len(tokens) * .99 ))
x_train = x[0:training_data_len]
x_train = np.array(x_train)
y_train_words = y[0:training_data_len]

print("train data length: ", len(x_train))

###**Data Preparation and preprocessing**

1.   Creating TextVectorization layer for tokenzation and Embeddings
2.   Transforming labels from words to tokens




In [None]:
def text_cleaning(text):
  text_tf = tf.convert_to_tensor(text, dtype=tf.string)
  text_tf = tf.strings.lower(text_tf)
  return text_tf

max_features = 3000

vectorize_layer = TextVectorization(
    standardize=text_cleaning,
    max_tokens=max_features,
    output_mode="int",
    output_sequence_length=sequence_length
)
vectorize_layer.adapt(x_train)
vocab = vectorize_layer.get_vocabulary()
print("Vocabulary lenght: ", len(vocab))
# print(vocab)

def transform_labels(y):
  words_tf = tf.data.Dataset.from_tensor_slices(y)
  tokens = words_tf.map(lambda x: vectorize_layer(x))
  token_list = []
  for token in tokens:
    token_list.append(token.numpy()[0])
  token_list = np.array(token_list).reshape(-1, 1).astype(np.float32)
  return token_list

y_train = transform_labels(y_train_words)

Vocabulary lenght:  2717


### **Model Architecture, compilation, and Fit**

In [None]:
embedding_dim = 128
inputs = Input(shape=(1,), dtype=tf.string)
x = vectorize_layer(inputs)
x = Embedding(max_features+1, embedding_dim)(x)
x = Dropout(0.2)(x)
x = LSTM(128, return_sequences=False)(x)
output = Dense(len(vocab), activation='softmax')(x)

model = keras.Model(inputs, output)

# input = tf.convert_to_tensor(x_train[0:2], dtype=tf.string)
# tokens = model(input)
# print(tokens)

# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')

# Train the model
model.fit(x_train,
          y_train,
          batch_size=32,
          epochs=100,
          )

# Save the model
model.save('/content/drive/MyDrive/Colab Notebooks/LSTM_text_generation/my_model')

### **Text generation**

In [None]:
def invert_vectorization(encoded_sequence, vocabulary):
    return [vocabulary[i] for i in encoded_sequence]

def generate_text (seed_string, len_sequence=20):
  output_string = ""
  for i in range(len_sequence):
    if i == 0:
      seed_tokens = get_tokens(seed_string)[0:sequence_length]

    seed_sequence = [" ".join(seed_tokens)]
    seed_sequence = np.array(seed_sequence).reshape(-1, 1)
    seed_sequence_tf = tf.convert_to_tensor(seed_sequence, dtype=tf.string)
    model_output = model.predict(seed_sequence_tf)
    next_token = np.argmax(model_output, axis=1)
    next_word = invert_vectorization(next_token, vocab)
    output_string = output_string + " " + next_word[0]
    seed_tokens.append(next_word[0])
    seed_tokens.pop(0)
  return output_string

seed_string = ["she pictured", "little sister", "she would", "the simple", "king said", "little children"]

output_string = []
for seed in seed_string:
  output_string.append(generate_text(seed, 20))

df = pd.DataFrame({"Seed String": seed_string, "Generated string": output_string})
df.head()

### **Results**

Below is the table of five different output (20 sequences) generated using five different seed tokens (2 sequences). The model was somewhat able to generate meaningful sentences. However, training on more data set would be necessary to improve the text quality.


|index|Seed String|Generated string|
|---|---|---|
|0|she pictured| such a thing before but she could not think of anything to put it into one of the court and|
|1|little sister| but more rather more you promised to tell me said alice a little scream of laughter oh hush the rabbit|
|2|she would| have appeared to them and considered a little scream of laughter oh hush the rabbit came near her about the|
|3|the simple| rules their friends had taught them such as sure i dont think alice went on in the pool of tears|
|4|king said| to herself as she could not think of anything to put it into one of the court and got behind|