# GloVe Scribe - Neural Text Generation with Semantic Embeddings

### Project Overview

- This project implements a text generation system using Recurrent Neural Networks (RNNs) powered by GloVe word embeddings. The architecture leverages deep learning techniques to generate contextually relevant and coherent text sequences.

### Technical Implementation
- Core Architecture: Sequential RNN model built with Keras
- Semantic Understanding: Integration of pre-trained GloVe embeddings
- Dual Implementation: Both Keras and native TensorFlow versions for framework flexibility and performance comparison
- Memory Management: LSTM/GRU layers for improved context retention

### Development Goals
- Enhance semantic coherence through GloVe embedding integration
- Implement and optimize RNN architecture for text generation
- Compare performance metrics between Keras and TensorFlow implementations
- Analyze and tune hyperparameters for optimal text generation

## Approach

- We will use the poem `Heavens and Earth.txt` as the dataset to train our model to generate text. 
- And to capture the meaning of each word (aka the vector embeddings), we will the GloVe vector embeddings stored in `glove.6B.100d.txt`

### 1. Importing Required Libraries

Here we import all the necessary Python libraries:
- `numpy`: For numerical operations
- `nltk`: For natural language processing tasks
- `sys`: For system-specific parameters and functions
- `tensorflow.keras.preprocessing.text.Tokenizer`: For converting text to sequences of integers
- `tensorflow.keras.preprocessing.sequence.pad_sequences`: For making sequences uniform length
- `tensorflow.keras.layers`: For various layer types
    - `Embedding`: Creates word embedding layer
    - `LSTM`: Long Short-Term Memory layer for sequence processing
    - `Dense`: Regular fully-connected neural network layer
- `tensorflow.keras.models.load_model`: Load saved Keras model

In [1]:
import numpy as np
import nltk
import sys
from datetime import datetime

from collections import Counter
from nltk import ngrams

import pickle

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras import layers
from tensorflow.keras.layers import Embedding, LSTM, Dense

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import load_model

- Download NLTK model data

In [None]:
# Download NLTK model data (you need to do this once)
nltk.download("book")

### 2. Text Cleaning Functions

We define two important functions for text cleaning:
1. `clean_roman_numerals`: Removes Roman numerals from the text
2. `_RE_COMBINE_WHITESPACE`: A regular expression to combine multiple whitespace characters into a single space

In [3]:
import re
def clean_roman_numerals(text):
    pattern = r"\b(?=[MDCLXVIΙ])M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})([IΙ]X|[IΙ]V|V?[IΙ]{0,3})\b\.?"
    return re.sub(pattern, '', text)

### 3. Tokenization and Vocabulary Building

In this stage, we:
1. Tokenize the cleaned text into sentences
2. Add sentence start and end tokens
3. Tokenize sentences into words
4. Build a vocabulary based on word frequencies
5. Create index-to-word and word-to-index mappings

In [4]:
import re
from nltk import tokenize

#alphabets= "([A-Za-z])"
#prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
#suffixes = "(Inc|Ltd|Jr|Sr|Co)"
#starters = "(Mr|Mrs|Ms|Dr|Prof|Capt|Cpt|Lt|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
#acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
#websites = "[.](com|net|org|io|gov|edu|me)"
#digits = "([0-9])"

# If you want to restrict the size of the voabulary
# Right now, we set it in the next cell to be the entire vocabular: vocabulary_size = len(word_freq.items())
#vocabulary_size = 3000

unknown_token = "UNKNOWN_TOKEN"
sentence_start_token = "SENTENCE_START"
sentence_end_token = "SENTENCE_END"

# Read the data and append SENTENCE_START and SENTENCE_END tokens
text = ''
print( "Reading txt file...")
with open(r'Heavens and Earth.txt', 'r', encoding="utf8") as f:
    text = f.read()

#text = text.replace(",\n","\n")

# too many commas if we do this
#text = text.replace(","," ,")
#text = text.replace(":"," ,")
#text = text.replace(";"," ,")

#.. so we do this instead
text = text.replace(",","")
text = text.replace(":","")
text = text.replace(";","")
text = text.replace("“","")
text = text.replace("”","")


# too many apostrophes in shakespeare
text = text.replace("’","")

text = text.replace("?\n",".\n")
text = text.replace("!\n",".\n")
text = text.replace("?","")
text = text.replace("!","")
text = text.replace("_","")
text = text.replace("...",".")
text = text.replace("..",".")
#text = text.replace("\n"," ")

text = text.replace('I ', 'i ')
text = clean_roman_numerals(text)
#text = text.replace('&', '')

_RE_COMBINE_WHITESPACE = re.compile(r"\s+")
text = _RE_COMBINE_WHITESPACE.sub(" ", text).strip()
print('done!')

Reading txt file...
done!


- Convert to lowercase and take a look at the first 1000 words

In [5]:
text = text.lower()

text[:1000]

'two visions of helen the first vision of helen slowly blanch-handed dawn eyes half-awake upraised magnificent the silver urn heaped with white roses at the trembling lip flowers that burn with crystalline accord and die not ever. like a pulsing heart beat from within against the fire-loud verge a milky vast transparency of light heavy with drowning stars a swimming void of august ether formless as the cloud and light made absolute. the mountains sighed turning in sleep. dawn held the frozen flame an instant high above the shaggy world then to the crowing of a thousand cocks poured out on earth the unconquerable sun. the centaurs awoke they aroused from their beds of pine their long flanks hoary with dew and their eyes deep-drowned in the primal slumber of stones stirred bright to the shine. and they stamped with their hooves and their gallop abased the ground. swifter than arrowy birds in an eager sky white-browed kings of the hills where old titans feast —cheiron ordered the charge w

- Apply sentence tokenizer on the sentences

In [6]:
sentences = tokenize.sent_tokenize(text)
for i in range(len(sentences)):
    print(sentences[i])
    print()

two visions of helen the first vision of helen slowly blanch-handed dawn eyes half-awake upraised magnificent the silver urn heaped with white roses at the trembling lip flowers that burn with crystalline accord and die not ever.

like a pulsing heart beat from within against the fire-loud verge a milky vast transparency of light heavy with drowning stars a swimming void of august ether formless as the cloud and light made absolute.

the mountains sighed turning in sleep.

dawn held the frozen flame an instant high above the shaggy world then to the crowing of a thousand cocks poured out on earth the unconquerable sun.

the centaurs awoke they aroused from their beds of pine their long flanks hoary with dew and their eyes deep-drowned in the primal slumber of stones stirred bright to the shine.

and they stamped with their hooves and their gallop abased the ground.

swifter than arrowy birds in an eager sky white-browed kings of the hills where old titans feast —cheiron ordered the cha

### 4. Text Tokenization and Sequence Creation

In this stage, we:
1. Create sequences for next-word prediction
2. Use a sequence length of 3 words
3. Convert the text to numerical sequences
4. Prepare training data (X_train and y_train)

In [7]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts([text])

# Create sequences for next word prediction
sequence_length = 3
sequences = []

# Assuming sentences are split by period
for sentence in text.split('.'):  
    tokens = tokenizer.texts_to_sequences([sentence])[0]
    for i in range(sequence_length, len(tokens)):
        sequences.append(tokens[i-sequence_length:i+1])

# Convert sequences to numpy array
sequences = np.array(sequences)
X_train = sequences[:, :-1]  # Input sequence
y_train = sequences[:, -1]   # Output next token

In [8]:
X_train.shape, y_train.shape

((11309, 3), (11309,))

### 5. Encode with GLOVE embeddings

In this section, we:
1. Load pre-trained GLOVE embeddings from a file
2. Create a dictionary mapping words to their vector representations
3. Print the number of word vectors found

In [9]:
glove_dir = "glove.100d"

embeddings_index = {}
f = open('glove.6B.100d.txt', encoding='utf8')
try:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
except:
    print(line)
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


### 6. Creating Embedding Matrix

- In this section, we create an embedding matrix using the GLOVE vectors for the vocabulary in the training data

In [10]:
embedding_dim = 100

# Get the total number of unique words in the training data
vocab_size = len(tokenizer.word_index) + 1

vocabulary_size = vocab_size
embedding_matrix = np.zeros((vocabulary_size, embedding_dim))
for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if i < vocabulary_size:
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector

In [11]:
embedding_matrix.shape

(3539, 100)

###  7. Model Building

In this section, we define a Sequential neural network with:

- Pre-trained GloVe embedding layer (non-trainable)
- LSTM layer with 256 units
- Dense output layer with softmax activation
- Compiled using Sparse Categorical Crossentropy loss and Adam optimizer

In [12]:
model = Sequential()

# Embedding layer using GloVe
model.add(Embedding(
    input_dim=vocab_size, 
    output_dim=embedding_dim, 
    weights=[embedding_matrix], 
    input_length=sequence_length, 
    trainable=False
))

# LSTM layer for sequence prediction
model.add(LSTM(256, return_sequences=False))

# Output layer
model.add(Dense(vocab_size, activation='softmax'))

# Compile the model
model.compile(
    loss='sparse_categorical_crossentropy', 
    optimizer='adam', 
    metrics=['accuracy']
)

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 3, 100)            353900    
                                                                 
 lstm (LSTM)                 (None, 256)               365568    
                                                                 
 dense (Dense)               (None, 3539)              909523    
                                                                 
Total params: 1,628,991
Trainable params: 1,275,091
Non-trainable params: 353,900
_________________________________________________________________


### 8. Model Training and Saving

In this section, we trains the model for 100 epochs with:

- Batch size of 64
- 20% validation split

Later, save the trained model

In [13]:
model.fit(
    X_train, 
    y_train, 
    epochs=100, 
    batch_size=64, 
    validation_split=0.2
)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.History at 0x2b087607df0>

In [14]:
# Saving the model via Tensorflow's save() method
model.save('nlp_model.h5')  

In [15]:
# Loading the model
model = load_model('nlp_model.h5')

### 9. Text Generation

In this section, we:

- Provide a seed text as input to our model
- Specify the number of sentences to generate
- Restrict each sentence's maximum length and handle sentence endings

In [18]:
def generate_text(seed_text, num_sentences, model, max_sequence_len, tokenizer, max_length=30):
    generated_text = seed_text
    sentences_generated = 0
    
    while sentences_generated < num_sentences:
        words_in_current_sentence = 0
        
        while words_in_current_sentence < max_length:
            token_list = tokenizer.texts_to_sequences([generated_text])[0]
            token_list = pad_sequences([token_list], maxlen=max_sequence_len, padding='pre')
            
            # Make prediction
            predicted = np.argmax(model.predict(token_list, verbose=0), axis=-1)
            
            # Convert prediction to word
            output_word = ""
            for word, index in tokenizer.word_index.items():
                if index == predicted:
                    output_word = word
                    break
            
            # Add the new word
            generated_text += " " + output_word
            words_in_current_sentence += 1
            
            # Check if sentence ended
            if output_word.endswith(('.', '!', '?')) or output_word in ['.', '!', '?']:
                sentences_generated += 1
                break
        
        # Force end sentence if it exceeds the length 
        if words_in_current_sentence >= max_length and sentences_generated < num_sentences:
            generated_text += "."
            sentences_generated += 1
    
    return generated_text

In [19]:
paragraph = generate_text(
    seed_text=input("Enter a sentence:"),
    num_sentences=5,
    model=model,
    max_sequence_len=sequence_length,
    tokenizer=tokenizer,
    max_length=20
)

paragraph

Enter a sentence: The centaurs awoke they aroused from their beds of pine and they stamped


'The centaurs awoke they aroused from their beds of pine and they stamped with their hooves and their gallop abased the ground vines grow in my garden blossoms a snake in size past. the wisdom the the of the woods to mourn their friend with strange solemnities of his hands like the long. cry of an old trumpet harsh with rust and gold the ballad rose assaulting struck and died into a clamorous. echo of light as a birds restless eyes and worn a little upon its eyes and the skies are vast. seeing her sleep like a swallow in deaths wide bed at last last peace and this your life then out.'