<h1> Recurrent Neural Network </h1>
<h2> by Nathan Dilla & John Haviland </h2>

<h3> Problem Statement </h3>

<h2> Dataset Overview </h2>

<h3> Purpose </h3>

<h3> Step 1: Import Libraries, Load & Preprocess Dataset </h3>

In this step, we import in the necessary libraries and load in the text dataset "poem.txt" and split it into sentences. We then preprocess each sentence by removing the punctuation and splitting the sentence into words.

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Masking, LSTM, Dense, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping
import re
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity

# Load in 'poem.txt' dataset with UTF-8 encoding
with open('poem.txt', 'r', encoding='utf-8') as file:
    text = file.read()

# Split the text into sentences/lines
sentences = text.split('\n')

# Preprocess the lines by removing punctuation and splitting into words
sentences = [re.sub(r'[^\w\s]', '', sentence).lower().split() for sentence in sentences if sentence.strip() != '']


<h3> Step 2: Tokenize, Prepare Sequences </h3>

We use the Tokenizer from the Keras library to convert the words in the lines into integer values. We create sequences and labels using a "sliding window" approach, where each sequence is a list of words and the last word in the sequences is the label.

In [None]:

# Initialize tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)
total_words = len(tokenizer.word_index) + 1

# Create sequences and labels using 'sliding window' approach
input_sequences = []
for line in sentences:
    for i in range(1, len(line)):
        n_gram_sequence = line[:i + 1]
        # Check if all words in the sequence are in the tokenizer's word index
        if all(word in tokenizer.word_index for word in n_gram_sequence):
            input_sequences.append(n_gram_sequence)

# Check the sequences for any empty lists
input_sequences = [seq for seq in input_sequences if len(seq) > 0]

# Convert the sequences to integer values using the tokenizer
input_sequences = [tokenizer.texts_to_sequences(seq) for seq in input_sequences]

max_sequence_length = max([len(x) for x in input_sequences])
input_sequences = pad_sequences(input_sequences, maxlen=max_sequence_length, padding='pre')

X = input_sequences[:, :-1]
y = input_sequences[:, -1]


<h3> Step 3: Load in Pre-trained GloVe Embeddings </h3>

In this step, we load in pre-trained GloVe word embeddings from the file 'glove.6B.100d.txt', creating an embedding matrix. The words in the dataset are matched to the pre-trained embeddings, and the matrix is created with the dimensions (total_words, 100).

In [None]:
embeddings_index = {}
with open('glove.6B.100d.txt', encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

embedding_matrix = np.zeros((total_words, 100))
for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

<h3> Step 4: Build & Compile LSTM Model </h3>



In [None]:
# Build LSTM model w/ embedding and dense layers
model = Sequential()
model.add(Embedding(total_words, 100, weights=[embedding_matrix], input_length=max_sequence_length - 1, trainable=False))
model.add(Masking(mask_value=0.0))
model.add(LSTM(128, return_sequences=False))
model.add(Dense(total_words, activation='softmax'))

# Print LSTM model architecture
model.summary()

# Compile the model
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')


<h3> Step 5: Train the Model </h3>

In [None]:
# Define model callbacks (using ModelCheckpoint and EarlyStopping)
checkpoint = ModelCheckpoint("best_model.h5", monitor='val_loss', save_best_only=True, verbose=1)
early_stopping = EarlyStopping(monitor='val_loss', patience=10, verbose=1)

# Train model
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=500, verbose=2, callbacks=[checkpoint, early_stopping])


<h3> Step 6: Test the Model (Generate Text)

In [None]:
# Function to generate text
def generate_text(seed_text, next_words, model, max_sequence_length):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_length - 1, padding='pre')
        predicted = model.predict(token_list, verbose=0)
        predicted_word_index = np.argmax(predicted)
        predicted_word = tokenizer.index_word[predicted_word_index]
        seed_text += " " + predicted_word
    return seed_text

# Test the model by generating text
seed_text = "The best advice is"
generated_text = generate_text(seed_text, next_words=3, model=model, max_sequence_length=max_sequence_length)
print(generated_text)

<h3> Step 7: Explore Embeddings using Cosine Similarity </h3>

In [None]:
word1 = "girl"
word2 = "boy"
index1 = tokenizer.word_index[word1]
index2 = tokenizer.word_index[word2]
vector1 = embedding_matrix[index1]
vector2 = embedding_matrix[index2]
similarity = cosine_similarity([vector1], [vector2])[0][0]
print(f"Cosine Similarity between '{word1}' and '{word2}': {similarity}")

# Create vectors for visualization
origin = np.zeros(2)
fig, ax = plt.subplots()
ax.quiver(*origin, vector1[0], vector1[1], angles='xy', scale_units='xy', scale=1, color='r', label=word1)
ax.quiver(*origin, vector2[0], vector2[1], angles='xy', scale_units='xy', scale=1, color='b', label=word2)
ax.set_xlim([-1, 1])
ax.set_ylim([-1, 1])
ax.legend(loc='upper right')
plt.show()

<h3> Step 8: Compute Performance Metrics </h3>

<h2> Analysis of our Findings </h2>



<h2> References </h2>

https://www.kaggle.com/datasets/harshalgadhe/poem-generation/
