# Copyright

<PRE>
Copyright (c) Bálint Gyires-Tóth - All Rights Reserved
You may use and modify this code for research and development purpuses.
Using this code for educational purposes (self-paced or instructor led) without the permission of the author is prohibited.
</PRE>

# Assignment: RNN text generation with your favorite book


## 1. Dataset
- Download your favorite book from https://www.gutenberg.org/
- Combine all sonnets into a single text source.  
- Split into training (80%) and validation (20%).  

In [1]:
from collections import Counter
import tensorflow as tf
import re
import numpy as np

# download
!wget https://www.gutenberg.org/cache/epub/41/pg41.txt



--2025-04-23 18:35:57--  https://www.gutenberg.org/cache/epub/41/pg41.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 90938 (89K) [text/plain]
Saving to: ‘pg41.txt’


2025-04-23 18:35:58 (1.14 MB/s) - ‘pg41.txt’ saved [90938/90938]



In [2]:
with open('pg41.txt', 'r', encoding='utf-8') as f:
    text = f.read()

start = "*** START OF THE PROJECT GUTENBERG EBOOK THE LEGEND OF SLEEPY HOLLOW ***"
end = "*** END OF THE PROJECT GUTENBERG EBOOK THE LEGEND OF SLEEPY HOLLOW ***"
text = text[text.find(start)+len(start):text.rfind(end)]

In [3]:
split = int(0.8 * len(text))
train_text = text[:split]
val_text = text[split:]

## 2. Preprocessing
- Convert text to lowercase.  
- Remove punctuation (except basic sentence delimiters).  
- Tokenize by words or characters (your choice).  
- Build a vocabulary (map each unique word to an integer ID).

In [40]:
# lowercase
train_text = train_text.lower()
val_text = val_text.lower()

In [10]:
# remove punctuation (keep words, whitespace, .?!)
clean_train_text = re.sub(r'[^\w\s.?!]', '', train_text)
clean_train_text = re.sub(r'\n', ' ', clean_train_text)
clean_train_text = re.sub(r'\s+', ' ', clean_train_text)
clean_train_text = clean_train_text.strip()
clean_val_text = re.sub(r'[^\w\s.?!]', '', val_text)
clean_val_text = re.sub(r'\n', ' ', clean_val_text)
print(clean_train_text[:100])
print(clean_val_text[:100])

the legend of sleepy hollow by washington irving found among the papers of the late diedrich knicker
ooded glen known by the name of wileys swamp. a few rough logs laid side by side served for a bridge


In [6]:
import nltk
import spacy
nltk.download('punkt')
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize

# download and install the spacy language model
!python3 -m spacy download en_core_web_sm
sp = spacy.load('en_core_web_sm')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m62.2 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [39]:
# tokenize
train_tokens = word_tokenize(clean_train_text)
val_tokens = word_tokenize(clean_val_text)

In [29]:
train_words = set(train_tokens)
vocab = {word: idx for idx, word in enumerate(train_words)}
train_id_to_word = np.array(list(train_words))

val_words = set(val_tokens)
val_id_to_word = np.array(list(val_words))

train_ids = [vocab[token] for token in train_tokens] # array of id's
val_ids = [vocab[token] for token in val_tokens if token in vocab]

window_size=10
X = []
Y = []
for i in range(len(train_ids) - window_size):
    X.append(train_ids[i:i + window_size])
    Y.append(train_ids[i + window_size])

X_train=np.array(X)
Y_train=np.array(Y)

x, y = [], []
for i in range(len(val_ids) - window_size):
    x.append(val_ids[i:i + window_size])
    y.append(val_ids[i + window_size])

X_val = np.array(x)
Y_val = np.array(y)

## 3. Embedding Layer in Keras
Below is a minimal example of defining an `Embedding` layer:
```python
from tensorflow.keras.layers import Embedding

embedding_layer = Embedding(
    input_dim=vocab_size,     # size of the vocabulary
    output_dim=128,           # embedding vector dimension
    input_length=sequence_length
)
```
- This layer transforms integer-encoded sequences (word IDs) into dense vector embeddings.

- Feed these embeddings into your LSTM or GRU OR 1D CNN layer.

In [32]:
vocab_size = len(vocab)
sequence_length = window_size

In [33]:
from tensorflow.keras.layers import Embedding

embedding_layer = Embedding(
    input_dim=vocab_size,     # size of the vocabulary
    output_dim=128,           # embedding vector dimension
    input_length=6
)

## 4. Model
- Implement an LSTM or GRU or 1D CNN-based language model with:
  - **The Embedding layer** as input.
  - At least **one recurrent layer** (e.g., `LSTM(256)` or `GRU(256)` or your custom 1D CNN).
  - A **Dense** output layer with **softmax** activation for word prediction.
- Train for about **5–10 epochs** so it can finish in approximately **2 hours** on a standard machine.


In [34]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.models import load_model
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.callbacks import EarlyStopping
es = EarlyStopping(monitor="val_loss", patience=10, restore_best_weights=True)
checkpoint = ModelCheckpoint('best_weights.keras',
                                      save_best_only=True,
                                      monitor='val_accuracy',
                                      mode='max',
                                      verbose=1)
model = Sequential()
model.add(embedding_layer)
model.add(LSTM(256))
model.add(Dropout(0.2))
model.add(Dense(vocab_size, activation='softmax'))
model.summary()

## 5. Training & Evaluation
- **Monitor** the loss on both training and validation sets.
- **Perplexity**: a common metric for language models.
  - It is the exponent of the average negative log-likelihood.
  - If your model outputs cross-entropy loss `H`, then `perplexity = e^H`.
  - Try to keep the validation perplexity **under 50** if possible.

In [35]:
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
network_history = model.fit(X_train, Y_train,
                            validation_data=(X_val,Y_val),
                            batch_size=128,
                            epochs=5,
                            verbose=1,
                            callbacks=[es, checkpoint])

val_loss, val_acc = model.evaluate(X_val, Y_val)

print("Val Perplexity: ", np.exp(val_loss))

Epoch 1/5
[1m78/79[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 114ms/step - accuracy: 0.0554 - loss: 7.2909
Epoch 1: val_accuracy improved from -inf to 0.10097, saving model to best_weights.keras
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 127ms/step - accuracy: 0.0556 - loss: 7.2802 - val_accuracy: 0.1010 - val_loss: 5.7202
Epoch 2/5
[1m78/79[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 100ms/step - accuracy: 0.0736 - loss: 6.2128
Epoch 2: val_accuracy improved from 0.10097 to 0.11449, saving model to best_weights.keras
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 118ms/step - accuracy: 0.0737 - loss: 6.2137 - val_accuracy: 0.1145 - val_loss: 5.6534
Epoch 3/5
[1m78/79[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 101ms/step - accuracy: 0.0886 - loss: 6.0974
Epoch 3: val_accuracy improved from 0.11449 to 0.11787, saving model to best_weights.keras
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 110m

## 6. Generation Criteria
- After training, generate **two distinct text samples**, each at least **50 tokens**.
- Use **different seed phrases** (e.g., “love is” vs. “time will”).

In [38]:
import numpy as np
# Load the best weights
model.load_weights('best_weights.keras')

def generate_text(seed_text, next_words=50):
    # Ensure seed_text is a list of tokens
    if isinstance(seed_text, str):
        seed_text = seed_text.lower().split()

    for _ in range(next_words):
        token_list = seed_text[-sequence_length:]
        token_ids = [vocab.get(token, 0) for token in token_list]

        # Pad if needed
        if len(token_ids) < sequence_length:
            token_ids = [0] * (sequence_length - len(token_ids)) + token_ids

        token_ids = np.array([token_ids])
        predicted_probs = model.predict(token_ids, verbose=0)[0]

        # Sample instead of taking the argmax
        predicted_id = np.random.choice(len(predicted_probs), p=predicted_probs)

        # Find word for predicted id
        output_word = next((word for word, idx in vocab.items() if idx == predicted_id), "")
        seed_text.append(output_word)

    return ' '.join(seed_text)

# Generate two text samples with different seed phrases
seed_phrase1 = "love is"
generated_text1 = generate_text(seed_phrase1)
print(f"Generated Text 1:\n{generated_text1}")

seed_phrase2 = "time will"
generated_text2 = generate_text(seed_phrase2)
print(f"\nGenerated Text 2:\n{generated_text2}")


Generated Text 1:
love is union rapid palings brief cricket raves of places his melody at cover on to importance of the hush that taken tied was the descended of the lasses forms the procure with the told of given their occasional jolly of sleepy reasoners as the pensive erudition he his direful dismay eloped

Generated Text 2:
time will run unimaginable armed arising knightserrant vocal hessian a close stubble retreats of shrub scarlet every course and beaming into then made been esteemed of winced half a time who the expanded of the overturned which lonely the cheerily and him was i belly of a houten by the pipe the
