# Copyright

<PRE>
Copyright (c) Bálint Gyires-Tóth - All Rights Reserved
You may use and modify this code for research and development purpuses.
Using this code for educational purposes (self-paced or instructor led) without the permission of the author is prohibited.
</PRE>

# Assignment: RNN text generation with your favorite book


## 1. Dataset
- Download your favorite book from https://www.gutenberg.org/
- Combine all sonnets into a single text source.  
- Split into training (80%) and validation (20%).  

In [8]:
from collections import Counter
import tensorflow as tf
import re
import numpy as np

# download
!wget https://www.gutenberg.org/cache/epub/41/pg41.txt



--2025-04-23 12:11:51--  https://www.gutenberg.org/cache/epub/41/pg41.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 90938 (89K) [text/plain]
Saving to: ‘pg41.txt.1’


2025-04-23 12:11:51 (1006 KB/s) - ‘pg41.txt.1’ saved [90938/90938]



In [24]:
with open('pg41.txt', 'r', encoding='utf-8') as f:
    text = f.read()

start = "*** START OF THE PROJECT GUTENBERG EBOOK THE LEGEND OF SLEEPY HOLLOW ***"
end = "*** END OF THE PROJECT GUTENBERG EBOOK THE LEGEND OF SLEEPY HOLLOW ***"
text = text[text.find(start)+len(start):text.rfind(end)]

## 2. Preprocessing
- Convert text to lowercase.  
- Remove punctuation (except basic sentence delimiters).  
- Tokenize by words or characters (your choice).  
- Build a vocabulary (map each unique word to an integer ID).

In [25]:
# lowercase
text = text.lower()

In [29]:
# remove punctuation (keep words, whitespace, .?!)
clean_text = re.sub(r'[^\w\s.?!]', '', text)
clean_text = re.sub(r'\n', ' ', clean_text)
clean_text = re.sub(r'\s+', ' ', clean_text)
clean_text = clean_text.strip()
clean_text

'the legend of sleepy hollow by washington irving found among the papers of the late diedrich knickerbocker. a pleasing land of drowsy head it was of dreams that wave before the halfshut eye and of gay castles in the clouds that pass forever flushing round a summer sky. castle of indolence. in the bosom of one of those spacious coves which indent the eastern shore of the hudson at that broad expansion of the river denominated by the ancient dutch navigators the tappan zee and where they always prudently shortened sail and implored the protection of st. nicholas when they crossed there lies a small market town or rural port which by some is called greensburgh but which is more generally and properly known by the name of tarry town. this name was given we are told in former days by the good housewives of the adjacent country from the inveterate propensity of their husbands to linger about the village tavern on market days. be that as it may i do not vouch for the fact but merely advert t

In [33]:
import nltk
import spacy
nltk.download('punkt')
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize

# download and install the spacy language model
!python3 -m spacy download en_core_web_sm
sp = spacy.load('en_core_web_sm')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m47.2 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [34]:
# tokenize
tokens = word_tokenize(clean_text)
print(tokens)

['the', 'legend', 'of', 'sleepy', 'hollow', 'by', 'washington', 'irving', 'found', 'among', 'the', 'papers', 'of', 'the', 'late', 'diedrich', 'knickerbocker', '.', 'a', 'pleasing', 'land', 'of', 'drowsy', 'head', 'it', 'was', 'of', 'dreams', 'that', 'wave', 'before', 'the', 'halfshut', 'eye', 'and', 'of', 'gay', 'castles', 'in', 'the', 'clouds', 'that', 'pass', 'forever', 'flushing', 'round', 'a', 'summer', 'sky', '.', 'castle', 'of', 'indolence', '.', 'in', 'the', 'bosom', 'of', 'one', 'of', 'those', 'spacious', 'coves', 'which', 'indent', 'the', 'eastern', 'shore', 'of', 'the', 'hudson', 'at', 'that', 'broad', 'expansion', 'of', 'the', 'river', 'denominated', 'by', 'the', 'ancient', 'dutch', 'navigators', 'the', 'tappan', 'zee', 'and', 'where', 'they', 'always', 'prudently', 'shortened', 'sail', 'and', 'implored', 'the', 'protection', 'of', 'st.', 'nicholas', 'when', 'they', 'crossed', 'there', 'lies', 'a', 'small', 'market', 'town', 'or', 'rural', 'port', 'which', 'by', 'some', 'is'

In [38]:
from collections import defaultdict

word_to_id = defaultdict(lambda: len(word_to_id))
id_to_word = []

for token in tokens:
    word_to_id[token]

for token in tokens:
    if token not in id_to_word:
        id_to_word.append(token)

print(word_to_id)

defaultdict(<function <lambda> at 0x7fb0bd9a1da0>, {'the': 0, 'legend': 1, 'of': 2, 'sleepy': 3, 'hollow': 4, 'by': 5, 'washington': 6, 'irving': 7, 'found': 8, 'among': 9, 'papers': 10, 'late': 11, 'diedrich': 12, 'knickerbocker': 13, '.': 14, 'a': 15, 'pleasing': 16, 'land': 17, 'drowsy': 18, 'head': 19, 'it': 20, 'was': 21, 'dreams': 22, 'that': 23, 'wave': 24, 'before': 25, 'halfshut': 26, 'eye': 27, 'and': 28, 'gay': 29, 'castles': 30, 'in': 31, 'clouds': 32, 'pass': 33, 'forever': 34, 'flushing': 35, 'round': 36, 'summer': 37, 'sky': 38, 'castle': 39, 'indolence': 40, 'bosom': 41, 'one': 42, 'those': 43, 'spacious': 44, 'coves': 45, 'which': 46, 'indent': 47, 'eastern': 48, 'shore': 49, 'hudson': 50, 'at': 51, 'broad': 52, 'expansion': 53, 'river': 54, 'denominated': 55, 'ancient': 56, 'dutch': 57, 'navigators': 58, 'tappan': 59, 'zee': 60, 'where': 61, 'they': 62, 'always': 63, 'prudently': 64, 'shortened': 65, 'sail': 66, 'implored': 67, 'protection': 68, 'st.': 69, 'nicholas':

In [37]:
# split training/validation
train_size = int(0.8 * len(clean_text))
train_data = clean_text[:train_size]
val_data = clean_text[train_size:]

## 3. Embedding Layer in Keras
Below is a minimal example of defining an `Embedding` layer:
```python
from tensorflow.keras.layers import Embedding

embedding_layer = Embedding(
    input_dim=vocab_size,     # size of the vocabulary
    output_dim=128,           # embedding vector dimension
    input_length=sequence_length
)
```
- This layer transforms integer-encoded sequences (word IDs) into dense vector embeddings.

- Feed these embeddings into your LSTM or GRU OR 1D CNN layer.

In [43]:
vocab_size=len(word_to_id)
sequence_length=5

In [44]:
from tensorflow.keras.layers import Embedding

embedding_layer = Embedding(
    input_dim=vocab_size,     # size of the vocabulary
    output_dim=128,           # embedding vector dimension
    input_length=sequence_length
)



## 4. Model
- Implement an LSTM or GRU or 1D CNN-based language model with:
  - **The Embedding layer** as input.
  - At least **one recurrent layer** (e.g., `LSTM(256)` or `GRU(256)` or your custom 1D CNN).
  - A **Dense** output layer with **softmax** activation for word prediction.
- Train for about **5–10 epochs** so it can finish in approximately **2 hours** on a standard machine.


In [51]:
X_train = []
Y_train = []
X_valid = []
Y_valid = []

for i in range(0, len(train_data) - sequence_length, 1):
    seq_in = train_data[i:i + sequence_length]
    seq_out = train_data[i + sequence_length]
    X_train.append([word_to_id[char] for char in seq_in])
    Y_train.append(word_to_id[seq_out])

for i in range(0, len(val_data) - sequence_length, 1):
    seq_in = val_data[i:i + sequence_length]
    seq_out = val_data[i + sequence_length]
    X_valid.append([word_to_id[char] for char in seq_in])
    Y_valid.append(word_to_id[seq_out])

X_train = np.array(X_train)
Y_train = np.array(Y_train)
X_valid = np.array(X_valid)
Y_valid = np.array(Y_valid)

NameError: name 'val_tokens' is not defined

In [50]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.models import load_model
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.callbacks import EarlyStopping
es = EarlyStopping(monitor="val_loss", patience=10, restore_best_weights=True)
checkpoint = ModelCheckpoint('best_weights.keras',
                                      save_best_only=True,
                                      monitor='val_accuracy',
                                      mode='max',
                                      verbose=1)
model = Sequential()
model.add(embedding_layer)
model.add(LSTM(256))
model.add(Dropout(0.2))
model.add(Dense(vocab_size, activation='softmax'))

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
network_history = model.fit(X_train, Y_train,
                            validation_data=(X_valid,Y_valid),
                            batch_size=128,
                            epochs=5,
                            verbose=1,
                            callbacks=[es, checkpoint])

Epoch 1/5


InvalidArgumentError: Graph execution error:

Detected at node sequential_1/embedding_1_1/GatherV2 defined at (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main

  File "<frozen runpy>", line 88, in _run_code

  File "/usr/local/lib/python3.11/dist-packages/colab_kernel_launcher.py", line 37, in <module>

  File "/usr/local/lib/python3.11/dist-packages/traitlets/config/application.py", line 992, in launch_instance

  File "/usr/local/lib/python3.11/dist-packages/ipykernel/kernelapp.py", line 712, in start

  File "/usr/local/lib/python3.11/dist-packages/tornado/platform/asyncio.py", line 205, in start

  File "/usr/lib/python3.11/asyncio/base_events.py", line 608, in run_forever

  File "/usr/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once

  File "/usr/lib/python3.11/asyncio/events.py", line 84, in _run

  File "/usr/local/lib/python3.11/dist-packages/ipykernel/kernelbase.py", line 510, in dispatch_queue

  File "/usr/local/lib/python3.11/dist-packages/ipykernel/kernelbase.py", line 499, in process_one

  File "/usr/local/lib/python3.11/dist-packages/ipykernel/kernelbase.py", line 406, in dispatch_shell

  File "/usr/local/lib/python3.11/dist-packages/ipykernel/kernelbase.py", line 730, in execute_request

  File "/usr/local/lib/python3.11/dist-packages/ipykernel/ipkernel.py", line 383, in do_execute

  File "/usr/local/lib/python3.11/dist-packages/ipykernel/zmqshell.py", line 528, in run_cell

  File "/usr/local/lib/python3.11/dist-packages/IPython/core/interactiveshell.py", line 2975, in run_cell

  File "/usr/local/lib/python3.11/dist-packages/IPython/core/interactiveshell.py", line 3030, in _run_cell

  File "/usr/local/lib/python3.11/dist-packages/IPython/core/async_helpers.py", line 78, in _pseudo_sync_runner

  File "/usr/local/lib/python3.11/dist-packages/IPython/core/interactiveshell.py", line 3257, in run_cell_async

  File "/usr/local/lib/python3.11/dist-packages/IPython/core/interactiveshell.py", line 3473, in run_ast_nodes

  File "/usr/local/lib/python3.11/dist-packages/IPython/core/interactiveshell.py", line 3553, in run_code

  File "<ipython-input-50-72998f060be7>", line 19, in <cell line: 0>

  File "/usr/local/lib/python3.11/dist-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/usr/local/lib/python3.11/dist-packages/keras/src/backend/tensorflow/trainer.py", line 371, in fit

  File "/usr/local/lib/python3.11/dist-packages/keras/src/backend/tensorflow/trainer.py", line 219, in function

  File "/usr/local/lib/python3.11/dist-packages/keras/src/backend/tensorflow/trainer.py", line 132, in multi_step_on_iterator

  File "/usr/local/lib/python3.11/dist-packages/keras/src/backend/tensorflow/trainer.py", line 113, in one_step_on_data

  File "/usr/local/lib/python3.11/dist-packages/keras/src/backend/tensorflow/trainer.py", line 57, in train_step

  File "/usr/local/lib/python3.11/dist-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/usr/local/lib/python3.11/dist-packages/keras/src/layers/layer.py", line 908, in __call__

  File "/usr/local/lib/python3.11/dist-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/usr/local/lib/python3.11/dist-packages/keras/src/ops/operation.py", line 46, in __call__

  File "/usr/local/lib/python3.11/dist-packages/keras/src/utils/traceback_utils.py", line 156, in error_handler

  File "/usr/local/lib/python3.11/dist-packages/keras/src/models/sequential.py", line 213, in call

  File "/usr/local/lib/python3.11/dist-packages/keras/src/models/functional.py", line 182, in call

  File "/usr/local/lib/python3.11/dist-packages/keras/src/ops/function.py", line 171, in _run_through_graph

  File "/usr/local/lib/python3.11/dist-packages/keras/src/models/functional.py", line 637, in call

  File "/usr/local/lib/python3.11/dist-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/usr/local/lib/python3.11/dist-packages/keras/src/layers/layer.py", line 908, in __call__

  File "/usr/local/lib/python3.11/dist-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/usr/local/lib/python3.11/dist-packages/keras/src/ops/operation.py", line 46, in __call__

  File "/usr/local/lib/python3.11/dist-packages/keras/src/utils/traceback_utils.py", line 156, in error_handler

  File "/usr/local/lib/python3.11/dist-packages/keras/src/layers/core/embedding.py", line 140, in call

  File "/usr/local/lib/python3.11/dist-packages/keras/src/ops/numpy.py", line 5346, in take

  File "/usr/local/lib/python3.11/dist-packages/keras/src/backend/tensorflow/numpy.py", line 2093, in take

indices[112,0] = 3277 is not in [0, 3231)
	 [[{{node sequential_1/embedding_1_1/GatherV2}}]] [Op:__inference_multi_step_on_iterator_2701]

## 5. Training & Evaluation
- **Monitor** the loss on both training and validation sets.
- **Perplexity**: a common metric for language models.
  - It is the exponent of the average negative log-likelihood.
  - If your model outputs cross-entropy loss `H`, then `perplexity = e^H`.
  - Try to keep the validation perplexity **under 50** if possible.

## 6. Generation Criteria
- After training, generate **two distinct text samples**, each at least **50 tokens**.
- Use **different seed phrases** (e.g., “love is” vs. “time will”).