# Copyright

<PRE>
Copyright (c) Bálint Gyires-Tóth - All Rights Reserved
You may use and modify this code for research and development purpuses.
Using this code for educational purposes (self-paced or instructor led) without the permission of the author is prohibited.
</PRE>

# Assignment: RNN text generation with your favorite book


## 1. Dataset
- Download your favorite book from https://www.gutenberg.org/
- Combine all sonnets into a single text source.  
- Split into training (80%) and validation (20%).  

In [1]:
# Download Little Women from Gutenberg: https://www.gutenberg.org/cache/epub/37106/pg37106.txt
import requests

url = "https://www.gutenberg.org/cache/epub/37106/pg37106.txt"
response = requests.get(url)

# Check if the request was successful and save
if response.status_code == 200:
    with open("book.txt", "w", encoding="utf-8") as file:
        file.write(response.text)
else:
    print(f"Failed to download")

# Combine into single text source
with open("book.txt", "r", encoding="utf-8") as file:
    text = file.read()

print(text[:100])

﻿The Project Gutenberg eBook of Little Women; Or, Meg, Jo, Beth, and Amy
    
This ebook is for the 


## 2. Preprocessing
- Convert text to lowercase.  
- Remove punctuation (except basic sentence delimiters).  
- Tokenize by words.  
- Build a vocabulary (map each unique word to an integer ID).

In [2]:
# Cut out header and footer
start = "*** START OF THE PROJECT GUTENBERG EBOOK"
header = text.find("*** START OF THE PROJECT GUTENBERG EBOOK")
footer = text.find("*** END OF THE PROJECT GUTENBERG EBOOK")
text = text[header + len(start):footer].strip()

# Convert text to lowercase
text = text.lower()

# Remove punctuation except basic sentence delimiters
import string

allowed_punctuation = {'.', '!', '?'}
clean_text = ''.join([
    char for char in text
    if char not in string.punctuation or char in allowed_punctuation
])

print(clean_text[:100])

little women or meg jo beth and amy 




                      illustration little women
           


In [18]:
import numpy as np

# Tokenize
tokens = clean_text.split()
print(tokens[:100])

# Build a vocabulary (map each unique word to an integer ID)
words = set(tokens)
vocabulary = {word: idx for idx, word in enumerate(words)}
id2word = np.array(list(words))

# Create numerical representation of tokens in the book using id's
text_as_ids = [vocabulary[token] for token in tokens]
print(text_as_ids[:100])

print(f"Number of unique words: {len(vocabulary)}")

['little', 'women', 'or', 'meg', 'jo', 'beth', 'and', 'amy', 'illustration', 'little', 'women', 'meg', 'jo', 'beth', 'and', 'amy', 'louisa', 'm.', 'alcott', 'little', 'women.', 'illustration', 'they', 'all', 'drew', 'to', 'the', 'fire', 'mother', 'in', 'the', 'big', 'chair', 'with', 'beth', 'at', 'her', 'feet', 'see', 'page', '9', 'frontispiece', 'little', 'women', 'or', 'meg', 'jo', 'beth', 'and', 'amy', 'by', 'louisa', 'm.', 'alcott', 'author', 'of', 'little', 'men', 'an', 'oldfashioned', 'girl', 'spinningwheel', 'stories', 'etc.', 'with', 'more', 'than', '200', 'illustrations', 'by', 'frank', 't.', 'merrill', 'and', 'a', 'picture', 'of', 'the', 'home', 'of', 'the', 'little', 'women', 'by', 'edmund', 'h.', 'garrett', 'boston', 'little', 'brown', 'and', 'company', 'entered', 'according', 'to', 'act', 'of', 'congress', 'in', 'the']
[11354, 4276, 2964, 9395, 9673, 5996, 5703, 10024, 12515, 11354, 4276, 9395, 9673, 5996, 5703, 10024, 8292, 918, 14267, 11354, 10387, 12515, 4437, 11133, 10

In [4]:
# Generate word sequences for training
sequence_length = 10
input_strings = []
target_output_word = []

# We want to try to generate the next word in the sequence
for i in range(len(text_as_ids) - sequence_length):
    input_strings.append(text_as_ids[i:i+sequence_length])
    target_output_word.append(text_as_ids[i+sequence_length])

# Convert to numpy for tf training
input_strings = np.array(input_strings)
target_output_word = np.array(target_output_word)

print(input_strings[:10])
print(target_output_word[:10])

[[11354  4276  2964  9395  9673  5996  5703 10024 12515 11354]
 [ 4276  2964  9395  9673  5996  5703 10024 12515 11354  4276]
 [ 2964  9395  9673  5996  5703 10024 12515 11354  4276  9395]
 [ 9395  9673  5996  5703 10024 12515 11354  4276  9395  9673]
 [ 9673  5996  5703 10024 12515 11354  4276  9395  9673  5996]
 [ 5996  5703 10024 12515 11354  4276  9395  9673  5996  5703]
 [ 5703 10024 12515 11354  4276  9395  9673  5996  5703 10024]
 [10024 12515 11354  4276  9395  9673  5996  5703 10024  8292]
 [12515 11354  4276  9395  9673  5996  5703 10024  8292   918]
 [11354  4276  9395  9673  5996  5703 10024  8292   918 14267]]
[ 4276  9395  9673  5996  5703 10024  8292   918 14267 11354]


In [5]:
from sklearn.model_selection import train_test_split

# Now we can split into training and testing
X_train, X_val, y_train, y_val = train_test_split(input_strings, target_output_word, test_size=0.2, random_state=42)

In [6]:
import tensorflow as tf

# Create tensorflow datasets to use for training:
# https://www.tensorflow.org/api_docs/python/tf/data/Dataset; https://www.tensorflow.org/text/tutorials/text_generation
BATCH_SIZE = 64
BUFFER_SIZE = 10000

# train
train_dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train))
train_dataset = train_dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

# validation
val_dataset = tf.data.Dataset.from_tensor_slices((X_val, y_val))
val_dataset = val_dataset.batch(BATCH_SIZE, drop_remainder=True)

## 3. Embedding Layer in Keras
Below is a minimal example of defining an `Embedding` layer:
```python
from tensorflow.keras.layers import Embedding

embedding_layer = Embedding(
    input_dim=vocab_size,     # size of the vocabulary
    output_dim=128,           # embedding vector dimension
    input_length=sequence_length
)
```
- This layer transforms integer-encoded sequences (word IDs) into dense vector embeddings.

- Feed these embeddings into your LSTM or GRU OR 1D CNN layer.

## 4. Model
- Implement an LSTM or GRU or 1D CNN-based language model with:
  - **The Embedding layer** as input.
  - At least **one recurrent layer** (e.g., `LSTM(256)` or `GRU(256)` or your custom 1D CNN).
  - A **Dense** output layer with **softmax** activation for word prediction.
- Train for about **5–10 epochs** so it can finish in approximately **2 hours** on a standard machine.


In [7]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.utils import to_categorical

vocab_size = len(vocabulary)

# Model Definition
model = Sequential()
model.add(Embedding(
    input_dim=vocab_size,
    output_dim=128,
    input_length=sequence_length))
model.add(LSTM(256))
model.add(Dense(vocab_size, activation='softmax'))



## 5. Training & Evaluation
- **Monitor** the loss on both training and validation sets.
- **Perplexity**: a common metric for language models.
  - It is the exponent of the average negative log-likelihood.
  - If your model outputs cross-entropy loss `H`, then `perplexity = e^H`.
  - Try to keep the validation perplexity **under 50** if possible.

In [8]:
# Perplexity has an issue with the shape of the labels
# We need to reshape the labels to be (batch_size, 1)
def custom_reshape(x, y):
    return x, tf.expand_dims(y, -1)  # Make y shape (batch_size, 1)

# https://www.tensorflow.org/api_docs/python/tf/data/Dataset
train_dataset = train_dataset.map(custom_reshape)
val_dataset = val_dataset.map(custom_reshape)

In [14]:
# Use perplexity as a metric as well: https://keras.io/keras_hub/api/metrics/perplexity/
import keras_hub

perplexity = keras_hub.metrics.Perplexity(from_logits=False)

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['sparse_categorical_accuracy', perplexity])
model.summary()

In [15]:
# Train the model
EPOCHS = 5
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
model.fit(
    train_dataset,
    validation_data=val_dataset,
    epochs=EPOCHS,
    verbose=1,
    callbacks=[early_stopping]
)

Epoch 1/5
[1m2406/2406[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m469s[0m 194ms/step - loss: 6.4164 - perplexity: 612.4304 - sparse_categorical_accuracy: 0.0768 - val_loss: 6.1563 - val_perplexity: 471.6947 - val_sparse_categorical_accuracy: 0.1112
Epoch 2/5
[1m2406/2406[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m480s[0m 185ms/step - loss: 5.6962 - perplexity: 298.0473 - sparse_categorical_accuracy: 0.1183 - val_loss: 6.0744 - val_perplexity: 434.5918 - val_sparse_categorical_accuracy: 0.1225
Epoch 3/5
[1m2406/2406[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m507s[0m 187ms/step - loss: 5.2035 - perplexity: 182.1118 - sparse_categorical_accuracy: 0.1400 - val_loss: 6.1911 - val_perplexity: 488.3973 - val_sparse_categorical_accuracy: 0.1279
Epoch 4/5
[1m2406/2406[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m497s[0m 185ms/step - loss: 4.7338 - perplexity: 113.9676 - sparse_categorical_accuracy: 0.1616 - val_loss: 6.3621 - val_perplexity: 579.4756 - val_sparse_categoric

<keras.src.callbacks.history.History at 0x7c24c83ead90>

## 6. Generation Criteria
- After training, generate **two distinct text samples**, each at least **50 tokens**.
- Use **different seed phrases** (e.g., “love is” vs. “time will”).

In [21]:
# Generate text with different starting seeds

sample1 = "i never shall stop loving you but the love is"
sample1 = [vocabulary[word] for word in sample1.split()]
for i in range(50):
    input1 = np.array(sample1[-sequence_length:]).reshape(1, sequence_length)
    sample1.append(model.predict(input1).argmax())
generated1 = [id2word[id] for id in sample1]

sample2 = "be fonder and prouder than ever of my little women"
sample2 = [vocabulary[word] for word in sample2.split()]
for i in range(50):
    input2 = np.array(sample2[-sequence_length:]).reshape(1, sequence_length)
    sample2.append(model.predict(input2).argmax())
generated2 = [id2word[id] for id in sample2]

print("Generated text sample 1: ", ' '.join(map(str, generated1)))
print("Generated text sample 2: ", ' '.join(map(str, generated2)))

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 105ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 79ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 75ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 53ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 57ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 63ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 58ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 70ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 58ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 60ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 57ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 91ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 58ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4

Note: It appears the generated text is relatively senseless as it sticks to repeating the same phrase. However, with more epochs, the model may become better, but this would take a very long time to train which we do not have.

## 7. Submission
- A Jupyter Notebook (or script) showing:
  - **Data loading** and **preprocessing**.
  - **Model definition** and **training process**.
  - **Validation perplexity** calculation.
  - **Two generated text samples** (each >50 tokens).
- Ensure your notebook/script **runs end-to-end without errors**.
