# Copyright

<PRE>
Copyright (c) Bálint Gyires-Tóth - All Rights Reserved
You may use and modify this code for research and development purpuses.
Using this code for educational purposes (self-paced or instructor led) without the permission of the author is prohibited.
</PRE>

# Assignment: RNN text generation with your favorite book


## 1. Dataset
- Download your favorite book from https://www.gutenberg.org/
- Split into training (80%) and validation (20%).

In [7]:


with open("book.txt", "r", encoding="utf-8") as f:
    text = f.read()

# Remove Gutenberg header/footer
start = text.find("*** START OF")
end = text.find("*** END OF")
text = text[start:end]

# Split into training (80%) and validation (20%)
split_index = int(len(text) * 0.8)
train_text = text[:split_index]
val_text = text[split_index:]

## 2. Preprocessing
- Convert text to lowercase.  
- Remove punctuation (except basic sentence delimiters).  
- Tokenize by words or characters (your choice).  
- Build a vocabulary (map each unique word to an integer ID).

In [8]:
import re
import string

def preprocess(text):
    text = text.lower()
    text = re.sub(f"[{re.escape(string.punctuation.replace('.', '').replace('!', '').replace('?', ''))}]", "", text)
    return text.split()

train_words = preprocess(train_text)
val_words = preprocess(val_text)

# Vocabulary
vocab = sorted(set(train_words + val_words))
word_to_id = {word: i for i, word in enumerate(vocab)}
id_to_word = {i: word for word, i in word_to_id.items()}

# Encode to IDs
train_ids = [word_to_id[word] for word in train_words]
val_ids = [word_to_id[word] for word in val_words]


## 3. Embedding Layer in Keras
Below is a minimal example of defining an `Embedding` layer:
```python
from tensorflow.keras.layers import Embedding

embedding_layer = Embedding(
    input_dim=vocab_size,     # size of the vocabulary
    output_dim=128,           # embedding vector dimension
    input_length=sequence_length
)
```
- This layer transforms integer-encoded sequences (word IDs) into dense vector embeddings.

- Feed these embeddings into your LSTM or GRU OR 1D CNN layer.

In [9]:
from tensorflow.keras.layers import Embedding

vocab_size = len(vocab)
sequence_length = 20
embedding_dim = 128

#Sequence preparation
import numpy as np

def create_sequences(data):
    X, y = [], []
    for i in range(len(data) - sequence_length):
        X.append(data[i:i+sequence_length])
        y.append(data[i+sequence_length])
    return np.array(X), np.array(y)

X_train, y_train = create_sequences(train_ids)
X_val, y_val = create_sequences(val_ids)


## 4. Model
- Implement an LSTM or GRU or 1D CNN-based language model with:
  - **The Embedding layer** as input.
  - At least **one recurrent layer** (e.g., `LSTM(256)` or `GRU(256)` or your custom 1D CNN).
  - A **Dense** output layer with **softmax** activation for word prediction.
- Train for about **5–10 epochs** so it can finish in approximately **2 hours** on a standard machine.


In [10]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
from tensorflow.keras.callbacks import EarlyStopping

model = Sequential([
    Embedding(vocab_size, embedding_dim, input_length=sequence_length),
    LSTM(256),
    Dense(vocab_size, activation='softmax')
])

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Early stopping
early_stop = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)


## 5. Training & Evaluation
- **Monitor** the loss on both training and validation sets.
- **Perplexity**: a common metric for language models.
  - It is the exponent of the average negative log-likelihood.
  - If your model outputs cross-entropy loss `H`, then `perplexity = e^H`.
  - Try to keep the validation perplexity **under 50** if possible. If you have higher value (which is possible) try to draw conclusions, why doesn't it decrease to a lower value.

In [12]:
import math

history = model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=5,
    batch_size=128,
    callbacks=[early_stop]
)

# Calculate perplexity
def perplexity(loss):
    return math.exp(loss)

train_loss = history.history['loss'][-1]
val_loss = history.history['val_loss'][-1]

print(f"\nTraining Perplexity: {perplexity(train_loss):.2f}")
print(f"Validation Perplexity: {perplexity(val_loss):.2f}")


Epoch 1/5
[1m793/793[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m237s[0m 299ms/step - accuracy: 0.1561 - loss: 4.9545 - val_accuracy: 0.1333 - val_loss: 5.7728
Epoch 2/5
[1m793/793[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m261s[0m 298ms/step - accuracy: 0.1740 - loss: 4.6563 - val_accuracy: 0.1358 - val_loss: 5.8014
Epoch 3/5
[1m793/793[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m262s[0m 298ms/step - accuracy: 0.1932 - loss: 4.3662 - val_accuracy: 0.1350 - val_loss: 5.8638
Epoch 4/5
[1m793/793[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m237s[0m 299ms/step - accuracy: 0.2102 - loss: 4.1101 - val_accuracy: 0.1345 - val_loss: 5.9431

Training Perplexity: 64.37
Validation Perplexity: 381.10


In my case, the validation perplexity is significantly higher than 50 (around 381.10), while the training perplexity is around 64.37. This large gap suggests that the model is overfitting to the training data and struggling to generalize to unseen text.
There could be multiple possible reasons. It's maybe because the vocabulary size is large due to the complexity and richness of the literary text used, which increases prediction difficulty. Or the sequence length may be too short (20 tokens) to capture long-term dependencies in the language. It also could be the model capacity (a single LSTM layer) might be insufficient for the complexity of the dataset. Training may also require more epochs or a larger dataset to improve generalization.
To improve performance and reduce validation perplexity, in the future I would consider limiting the vocabulary, increasing model depth or sequence length, and potentially training on a larger or simpler dataset.

## 6. Generation Criteria
- After training, generate **two distinct text samples**, each at least **50 tokens**.
- Use **different seed phrases** (e.g., “love is” vs. “time will”).

In [14]:
# Step 6: Generate text
from tensorflow.keras.preprocessing.sequence import pad_sequences

def generate_text(seed_text, num_words=50):
    result = seed_text.lower().split()
    for _ in range(num_words):
        input_seq = [word_to_id.get(w, 0) for w in result[-sequence_length:]]
        input_seq = pad_sequences([input_seq], maxlen=sequence_length)
        predicted_probs = model.predict(input_seq, verbose=0)[0]
        next_id = np.random.choice(len(predicted_probs), p=predicted_probs)
        result.append(id_to_word[next_id])
    return ' '.join(result)

# Sample outputs
print("\n--- Sample 1 ---")
print(generate_text("love is"))

print("\n--- Sample 2 ---")
print(generate_text("time will"))


--- Sample 1 ---
love is hurt it might be still that truly continued asserted out of themselves. she delightful done explain if they and added in his morning to the doorway of examination for state.” his dear comfortably disposition for her father are dead. my very wits will safely speak “he is very lvi. a

--- Sample 2 ---
time will taste to us from she in! prefaced his church after many delay. which examine yes. threeandtwenty! so just as you means that music cover as you have always pretty certain my own brother you left elizabeth so well by him before. it is as you are exist cried she. unhesitatingly.
