<a href="https://colab.research.google.com/github/ldmcgo26/DL_Assignment_6/blob/main/LM_09_Assigment_6_text_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Copyright

<PRE>
Copyright (c) Bálint Gyires-Tóth - All Rights Reserved
You may use and modify this code for research and development purpuses.
Using this code for educational purposes (self-paced or instructor led) without the permission of the author is prohibited.
</PRE>

# Assignment: RNN text generation with your favorite book


## 1. Dataset
- Download your favorite book from https://www.gutenberg.org/
- Combine all sonnets into a single text source.  
- Split into training (80%) and validation (20%).  

In [None]:
import pandas as pd
import numpy as np
import requests

In [None]:
url = 'https://www.gutenberg.org/cache/epub/45/pg45.txt'
response = requests.get(url)
text = response.text

words = text.lower().split()
words = [w for w in words if w.isalpha()]
df = pd.DataFrame(words, columns=['word'])

df.head()

Unnamed: 0,word
0,project
1,gutenberg
2,ebook
3,of
4,anne


## 2. Preprocessing
- Convert text to lowercase.  
- Remove punctuation (except basic sentence delimiters).  
- Tokenize by words or characters (your choice).  
- Build a vocabulary (map each unique word to an integer ID).

In [None]:
from keras.utils import to_categorical

unique_words = df['word'].unique()
vocab = {word: i for i, word in enumerate(unique_words)}
df['word_id'] = df['word'].map(vocab)

sequence_length = 5
tokens = df['word_id'].to_list()

sequences = []
for i in range(sequence_length, len(tokens)):
    seq = tokens[i-sequence_length:i+1]
    sequences.append(seq)

sequences = np.array(sequences)
X, y = sequences[:, :-1], sequences[:, -1]
y = to_categorical(y, num_classes=len(unique_words))

split = int(0.8 * len(X))
X_train, X_val = X[:split], X[split:]
y_train, y_val = y[:split], y[split:]

print(X_train[:10], '\n')
print(y_train[:10])

[[ 0  1  2  3  4]
 [ 1  2  3  4  3]
 [ 2  3  4  3  5]
 [ 3  4  3  5  6]
 [ 4  3  5  6  7]
 [ 3  5  6  7  2]
 [ 5  6  7  2  8]
 [ 6  7  2  8  9]
 [ 7  2  8  9 10]
 [ 2  8  9 10 11]] 

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [None]:
print(X_train.shape, y_train.shape)

(68848, 5) (68848, 6660)


## 3. Embedding Layer in Keras
Below is a minimal example of defining an `Embedding` layer:
```python
from tensorflow.keras.layers import Embedding

embedding_layer = Embedding(
    input_dim=vocab_size,     # size of the vocabulary
    output_dim=128,           # embedding vector dimension
    input_length=sequence_length
)
```
- This layer transforms integer-encoded sequences (word IDs) into dense vector embeddings.

- Feed these embeddings into your LSTM or GRU OR 1D CNN layer.

In [None]:
from tensorflow.keras.layers import Embedding

vocab_size = len(unique_words)

embedding_layer = Embedding(
    input_dim=vocab_size,     # size of the vocabulary
    output_dim=128,           # embedding vector dimension
    input_length=sequence_length
)



## 4. Model
- Implement an LSTM or GRU or 1D CNN-based language model with:
  - **The Embedding layer** as input.
  - At least **one recurrent layer** (e.g., `LSTM(256)` or `GRU(256)` or your custom 1D CNN).
  - A **Dense** output layer with **softmax** activation for word prediction.
- Train for about **5–10 epochs** so it can finish in approximately **2 hours** on a standard machine.


In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense

In [None]:
model = Sequential()
model.add(embedding_layer)
model.add(LSTM(256))
model.add(Dense(vocab_size, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
model.summary()

In [None]:
from keras.callbacks import EarlyStopping

history = model.fit(X_train, y_train, epochs=5, batch_size=32,
                    validation_data=(X_val, y_val),
                    callbacks=[EarlyStopping(patience=2, restore_best_weights=True)])

Epoch 1/5
[1m2152/2152[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m115s[0m 52ms/step - loss: 6.2323 - val_loss: 6.1272
Epoch 2/5
[1m2152/2152[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m115s[0m 53ms/step - loss: 5.1419 - val_loss: 6.2061
Epoch 3/5
[1m2152/2152[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m113s[0m 52ms/step - loss: 4.7091 - val_loss: 6.4151


## 5. Training & Evaluation
- **Monitor** the loss on both training and validation sets.
- **Perplexity**: a common metric for language models.
  - It is the exponent of the average negative log-likelihood.
  - If your model outputs cross-entropy loss `H`, then `perplexity = e^H`.
  - Try to keep the validation perplexity **under 50** if possible.

In [None]:
train_loss = history.history['loss'][0]
val_loss = history.history['val_loss'][0]

train_perplexity = np.exp(train_loss)
val_perplexity = np.exp(val_loss)

print(f"Train Perplexity: {train_perplexity:.2f}")
print(f"Validation Perplexity: {val_perplexity:.2f}")

Train Perplexity: 341.88
Validation Perplexity: 458.17


## 6. Generation Criteria
- After training, generate **two distinct text samples**, each at least **50 tokens**.
- Use **different seed phrases** (e.g., “love is” vs. “time will”).

In [None]:
import random

inv_vocab = {i: word for word, i in vocab.items()}

def generate_text(seed_text, num_words=50):
    output = seed_text.lower().split()

    for _ in range(num_words):
        input_seq = [vocab.get(word, 0) for word in output[-1:]]
        input_seq = np.array(input_seq).reshape(1, 1)

        preds = model.predict(input_seq, verbose=0)[0]
        next_index = np.random.choice(len(preds), p=preds)
        next_word = inv_vocab.get(next_index, '')

        output.append(next_word)

    return ' '.join(output)

In [None]:
print("Sample 1:")
print(generate_text("love is", num_words=50))

print("\nSample 2:")
print(generate_text("time will", num_words=50))

Sample 1:
love is whenever king charles i must lily day talkative fashionably overburdened all feather unsuccessfully defeat hard tacking especially pure unappeased time visions especially cracked flying furnish city never desired she had guess thrilling you go like see rakish hair dogs moons moaned for minnie ought started dwelling hold company imagining numerous

Sample 2:
time will be have lay thumping never and a theology bridal bread combine deep ruthlessly eating recipe within sprinkled gasped amethyst mebbe totally triangular remorseful dwelt period consist required the attired left worldly boughs hence pieces passed wholeheartedly secret any avonlea happiest awfully thistle kissed bearding dimples intensely ridgepoles ask ahead silence
