# Copyright

<PRE>
Copyright (c) Bálint Gyires-Tóth - All Rights Reserved
You may use and modify this code for research and development purpuses.
Using this code for educational purposes (self-paced or instructor led) without the permission of the author is prohibited.
</PRE>

# Assignment: RNN text generation with your favorite book


In [26]:
import string
import numpy as np
import tensorflow as tf

## 1. Dataset
- Download your favorite book from https://www.gutenberg.org/
- Split into training (80%) and validation (20%).

In [8]:
f = open("/content/gatsby.txt", "r")
book = f.read()

train_size = int(0.8 * len(book))

train = book[:train_size]
val = book[train_size:]

## 2. Preprocessing
- Convert text to lowercase.  
- Remove punctuation (except basic sentence delimiters).  
- Tokenize by words or characters (your choice).  
- Build a vocabulary (map each unique word to an integer ID).

In [9]:
text = book.lower()  # convert text to lowercase

# remove punctuation
remove = ''.join([char for char in string.punctuation if char not in ".!?"])
translator = str.maketrans('', '', remove)
text = text.translate(translator)
text = ' '.join(text.split())  # removes white space

# tokenize
vocab = sorted(set(list(text)))

# vocabulary
chars = tf.strings.unicode_split(text, input_encoding='UTF-8')
chars_to_ids = tf.keras.layers.StringLookup(vocabulary=list(vocab),mask_token=None)
ids = chars_to_ids(chars)

# split preprocessed data into training and validation
train = ids[:train_size]
val = ids[train_size:]
print(f"Train size: {len(train)}")
print(f"Validation size: {len(val)}")

Train size: 232061
Validation size: 49744


## 3. Embedding Layer in Keras
Below is a minimal example of defining an `Embedding` layer:
```python
from tensorflow.keras.layers import Embedding

embedding_layer = Embedding(
    input_dim=vocab_size,     # size of the vocabulary
    output_dim=128,           # embedding vector dimension
    input_length=sequence_length
)
```
- This layer transforms integer-encoded sequences (word IDs) into dense vector embeddings.

- Feed these embeddings into your LSTM or GRU OR 1D CNN layer.

In [41]:
def sequence(text,sequence_length):
  x = []; y = []
  for i in range(0,len(text)-sequence_length,1):
    x.append(text[i:i+sequence_length])
    y.append(text[i+sequence_length])
  return np.array(x), np.array(y)

sequence_length = 100
x_train, y_train = sequence(train,sequence_length)
x_val, y_val = sequence(val,sequence_length)

In [44]:
print(x_train.shape,y_train.shape)
print(x_val.shape,y_val.shape)

(231961, 100) (231961,)
(49644, 100) (49644,)


In [47]:
vocab_size = len(vocab)
print(vocab_size)

53


## 4. Model
- Implement an LSTM or GRU or 1D CNN-based language model with:
  - **The Embedding layer** as input.
  - At least **one recurrent layer** (e.g., `LSTM(256)` or `GRU(256)` or your custom 1D CNN).
  - A **Dense** output layer with **softmax** activation for word prediction.
- Train for about **5–10 epochs** so it can finish in approximately **2 hours** on a standard machine.


In [48]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding,LSTM,Dense,Dropout
from tensorflow.keras.optimizers import Adam

In [50]:
# Build the model
model = Sequential()

# Embedding layer to convert integers into dense vectors
model.add(Embedding(input_dim=vocab_size, output_dim=128, input_length=sequence_length))

# LSTM layer for sequence learning
model.add(LSTM(256, return_sequences=True))  # LSTM with 256 units

# Dense layer with softmax to predict the next token (word/character)
model.add(Dense(vocab_size, activation='softmax'))

# Compile the model
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam(), metrics=['accuracy'])

# Train the model
network_history = model.fit(x_train, y_train,
                            validation_data=(x_val, y_val),
                            batch_size=128,
                            epochs=5,
                            verbose=1)

Epoch 1/5


ValueError: Argument `output` must have rank (ndim) `target.ndim - 1`. Received: target.shape=(None,), output.shape=(None, 100, 53)

## 5. Training & Evaluation
- **Monitor** the loss on both training and validation sets.
- **Perplexity**: a common metric for language models.
  - It is the exponent of the average negative log-likelihood.
  - If your model outputs cross-entropy loss `H`, then `perplexity = e^H`.
  - Try to keep the validation perplexity **under 50** if possible. If you have higher value (which is possible) try to draw conclusions, why doesn't it decrease to a lower value.

## 6. Generation Criteria
- After training, generate **two distinct text samples**, each at least **50 tokens**.
- Use **different seed phrases** (e.g., “love is” vs. “time will”).