<a href="https://colab.research.google.com/github/itoshiyanazawa/rnn_project/blob/main/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Libraries

In [1]:
import urllib.request
import zipfile
import os
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# Step 1

1. Collect the text dataset:
  *   Download the text dataset from a reputable source or use a pre-existing one.

2. Clean the text data:
  *   Remove unwanted characters, punctuation, and formatting.
  *  Convert all text to lowercase to reduce complexity.

3. Tokenize the text:
  *   Split the text into individual characters.
  *   Create a vocabulary of unique tokens and map each token to an integer.

4. Create input sequences:
  *   Generate input sequences and corresponding targets for training.



Collecting the Movie Dialogues dataset from the Cornell University Website

In [2]:
url = "http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip"
file_name = "cornell_movie_dialogs_corpus.zip"

# Download the dataset
urllib.request.urlretrieve(url, file_name)

# Extract the dataset
with zipfile.ZipFile(file_name, 'r') as zip_ref:
    zip_ref.extractall("cornell_data")

print("✅ Dataset downloaded and extracted.")


✅ Dataset downloaded and extracted.


Cleaning the Text Data

In [3]:
# Load dialogue lines
with open("cornell_data/cornell movie-dialogs corpus/movie_lines.txt", encoding='iso-8859-1') as f:
    lines = f.readlines()

# Extract the actual text (5th field) from each line
dialogues = []
for line in lines:
    parts = line.split(" +++$+++ ")
    if len(parts) == 5:
        dialogues.append(parts[-1].strip())

# Clean the text
import re

def clean_text(text):
    text = text.lower()  # lowercase
    text = re.sub(r"[^a-zA-Z0-9.,!?'\n ]+", ' ', text)  # remove special chars
    text = re.sub(r'\s+', ' ', text).strip()  # normalize spacing
    return text

cleaned_dialogues = [clean_text(line) for line in dialogues]

# Combine into one long string
full_text = ' '.join(cleaned_dialogues)

print("🧹 Cleaned text preview:\n", full_text[:500])


🧹 Cleaned text preview:
 they do not! they do to! i hope so. she okay? let's go. wow okay you're gonna need to learn how to lie. no i'm kidding. you know how sometimes you just become this persona ? and you don't know how to quit? like my fear of wearing pastels? the real you . what good stuff? i figured you'd get to the good stuff eventually. thank god! if i had to hear one more story about your coiffure... me. this endless ...blonde babble. i'm like, boring myself. what crap? do you listen to this crap? no... then gui


Tokenize the text

In [None]:
# Character-level tokenization
chars = sorted(set(full_text))  # unique characters
char2idx = {ch: idx for idx, ch in enumerate(chars)}  # map char to index
idx2char = {idx: ch for ch, idx in char2idx.items()}  # map index to char

# Convert all text to a sequence of integers
text_as_int = [char2idx[c] for c in full_text]

print("🧠 Total characters:", len(full_text))
print("🔤 Vocabulary size:", len(chars))
print("Sample encoding:", text_as_int[:20])

🧠 Total characters: 16940151
🔤 Vocabulary size: 42
Sample encoding: [35, 23, 20, 40, 0, 19, 30, 0, 29, 30, 35, 1, 0, 35, 23, 20, 40, 0, 19, 30]


Generate input sequences and corresponding targets for training

In [None]:
# Define sequence length
seq_length = 100

# Convert list to NumPy array for faster slicing
text_as_int = np.array(text_as_int)

# Calculate number of sequences
num_sequences = len(text_as_int) - seq_length

# Create input sequences and targets using vectorized slicing
sequences = np.array([text_as_int[i:i+seq_length] for i in range(num_sequences)])
targets = text_as_int[seq_length:]  # targets are simply shifted by one position

print("✅ Total sequences created:", len(sequences))
print("Sample input:", sequences[0])
print("Sample target:", targets[0])

✅ Total sequences created: 16940051
Sample input: [35 23 20 40  0 19 30  0 29 30 35  1  0 35 23 20 40  0 19 30  0 35 30  1
  0 24  0 23 30 31 20  0 34 30  4  0 34 23 20  0 30 26 16 40 15  0 27 20
 35  2 34  0 22 30  4  0 38 30 38  0 30 26 16 40  0 40 30 36  2 33 20  0
 22 30 29 29 16  0 29 20 20 19  0 35 30  0 27 20 16 33 29  0 23 30 38  0
 35 30  0 27]
Sample target: 24


# Step 2

1. Define the RNN architecture (e.g. using Tensorflow or PyTorch).
2. Explain the type of layers you are including and why (layers such as Embedding, LSTM, and Linear)
3. Visualize your RNN architecture
4. Compile the model with appropriate loss function and optimizer. Explain your choice of loos function and optimizer.
5. Prepare data for training by converting sequences and targets into batches.
6. Train the model on the training data and validate it on the validation set.
7. Visualize the training process using both training and validation results.


Defining the RNN Architecture

In [None]:
vocab_size = len(char2idx)

# Define a deeper model
model = Sequential([
    Embedding(input_dim=vocab_size, output_dim=128),  # Increased embedding size
    LSTM(256, return_sequences=True),
    LSTM(256),
    Dense(vocab_size, activation='softmax')
])

**Layer Breakdown**:
- `Embedding`: Converts integer indices into dense 64-dimensional vectors. This helps the model learn semantic relationships between characters.
- `LSTM`: Long Short-Term Memory layer with 128 units, captures temporal dependencies and handles vanishing gradient issues better than simple RNNs.
- `Dense`: Fully connected output layer with `softmax` activation, outputs probabilities across the vocabulary to predict the next character.


Visualizing the Model

In [None]:
model.summary()

Compiling the model with appropriate loss function and optimizer.
**Loss Function**:
- We use `sparse_categorical_crossentropy` because the target output is a single integer (not one-hot encoded).

**Optimizer**:
- `Adam` is used for its adaptive learning rate, helping the model converge faster and more reliably during training.


In [None]:
model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

Prepare data for training

In [None]:
# Step 1: Reduce data if necessary (optional sampling)
max_examples = 120000  # You can tune this down if needed
sequences = sequences[:max_examples]
targets = targets[:max_examples]

# Step 2: Convert to tf.data.Dataset
dataset = tf.data.Dataset.from_tensor_slices((sequences, targets))

# Step 3: Shuffle and split into train/validation
buffer_size = len(sequences)
batch_size = 64

# Shuffle and batch
train_size = int(0.9 * buffer_size)
val_size = buffer_size - train_size

train_dataset = dataset.take(train_size).shuffle(buffer_size).batch(batch_size, drop_remainder=True)
val_dataset = dataset.skip(train_size).batch(batch_size, drop_remainder=True)

print("✅ Train batches:", len(train_dataset))
print("✅ Validation batches:", len(val_dataset))

✅ Train batches: 1687
✅ Validation batches: 187


Training the Model

In [None]:
history = model.fit(
    train_dataset,
    epochs=30,
    validation_data=val_dataset
)

Epoch 1/30
[1m1627/1687[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m1:02[0m 1s/step - accuracy: 0.2910 - loss: 2.5318

Visualizing Model Accuracy and Loss

In [None]:
history_dict = history.history

# Accuracy plot
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(history_dict['accuracy'], label='Training Accuracy', marker='o')
plt.plot(history_dict['val_accuracy'], label='Validation Accuracy', marker='s')
plt.title('Model Accuracy Over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)

# Loss plot
plt.subplot(1, 2, 2)
plt.plot(history_dict['loss'], label='Training Loss', marker='o')
plt.plot(history_dict['val_loss'], label='Validation Loss', marker='s')
plt.title('Model Loss Over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)

plt.tight_layout()
plt.show()

# Step 3

1. Create a function to generate text by sampling from the model's predictions.

Qualitative Evaluation:

2. Coherence and Grammar: Check if the generated text is grammatically correct and coherent.

3. Creativity: Evaluate if the generated text is creative and interesting.

4. Contextual Relevance: Assess whether the generated text maintains context and follows logically from the seed text.

5. Diversity: Ensure that the model does not repeat itself excessively and generates diverse outputs.

Quantitative Evaluation:

6. Perplexity: A common metric for evaluating language models. Perplexity measures how well the model predicts the next token in a sequence. Lower perplexity indicates better performance.

7. BLEU Score: Used to evaluate the quality of text that has been machine-translated from one language to another. It can be adapted to measure the overlap between generated text and reference text.

8. ROUGE Score: Commonly used for evaluating summarization and translation models. It measures the overlap of n-grams between the generated text and a reference text.

9. Entropy and Repetition Metrics: Measure the diversity of the generated text. High entropy and low repetition indicate diverse and less repetitive outputs.




In [None]:
def generate_text(model, start_string, num_generate=300, temperature=1.0):
    """
    Generates text using the trained model.
    """
    start_string = start_string.lower()

    try:
        input_indices = [char2idx[char] for char in start_string]
    except KeyError as e:
        print(f"Error: Character '{e.args[0]}' not in vocabulary.")
        return ""

    input_indices = np.expand_dims(input_indices, axis=0)
    generated_text = start_string

    for _ in range(num_generate):
        predictions = model.predict(input_indices)
        predictions = predictions[0]  # Shape: (vocab_size,)

        # Apply temperature adjustment
        predictions = predictions / temperature
        predicted_idx = np.random.choice(len(chars), p=np.exp(predictions) / np.sum(np.exp(predictions)))

        predicted_char = idx2char[predicted_idx]
        generated_text += predicted_char

        # Update input sequence
        input_indices = np.append(input_indices[0][1:], predicted_idx)
        input_indices = np.expand_dims(input_indices, axis=0)

    return generated_text


In [None]:
seed_text = "Once upon a time, "

print("Temperature 0.2 (Predictable):\n")
print(generate_text(model, start_string=seed_text, num_generate=300, temperature=0.2))

print("\nTemperature 0.5 (Balanced):\n")
print(generate_text(model, start_string=seed_text, num_generate=300, temperature=0.5))

print("\nTemperature 1.0 (Creative):\n")
print(generate_text(model, start_string=seed_text, num_generate=300, temperature=1.0))


# Step 4

1. Experiment with different architectures (e.g., adding more layers, or trying other layer types).
 * Try deeper networks, different activation functions, or different layer configurations.
 * Example: Adding more LSTM layers or using GRU instead of LSTM.
2. Apply regularization techniques (e.g., Use dropout to prevent overfitting).
3. Use advanced text preprocessing.
 * Implement techniques like stemming, lemmatization, or BPE (Byte Pair Encoding) for
better tokenization.
4. Fine-tune hyperparameters (e.g., learning rate, batch size).
 * Experiment with different learning rates, batch sizes, and epochs.
 * Explain your approach for fine-tuning the hyper-parameters.

GPT fine tuned

In [6]:
# Data preparation
joined_text = "\n".join(cleaned_dialogues)

In [7]:
!pip install gpt_2_simple

Collecting gpt_2_simple
  Downloading gpt_2_simple-0.8.1.tar.gz (26 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting toposort (from gpt_2_simple)
  Downloading toposort-1.10-py3-none-any.whl.metadata (4.1 kB)
Downloading toposort-1.10-py3-none-any.whl (8.5 kB)
Building wheels for collected packages: gpt_2_simple
  Building wheel for gpt_2_simple (setup.py) ... [?25l[?25hdone
  Created wheel for gpt_2_simple: filename=gpt_2_simple-0.8.1-py3-none-any.whl size=24557 sha256=88f53af682582d8676227f3be7e4fbabe76d334c641b63b5971ab4afc8180532
  Stored in directory: /root/.cache/pip/wheels/9e/59/88/2abf9f043f52307bb3d81010e26ecdb5e539b392e8aca2501f
Successfully built gpt_2_simple
Installing collected packages: toposort, gpt_2_simple
Successfully installed gpt_2_simple-0.8.1 toposort-1.10


In [8]:
with open("output.txt", "w", encoding="utf-8") as f:
    f.write(joined_text)

In [None]:
import gpt_2_simple as gpt2
import tensorflow as tf
gpt2.download_gpt2(model_name='124M')
sess = gpt2.start_tf_sess()
gpt2.finetune(sess,dataset='/content/output.txt'
             ,model_name='124M'
             ,steps=200
             ,print_every=200
             ,sample_every=100,
              restore_from='fresh'

             )

Fetching checkpoint: 1.05Mit [00:00, 3.84Git/s]                                                     
Fetching encoder.json: 1.05Mit [00:02, 519kit/s]
Fetching hparams.json: 1.05Mit [00:00, 5.10Git/s]                                                   
Fetching model.ckpt.data-00000-of-00001: 498Mit [02:56, 2.82Mit/s]
Fetching model.ckpt.index: 1.05Mit [00:00, 3.27Git/s]                                               
Fetching model.ckpt.meta: 1.05Mit [00:01, 878kit/s]
Fetching vocab.bpe: 1.05Mit [00:01, 872kit/s]


Loading checkpoint models/124M/model.ckpt
Loading dataset...


100%|██████████| 1/1 [00:15<00:00, 15.14s/it]


dataset has 4667336 tokens
Training...
 a you're doing a goddamn thing right now, he's doing a goddamn thing. that's a fucking thing. that's just something to fucking get a feel yourself. that's not good, and he's still good enough to do it like,

your dad is a fucking dope. the only thing i don't like is he's the only goddamn thing fucking nice about it.

i didn't watch any of that kid play baseball. i was just getting started.

i guess it's just you, or it just gets worse and worse.

not that you don't like it.

yeah, but you'd be the better coach. there's just no way you'd ever get that out of me, anyway. i can tell you who's better, my dad.

hey, look!
oh, you know, he's great!
yeah, but you're too goddamn smart.
you wouldn't. you'd call yourself the smartest person in the room.
it was nice. i could see what was taking place.
well, it's gotta be nice.
you, uh, i don't know anyone like that. you want me to meet those guys?
you gotta be super smart, you fucking idiot. you're gonna be