# Project title: Chat-Style Text Generator using RNN (LSTM) — Cornell Movie-Dialogs Edition

This notebook trains a recurrent neural network (LSTM) to model short conversational turns and generate chat-style text. We use the Cornell Movie-Dialogs Corpus (a compact, dialogue-heavy dataset) to teach the model common conversational patterns. After training, the model can autocomplete a seed phrase or generate short replies in a movie-dialog style.

## Why this project?

RNNs (and gated variants like LSTM/GRU) are designed for sequential data. Training an LSTM on conversational data helps you learn:
* tokenization and sequence preparation for NLP,
* how to create n-gram style inputs for next-word prediction,
* embedding layers and how they reduce dimensionality,
* handling memory and performance constraints (important on Kaggle),
* Practical text generation (greedy vs sampling strategies).

## Important Steps
* Data handling & preprocessing : Load raw dialog files, parse the format, clean text, and construct a corpus of utterances.
* Sequence creation: Convert text to integer tokens, build n-gram sequences for next-word prediction.
* Modeling : Build a small Embedding + LSTM model for next-word prediction.
* Training under resource limits : Techniques to reduce memory usage (subsetting data, sparse loss, smaller model).
* Generation : Generate text from a seed phrase using greedy and (optional) temperature sampling.
* Documentation & reproducibility

## Imports + Data Loading + Initial Exploration.
- importing
- loading dataset
- previwing
- basic cleaning

In [None]:
import os
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
import re
import random

# Check TensorFlow and GPU availability
print("TensorFlow version:", tf.__version__)
print("Num GPUs Available:", len(tf.config.list_physical_devices('GPU')))

In [None]:

data_path = "/kaggle/input/cornell-moviedialog-corpus"

os.listdir(data_path)

# Load movie lines file
lines_file = os.path.join(data_path, "movie_lines.txt")

# Each line has metadata and the actual dialogue text
with open(lines_file, encoding="iso-8859-1") as f:
    lines = f.readlines()

print("Total lines:", len(lines))
print("\nSample line:\n", lines[100])


In [None]:
# ================================
# Extract actual dialogue text
# ================================
corpus = []
for line in lines:
    parts = line.strip().split("+++$+++")
    if len(parts) == 5:
        text = parts[-1].strip()
        corpus.append(text)

print("Total dialogues extracted:", len(corpus))
print("Sample dialogues:\n", corpus[:5])


In [None]:
# basic cleaning

def clean_text(text):
    text = text.lower()
    text = re.sub(r"[^a-zA-Z' ]+", "", text)  # remove numbers, punctuation (except apostrophes)
    return text

corpus = [clean_text(t) for t in corpus if t.strip() != ""]

print("After cleaning:", len(corpus))
print("Sample cleaned lines:\n", corpus[:5])

corpus = corpus[:8000]  # keep first 8,000 dialogues for now (safe size)
print("Using subset of corpus:", len(corpus))

## Tokenization and Sequence Preparation
- Tokenize the text (convert words → numeric tokens).
- Create n-gram sequences to help the RNN learn word-to-word context.
- Pad sequences so all have the same length.
- Prepare the final X (input) and y (output) data for training.

### toeknizer & input sequence

In [None]:

tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)

total_words = len(tokenizer.word_index) + 1
print("Total unique words:", total_words)

input_sequences = []
for line in corpus:
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_seq = token_list[:i+1]
        input_sequences.append(n_gram_seq)

print("Total input sequences:", len(input_sequences))
print("Sample sequence:", input_sequences[0])

### Pad Sequence & splitting into labels


In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

max_seq_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_seq_len, padding='pre'))

print("Max sequence length:", max_seq_len)
print("Padded example:", input_sequences[0])

X = input_sequences[:, :-1]
y = input_sequences[:, -1]  # last word is target

print("X shape:", X.shape)
print("y shape:", y.shape)



In [None]:
# Check a random example
index = random.randint(0, len(X)-1)
input_example = [k for k,v in tokenizer.word_index.items() if v in X[index]]
predicted_word = [k for k,v in tokenizer.word_index.items() if v == y[index]]

print("Input:", input_example)
print("Target word:", predicted_word)

## Build the RNN (LSTM) Model

Now we’ll design and train a Recurrent Neural Network that predicts the next word in a sequence.

We’ll use a Keras Sequential model with:
* an Embedding layer (to learn word meanings),
* an LSTM layer (to capture sequence dependencies),
* a Dense output layer (to predict the next word)

### Building the model

* Embedding converts words (integers) → dense vectors that the model can understand.
* LSTM(128) learns context (long-term dependencies) in text.
* Dropout(0.2) helps prevent overfitting.
* Dense(total_words, activation='softmax') outputs probability of each word being next.

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout

model = Sequential([
    Embedding(input_dim=total_words, output_dim=64, input_length=max_seq_len-1),
    LSTM(128),
    Dropout(0.2),
    Dense(total_words, activation='softmax')
])

model.build(input_shape=(None, max_seq_len-1))
model.summary()


In [None]:
model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

history = model.fit(
    X, y,
    epochs=2,           # you can increase to 30 if GPU available
    batch_size=128,
    verbose=1
)

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(6,4))
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['loss'], label='Training Loss')
plt.xlabel('Epochs')
plt.ylabel('Value')
plt.legend()
plt.title('Training Progress')
plt.show()


## Generate Text Using the Trained Model
Now that our LSTM model is trained, the next step is to generate text automatically.
We will feed a starting word or phrase (seed text) to the model, and it will predict the most likely next words — one at a time — until we reach a desired output length.

This helps us evaluate whether the model has actually learned meaningful patterns from the dataset.

In [None]:
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences

def generate_text(seed_text, next_words=10):
    """
    Generate text based on a seed input using the trained LSTM model.
    
    Args:
        seed_text (str): The starting text for prediction
        next_words (int): Number of words to generate
    
    Returns:
        str: The generated text
    """
    for _ in range(next_words):
        # Convert text to sequence of integers
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        
        # Pad the sequence to match input length
        token_list = pad_sequences([token_list], maxlen=max_seq_len - 1, padding='pre')
        
        # Predict next word index
        predicted_probs = model.predict(token_list, verbose=0)
        predicted_index = np.argmax(predicted_probs, axis=1)[0]
        
        # Convert index back to word
        for word, index in tokenizer.word_index.items():
            if index == predicted_index:
                seed_text += " " + word
                break
    return seed_text


In [None]:
seed_text = "as"
generated_text = generate_text(seed_text, next_words=10)
print("Generated Sequence:\n", generated_text)
