## **Project: Conversational Next Word Predictor**
Project Type: NLP / Generative AI / LSTM Dataset: DailyDialog (Kaggle)

### **Project Overview :**
This project focuses on building a Deep Learning model capable of understanding and predicting natural human conversation. Unlike standard text generators trained on Wikipedia (which are formal and encyclopedic), this model is trained on the DailyDialog dataset to capture the flow, tone, and grammar of casual English dialogue.

In [52]:
#Import Necessary Libraries
import tensorflow as tf
import numpy as np
import pandas as pd
import pickle
import os
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import Dense, Embedding, LSTM, Input, Dropout
from tensorflow.keras.preprocessing.sequence import pad_sequences

print(f'Tensorflow Version : {tf.__version__}')

Tensorflow Version : 2.19.0


### **1. Data Loading (Pre-processed)**
The model is trained on the **DailyDialog** dataset. To ensure high-quality input, the raw data underwent a rigorous **external data engineering process** before being loaded here.

**The Pre-processing Pipeline involved:**
1.  **Parsing:** Converting stringified lists from the raw CSV into flat text.
2.  **Sanitization:** Removing artifacts like brackets `['...']` and non-English punctuation (e.g., Chinese full stops).
3.  **Sentence Splitting:** Fixing "fused" sentences (e.g., "How are you?I am fine") using Regex.

The resulting clean dataset is loaded from `final_training_data_refined.txt`.

In [None]:
#Loading the cleaned data
with open ("Data/final_training_data_refined.txt", 'r', encoding='utf-8') as f:
    lines = f.read().split('\n')

#Removing empty strings if any
text_data = [line for line in lines if len(line) > 1]

print(f'Succesfully loaded {len(text_data)} conversation lines.')
print(f'Sample : {text_data[0]}')

Succesfully loaded 168870 conversation lines.
Sample : Say , Jim , how about going for a few beers after dinner ?


### **2. Tokenization & Sequence Generation**
We convert text into sequences of integers.

- **Tokenizer:** Fits on the corpus to build a dictionary of the top 15,000 words.

- **Persistence:** The tokenizer is serialized using pickle so the exact same mapping can be used in the deployment app.

- **N-Grams:** We use a sliding window approach to generate multiple training examples from a single sentence (e.g., "Hi how" -> "are", "Hi how are" -> "you").

In [4]:
## Tokenization (converting words to numbers)
VOCAB_SIZE = 15000
tokenizer = Tokenizer()
tokenizer.fit_on_texts(text_data)

total_words = len(tokenizer.word_index) + 1
print("Total Unique Words : ", total_words)

Total Unique Words :  19391


In [None]:
#Save Tokenizer
with open('models/tokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [5]:
## Creating N-Grams (Input Sequences)
input_sequences = list()
for line in text_data:
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

In [6]:
for sequence in input_sequences[:15]:
    print(f'{sequence} --> {[tokenizer.index_word[i] for i in sequence]}')

[151, 974] --> ['say', 'jim']
[151, 974, 31] --> ['say', 'jim', 'how']
[151, 974, 31, 34] --> ['say', 'jim', 'how', 'about']
[151, 974, 31, 34, 75] --> ['say', 'jim', 'how', 'about', 'going']
[151, 974, 31, 34, 75, 12] --> ['say', 'jim', 'how', 'about', 'going', 'for']
[151, 974, 31, 34, 75, 12, 5] --> ['say', 'jim', 'how', 'about', 'going', 'for', 'a']
[151, 974, 31, 34, 75, 12, 5, 199] --> ['say', 'jim', 'how', 'about', 'going', 'for', 'a', 'few']
[151, 974, 31, 34, 75, 12, 5, 199, 3257] --> ['say', 'jim', 'how', 'about', 'going', 'for', 'a', 'few', 'beers']
[151, 974, 31, 34, 75, 12, 5, 199, 3257, 155] --> ['say', 'jim', 'how', 'about', 'going', 'for', 'a', 'few', 'beers', 'after']
[151, 974, 31, 34, 75, 12, 5, 199, 3257, 155, 307] --> ['say', 'jim', 'how', 'about', 'going', 'for', 'a', 'few', 'beers', 'after', 'dinner']
[1, 44] --> ['you', 'know']
[1, 44, 13] --> ['you', 'know', 'that']
[1, 44, 13, 8] --> ['you', 'know', 'that', 'is']
[1, 44, 13, 8, 4476] --> ['you', 'know', 'that'

In [7]:
## Padding
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len))
print(len(input_sequences[0]))
input_sequences[0]

69


array([  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0, 151, 974], dtype=int32)

In [8]:
# Splitting in features and labels
X, y = input_sequences[:, :-1], input_sequences[:, -1]

print(f"Shape of X: {X.shape}")
print(f"Shape of y: {y.shape}")

Shape of X: (989152, 68)
Shape of y: (989152,)


### **3. Model Architecture (Stacked LSTM)**
The model uses a Stacked LSTM architecture:

- **Embedding Layer (100-dim):** Learned vector representations of words.

- **LSTM Layer 1 (150 units):** Captures lower-level sequence patterns.

- **Dropout (0.2):** Prevents overfitting.

- **LSTM Layer 2 (100 units):** Captures higher-level semantic context.

- **Dense Layer (Softmax):** Predicts the probability of the next word.

In [9]:
## Building LSTM
model = Sequential()
model.add(Input(shape=(max_sequence_len-1, )))
model.add(Embedding(input_dim=total_words, output_dim=100))
model.add(LSTM(150, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(100))
model.add(Dense(total_words, activation='softmax'))

model.compile(optimizer= 'adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

model.summary()

### **4. Training with Checkpointing**
To ensure training stability, I employed a manual checkpointing loop. The model is trained in 5-epoch chunks, saving the state to disk after each chunk. This allows for:

1. Comparison of model performance at different stages (Epoch 5 vs Epoch 25).

2. Fault tolerance against session timeouts.

In [None]:
# Model Training
history = model.fit(X, y, epochs=5, batch_size=128, validation_split=0.2)
model.save('models/model_0_5.keras')

Epoch 1/5
[1m6183/6183[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m146s[0m 23ms/step - accuracy: 0.0792 - loss: 6.2945 - val_accuracy: 0.1580 - val_loss: 5.3108
Epoch 2/5
[1m6183/6183[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m142s[0m 23ms/step - accuracy: 0.1643 - loss: 5.1488 - val_accuracy: 0.1833 - val_loss: 5.0605
Epoch 3/5
[1m6183/6183[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m142s[0m 23ms/step - accuracy: 0.1848 - loss: 4.8464 - val_accuracy: 0.1946 - val_loss: 4.9569
Epoch 4/5
[1m6183/6183[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m144s[0m 23ms/step - accuracy: 0.1953 - loss: 4.6691 - val_accuracy: 0.1997 - val_loss: 4.9032
Epoch 5/5
[1m6183/6183[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m149s[0m 24ms/step - accuracy: 0.2019 - loss: 4.5441 - val_accuracy: 0.2063 - val_loss: 4.8758


In [None]:
def train_chunk(path, initial):
  model = load_model(path)
  model.fit(X, y, epochs=initial+5, initial_epoch=initial, batch_size=128, validation_split=0.2)
  model.save(f'models/model_{initial}_{initial+5}.keras')

In [None]:
train_chunk('models/model_0_5.keras', 5)

  saveable.load_own_variables(weights_store.get(inner_path))


Epoch 6/10
[1m6183/6183[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m137s[0m 22ms/step - accuracy: 0.2078 - loss: 4.4498 - val_accuracy: 0.2027 - val_loss: 5.0174
Epoch 7/10
[1m6183/6183[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m134s[0m 22ms/step - accuracy: 0.2021 - loss: 4.6968 - val_accuracy: 0.2005 - val_loss: 5.0572
Epoch 8/10
[1m6183/6183[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m135s[0m 22ms/step - accuracy: 0.2012 - loss: 4.8370 - val_accuracy: 0.2018 - val_loss: 5.0708
Epoch 9/10
[1m6183/6183[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m134s[0m 22ms/step - accuracy: 0.2026 - loss: 4.8878 - val_accuracy: 0.2034 - val_loss: 5.0650
Epoch 10/10
[1m6183/6183[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m135s[0m 22ms/step - accuracy: 0.2050 - loss: 4.9042 - val_accuracy: 0.2041 - val_loss: 5.0702


In [None]:
train_chunk('models/model_5_10.keras', 10)

Epoch 11/15
[1m6183/6183[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m136s[0m 22ms/step - accuracy: 0.2067 - loss: 4.9136 - val_accuracy: 0.2049 - val_loss: 5.0866
Epoch 12/15
[1m6183/6183[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m134s[0m 22ms/step - accuracy: 0.2072 - loss: 4.9252 - val_accuracy: 0.2060 - val_loss: 5.0631
Epoch 13/15
[1m6183/6183[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m134s[0m 22ms/step - accuracy: 0.2095 - loss: 4.9274 - val_accuracy: 0.2092 - val_loss: 5.0653
Epoch 14/15
[1m6183/6183[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m134s[0m 22ms/step - accuracy: 0.2123 - loss: 4.9292 - val_accuracy: 0.2090 - val_loss: 5.0792
Epoch 15/15
[1m6183/6183[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m142s[0m 22ms/step - accuracy: 0.2122 - loss: 4.9347 - val_accuracy: 0.2103 - val_loss: 5.0622


In [None]:
train_chunk('models/model_10_15.keras', 15)

Epoch 16/20
[1m6183/6183[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m137s[0m 22ms/step - accuracy: 0.2138 - loss: 4.9263 - val_accuracy: 0.2114 - val_loss: 5.0643
Epoch 17/20
[1m6183/6183[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m135s[0m 22ms/step - accuracy: 0.2148 - loss: 4.9247 - val_accuracy: 0.2120 - val_loss: 5.0606
Epoch 18/20
[1m6183/6183[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m136s[0m 22ms/step - accuracy: 0.2170 - loss: 4.9252 - val_accuracy: 0.2126 - val_loss: 5.0659
Epoch 19/20
[1m6183/6183[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m134s[0m 22ms/step - accuracy: 0.2178 - loss: 4.9155 - val_accuracy: 0.2118 - val_loss: 5.0855
Epoch 20/20
[1m6183/6183[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m146s[0m 22ms/step - accuracy: 0.2178 - loss: 4.9176 - val_accuracy: 0.2124 - val_loss: 5.1060


In [None]:
train_chunk('models/model_15_20.keras', 20)

Epoch 21/25
[1m6183/6183[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m139s[0m 22ms/step - accuracy: 0.2185 - loss: 4.9115 - val_accuracy: 0.2139 - val_loss: 5.0919
Epoch 22/25
[1m6183/6183[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m136s[0m 22ms/step - accuracy: 0.2195 - loss: 4.9046 - val_accuracy: 0.2132 - val_loss: 5.1001
Epoch 23/25
[1m6183/6183[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m135s[0m 22ms/step - accuracy: 0.2199 - loss: 4.8928 - val_accuracy: 0.2130 - val_loss: 5.0780
Epoch 24/25
[1m6183/6183[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m135s[0m 22ms/step - accuracy: 0.2201 - loss: 4.8821 - val_accuracy: 0.2140 - val_loss: 5.1233
Epoch 25/25
[1m6183/6183[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m135s[0m 22ms/step - accuracy: 0.2221 - loss: 4.8710 - val_accuracy: 0.2160 - val_loss: 5.0571


### **5. Model Evaluation & Comparison**
We analyze the evolution of the model's intelligence by comparing predictions from different checkpoints.

**Epoch 5 (Baby):** Often produces repetitive or grammatical errors.

**Epoch 25 (Graduate):** Produces coherent, context-aware sentences.

In [None]:
model_files = [
    'model_0_5.keras',
    'model_5_10.keras',
    'model_10_15.keras',
    'model_15_20.keras',
    'model_20_25.keras'
]

def generate_text_comparison(model, seed_text, next_words):
    output_text = seed_text
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([output_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
        probs = model.predict(token_list, verbose=0)[0]
        predicted_index = np.argmax(probs) # Using Greedy Search for comparison
        output_word = tokenizer.index_word.get(predicted_index, "")
        output_text += " " + output_word
    return output_text

print("ANALYZING MODEL EVOLUTION:\n")
test_phrase = "Hi how are"

for filename in model_files:
    try:
        m = load_model(f'models/{filename}')
        pred = generate_text_comparison(m, test_phrase, 3)
        print(f"{filename} prediction: -> '{pred}'")
        del m
    except:
        pass

ANALYZING MODEL EVOLUTION:

model_0_5.keras prediction: -> 'Hi how are you doing this'
model_5_10.keras prediction: -> 'Hi how are you going to'
model_10_15.keras prediction: -> 'Hi how are you going to'
model_15_20.keras prediction: -> 'Hi how are you going to'
model_20_25.keras prediction: -> 'Hi how are you doing today'


### **6. Final Inference Engine**

For the production application, we use Temperature Sampling (Top-K / Random Choice) instead of Greedy Search. This introduces variation and prevents the model from getting stuck in repetitive loops like "how are you how are you".

In [None]:
def predict_smart(model, text, next_words=1):
    max_sequence_len = model.input_shape[1] + 1
    token_list = tokenizer.texts_to_sequences([text])[0]
    token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')

    probs = model.predict(token_list, verbose=0)[0]

    # Pick randomly from top 3 choices to break loops (Temperature Strategy)
    top_indices = probs.argsort()[-3:][::-1]
    top_probs = probs[top_indices] / np.sum(probs[top_indices])
    predicted_index = np.random.choice(top_indices, p=top_probs)

    output_word = tokenizer.index_word.get(predicted_index, "")
    return output_word

# Load Final Model
final_model = load_model('models/model_20_25.keras')

# Test
from IPython.display import display, Markdown
prompt = input('Enter Input Text : ')
completion = predict_smart(final_model, prompt, next_words=3)
display(Markdown(f"Output: {prompt} **{completion}**"))

Enter Input Text : How are you


Output: How are you **doing**