<h2 style="color:blue">Electronics & ICT Academy National Institute of Technology, Warangal</h2>

### Course: Post Graduate Program in Machine Learning and Artificial Intelligence
#### Project: Building A Conversational Chatbot

<p>Aim: Aim of the project is to build an intelligent conversational chatbot, Riki, that can understand complex queries from the user and intelligently respond.</p>

### Import libraries

In [None]:
import re
import numpy as np

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense,Embedding,LSTM,Input
from tensorflow.keras.losses import CategoricalCrossentropy
from tensorflow.keras.optimizers import Adam
import tensorflow as tf
import pickle

#### Download GloVe Model

In this task, download the GloVe model provided by Stanford NLP. The specific model will download is the Twitter version with the following specifications: 2 billion tweets, 27 billion tokens, 1.2 million vocabulary, uncased, and available in various dimensions (25d, 50d, 100d, and 200d vectors). The download size is approximately 1.42 GB.

#### Steps to Download

Follow these steps to download the GloVe Twitter model:

1. *Access the Download Page:*
   - Visit the [Stanford NLP GloVe Project](https://nlp.stanford.edu/projects/glove/) page.

2. *Locate the Twitter Model:*
   - On the project page, scroll down until you find the section for the Twitter models.

3. *Download the Zip File:*
   - Click on the download link for models. For example, click on the link for `glove.twitter.27B.zip`.

4. *Wait for Download:*
   - Depending on your internet connection, the download may take some time due to the large size of the model.

5. *Extract the Model:*
   - Once the download is complete, extract the contents of the downloaded ZIP file to access the GloVe model files.

6. *Use the GloVe Model:*
   - Now, use the GloVe model in our NLP project for tasks such as word embedding

In [None]:
!wget https://nlp.stanford.edu/data/glove.twitter.27B.zip

In [None]:
!unzip /content/glove.twitter.27B.zip

#### Load GloVe Word Embeddings into a Dictionary

In this task, demonstrate how to load GloVe word embeddings into a Python dictionary, where each unique word token serves as the key, and the associated value is a d-dimensional vector. This allows you to efficiently access word vectors for natural language processing tasks.

#### Prerequisites

Before proceeding, make sure you have:

- Downloaded the GloVe model as described in the previous Markdown document, [Download GloVe Model](#link-to-download-glove-model).

In [None]:
# Load GloVe word embeddings (adjust the path and dimension as needed)
glove_file = 'glove_model/glove.twitter.27B.100d.txt'
embedding_dim = 100

def load_glove_embeddings(file_path, embedding_dim):
    embeddings_index = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs
    return embeddings_index

glove_embeddings_index = load_glove_embeddings(glove_file, embedding_dim)

### Data Preparation

In this step, prepare the dataset for training a dialogue model. This involves:

1. Filtering the conversations to a maximum word length.
2. Converting the dialogue pairs into input text and target text.
3. Adding special start and end tokens to recognize the beginning and end of each sentence.

Let's proceed with these steps to ensure the data is ready for training our dialogue model.

In [None]:
# Load Conversations and Lines from Files
def load_conversations():
    conversations = []
    with open('cornell_movie_dialogs_corpus/movie_conversations.txt', 'r', encoding='utf-8', errors='ignore') as conv_file:
        for line in conv_file:
            conversation = line.strip().split(" +++$+++ ")[-1][1:-1].replace("'", "").split(', ')
            conversations.append(conversation)
    return conversations

def load_lines():
    lines = {}
    with open('cornell_movie_dialogs_corpus/movie_lines.txt', 'r', encoding='utf-8', errors='ignore') as lines_file:
        for line in lines_file:
            parts = line.strip().split(" +++$+++ ")
            lines[parts[0]] = parts[-1]
    return lines

# Filter Conversations by Maximum Word Length
def filter_conversations(conversations, max_length):
    filtered_conversations = []
    for conversation in conversations:
        if all(len(re.findall(r'\w+', line)) <= max_length for line in conversation):
            filtered_conversations.append(conversation)
    return filtered_conversations

# Convert Dialogue Pairs into Input and Target Texts
def convert_to_input_target(conversations, lines):
    input_texts = []
    target_texts = []
    for conversation in conversations:
        for i in range(len(conversation) - 1):
            input_line_id = conversation[i]
            target_line_id = conversation[i + 1]
            
            input_text = lines.get(input_line_id, "")  # Get the text associated with the line ID
            target_text = lines.get(target_line_id, "")  # Get the text associated with the line ID
            
            if input_text and target_text:
                input_texts.append(input_text)
                target_texts.append(target_text)
    return input_texts, target_texts

# Add Start and End Tokens
def add_start_end_tokens(texts):
    start_token = '<start>'
    end_token = '<end>'
    return [f'{start_token} {text} {end_token}' for text in texts]

# Define your max_word_length
max_word_length = 15  # Adjust as needed

# Load conversations and lines from the dataset
conversations = load_conversations()
lines = load_lines()

# Filter conversations by maximum word length
filtered_conversations = filter_conversations(conversations, max_word_length)

# Convert dialogue pairs into input and target texts
input_texts, target_texts = convert_to_input_target(filtered_conversations, lines)

# Add start and end tokens to target texts
target_texts = add_start_end_tokens(target_texts)

In [None]:
# Filter the data to train the model
sorted_input_texts = []
sorted_target_texts = []

for i in range(len(input_texts)):
    if (len(input_texts[i]) > 10) and (len(input_texts[i]) < 50)and (len(target_texts[i]) > 20) and (len(target_texts[i]) < 60):
        sorted_input_texts.append(input_texts[i])
        sorted_target_texts.append(target_texts[i])

### Create Dictionaries and Save to Disk

In this step, create two dictionaries:
1. `target_word2id`: A dictionary where the keys are unique target words, and the values are their corresponding IDs.
2. `target_id2word`: A dictionary where the keys are target word IDs, and the values are the corresponding words.

Save these dictionaries in the NumPy file format on the disk for future use.

Let's proceed with creating and saving these dictionaries.

In [None]:
# Create target_word2id and target_id2word dictionaries
target_word2id = {}
target_id2word = {}
target_words = list(set(word for text in sorted_target_texts for word in text.split()))

for idx, word in enumerate(target_words):
    target_word2id[word] = idx
    target_id2word[idx] = word

# Size of the vocabulary
vocab_size = len(target_word2id)

# Convert dictionaries to NumPy arrays
target_word2id_array = np.array(list(target_word2id.items()), dtype=object)
target_id2word_array = np.array(list(target_id2word.items()), dtype=object)

# Define the file paths to save the NumPy arrays
target_word2id_file = 'target_word2id.npy'
target_id2word_file = 'target_id2word.npy'

# Save the NumPy arrays to disk
np.save(target_word2id_file, target_word2id_array)
np.save(target_id2word_file, target_id2word_array)

In [None]:
# Describe the structure of the input data, which consists of a list of sentences. 
# Each sentence is itself represented as a list of words.

# Tokenize sentences for Inputs
tokenizer_input = Tokenizer(filters="")
tokenizer_input.fit_on_texts(sorted_input_texts)  # input_texts contains your preprocessed sentences

# Map words to GloVe embeddings
num_words = len(tokenizer_input.word_index) + 1  # Add 1 for the padding token
embedding_matrix = np.zeros((num_words, embedding_dim))
for word, i in tokenizer_input.word_index.items():
    embedding_vector = glove_embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

# Convert sentences to sequences of word indices
input_sequences = tokenizer_input.texts_to_sequences(sorted_input_texts)

# Pad input sequences to a fixed length (you can adjust the maxlen)
padded_input_sequences = pad_sequences(input_sequences, maxlen=20, padding='post')

In [None]:
# Save tokenizer_input to a file
with open('tokenizer_input.pkl', 'wb') as tokenizer_file:
    pickle.dump(tokenizer_input, tokenizer_file)
    
# Load tokenizer_input from a file
# with open('tokenizer_input.pkl', 'rb') as tokenizer_file:
#     tokenizer_input = pickle.load(tokenizer_file)

In [None]:
# Tokenize sentences for Outputs
tokenizer_target = Tokenizer(filters="")
tokenizer_target.fit_on_texts(sorted_target_texts)

Map words to GloVe embeddings for targets
for word, i in tokenizer_target.word_index.items():
    embedding_vector = glove_embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

# Convert sentences to sequences of word indices
target_sentences = [s.split() for s in sorted_target_texts]  # Split target sentences into words
target_sequences = tokenizer_target.texts_to_sequences(target_sentences)

# Pad input sequences to a fixed length (you can adjust the maxlen)
padded_target_sequences = pad_sequences(target_sequences, maxlen=20, padding='post')

In [None]:
# Save tokenizer_target to a file
with open('tokenizer_target.pkl', 'wb') as tokenizer_file:
    pickle.dump(tokenizer_target, tokenizer_file)
    
# Load tokenizer_target from a file
# with open('tokenizer_target.pkl', 'rb') as tokenizer_file:
#     tokenizer_target = pickle.load(tokenizer_file)

In [None]:
# Create the One hot encoding for the target sequences to use during the model training
num_target_tokens = len(tokenizer_target.word_index) + 1  # +1 for padding
target_sequences_one_hot = to_categorical(padded_target_sequences, num_classes=num_target_tokens)

### Generating Training Data per Batch

In this step, training data will be generated in batches.

The dataset will be divided into smaller batches. During each training iteration, one batch of data will be fed to the model for optimization.

In [None]:
def _data_generator(x, y, batch_size):
    """Generates batches of vectorized texts for training/validation.

    # Arguments
        x: np.matrix, feature matrix.
        y: np.ndarray, labels.
        num_features: int, number of features.
        batch_size: int, number of samples per batch.

    # Returns
        Yields feature and label data in batches.
    """
    num_samples = x.shape[0]
    num_batches = num_samples // batch_size
    if num_samples % batch_size:
        num_batches += 1

    while 1:
        for i in range(num_batches):
            start_idx = i * batch_size
            end_idx = (i + 1) * batch_size
            if end_idx > num_samples:
                end_idx = num_samples
            x_batch = x[start_idx:end_idx]
            y_batch = y[start_idx:end_idx]
            yield x_batch, y_batch

# Create training and validation generators.
training_generator = _data_generator(
    padded_input_sequences, padded_target_sequences, 32)

# Get number of training steps. This indicated the number of steps it takes
# to cover all samples in one epoch.
steps_per_epoch = padded_input_sequences.shape[0] // 32
if padded_input_sequences.shape[0] % 32:
    steps_per_epoch += 1

### Model Architecture Overview

This section explains how our model is structured for sequence-to-sequence tasks:

#### Step 1: LSTM Encoder
- Use an LSTM encoder to understand the input words. It generates information about the input, including encoder outputs, encoder hidden state, and encoder context.

#### Step 2: LSTM Decoder
- Use an LSTM decoder helps generate target words. It relies on the information from the encoder and produces decoder outputs, decoder hidden state, and decoder context.

#### Step 3: Prediction with Dense Layer
- To predict the next word from our vocabulary, we use a dense layer. This step is crucial for creating meaningful sequences.

#### Step 4: Loss and Optimization
- For training the model, choose the 'categorical_crossentropy' loss function and the 'rmsprop' optimizer. This helps improve the model's accuracy during training.

Following these steps, model is set up to learn and generate sequences effectively.

In [None]:
max_source_seq_length = 20 #max(len(seq) for seq in source_sequences)
max_target_seq_length = 20 #max(len(seq) for seq in target_sequences)

# Define the model
latent_dim = 256  # Dimensionality of the LSTM hidden state

# Encoder
encoder_inputs = Input(shape=(max_source_seq_length,))
encoder_embedding = Embedding(
    input_dim=len(tokenizer_source.word_index),
    output_dim=100,  # Use the same dimension as GloVe embeddings (100d)
    weights=[np.array([glove_embeddings_index[word] if word in glove_embeddings_index else np.zeros(100) for word in tokenizer_source.word_index.keys()])],
    trainable=False,  # Freeze the embeddings
    input_length=max_source_seq_length,
)(encoder_inputs)
encoder_lstm = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder_lstm(encoder_embedding)
encoder_states = [state_h, state_c]

In [None]:
# Decoder
decoder_inputs = Input(shape=(max_target_seq_length,))
decoder_embedding = Embedding(input_dim=num_target_tokens, output_dim=latent_dim)(decoder_inputs)
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_embedding, initial_state=encoder_states)

In [None]:
# Dense layer for prediction
decoder_dense = Dense(num_target_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

In [None]:
# Create the model
model = Model([encoder_input, decoder_input], decoder_outputs)

# Compile the model
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

## Model Summary

In this section, an overview of the model's structure and parameters is presented:

In [None]:
model.summary()

In [None]:
# Train the model
batch_size = 32
epochs = 200
history = model.fit(
    [padded_input_sequences, padded_target_sequences],
    target_sequences_one_hot,
    epochs=epochs,
)

In [None]:
# Save the trained model
model.save('chatbot')

### Generating Predictions

In this final step, an operational model generates predictions. After completing the training and setup phases, this is where the model's learned knowledge is applied. It takes input data and produces predictions based on its training.

Let's proceed to generate predictions using the trained model.

In [None]:
# Load the trained model
from tensorflow import keras
model = keras.models.load_model("chatbot")

In [None]:
# Define the encoder inputs, embedding, and LSTM layers
encoder_inputs = model.input[0]  # Input layer for encoder
decoder_inputs = model.input[1]
encoder_embedding = model.layers[2]  # Embedding layer for encoder
encoder_lstm = model.layers[4]  # LSTM layer for encoder

# Define the encoder model for inference
encoder_outputs, state_h_enc, state_c_enc = encoder_lstm(encoder_embedding(encoder_inputs))
encoder_states = [state_h_enc, state_c_enc]
encoder_model = keras.Model(encoder_inputs, encoder_states)

# Define the decoder inputs and states
decoder_lstm = model.layers[5]  # LSTM layer for decoder
decoder_states_inputs = [
    keras.Input(shape=(latent_dim,)),  # Initial state for LSTM (state_h)
    keras.Input(shape=(latent_dim,))   # Initial state for LSTM (state_c)
]

# Define the decoder embedding layer (should match the architecture)
decoder_embedding = model.layers[3]  # Embedding layer for decoder

# Connect the decoder LSTM to the inputs and states
decoder_outputs_and_states = decoder_lstm(
    decoder_embedding(decoder_inputs),  # Embedding layer for decoder
    initial_state=decoder_states_inputs
)
decoder_outputs, state_h_dec, state_c_dec = decoder_outputs_and_states
decoder_states = [state_h_dec, state_c_dec]

# Define the decoder dense layer (should match the architecture)
decoder_dense = model.layers[6]  # Dense layer for decoder

# Connect the dense layer to the decoder LSTM
decoder_outputs = decoder_dense(decoder_outputs)

# Define the decoder model for inference
decoder_model = keras.Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states)


In [None]:
def predict_sentence(input_sentence):
    input_seq = tokenizer_source.texts_to_sequences([input_sentence])[0]
    input_seq = pad_sequences([input_seq], maxlen=max_source_seq_length, padding='post')

    initial_states = encoder_model.predict(input_seq)

    target_seq = np.zeros((1, 1))  # Initialize the target sequence with a start token
    target_text = ''

    while True:
        output_tokens, h, c = decoder_model.predict([target_seq] + initial_states)
        sampled_token_index = np.argmax(output_tokens[0, -1, :])

        if sampled_token_index == 0:
            break
        sampled_word = tokenizer_target.index_word[sampled_token_index]

        if sampled_word == '<end>' or len(target_text.split()) >= max_target_seq_length:
            break

        if sampled_word != "<start>":
            target_text += sampled_word + ' '

        target_seq = np.zeros((1, 1))
        target_seq[0, 0] = sampled_token_index

        initial_states = [h, c]

    return target_text.strip()

In [None]:
# Example usage of the translation function
input_sentence = "Hello"
predicted_sentence = predict_sentence(input_sentence)
print("Input Sentence:", input_sentence)
print("Predicted Sentence:", predicted_sentence)

In [None]:
# Example usage of the translation function
input_sentence = "i am not feeling well."
predicted_sentence = predict_sentence(input_sentence)
print("Input Sentence:", input_sentence)
print("Predicted Sentence:", predicted_sentence)

In [None]:
# Example usage of the translation function
input_sentence = "Sorry."
predicted_sentence = predict_sentence(input_sentence)
print("Input Sentence:", input_sentence)
print("Predicted Sentence:", predicted_sentence)