# **Movie Glossary Bot**

###**Goal** - Build a chatbot that can carry out multi-turn conversations, adapt to context, and handle a variety of topics.

###**Output** - A web or app interface where users can converse with the chatbot.

## Imports

In [46]:
import re
import random
import nltk
from nltk.tokenize import word_tokenize
from langdetect import detect
from transformers import BertTokenizer
import kagglehub
from transformers import pipeline
import pandas as pd
import os
from torch.utils.data import Dataset
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
import torch
from google.colab import files
import shutil
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import zipfile
from transformers import Trainer, TrainingArguments

###Required Libraries

In [3]:
!pip install langdetect nltk transformers

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/981.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m276.5/981.5 kB[0m [31m8.1 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m972.8/981.5 kB[0m [31m17.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993222 sha256=216db2a7e74861a76300f84ec6ac43b08290e57d6903f87ee817e4a4da3aaeb3
  Stored in directory: /root/.cache/pip/wheels/95/03/7d/59ea870c70ce4e5a370638b5462a7711

In [4]:
!pip install wandb



In [5]:
# The NLTK tokenizer and necessary models
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [8]:
# Initialize a BERT tokenizer
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



###Training/Loading a Pre-trained Chatbot Model

We'll use Hugging Face's GPT-2 model to handle the chatbot conversations.

In [9]:
# Load a pre-trained GPT-2 model for text generation
chatbot = pipeline("text-generation", model="gpt2")

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [10]:
response = chatbot("Hello, how are you?", max_length=50)
print(response)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Hello, how are you? What\'s going on? What do you want from me? You seem pretty worried. Are you serious? No?" I thought about him. He stared at me down, his eyes fluttering and his lips parted. Then'}]


This code returns a response generated by GPT-2 for a given input prompt.

##**Flask API - Chatbot Conversations**

We set up a Flask app that will serve as the backend API to handle the chatbot conversations.

As we are using Google Colab for this project, we need to run Flask via ngrok to expose it to the web.

## **Dataset Readme**

####**File Descriptions**

- **movie_titles_metadata.txt** - Contains information about each movie title.

Fields -
```
- movieID,
- movie title,
- movie year,
- IMDB rating,
- nos. of IMDB votes,
- genres in the format ['genre1','genre2',......'genreN']
```



- **movie_characters_metadata.txt** - Contains information about each movie character.

Fields -
```
- characterID
- character name
- movieID
- movie title
- gender ("?" for unlabeled cases)
- position in credits ("?" for unlabeled cases)
```



- **movie_lines.txt** - Contains the actual text of each utterance.

Fields -
```
- lineID
- characterID (who uttered this phrase)
- movieID
- character name
- text of the utterance
```



- **movie_conversations.txt** - The structure of the conversations.

Fields -
```
- characterID of the first character involved in the conversation
- characterID of the second character involved in the conversation
- movieID of the movie in which the conversation occurred
- list of the utterances that make the conversation, in chronological
	order: ['lineID1','lineID2',...,'lineIDN'] has to be matched with
  movie_lines.txt to reconstruct the actual content
```

This corpus contains a metadata-rich collection of fictional conversations extracted from raw movie scripts:

- 220,579 conversational exchanges between 10,292 pairs of movie characters
- involves 9,035 characters from 617 movies
- in total 304,713 utterances
- movie metadata included:
	- *genres*
	- *release year*
	- *IMDB rating*
	- *number of IMDB votes*
	- *IMDB rating*
- character metadata included:
	- *gender (for 3,774 characters)*
	- *position on movie credits (3,321 characters)*



In [10]:
# Download latest version movie corpus dataset
path = kagglehub.dataset_download("rajathmc/cornell-moviedialog-corpus")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/rajathmc/cornell-moviedialog-corpus?dataset_version_number=1...


100%|██████████| 9.58M/9.58M [00:00<00:00, 122MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/rajathmc/cornell-moviedialog-corpus/versions/1


## **Data Selection**

####We will train our model using conversational exchanges from the dataset, specifically focusing on -

##### **(movie_lines.txt) - Movie lines:** This contains the actual text of dialogues, which forms the conversational data.

##### **(movie_conversations.txt) - Movie conversations:** This file defines how the dialogues are structured, linking specific lines together in a conversation.

Additional metadata like character details **(movie_characters_metadata.txt)** and movie metadata **(movie_titles_metadata.txt)** can be useful for context.


Given the large size of the dataset (over 300,000 utterances), we limit the data to a manageable size for faster training. We had multiple options like -



*   *Number of movies: Select a few movies to train the chatbot on.*
*   *Number of conversations: Limit the number of conversations per movie.*


####***For the purpose of this project we will -***

Load Character Metadata - The code reads the character metadata from movie_characters_metadata.txt and creates a dictionary mapping character IDs to their names.

Modify Dialogue Dictionary - While populating line_dict, each entry now stores both the character ID and the dialogue text.

Reconstruct Conversations - When reconstructing each conversation, it retrieves the character name for each utterance from the line_dict and includes it in the final output format.

Small Subset - Limit the number of conversations to 100 for faster processing

Output - The final printed conversations now show the character names alongside their dialogues, as well as the movie titles.

In [11]:
# Load movie titles metadata to create a movie ID to title mapping
movie_titles = {}
with open('/root/.cache/kagglehub/datasets/rajathmc/cornell-moviedialog-corpus/versions/1/movie_titles_metadata.txt', 'r', encoding='utf-8', errors='replace') as file:
    titles = file.readlines()
    for title in titles:
        parts = title.split(" +++$+++ ")
        movie_id = parts[0]
        movie_name = parts[1].strip()  # Movie title
        movie_titles[movie_id] = movie_name

# Load character metadata to create a character ID to name mapping
character_names = {}
with open('/root/.cache/kagglehub/datasets/rajathmc/cornell-moviedialog-corpus/versions/1/movie_characters_metadata.txt', 'r', encoding='utf-8', errors='replace') as file:
    characters = file.readlines()
    for character in characters:
        parts = character.split(" +++$+++ ")
        character_id = parts[0]
        character_name = parts[1].strip()  # Character name
        character_names[character_id] = character_name

# Load the conversation structure
with open('/root/.cache/kagglehub/datasets/rajathmc/cornell-moviedialog-corpus/versions/1/movie_conversations.txt', 'r', encoding='utf-8', errors='replace') as file:
    conversations = file.readlines()

# Limit the number of conversations to 100 for faster processing
limited_conversations = random.sample(conversations, 100)

# Load movie lines to map conversation line IDs to actual text
with open('/root/.cache/kagglehub/datasets/rajathmc/cornell-moviedialog-corpus/versions/1/movie_lines.txt', 'r', encoding='utf-8', errors='replace') as file:
    lines = file.readlines()

# Create a dictionary of lineID to text
line_dict = {}
for line in lines:
    parts = line.split(" +++$+++ ")
    line_id = parts[0]
    character_id = parts[1]  # Get character ID
    dialogue = parts[-1].strip()
    line_dict[line_id] = (character_id, dialogue)  # Store character ID with dialogue

# Reconstruct the conversations using the limited dataset and include movie names and character names
limited_conversation_data = []
for conv in limited_conversations:
    parts = conv.split(" +++$+++ ")
    movie_id = parts[2]  # Movie ID from conversation
    line_ids = eval(parts[-1])  # List of line IDs for this conversation

    # Get movie name using the movie ID
    movie_name = movie_titles.get(movie_id, "Unknown Movie")

    # Retrieve the dialogue for the conversation
    conv_text = []
    for line_id in line_ids:
        if line_id in line_dict:
            character_id, dialogue = line_dict[line_id]
            character_name = character_names.get(character_id, "Unknown Character")
            conv_text.append(f"{character_name}: {dialogue}")

    # Append movie name and conversation to the list
    limited_conversation_data.append({
        'movie': movie_name,
        'conversation': conv_text
    })

# Print a few limited conversations with movie names and character names
for conversation in limited_conversation_data[:3]:
    print(f"Movie: {conversation['movie']}")
    print("Conversation:")
    for utterance in conversation['conversation']:
        print(f" - {utterance}")
    print()


Movie: the hustler
Conversation:
 - EDDIE: And them fingers, them chubby fingers. And that stroke. It's like he's, uh, like he's playing a violin or something.
 - FATS: Nine ball.  Three ball.

Movie: jennifer eight
Conversation:
 - ROSS: Did I say he did?
 - BERLIN: You looked like you did?
 - ROSS: No, I think you'll find I looked like he could have? By accident even? He's up here spraying the scenery all day.
 - BERLIN: He didn't shoot it, Ross. And no way by accident. There's a flash-burn. It was point-blank.

Movie: star trek: first contact
Conversation:
 - PICARD: Oh...  yes... ultraviolet protection. Thank you. Mister...?
 - SCRIMM: Lieutenant, actually. Lieutenant Jonathan Scrimm. I'm the head of the Resurrection Protective Force.  And you are?
 - PICARD: Jean-Luc Picard.
 - SCRIMM: Great name. French?
 - PICARD: Yes.
 - SCRIMM: You don't sound French.



## **Test Set - Collection**

Next, we will create a test set for our model evaluation later, and preprocess it similar to our training set.

### Our Approach -

Unseen Conversations from the Same Movies: Since all the movies were already used for training, we decided to build the test set using conversations from those same movies but filtering out any conversations that were used during training.

This ensured that the test set consisted of conversations the model hadn't seen during training, thus providing a valid evaluation while avoiding bias from overfitting to the training data.

### Why Unseen Conversations for Testing?

Avoid Testing on Training Data: By filtering out conversations used in training, we made sure the test set wasn't just a repetition of the training data, which would have given misleadingly high accuracy.

Realistic Evaluation: Using unseen conversations from the same movies allowed us to test the model's generalization ability while staying within the same context.

In [21]:
all_conversations = conversations

# Convert limited conversations to a set for faster lookup
limited_conversation_set = set(limited_conversations)

# Identify conversations that were not used during training
unused_conversations = [conv for conv in all_conversations if conv not in limited_conversation_set]

# Check how many unused conversations we have
num_unused_conversations = len(unused_conversations)
print(f"Number of unused conversations: {num_unused_conversations}")
print(f"Number of used conversations for training: {len(limited_conversation_set)}")

# Define how many conversations you want for our test set
num_test_conversations = int(0.2 * len(limited_conversation_set))

if num_unused_conversations < num_test_conversations:
    print("Not enough unused conversations available for the test set.")
elif num_test_conversations <= 0:
    print("No conversations available for testing.")
else:
    # Randomly select conversations for the test set
    test_conversations = random.sample(unused_conversations, num_test_conversations)
    print(f"Number of used conversations for testing: {len(test_conversations)}")

Number of unused conversations: 82997
Number of used conversations for training: 100
Number of used conversations for testing: 20


In [30]:
# Reconstruct the test conversations using the unused dataset
test_conversation_data = []
for conv in test_conversations:
    parts = conv.split(" +++$+++ ")
    movie_id = parts[2]  # Movie ID from conversation
    line_ids = eval(parts[-1])  # List of line IDs for this conversation

    # Get movie name using the movie ID
    movie_name = movie_titles.get(movie_id, "Unknown Movie")

    # Retrieve the dialogue for the conversation
    conv_text = []
    for line_id in line_ids:
        if line_id in line_dict:
            character_id, dialogue = line_dict[line_id]
            character_name = character_names.get(character_id, "Unknown Character")
            conv_text.append(f"{character_name}: {dialogue}")

    # Append movie name and conversation to the list
    test_conversation_data.append({
        'movie': movie_name,
        'conversation': conv_text
    })

# Print a few test conversations with movie names and character names
print("Test Conversations:")
for conversation in test_conversation_data[:3]:
    print(f"Movie: {conversation['movie']}")
    print("Conversation:")
    for utterance in conversation['conversation']:
        print(f" - {utterance}")
    print()

Test Conversations:
Movie: hider in the house
Conversation:
 - PHIL: Sweetheart, this is a very risky time for me right now.  Maybe you don't appreciate that.
 - JULIE: I don't care, Philip.  You want to go chasing Barbara Zelman, go ahead.  Just watch out for those buck teeth.
 - PHIL: Barbara <u>Zelman</u>?  I don't believe this!
 - JULIE: Do you usually pay for Charlie? At "Trattoria <u>Valentino</u>"?
 - PHIL: Honey, I can't track of all the meals Charlie and I have been having.  This is a delicate time. If it leaks out that I'm jumping ship before I'm set up someplace else I could be out on my ear before I'm ready with nothing. With <u>nothing</u>.
 - JULIE: There <u>are</u> people who do things because they <u>want</u> to get caught.

Movie: the horse whisperer
Conversation:
 - TOM: You're looking fit.
 - RONA: Fit? You want to check my teeth.  Good crowd today. I think you'll have some fun. You going to stay for dinner?
 - TOM: If it's not too much trouble, I thought I might.
 -

## **Data Preprocess**

Preprocessing Functions -

**preprocess_text** - This function handles text cleaning and tokenization.

**process_conversations** - This function processes each dialogue, detecting its language and applying the appropriate preprocessing.

**Example Usage** - The code processes the conversations and prints the original dialogues along with their tokenized versions.



In [31]:
def preprocess_text(text):
    """Preprocess the input text."""
    text = text.lower()
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Adjust regex if needed
    tokens = bert_tokenizer.tokenize(text)
    return tokens

def process_conversations_with_context(conversation_data):
    """Process the conversations and manage context."""
    processed_data = []

    for conversation in conversation_data:
        movie = conversation['movie']
        dialogues = conversation['conversation']

        previous_utterances = []  # To store context for each dialogue

        for dialogue in dialogues:
            # Detect the language of the dialogue
            language = detect(dialogue)

            # Add the dialogue and context
            previous_utterances.append(dialogue)

            # Create a context entry, no tokenization here
            processed_data.append({
                'movie': movie,
                'original': dialogue,  # The original dialogue (untokenized)
                'context': previous_utterances.copy(),  # Copy of the previous context
                'language': language  # Detected language
            })

    return processed_data

# Example usage with limited_conversation_data
processed_conversations_with_context = process_conversations_with_context(limited_conversation_data)

print("Example Training Set:")
print()
# Print a few processed dialogues with context
for item in processed_conversations_with_context[:3]:
    print(f"Movie: {item['movie']}")
    print(f"Original: {item['original']}")
    print(f"Context: {item['context']}")
    print(f"Language: {item['language']}")
    print()

# Example usage with limited_conversation_data
processed_test_conversations_with_context = process_conversations_with_context(test_conversation_data)

print("Example Test Set:")
print()
# Print a few processed dialogues with context
for item in processed_test_conversations_with_context[:3]:
    print(f"Movie: {item['movie']}")
    print(f"Original: {item['original']}")
    print(f"Context: {item['context']}")
    print(f"Language: {item['language']}")
    print()

Example Training Set:

Movie: the hustler
Original: EDDIE: And them fingers, them chubby fingers. And that stroke. It's like he's, uh, like he's playing a violin or something.
Context: ["EDDIE: And them fingers, them chubby fingers. And that stroke. It's like he's, uh, like he's playing a violin or something."]
Language: en

Movie: the hustler
Original: FATS: Nine ball.  Three ball.
Context: ["EDDIE: And them fingers, them chubby fingers. And that stroke. It's like he's, uh, like he's playing a violin or something.", 'FATS: Nine ball.  Three ball.']
Language: en

Movie: jennifer eight
Original: ROSS: Did I say he did?
Context: ['ROSS: Did I say he did?']
Language: af

Example Test Set:

Movie: hider in the house
Original: PHIL: Sweetheart, this is a very risky time for me right now.  Maybe you don't appreciate that.
Context: ["PHIL: Sweetheart, this is a very risky time for me right now.  Maybe you don't appreciate that."]
Language: en

Movie: hider in the house
Original: JULIE: I don'

**Preprocessing Function** - The preprocess_text function converts text to lowercase, removes special characters, and tokenizes it using a BERT tokenizer.

**Conversation Processing** - The process_conversations_with_context function goes through each conversation and dialogue.

It makes sure that the dialogue is not empty.

It uses langdetect to identify the language of the dialogue.

If the dialogue is in English, it tokenizes it and stores the movie name,
original dialogue, and tokenized dialogue in a processed list.

If the dialogue is not in English, it skips that dialogue.

It maintains a history of previous utterances to provide context. We limit the number of previous utterances stored in the previous_utterances list to prevent memory overflow.

The processed data includes the original dialogue, its tokens, and the context.

## **Model Design and Training for the Dialogue Processing System**

### Input Preparation

Model Inputs - The Context can be combined to form the input sequence for the model, where the last dialogue is the target output.

In [None]:
class ConversationDataset(Dataset):
    def __init__(self, conversations, tokenizer, max_length=512):
        self.tokenizer = tokenizer
        self.conversations = conversations
        self.max_length = max_length

    def __len__(self):
        return len(self.conversations)

    def __getitem__(self, idx):
        conversation = self.conversations[idx]

        # Combine context and original dialogue for GPT-2 input
        input_text = f"Movie: {conversation['movie']}\n" + \
                     f"Context: {' '.join(conversation['context'])}\n" + \
                     f"Dialogue: {conversation['original']}"

        # Tokenization and padding
        encoding = self.tokenizer.encode_plus(
            input_text,
            truncation=True,
            max_length=self.max_length,
            padding='max_length',
            return_tensors='pt',
        )

        input_ids = encoding['input_ids'].flatten()
        attention_mask = encoding['attention_mask'].flatten()

        # GPT-2 expects the labels to be the same as input_ids
        labels = input_ids.clone()

        # Set padding token labels to -100 so they are ignored in the loss calculation
        labels[labels == self.tokenizer.pad_token_id] = -100

        return {
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'labels': labels,  # Add labels for loss computation
        }

# Initialize GPT-2 tokenizer and add padding token
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token  # Use EOS token as padding token

# Initialize dataset with the processed conversations
conversation_dataset = ConversationDataset(processed_conversations_with_context, tokenizer)

# Initialize GPT-2 model
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Ensure the model can handle the new padding token
model.resize_token_embeddings(len(tokenizer))

# Set up training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=2,
    save_steps=10_000,
    save_total_limit=2,
    logging_dir='./logs',
    logging_steps=200,
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=conversation_dataset,
)

# Train the model
trainer.train()

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Step,Training Loss
200,2.0682


Step,Training Loss
200,2.0682
400,1.2399


### Test Set Input Preparation

In [32]:
class ConversationDataset(Dataset):
    def __init__(self, conversations, tokenizer, max_length=512, is_test=False):
        self.tokenizer = tokenizer
        self.conversations = conversations
        self.max_length = max_length
        self.is_test = is_test  # New flag to differentiate between train and test

    def __len__(self):
        return len(self.conversations)

    def __getitem__(self, idx):
        conversation = self.conversations[idx]

        # Combine context and original dialogue for GPT-2 input
        input_text = f"Movie: {conversation['movie']}\n" + \
                     f"Context: {' '.join(conversation['context'])}\n" + \
                     f"Dialogue: {conversation['original']}"

        # Tokenization and padding
        encoding = self.tokenizer.encode_plus(
            input_text,
            truncation=True,
            max_length=self.max_length,
            padding='max_length',
            return_tensors='pt',
        )

        input_ids = encoding['input_ids'].flatten()
        attention_mask = encoding['attention_mask'].flatten()

        # For training, we provide labels
        if not self.is_test:
            labels = input_ids.clone()
            labels[labels == self.tokenizer.pad_token_id] = -100  # Ignore padding tokens
            return {
                'input_ids': input_ids,
                'attention_mask': attention_mask,
                'labels': labels,
            }
        else:
            # For testing, just return input_ids and attention_mask
            return {
                'input_ids': input_ids,
                'attention_mask': attention_mask,
            }

# Initialize GPT-2 tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Set the pad token as the EOS token if it's not already set
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token  # Use EOS token as padding token
    tokenizer.pad_token_id = tokenizer.eos_token_id  # Ensure the pad token ID is set

# Ensure the model can handle the new padding token
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.resize_token_embeddings(len(tokenizer))

# Initialize dataset with the processed test conversations
test_conversation_dataset = ConversationDataset(processed_test_conversations_with_context, tokenizer, is_test=True)



## **Saving the Model and Tokenizer**

In [28]:
# Saving the trained model and tokenizer
output_dir = "/content/model_on_colab"

# Save model
model.save_pretrained(output_dir)

# Save tokenizer
tokenizer.save_pretrained(output_dir)

print(f"Model and tokenizer saved to {output_dir}")

# Create a zip file of the saved model folder
shutil.make_archive("/content/model_on_colab", 'zip', "/content/model_on_colab")

print("Model has been zipped.")

# Download the zipped file
files.download("/content/model_on_colab.zip")

Model and tokenizer saved to /content/model_on_colab
Model has been zipped.


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [39]:
from google.colab import files
uploaded = files.upload()

Saving model_on_colab.zip to model_on_colab.zip


In [40]:
zip_file_path = 'model_on_colab.zip'
extract_path = './model_directory'  # Define a directory to extract the model

with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(extract_path)

# Check the extracted files
print(os.listdir(extract_path))

['merges.txt', 'config.json', 'model.safetensors', 'vocab.json', 'generation_config.json', 'tokenizer_config.json', 'special_tokens_map.json']


## **Model Evaluation**

In [42]:
# Load the tokenizer and trained model from the extracted directory
tokenizer = GPT2Tokenizer.from_pretrained(extract_path)  # Use the extract path here
model = GPT2LMHeadModel.from_pretrained(extract_path)     # Use the same extract path

# Set the model to evaluation mode
model.eval()

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

In [49]:
# Set a padding token (using eos_token as padding)
tokenizer.pad_token = tokenizer.eos_token  # Set padding token to EOS token

training_args = TrainingArguments(
    output_dir='./results',
    eval_strategy="epoch",
    per_device_eval_batch_size=8,
    save_strategy="epoch",
    logging_dir='./logs',
    logging_steps=10,
    report_to=["none"]  # Disable W&B logging
)


# Initialize the Trainer
trainer = Trainer(
    model=model,                         # Your pre-trained model
    args=training_args,                  # Training arguments
    tokenizer=tokenizer,                 # Your tokenizer
    eval_dataset=test_conversation_dataset  # Your evaluation dataset
)

# Evaluate the model using the test dataset
eval_result = trainer.evaluate()

# Print evaluation results
print(eval_result)

{'eval_model_preparation_time': 0.0031, 'eval_runtime': 224.048, 'eval_samples_per_second': 0.29, 'eval_steps_per_second': 0.04}


### **Evaluation Results Breakdown**

**eval_model_preparation_time**:

**Value: 0.0031 seconds**

Description: This is the time taken to prepare the model for evaluation. A low value suggests that the model was ready quickly.

**eval_runtime**:

Value: 224.048 seconds
Description: This is the total time taken for the evaluation process. This value can be affected by factors such as the size of the evaluation dataset, the complexity of the model, and the hardware used for evaluation.

**eval_samples_per_second**:

**Value: 0.29 samples/second**

Description: This metric indicates how many samples (conversations, in your case) were processed per second during evaluation. A low value could suggest that the model or evaluation process might be bottlenecked by resource limitations (CPU/GPU) or inefficiencies in data processing.

**eval_steps_per_second**:

**Value: 0.04 steps/second**

Description: Similar to eval_samples_per_second, this indicates the number of steps processed per second during evaluation. Each step typically corresponds to the evaluation of a batch of samples.

### **BLEU Score Evaluation**

BLEU (Bilingual Evaluation Understudy) is a metric that compares generated text with reference text, often used for translation or dialogue systems.

In [52]:
# Assuming test_conversations is a list of dialogues
def evaluate_bleu(model, tokenizer, test_dataset):
    bleu_scores = []
    model.eval()  # Set model to evaluation mode

    for dialogue in test_dataset:
        # Assuming dialogue is the original text
        input_text = f"Dialogue: {dialogue}"

        # Generate response
        inputs = tokenizer(input_text, return_tensors="pt").to(model.device)  # Move to device
        with torch.no_grad():  # No gradients needed during evaluation
            outputs = model.generate(inputs['input_ids'], max_length=100, num_return_sequences=1)

        generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

        # Calculate BLEU score
        reference = dialogue.split()  # Reference text
        candidate = generated_text.split()  # Generated text
        bleu_score = sentence_bleu([reference], candidate, smoothing_function=SmoothingFunction().method1)
        bleu_scores.append(bleu_score)

    # Average BLEU score
    avg_bleu = sum(bleu_scores) / len(bleu_scores) if bleu_scores else 0.0  # Avoid division by zero
    return avg_bleu

# Evaluate BLEU score
avg_bleu = evaluate_bleu(model, tokenizer, test_conversations)
print(f"Average BLEU score: {avg_bleu:.4f}")


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask

Average BLEU score: 0.2182


###Explanation of BLEU:

BLEU Score is a measure of how close the model’s generated text is to the reference text (the correct answer).

An average BLEU score of **0.2182** indicates a moderate level of similarity between the generated dialogues and the reference dialogues. Here’s how to interpret this result and what steps you might consider next:

Interpretation of BLEU Score

Range: The BLEU score ranges from 0 to 1, where 0 means no overlap between the generated text and the reference text, and 1 means an exact match.

0.2182: This score suggests that the model is generating responses that have some level of similarity to the original dialogues, but there is considerable room for improvement.

Factors Influencing BLEU Score

Model Training: The quality of the training data and how well the model was fine-tuned on that data can significantly impact the score. If the training data had limited diversity or contained noisy examples, this could lead to lower BLEU scores.

Input Complexity: If the input contexts are complex or highly variable, the model might struggle to generate relevant responses, impacting the BLEU score.

Tokenization: Ensure that tokenization is consistent between the generated and reference texts. Variations in tokenization can lead to discrepancies in the BLEU score.

Smoothing: BLEU scores can be sensitive to very short sequences. You might want to experiment with different smoothing techniques to see if they affect your scores.

In [57]:
!pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24935 sha256=e51d8f9b815f278530182011e8eaab47ad67c2239a80a57608918de5dc4e516a
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


##**ROUGE Score**

The ROUGE (Recall-Oriented Understudy for Gisting Evaluation) score is often used for evaluating generated text by comparing it to reference text.

In [58]:
from rouge_score import rouge_scorer

def evaluate_rouge(model, tokenizer, test_dataset):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    rouge_scores = []
    model.eval()  # Set model to evaluation mode

    for dialogue in test_dataset:
        input_text = f"Dialogue: {dialogue}"

        # Generate response
        inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
        with torch.no_grad():
            outputs = model.generate(inputs['input_ids'], max_length=100, num_return_sequences=1)

        generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

        # Calculate ROUGE score
        scores = scorer.score(dialogue, generated_text)
        rouge_scores.append(scores)

    # Average ROUGE scores
    avg_rouge1 = sum(score['rouge1'].fmeasure for score in rouge_scores) / len(rouge_scores)
    avg_rouge2 = sum(score['rouge2'].fmeasure for score in rouge_scores) / len(rouge_scores)
    avg_rougeL = sum(score['rougeL'].fmeasure for score in rouge_scores) / len(rouge_scores)

    return avg_rouge1, avg_rouge2, avg_rougeL

# Evaluate ROUGE score
avg_rouge1, avg_rouge2, avg_rougeL = evaluate_rouge(model, tokenizer, test_conversations)
print(f"Average ROUGE-1: {avg_rouge1:.4f}")
print(f"Average ROUGE-2: {avg_rouge2:.4f}")
print(f"Average ROUGE-L: {avg_rougeL:.4f}")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generati

Average ROUGE-1: 0.2931
Average ROUGE-2: 0.2624
Average ROUGE-L: 0.2931


## Explanation of Rough

ROUGE-1: This score measures the overlap of unigrams (individual words) between the generated and reference texts. An average score of 0.2931 indicates that around 29% of the words in the generated dialogues match with the reference dialogues.

ROUGE-2: This score looks at the overlap of bigrams (pairs of consecutive words). A score of 0.2624 suggests that approximately 26% of the bigram sequences in the generated text are also found in the reference texts, showing some level of coherence and contextual similarity.

ROUGE-L: This score assesses the longest common subsequence, which is important for capturing the overall structure and flow of the generated text. An average score of 0.2931 indicates a reasonable alignment in terms of structure between your generated outputs and the reference dialogues.