# Final Project Team 3

Members: Cesar Lucero Ortiz; Jeremy Cryer; Akram Mahmoud

AAI 520: Natural Language Processing

Professor Roozbeh Sadeghian

October 23, 2023

**Assignment Objective**:
- *Goal*: Build a chatbot that can carry out multi-turn conversations, adapt to context, and handle a variety of topics.
- *Output*: A web or app interface where users can converse with the chatbot.

**Dataset**
Download the Cornell Movie Dialogs Corpus Dataset Dataset from Kaggle using the following link:

https://www.kaggle.com/datasets/rajathmc/cornell-moviedialog-corpus

**Deliverable**: A working generative chatbot accessible via Notebook and/or accessible through a web interface (include the link in your submission and report). Your Notebook should be in PDF or HTML format.

# Library Imports
Installing and importing the libraries required for the project.

In [None]:
!pip --quiet install transformers torch

In [None]:
!pip install --quiet transformers[torch] -U
!pip install --quiet accelerate -U

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import confusion_matrix
import numpy as np
import random
import pandas as pd
import seaborn as sns
from google.colab import drive
import os
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer, TextDataset, DataCollatorForLanguageModeling, Trainer, TrainingArguments
from accelerate import Accelerator

#Data Upload
Here the raw data will be uploaded to the notebook for use.

In [None]:
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Configuration of data extraction. Here we pre-process the data, extracting the feautres from plain text separated by symbols

In [None]:
file_path = '/content/drive/MyDrive/NLP Project/movie_lines.txt'
separator = r'\s*\+\+\+\$\+\+\+\s*'
encod = 'iso-8859-1'
raw_data = pd.DataFrame(columns = ['Index', 'U', 'Movie', 'Char', 'Line'])
#raw_data.columns = ['Index', 'U', 'Movie', 'Char', 'Line']
raw_data = pd.read_csv(file_path, sep= separator, encoding= encod, header=None)

  raw_data = pd.read_csv(file_path, sep= separator, encoding= encod, header=None)


In [None]:
raw_data.columns = ['Line_ID', 'Char_ID', 'Movie', 'Char', 'Line'] #Name columns
df = raw_data.astype(str)          # Convert to string
df['Line_ID'] = df['Line_ID'].str.replace('L', '')  #Sort by Line ID
df['Line_ID'] = df['Line_ID'].astype(int)
df = df.sort_values(by='Line_ID')
df

Unnamed: 0,Line_ID,Char_ID,Movie,Char,Line
86,49,u0,m0,BIANCA,Did you change your hair?
85,50,u3,m0,CHASTITY,No.
84,51,u0,m0,BIANCA,You might wanna think about it
648,59,u9,m0,PATRICK,I missed you.
647,60,u8,m0,MISS PERKY,It says here you exposed yourself to a group o...
...,...,...,...,...,...
304704,666522,u9034,m616,VEREKER,So far only their scouts. But we have had repo...
304679,666546,u9027,m616,CHELMSFORD,"Splendid site, Crealock, splendil I want to es..."
304678,666547,u9029,m616,CREALOCK,"Certainly, Sin"
304696,666575,u9028,m616,COGHILL,Choose your targets men. That's right Watch th...


#Preprocessing
Here all preprocessing techniques priro to model build and training will be done.

At this time, we have the 'Line' Column  all lower case. We decided to try and train the model with stopwords and punctuation to make it learn with them. If we have low accuracy then we will remove the stopwords.

In [None]:
#df = df.iloc[0:250000] #Limiting DF
df['Line'] = df['Line'].apply(lambda x:x.lower()).astype(str)
df

Unnamed: 0,Line_ID,Char_ID,Movie,Char,Line
86,49,u0,m0,BIANCA,did you change your hair?
85,50,u3,m0,CHASTITY,no.
84,51,u0,m0,BIANCA,you might wanna think about it
648,59,u9,m0,PATRICK,i missed you.
647,60,u8,m0,MISS PERKY,it says here you exposed yourself to a group o...
...,...,...,...,...,...
304704,666522,u9034,m616,VEREKER,so far only their scouts. but we have had repo...
304679,666546,u9027,m616,CHELMSFORD,"splendid site, crealock, splendil i want to es..."
304678,666547,u9029,m616,CREALOCK,"certainly, sin"
304696,666575,u9028,m616,COGHILL,choose your targets men. that's right watch th...


In [None]:
#df['Line'] = df['Line'].apply(remove_punct)   Keep punctuation to make the
                                               #model conversational

Before more preprocessing is done it will just be interesting to look at the amount of dialog that exists per user so when we look at the results we can understand if there may be a bias. However, no diolog will be removed just to allow as much variation as possible into the model.

In [None]:
# Group the data by 'Char_ID' and count the number of dialogues for each character
dialogue_count = df.groupby('Char')['Line'].count().reset_index()

# Renaming the columns for clarity
dialogue_count.columns = ['Char', 'Dialogue_Count']

# Sortting the results by dialogue count in descending order
dialogue_count = dialogue_count.sort_values(by='Dialogue_Count', ascending=False)

# Displaying the dialogue count per user
print(dialogue_count)

In [None]:
# Setting the figure size
plt.figure(figsize=(10, 6))

# Creating a bar chart
plt.bar(dialogue_count['Char'], dialogue_count['Dialogue_Count'], color='skyblue')

# Setting labels and title
plt.xlabel('Character ID')
plt.ylabel('Dialogue Count')
plt.title('Dialogue Count per User')

# Rotating x-axis labels for better readability
plt.xticks(rotation=45)

# Showing the plot
plt.tight_layout()
plt.show()

This function, structures the dialogs to join the consecutive lines of each character. If a character says two or more consecutive lines, these are joined until the next character says something. Structured in a dictionary fashion for handling.

In [None]:
def structure_dialogues(df):
    dialogues = []
    current_character = None
    current_dialogue = ""

    for index, row in df.iterrows():
        character = row['Char']
        dialogue = row['Line']

        if character != current_character:
            # A new character's dialogue begins
            if current_character is not None:
                dialogues.append({"character": current_character, "dialogue": current_dialogue})
            current_character = character
            current_dialogue = dialogue
        else:
            # Continue the dialogue for the same character
            current_dialogue += " " + dialogue

    # Append the last character's dialogue
    if current_character is not None:
        dialogues.append({"character": current_character, "dialogue": current_dialogue})

    return dialogues



In [None]:
structured_data = structure_dialogues(df)

df_structured_dialogs = pd.DataFrame(structured_data)
df_structured_dialogs[0:12]
#csv_file_path = '/content/drive/MyDrive/NLP Project/structured_dialogues.csv'
#This line saves the structured dialogs in a new csv file
#df.to_csv(csv_file_path, index=False)

Unnamed: 0,character,dialogue
0,BIANCA,did you change your hair?
1,CHASTITY,no.
2,BIANCA,you might wanna think about it
3,PATRICK,i missed you.
4,MISS PERKY,it says here you exposed yourself to a group o...
5,PATRICK,it was a bratwurst. i was eating lunch.
6,MISS PERKY,with the teeth of your zipper?
7,MICHAEL,you the new guy?
8,CAMERON,so they tell me...
9,MICHAEL,c'mon. i'm supposed to give you the tour. so ...


In the development of the training data, we have the option of grouping by character, so the model learns the patterns of the characters or to train it as the dialog runs through the script so it learns the conversational structure of a dialog.

In [None]:
dialogs = df_structured_dialogs.groupby('character')['dialogue'].apply(list).reset_index() #With dialogs grouped by character

In [None]:
# dialogs = df_structured_dialogs['dialogue'].apply(list).reset_index() #With dialogs in the original structure

# Model Architecture
Here the model will be built and trained.

In [None]:
# Built the GPT 2 and call the tokenizer
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

In [None]:
tokenizer.pad_token = tokenizer.eos_token
data = ""
# Filling with empty tokens for padding
input_ids = tokenizer(data, return_tensors="pt", truncation=True, padding=True)


In [None]:
#Preparing the dataset to teach the model by making a new text file and feed it

with open("train_dataset.txt", "w", encoding="utf-8") as file:
    for index, row in dialogs.iterrows():
        charac = row['character']
        dialog = '\n'.join(row['dialogue'])
        data = f"{charac}: {dialog}\n\n"
        # Tokenize the dialog and write it in the text file
        tokenized_data = tokenizer(data, return_tensors="pt", truncation=True, padding=True)
        input_ids = tokenized_data['input_ids'].numpy().tolist()  # Convierte el tensor a una lista
        file.write(tokenizer.decode(input_ids[0]) + '\n')

In [None]:
train_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path="train_dataset.txt",
    block_size=256,
)




In [None]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,  # Not modeling to predict words
)

In [None]:
#Model Training loop.
training_args = TrainingArguments(
    output_dir="./chatbot_model",
    overwrite_output_dir=True,
    num_train_epochs=10,
    per_device_train_batch_size=256,
    save_steps=10_000,
    save_total_limit=2,
    logging_steps=1_000,
)
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
)


In [None]:
#Model Saving
trainer.save_model("./trained_model")

#Model Evaluation and Chat Bot
Here we will evaluate the models performance

In [None]:
# Load the model and tokenizer
model_path = "./trained_model"
model = GPT2LMHeadModel.from_pretrained(model_path)
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

In [None]:
# Function to generate a response
def generate_response(input_text):
    # Tokenize the user input
    input_ids = tokenizer.encode(input_text, return_tensors="pt")
    attention_mask = torch.ones(input_ids.shape, dtype=torch.long,
                                device=input_ids.device)

    # Generate a response
    with torch.no_grad():
        output = model.generate(input_ids, max_length=50,
                                num_return_sequences=1,
                                attention_mask=attention_mask, #attention mask is not working
                                pad_token_id=50256,
                                do_sample=True ,   #Add a little randomnes with temperature parameter
                                temperature=0.80   #Increasing the temperature will increase randomness
                                )
    # Decode the generated response
    response = tokenizer.decode(output[0], skip_special_tokens=True)

    return response

# Conversation loop
while True:
    # Request input from the user
    user_input = input("You: ")
    # Exit the loop if the user types "exit"
    if user_input.lower() == "exit":
        break
    # Generate a response based on the user's input
    response = generate_response(user_input)
    # Print the chatbot's response
    print("Chatbot: " + response)

You: Hi, how are you tonight?
Chatbot: Hi, how are you tonight? I am being so pretty today. I think about me, and I'm going to do so in such a way that you will be able to hear me."

"I don't know what's inside,"
You: What is inside?
Chatbot: What is inside?

"We were doing research on how to make the right changes for people with autism," says Ritchie. To get a sense of what a "typical" autism spectrum disorder is, he and his colleagues measured 553
You: exit
