# Custom Chatbot Project 

Mauricio Cabreira
v1.0
20240512

I have chosen the character description CSV file which contains theater, television, and film productions. Each row contains the name, description, medium, and setting. All characters were invented by an OpenAI model.

In [1]:
import os
import openai
import json
from pathlib import Path
import pandas as pd

In [2]:
OPENAI_API_KEY= "API KEY HERE"
openai.api_key =  OPENAI_API_KEY

In [3]:
# Decoding parameters
TEMPERATURE = 0.0
MAX_TOKENS = 3950  # Increased to simulate LLM with smaller attention window
TOP_P = 1.0

## Step 1: Inspecting Non-Customized Results

Before training the model with the new dataset, let's ask OpenAI model two questions and see how it answers


### Question 1: 
**It asks about the Australia Limited Series show, whose main characters are Mia, Lucas, Tahlia, Max and Ava.**


**Mia**:	A young Australian woman in her mid-20s, Mia is a driven and ambitious lawyer who's just landed her dream job at a top law firm in Sydney. She's the younger sister of Max, a former soldier who's struggling with PTSD, and is trying to help him navigate his challenges while also balancing her demanding career.

**Lucas**:	A middle-aged Australian man in his 40s, Lucas is a successful businessman and the CEO of a major tech company. He's charming, charismatic, and has a way with people. He's been married to Ava, a successful fashion designer, for many years, but their marriage is on the rocks due to his infidelity.

**Tahlia**:	A young Indigenous Australian woman in her early 20s, Tahlia is a talented artist who's just been accepted into a prestigious art school. She's the niece of Mia and Max, and they've been like siblings since they were young. She's struggling to find her place in the world as an Indigenous woman, but Mia and Max are always there to support her.

**Max**:	A white Australian man in his late 20s, Max is a former soldier who's struggling to adjust to civilian life after serving in Afghanistan. He's tough, no-nonsense, and has a strong sense of duty. He's the older brother of Mia, and they've always been close. He's also Ava's godson, and she's always been like a second mother to him.

**Ava**:	A middle-aged Australian woman in her 50s, Ava is a successful fashion designer who's built an empire on her impeccable taste and attention to detail. She's elegant, sophisticated, and always knows what's in style. She's married to Lucas, but their marriage is strained due to his infidelity. She's also been a mentor to Tahlia, and has helped her navigate the art world.

In [4]:
limited_series_australia_prompt = """
Question: "What are the main characters of a Limited series that takes place in Australia, and how they are related to each other?"
Answer:
"""
initial_limited_series_australia_answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=limited_series_australia_prompt,
    max_tokens=150
)["choices"][0]["text"].strip()
print(initial_limited_series_australia_answer)


The main characters in a Limited series taking place in Australia may include:

1. Sarah - She is a middle-aged woman who has recently lost her husband in a tragic accident. She is struggling to come to terms with her grief and is trying to rebuild her life. She is also a mother to two teenagers, Emma and Jack.

2. Emma - Sarah's 17-year-old daughter, she is rebellious and often clashes with her mother. She is struggling with the loss of her father and is acting out in various ways, causing tension within the family.

3. Jack - Sarah's 15-year-old son, he is shy and introverted, and often feels ignored by his mother who is focused on Emma's rebellious behavior. He


### Question 2: 

**It asks about the Sitcom that takes place in USA, whose main characters are Abigail, Thomas, Reverend Brown, Captain James, Mrs. Mercer and Mr. Mercer.**

**Abigail**:	A plucky and resourceful young woman who works as a maid in one of the taverns in colonial Williamsburg. Abigail is hard-working and determined, and dreams of one day owning her own business. She has a friendly rivalry with her co-worker, Thomas.

**Thomas**:	A good-natured and affable young man who also works in the same tavern as Abigail. Thomas is the jester of the group and enjoys making jokes and lightening the mood. He often finds himself caught up in Abigail's schemes.
Reverend Brown	The pious and stern minister of the local church. Reverend Brown takes his role very seriously and often clashes with the more irreverent characters in the town. He is also secretly in love with Abigail and tries to win her over with his piety.

**Captain James**:	The charismatic and dashing captain of the local militia. Captain James is a ladies' man and enjoys flirting with the women of the town. He has a friendly rivalry with Reverend Brown and often teases him about his piousness.

**Mrs. Mercer**:	The matriarch of the wealthiest family in Williamsburg. Mrs. Mercer is a bit of a snob and enjoys reminding everyone of her social standing. She often hires Abigail to work in her home and is very demanding.

**Mr. Mercer**: 	The bumbling and absent-minded patriarch of the Mercer family. Mr. Mercer is often clueless about what is going on around him and relies on his wife to keep him in line. He has a secret love of practical jokes and often finds himself in trouble because of them.

In [5]:
sitcom_usa_prompt = """
Question: "What are the main characters of a sitcom that takes place in USA, Williansburg, and how they are related to each other?"
Answer:
"""
initial_sitcom_usa_answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=sitcom_usa_prompt,
    max_tokens=150
)["choices"][0]["text"].strip()
print(initial_sitcom_usa_answer)


The main characters of the sitcom in USA, Williansburg could be a diverse group of individuals who are all connected in some way through their lives in the neighborhood. Some possible main characters and their relationships to each other could include:

1. The Smith family - a typical American family living in Williansburg. Mark Smith is the father who works as an accountant, while his wife Sarah is a stay-at-home mom and their teenage daughter Emma is a rebellious high school student. They live next door to...

2. The Chang family - a Chinese-American family who own and run a local restaurant. Mr. Chang is the head chef while his wife Ling manages the front of the house. Their son Kevin is Emma's best friend and classmate,


## Step 2: Data Wrangling

Loading the data/character_descriptions.csv and storing into a df dataset. This file will be used later to generated embeddings

In [6]:
df = pd.read_csv("data/character_descriptions.csv")
df.head()

Unnamed: 0,Name,Description,Medium,Setting
0,Emily,"A young woman in her early 20s, Emily is an as...",Play,England
1,Jack,"A middle-aged man in his 40s, Jack is a succes...",Play,England
2,Alice,"A woman in her late 30s, Alice is a warm and n...",Play,England
3,Tom,"A man in his 50s, Tom is a retired soldier and...",Play,England
4,Sarah,"A woman in her mid-20s, Sarah is a free-spirit...",Play,England


Creating a single column called Text that will containg all columns, separated by an '-' character, which will be used to crated the embeddings.

In [7]:
df["text"] = df["Name"] + " – " + df["Description"] + " – " + df["Medium"] + " – " + df["Setting"]
df.head()

Unnamed: 0,Name,Description,Medium,Setting,text
0,Emily,"A young woman in her early 20s, Emily is an as...",Play,England,"Emily – A young woman in her early 20s, Emily ..."
1,Jack,"A middle-aged man in his 40s, Jack is a succes...",Play,England,"Jack – A middle-aged man in his 40s, Jack is a..."
2,Alice,"A woman in her late 30s, Alice is a warm and n...",Play,England,"Alice – A woman in her late 30s, Alice is a wa..."
3,Tom,"A man in his 50s, Tom is a retired soldier and...",Play,England,"Tom – A man in his 50s, Tom is a retired soldi..."
4,Sarah,"A woman in her mid-20s, Sarah is a free-spirit...",Play,England,"Sarah – A woman in her mid-20s, Sarah is a fre..."


In [8]:
df = df.drop(columns=['Name', 'Description', 'Medium', 'Setting'])
df.head()

Unnamed: 0,text
0,"Emily – A young woman in her early 20s, Emily ..."
1,"Jack – A middle-aged man in his 40s, Jack is a..."
2,"Alice – A woman in her late 30s, Alice is a wa..."
3,"Tom – A man in his 50s, Tom is a retired soldi..."
4,"Sarah – A woman in her mid-20s, Sarah is a fre..."


## Step 3: Generating Embeddings

We'll use the `Embedding` tooling from OpenAI [documentation here](https://platform.openai.com/docs/guides/embeddings/embeddings) to create vectors representing each row of our custom dataset.

In order to avoid a `RateLimitError` we'll send our data in batches to the `Embedding.create` function.

In [9]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 100
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )
    
    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings
df.head()

Unnamed: 0,text,embeddings
0,"Emily – A young woman in her early 20s, Emily ...","[-0.016638154163956642, -0.011081420816481113,..."
1,"Jack – A middle-aged man in his 40s, Jack is a...","[0.0021353126503527164, -0.018396539613604546,..."
2,"Alice – A woman in her late 30s, Alice is a wa...","[0.0035223872400820255, -0.00580426724627614, ..."
3,"Tom – A man in his 50s, Tom is a retired soldi...","[0.015513812191784382, -0.013250299729406834, ..."
4,"Sarah – A woman in her mid-20s, Sarah is a fre...","[-0.011919834651052952, -0.019470803439617157,..."


Saving the data and embeddings, so we can continue from there in future without needing to perform the previous steps

In [10]:
df.to_csv("embeddings.csv")

In [11]:
!ls

Udacity-GenAI_ND_Module_2_Custom_Chatbot_20240512-01.ipynb  embeddings.csv
data							    project.ipynb


Reding the file:

In [12]:
import numpy as np
import pandas as pd
import openai
openai.api_key = "API KEY HERE"
df = pd.read_csv("embeddings.csv", index_col=0)
df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)

In [13]:
df.head()

Unnamed: 0,text,embeddings
0,"Emily – A young woman in her early 20s, Emily ...","[-0.016638154163956642, -0.011081420816481113,..."
1,"Jack – A middle-aged man in his 40s, Jack is a...","[0.0021353126503527164, -0.018396539613604546,..."
2,"Alice – A woman in her late 30s, Alice is a wa...","[0.0035223872400820255, -0.00580426724627614, ..."
3,"Tom – A man in his 50s, Tom is a retired soldi...","[0.015513812191784382, -0.013250299729406834, ..."
4,"Sarah – A woman in her mid-20s, Sarah is a fre...","[-0.011919834651052952, -0.019470803439617157,..."


# Step 4: Create a Function that Finds Related Pieces of Text for a Given Question

What we are implementing here is similar to a search engine or recommendation algorithm. We want to sort all of the rows of our dataset from least relevant to most relevant.

This will use the embeddings that we generated previously in order to compare the vectorized version of our question to the vectorized versions of the rows of the dataset.

In [14]:
from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy


Testing with some questions to see the responses and its relevance

Correct response is **Abigail**. Let's see if the model returns it as most ranked response

In [15]:
get_rows_sorted_by_relevance("Who is the young woman that works as a maid in a tavern in the USA Sitcom that takes place in Williansburg?", df)

Unnamed: 0,text,embeddings,distances
49,Abigail – A plucky and resourceful young woman...,"[-0.027043841779232025, -0.01988670602440834, ...",0.135161
53,Mrs. Mercer – The matriarch of the wealthiest ...,"[-0.019808832556009293, -0.010635176673531532,...",0.177994
50,Thomas – A good-natured and affable young man ...,"[-0.017957238480448723, -0.012376978993415833,...",0.19044
8,"Maria – A middle-aged Latina woman in her 40s,...","[-0.0036456510424613953, -0.008789515122771263...",0.205441
0,"Emily – A young woman in her early 20s, Emily ...","[-0.016638154163956642, -0.011081420816481113,...",0.21281
52,Captain James – The charismatic and dashing ca...,"[-0.006222863215953112, -0.016035335138440132,...",0.214885
4,"Sarah – A woman in her mid-20s, Sarah is a fre...","[-0.011919834651052952, -0.019470803439617157,...",0.21599
45,Bianca – Lady Olivia's cunning and quick-witte...,"[-0.01409891713410616, -0.021390557289123535, ...",0.217439
20,"Johnny – A young up-and-coming performer, John...","[-0.033840082585811615, -0.019628280773758888,...",0.232662
40,Lady Olivia – A wealthy and beautiful noblewom...,"[-0.018421368673443794, -0.019241541624069214,...",0.232716


Correct! Abigail was ranked first with the shortest calculated distance from the embeddings we have created before

# Step 5: Create a Function that Composes a Text Prompt

Building on that sorted list of rows, we're going to select the create a text prompt that provides context to a `Completion` model in order to help it answer a question. The outline of the prompt looks like this:

```
Answer the question based on the context below, and if the
question can't be answered based on the context, say "I don't
know"

Context:

{context}

---

Question: {question}
Answer:
```

We want to fit as much of our dataset as possible into the "context" part of the prompt without exceeding the number of tokens allowed by the `Completion` model, which is currently 4,000. So we'll loop over the dataset, counting the tokens as we go, and stop when we hit the limit. Then we'll join that list of text data into a single string and add it to the prompt.

In [16]:
import tiktoken

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)
    

Now let's test that out! We'll use a `max_token_count` below the actual limit just to keep the output shorter and more readable.

In [17]:
print(create_prompt("Who is the young woman that works as a maid in a tavern in the USA Sitcom that takes place in Williansburg?", df, 200))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

Abigail – A plucky and resourceful young woman who works as a maid in one of the taverns in colonial Williamsburg. Abigail is hard-working and determined, and dreams of one day owning her own business. She has a friendly rivalry with her co-worker, Thomas. – Sitcom – USA

###

Mrs. Mercer – The matriarch of the wealthiest family in Williamsburg. Mrs. Mercer is a bit of a snob and enjoys reminding everyone of her social standing. She often hires Abigail to work in her home and is very demanding. – Sitcom – USA

---

Question: Who is the young woman that works as a maid in a tavern in the USA Sitcom that takes place in Williansburg?
Answer:


# Step 6: Create a Function that Answers a Question

Our final step is to send that text prompt to a `Completion` model and parse the model output!

In [18]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(
    question, df, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""
        

Let's test it out:

**Question 1**: The expected answer must include Mia, Max, Lucas, Tahlia, and Ava and their relationships.

In [19]:
custom_limited_series_australia_answer = answer_question("What are the main characters of a Limited series that takes place in Australia, and how they are related to each other?", df)
print(custom_limited_series_australia_answer)


The main characters are Mia, Max, Lucas, Tahlia, and Ava. Mia and Max are siblings, Ava is their godmother, and Tahlia is their niece. Additionally, Lucas is married to Ava.


**Question 2**: The expected answer must include Abigail, Thomas, Reverend Brown, Captain James, Mrs. Mercer and Mr. Mercer, and their relationships.

In [20]:
custom_sitcom_usa_answer = answer_question("What are the main characters of a sitcom that takes place in USA, Williansburg, and how they are related to each other?", df)
print(custom_sitcom_usa_answer)


The main characters of this sitcom are Mrs. Mercer, Mr. Mercer, Abigail, Thomas, and Reverend Brown. Mrs. Mercer is the matriarch of the wealthiest family in Williamsburg, and she often hires Abigail to work for her. Mr. Mercer is her husband and is often clueless and bumbling. Abigail works in a tavern alongside Thomas, who is her coworker and a good friend. Reverend Brown is the local minister and he is secretly in love with Abigail.


In [21]:
custom_sitcom_usa_answer2 = answer_question("What are the main characters of a sitcom that takes place in USA, Williansburg?", df)
print(custom_sitcom_usa_answer2)


Mrs. Mercer, Captain James, Abigail, Thomas, Mr. Mercer, and Reverend Brown.


# Step 7: Comparing the answers

Below we compare the answers, and can see that the results of the RAG performed were successfully executed:

In [23]:
print(f"""
What are the main characters of a Limited series that takes place in Australia, and how they are related to each other?

Original Answer: {initial_limited_series_australia_answer}

Custom Answer:   {custom_limited_series_australia_answer}





What are the main characters of a sitcom that takes place in USA, Williansburg, and how they are related to each other?
Original Answer: {initial_sitcom_usa_answer}

Custom Answer:   {custom_sitcom_usa_answer}
""")


What are the main characters of a Limited series that takes place in Australia, and how they are related to each other?

Original Answer: The main characters in a Limited series taking place in Australia may include:

1. Sarah - She is a middle-aged woman who has recently lost her husband in a tragic accident. She is struggling to come to terms with her grief and is trying to rebuild her life. She is also a mother to two teenagers, Emma and Jack.

2. Emma - Sarah's 17-year-old daughter, she is rebellious and often clashes with her mother. She is struggling with the loss of her father and is acting out in various ways, causing tension within the family.

3. Jack - Sarah's 15-year-old son, he is shy and introverted, and often feels ignored by his mother who is focused on Emma's rebellious behavior. He

Custom Answer:   The main characters are Mia, Max, Lucas, Tahlia, and Ava. Mia and Max are siblings, Ava is their godmother, and Tahlia is their niece. Additionally, Lucas is married to A