# Custom Chatbot Project

## Data Description
The 'character_descriptions.csv' dataset is the file which containes the characters information such as name, short description, medium and setting. This project will implement the RAG approach to customize the chatbot using this dataset.


## Reason for Dataset Selection

I have choosen this character_descriptions.csv dataset because this is an excellent choice for this RAG project due to its synthetic nature, evaluation clarity, structured format, and domain relevance. We can accurately measure the effectiveness of retrieval and grounding, Trace hallucinations and fine-tune prompts, Simulate real-world chatbot scenarios in a safe and experimental way.

## Data Wrangling

In [1]:
import pandas as pd
import tiktoken
import numpy as np
from openai.embeddings_utils import get_embedding, distances_from_embeddings

In [15]:
df = pd.read_csv("data/character_descriptions.csv", index_col=False)
df.head(5)

Unnamed: 0,Name,Description,Medium,Setting
0,Emily,"A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also in a relationship with George.",Play,England
1,Jack,"A middle-aged man in his 40s, Jack is a successful businessman and Sarah's boss. He has a no-nonsense attitude, but is fiercely loyal to his friends and family. He's married to Alice.",Play,England
2,Alice,"A woman in her late 30s, Alice is a warm and nurturing mother of two, including Emily. She's kind-hearted and empathetic, but can be overly protective of her children and prone to worrying. She's married to Jack.",Play,England
3,Tom,"A man in his 50s, Tom is a retired soldier and John's son. He has a no-nonsense approach to life, but is haunted by his experiences in combat and struggles with PTSD. He's also in a relationship with Rachel.",Play,England
4,Sarah,"A woman in her mid-20s, Sarah is a free-spirited artist and Jack's employee. She's creative, unconventional, and passionate about her work. However, she can also be flighty and impulsive at times.",Play,England


In [16]:
#Create the text column that describe the data
pd.options.display.max_colwidth = 300

df["text"] = df.apply(
    lambda row: f"{row['Name']} is {row['Description']}. "
                f"This character is shown in the {row['Medium']} in {row['Setting']}.",
    axis=1
)

df.head(2)

Unnamed: 0,Name,Description,Medium,Setting,text
0,Emily,"A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also in a relationship with George.",Play,England,"Emily is A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also in a relationship with George.. This character is shown in the Play in England."
1,Jack,"A middle-aged man in his 40s, Jack is a successful businessman and Sarah's boss. He has a no-nonsense attitude, but is fiercely loyal to his friends and family. He's married to Alice.",Play,England,"Jack is A middle-aged man in his 40s, Jack is a successful businessman and Sarah's boss. He has a no-nonsense attitude, but is fiercely loyal to his friends and family. He's married to Alice.. This character is shown in the Play in England."


In [17]:
# Embedding the text column
import openai

openai.api_base = "https://openai.vocareum.com/v1"

# For security reason, I omited the API KEY after completing the project.

openai.api_key = "YOUR API KEY" 

In [19]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 100
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )
    
    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings
df.head(2)

Unnamed: 0,Name,Description,Medium,Setting,text,embeddings
0,Emily,"A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also in a relationship with George.",Play,England,"Emily is A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also in a relationship with George.. This character is shown in the Play in England.","[-0.017194269225001335, -0.010352705605328083, -0.0069706495851278305, -0.021389568224549294, -0.04030068218708038, 0.023480761796236038, -0.008732674643397331, 0.022951509803533554, -0.005934733431786299, -0.01626484841108322, -0.0013215190265327692, -0.007073918357491493, 0.01194691937416792, ..."
1,Jack,"A middle-aged man in his 40s, Jack is a successful businessman and Sarah's boss. He has a no-nonsense attitude, but is fiercely loyal to his friends and family. He's married to Alice.",Play,England,"Jack is A middle-aged man in his 40s, Jack is a successful businessman and Sarah's boss. He has a no-nonsense attitude, but is fiercely loyal to his friends and family. He's married to Alice.. This character is shown in the Play in England.","[0.0052487291395664215, -0.018989197909832, -0.0010948418639600277, -0.030919311568140984, -0.03422744572162628, 0.016553683206439018, -0.008296377956867218, 0.007801460567861795, 0.005593868903815746, -0.017673760652542114, 0.005375714506953955, 0.004337039310485125, 0.0028539150953292847, -0.0..."


## Custom Query Completion

In [20]:
def get_cosine_distance(question, df):
    """
    This function do as following:
    First, it generates an embedding for the user's question.
    Next, it creates a copy of the original DataFrame.
    Then, it calculates a distances column that measures how similar each row's text is to the user’s question.
    Finally, it sorts the DataFrame in ascending order of distance — placing the most relevant texts ( those closest in meaning) at the top.
    """

    # Get the embedding for question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)

    # Copy the current dataframe. Create distances column
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(question_embeddings,
                                                df_copy["embeddings"].values,
                                                distance_metric="cosine")

    # Order by ascending order. The closer distance mean better relevant
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

In [33]:
def get_relevant_context(prompt_template, question, df, max_token_count):
    # count the total token by tiktoken
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model. 
    
    """
    
    # Count total token
    current_token_count = len(tokenizer.encode(prompt_template)) + len(tokenizer.encode(question))

    # List of contexts to send to Openai
    context = []
    for text in get_cosine_distance(question, df)["text"].values:
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        # if not exceed max tokens, append to context
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break
    return context

In [24]:
def prompt_and_context(question, df, max_token_count):
    """
    Format the prompt template, add relevant contexts to guide chatbot to answer user questions.
    This is no-shot example.
    """

    # Prompt template to instruct the chatbot
    prompt_template = """
    You are a smart assistant to answer the question based on provided context. \
    If the question can not be answered based on the provided contexts, only say \ 
    "The question is out of scope. Could you please check your question or ask another question". Do not try to \
    answer the question out of the provide contexts.
    Context: 

    {}

    ---

    Question: {}
    Answer:"""

    # Get the relevant context
    context = get_relevant_context(prompt_template = prompt_template, question = question, 
                                   df = df, max_token_count = max_token_count)
    # Format the prompt template
    prompt_template = prompt_template.format("\n\n###\n\n".join(context), question)

    return prompt_template

## Custom Performance Demonstration

### Question 1

In [42]:
question_1 = "Who is the mother of Emily?"

In [43]:
# The general question "question_1" is sent to Openai
# Thus, the response is unknown
answer1_without_context = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=question_1,
    max_tokens=150
)
answer1_without_context["choices"][0]["text"]

'\n\nThere is not enough information provided to answer this question. There are likely many people named Emily with different mothers.'

In [44]:
df['text'].iloc[0]

"Emily is A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also in a relationship with George.. This character is shown in the Play in England."

In [45]:
# The question is sent along with relevant contexts
# Thus, the response is as expected (Emily is in a relationship with George)
answer1_customized = openai.Completion.create(
  model="gpt-3.5-turbo-instruct",
  prompt=prompt_and_context(question_1, df, 2000),
    max_tokens=150
)
answer1_customized["choices"][0]["text"]

' Alice.'

### Question 2

In [39]:
question_2 = "Who is a middle-aged man?"

In [40]:
# The general question "question_2" is sent to Openai
# Thus, the response is hallucinated
answer2_without_context = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=question_2,
    max_tokens=150
)
answer2_without_context["choices"][0]["text"]

'\n\nA middle-aged man is typically defined as someone between the ages of 40 and 65 years old. This age bracket is considered the transition period between young adulthood and old age and is generally characterized by a stable career, established relationships, and physical changes such as a decrease in energy and a greying of hair. '

In [37]:
df['text'].iloc[1]

"Jack is A middle-aged man in his 40s, Jack is a successful businessman and Sarah's boss. He has a no-nonsense attitude, but is fiercely loyal to his friends and family. He's married to Alice.. This character is shown in the Play in England."

In [41]:
# The question is sent along with relevant contexts
# Thus, the response is as expected (Jack married to Alice and appears in the Play in England)
answer2_customized = openai.Completion.create(
  model="gpt-3.5-turbo-instruct",
  prompt=prompt_and_context(question_2, df, 2000),
    max_tokens=150
)
answer2_customized["choices"][0]["text"]

' Jack is a middle-aged man in his 40s.'