# Custom Chatbot Project

#### Scope
This is the project which will be implemented the RAG approach to customize the chatbot using the provided dataset.

#### Data source
The dataset named 'character_descriptions.csv' is the file contained the characters information such as name, short description, medium and setting.

#### Reason for selection
I chose this dataset because it includes character descriptions from theater, television, and film productions, all created by an OpenAI model. Since the dataset doesn't exist in real life, the LLM won't have prior knowledge of it, making it ideal for testing the RAG approach. Additionally, if the LLM generates hallucinated answers, we can easily trace and compare them with the grounded answers.

## Data Wrangling

In [25]:
import pandas as pd
import tiktoken
import numpy as np
from openai.embeddings_utils import get_embedding, distances_from_embeddings

In [14]:
df = pd.read_csv("data/character_descriptions.csv", index_col=False)
df.head(3)

Unnamed: 0,Name,Description,Medium,Setting
0,Emily,"A young woman in her early 20s, Emily is an as...",Play,England
1,Jack,"A middle-aged man in his 40s, Jack is a succes...",Play,England
2,Alice,"A woman in her late 30s, Alice is a warm and n...",Play,England


In [None]:
#Create the text column that describe the data

In [17]:
pd.options.display.max_colwidth = 200

In [18]:
df['text'] = df['Name'] + " is " \
+ df['Description'] + ". This character appears in the " \
+ df['Medium'] + " in " + df['Setting']

In [20]:
df.head(3)

Unnamed: 0,Name,Description,Medium,Setting,text
0,Emily,"A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also in a relat...",Play,England,"Emily is A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also i..."
1,Jack,"A middle-aged man in his 40s, Jack is a successful businessman and Sarah's boss. He has a no-nonsense attitude, but is fiercely loyal to his friends and family. He's married to Alice.",Play,England,"Jack is A middle-aged man in his 40s, Jack is a successful businessman and Sarah's boss. He has a no-nonsense attitude, but is fiercely loyal to his friends and family. He's married to Alice.. Thi..."
2,Alice,"A woman in her late 30s, Alice is a warm and nurturing mother of two, including Emily. She's kind-hearted and empathetic, but can be overly protective of her children and prone to worrying. She's ...",Play,England,"Alice is A woman in her late 30s, Alice is a warm and nurturing mother of two, including Emily. She's kind-hearted and empathetic, but can be overly protective of her children and prone to worryin..."


In [None]:
# Embedding the text column

In [21]:
import openai
openai.api_base = "https://openai.vocareum.com/v1"
# This is the OpenAI Key. For security reason, I fill to "some_key" after completing the exercise
openai.api_key = "some_key" 

In [23]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 100
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )
    
    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings
df.head()

Unnamed: 0,Name,Description,Medium,Setting,text,embeddings
0,Emily,"A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also in a relat...",Play,England,"Emily is A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also i...","[-0.01430389191955328, -0.01158975437283516, -0.01025841198861599, -0.022510621696710587, -0.04231996834278107, 0.02829906716942787, -0.006560238543897867, 0.021442975848913193, -0.006984724197536..."
1,Jack,"A middle-aged man in his 40s, Jack is a successful businessman and Sarah's boss. He has a no-nonsense attitude, but is fiercely loyal to his friends and family. He's married to Alice.",Play,England,"Jack is A middle-aged man in his 40s, Jack is a successful businessman and Sarah's boss. He has a no-nonsense attitude, but is fiercely loyal to his friends and family. He's married to Alice.. Thi...","[0.00760793499648571, -0.0193692147731781, -0.004855364561080933, -0.02933463081717491, -0.03568219393491745, 0.022151172161102295, -0.005119846202433109, 0.007124684285372496, 0.00632797321304678..."
2,Alice,"A woman in her late 30s, Alice is a warm and nurturing mother of two, including Emily. She's kind-hearted and empathetic, but can be overly protective of her children and prone to worrying. She's ...",Play,England,"Alice is A woman in her late 30s, Alice is a warm and nurturing mother of two, including Emily. She's kind-hearted and empathetic, but can be overly protective of her children and prone to worryin...","[0.005332021042704582, -0.00995962880551815, -0.016684170812368393, -0.030602866783738136, -0.04646522179245949, 0.031515996903181076, -0.0008214084664359689, 0.02061062678694725, 0.00577554106712..."
3,Tom,"A man in his 50s, Tom is a retired soldier and John's son. He has a no-nonsense approach to life, but is haunted by his experiences in combat and struggles with PTSD. He's also in a relationship w...",Play,England,"Tom is A man in his 50s, Tom is a retired soldier and John's son. He has a no-nonsense approach to life, but is haunted by his experiences in combat and struggles with PTSD. He's also in a relatio...","[0.01569385640323162, -0.015496697276830673, -0.005185281857848167, -0.013170220889151096, -0.027628548443317413, 0.044111039489507675, 0.009082457982003689, 0.002807872835546732, -0.0213720351457..."
4,Sarah,"A woman in her mid-20s, Sarah is a free-spirited artist and Jack's employee. She's creative, unconventional, and passionate about her work. However, she can also be flighty and impulsive at times.",Play,England,"Sarah is A woman in her mid-20s, Sarah is a free-spirited artist and Jack's employee. She's creative, unconventional, and passionate about her work. However, she can also be flighty and impulsive ...","[-0.015658441931009293, -0.02574547939002514, -0.01528955902904272, -0.013814027421176434, -0.03856733813881874, 0.007123255170881748, -0.015060597099363804, 0.012739178724586964, -0.0152005180716..."


In [26]:

def get_cosine_distance_sorted(question, df):
    """
    This function do as following:
    - First, embedding the user question
    - Next, create a copy of the df dataframe. 
      Create the distances column that shows the difference from user question to text column
    - Sort the value of distances by ascending order (the closer distance the more relevant)
    """

    # Get the embedding for user question
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)

    # Copy the current dataframe. Create distances column
    df1 = df.copy()
    df1["distances"] = distances_from_embeddings(question_embeddings,
                                                df1["embeddings"].values,
                                                distance_metric="cosine")

    # Order by ascending order. The closer distance mean better relevant
    df1.sort_values("distances", ascending=True, inplace=True)
    return df1

## Custom Query Completion

In [28]:
# tiktoken to count the total token
tokenizer = tiktoken.get_encoding("cl100k_base")

In [32]:
def get_relevant_context(prompt_template:str, question:str, df: pd.DataFrame, max_token_count: int):
    """
    This function will calculate total tokens sent to Openai
    As long as the total token do not exceed the max token limit, append all context to list
    Return the list of relevant context
    """
    
    # Count total token
    current_token_count = len(tokenizer.encode(prompt_template)) + len(tokenizer.encode(question))

    # List of contexts to send to Openai
    context = []
    for text in get_cosine_distance_sorted(question, df)["text"].values:
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        # if not exceed max tokens, append to context
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break
    return context

In [38]:
def prompt_and_context(question, df, max_token_count):
    """
    Format the prompt template, add relevant contexts to guide chatbot to answer user questions.
    This is no-shot example.
    """

    # Prompt template to instruct the chatbot
    prompt_template = """
    You are a smart assistant to answer the question based on provided context. \
    If the question can not be answered based on the provided contexts, only say \ 
    "The question is out of scope. Could you please check your question or ask another question". Do not try to \
    answer the question out of the provide contexts.
    Context: 

    {}

    ---

    Question: {}
    Answer:"""

    # Get the relevant context
    context = get_relevant_context(prompt_template = prompt_template, question = question, 
                                   df = df, max_token_count = max_token_count)
    # Format the prompt template
    prompt_template = prompt_template.format("\n\n###\n\n".join(context), question)

    return prompt_template

## Custom Performance Demonstration

### Question 1

In [34]:
df['text'].iloc[0]

"Emily is A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also in a relationship with George.. This character appears in the Play in England"

In [35]:
question_1 = "Who is in the relationship with George?"

In [36]:
# The general question "question_1" is sent to Openai
# Thus, the response is unknown
answer1_without_context = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=question_1,
    max_tokens=150
)
answer1_without_context["choices"][0]["text"]

'\n\nIt is unclear who George is in a relationship with. More information is needed to answer this question.'

In [37]:
# The question is sent along with relevant contexts
# Thus, the response is as expected (Emily is in a relationship with George)
answer1_customized = openai.Completion.create(
  model="gpt-3.5-turbo-instruct",
  prompt=prompt_and_context(question_1, df, 2000),
    max_tokens=150
)
answer1_customized["choices"][0]["text"]

' Emily is in a relationship with George.'

### Question 2

In [39]:
df['text'].iloc[1]

"Jack is A middle-aged man in his 40s, Jack is a successful businessman and Sarah's boss. He has a no-nonsense attitude, but is fiercely loyal to his friends and family. He's married to Alice.. This character appears in the Play in England"

In [40]:
question_2 = "Which man married to Alice and appears in the Play in England?"

In [43]:
# The general question "question_2" is sent to Openai
# Thus, the response is hallucinated
answer2_without_context = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=question_2,
    max_tokens=150
)
answer2_without_context["choices"][0]["text"]

'\n\nThe man who was married to Alice and appears in the play in England is John of Gaunt. He is the Duke of Lancaster and also the father of the protagonist, Henry Bolingbroke.'

In [46]:
# The question is sent along with relevant contexts
# Thus, the response is as expected (Jack married to Alice and appears in the Play in England)
answer2_customized = openai.Completion.create(
  model="gpt-3.5-turbo-instruct",
  prompt=prompt_and_context(question_2, df, 2000),
    max_tokens=150
)
answer2_customized["choices"][0]["text"]

' Jack'