# Custom Chatbot Project

I chose the character_descriptions dataset. With a context provided, I can ask the AI specific questions about characters, using their names for example, whereas without context, the AI would have no idea which characters I'm referring to.

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [3]:
import pandas as pd
import numpy as np

In [4]:
df = pd.read_csv("data/character_descriptions.csv")
df.head()

Unnamed: 0,Name,Description,Medium,Setting
0,Emily,"A young woman in her early 20s, Emily is an as...",Play,England
1,Jack,"A middle-aged man in his 40s, Jack is a succes...",Play,England
2,Alice,"A woman in her late 30s, Alice is a warm and n...",Play,England
3,Tom,"A man in his 50s, Tom is a retired soldier and...",Play,England
4,Sarah,"A woman in her mid-20s, Sarah is a free-spirit...",Play,England


In [5]:
# Initialize an empty list to store the concatenated strings
concatenated_texts = []

# Loop through each row in the DataFrame
for index, row in df.iterrows():
    concatenated_text = f"{row['Name']}: {row['Description']}, {row['Medium']}, {row['Setting']}"
    concatenated_texts.append(concatenated_text)

# Create a new DataFrame with the concatenated texts
df = pd.DataFrame(concatenated_texts, columns=['text'])
df

Unnamed: 0,text
0,"Emily: A young woman in her early 20s, Emily i..."
1,"Jack: A middle-aged man in his 40s, Jack is a ..."
2,"Alice: A woman in her late 30s, Alice is a war..."
3,"Tom: A man in his 50s, Tom is a retired soldie..."
4,"Sarah: A woman in her mid-20s, Sarah is a free..."
5,"George: A man in his early 30s, George is a ch..."
6,"Rachel: A woman in her late 20s, Rachel is a s..."
7,"John: A man in his 60s, John is a retired prof..."
8,"Maria: A middle-aged Latina woman in her 40s, ..."
9,Caleb: A young African American man in his ear...


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [6]:
import openai
openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = # SECRET

In [7]:
# Using Code from 4.24 Case Study Workspace
# Generate and save embeddings

EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 100
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )
    
    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings
df.to_csv("embeddings.csv")
df

Unnamed: 0,text,embeddings
0,"Emily: A young woman in her early 20s, Emily i...","[-0.016379576176404953, -0.014448083005845547,..."
1,"Jack: A middle-aged man in his 40s, Jack is a ...","[0.00714816665276885, -0.022021278738975525, 0..."
2,"Alice: A woman in her late 30s, Alice is a war...","[0.005913194734603167, -0.011871776543557644, ..."
3,"Tom: A man in his 50s, Tom is a retired soldie...","[0.017878901213407516, -0.017596535384655, 0.0..."
4,"Sarah: A woman in her mid-20s, Sarah is a free...","[-0.018298335373401642, -0.02350444905459881, ..."
5,"George: A man in his early 30s, George is a ch...","[-0.020879525691270828, -0.012621497735381126,..."
6,"Rachel: A woman in her late 20s, Rachel is a s...","[-0.0037319459952414036, -0.009705585427582264..."
7,"John: A man in his 60s, John is a retired prof...","[0.019646571949124336, -0.013807781040668488, ..."
8,"Maria: A middle-aged Latina woman in her 40s, ...","[-0.006911840755492449, -0.013061589561402798,..."
9,Caleb: A young African American man in his ear...,"[0.006823256146162748, -0.029563192278146744, ..."


In [8]:
# Using Code from 4.24 Case Study Workspace
# Load embeddings from file
#df = pd.read_csv("embeddings.csv", index_col=0)
#df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)

In [9]:
# Using Code from 4.24 Case Study Workspace

from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

In [10]:
get_rows_sorted_by_relevance("retired soldier", df)

Unnamed: 0,text,embeddings,distances
3,"Tom: A man in his 50s, Tom is a retired soldie...","[0.017878901213407516, -0.017596535384655, 0.0...",0.166532
7,"John: A man in his 60s, John is a retired prof...","[0.019646571949124336, -0.013807781040668488, ...",0.199419
12,"Manuel: A middle-aged Hispanic man in his 50s,...","[-0.008526013232767582, -0.022947043180465698,...",0.222522
17,"Max: A white Australian man in his late 20s, M...","[0.0021642628125846386, -0.03667907789349556, ...",0.22716
1,"Jack: A middle-aged man in his 40s, Jack is a ...","[0.00714816665276885, -0.022021278738975525, 0...",0.233641
52,Captain James: The charismatic and dashing cap...,"[-0.005605380516499281, -0.020872067660093307,...",0.234605
13,"Will: A white man in his early 40s, Will is a ...","[-0.0026530236937105656, -0.046536002308130264...",0.238522
10,"Tyler: A white man in his mid-30s, Tyler is a ...","[0.013335198163986206, -0.048971496522426605, ...",0.241675
14,"Mia: A young Australian woman in her mid-20s, ...","[-0.013563292101025581, -0.022774379700422287,...",0.242706
29,James: A handsome and athletic personal traine...,"[-0.02195400930941105, -0.012563727796077728, ...",0.24416


In [11]:
# Using Code from 4.24 Case Study Workspace
# Create prompt, pay attention to max tokens

import tiktoken

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

In [12]:
print(create_prompt("Who is currently in a relationship?", df, 200))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

Jake: A laid-back and easygoing firefighter, Jake is the quintessential good guy. He's looking for someone who shares his values of honesty and integrity, and who is looking for a stable and committed relationship. He's a bit of a hopeless romantic, and is always looking for ways to sweep his partner off their feet., Reality Show, USA

###

Chloe: A driven and ambitious attorney, Chloe is always striving for success. She's looking for someone who can match her intellect and drive, and who is supportive of her career goals. She's a bit guarded when it comes to matters of the heart, but is ready to let her guard down for the right person., Reality Show, USA

---

Question: Who is currently in a relationship?
Answer:


In [13]:
# Using Code from 4.24 Case Study Workspace
# Answer a question

COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(question, max_answer_tokens=150):
    response = openai.Completion.create(
        model=COMPLETION_MODEL_NAME,
        prompt=question,
        max_tokens=max_answer_tokens
    )
    return response["choices"][0]["text"].strip()

def answer_question_with_context(
    question, df, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [14]:
question1 = "Which characters are above 30?"

In [15]:
answer1 = answer_question(question1)
print(answer1)

It is not possible to answer this question without more context or information about which group of characters you are referring to. However, here are some examples of characters above 30 in various works of fiction:

1. Harry Potter - In the later books in the series, Harry and his friends are all above 30 years old.
2. Tony Stark (Iron Man) - In the Marvel Cinematic Universe, Tony Stark is over 30 years old during the events of the first Iron Man movie.
3. Elizabeth Bennet - In Jane Austen's Pride and Prejudice, Elizabeth is around 20 years old at the beginning of the novel, but by the end she is in her mid-30s.
4. Tyrion Lann


In [16]:
custom_answer1 = answer_question_with_context(question1, df)
print(custom_answer1)

John, Lucas, Ava, Jack, Maria, Ava, Tyler, Captain James, Manuel, Will, Duke Orsino, Sir Toby Belch, and Mrs. Mercer are all above 30.


### Question 2

In [17]:
question2 = "Who is currently in a relationship?"

In [18]:
answer2 = answer_question(question2)
print(answer2)

Without more context, it is impossible to determine who is currently in a relationship. The answer could be anyone in the world.


In [19]:
custom_answer2 = answer_question_with_context(question2, df)
print(custom_answer2)

George and Emily are currently in a relationship. Tom and Rachel are also in a relationship. Jack and Alice are married, which could also be considered a "relationship."


### Question 3

In [20]:
question3 = "Is Rachel in a relationship?"

In [21]:
answer3 = answer_question(question3)
print(answer3)

As an AI, Rachel does not have the capacity for romantic relationships. She is here to assist and answer questions.


In [22]:
custom_answer3 = answer_question_with_context(question3, df)
print(custom_answer3)

Yes, Rachel is in a relationship with Tom, as stated in the context.


### Question 4

In [26]:
question4 = "Which of the characters play an evil role?"

In [27]:
answer4 = answer_question(question4)
print(answer4)

The characters who play an evil role are usually the antagonists or villains of the story. Some examples include: 

1. Lord Voldemort in the Harry Potter series
2. The Joker in Batman
3. Cersei Lannister in Game of Thrones
4. Hannibal Lecter in The Silence of the Lambs
5. Sauron in The Lord of the Rings
6. Maleficent in Sleeping Beauty
7. Darth Vader in Star Wars
8. Ursula in The Little Mermaid
9. Iago in Othello
10. Professor Moriarty in Sherlock Holmes.


In [28]:
custom_answer4 = answer_question_with_context(question4, df)
print(custom_answer4)

I don't know.
