# Custom Chatbot Project

For this exercise, I have chosen one of the datasets available in the data folder, specifically the dataset character_descriptions.csv. This dataset contains text information about theater, television, and film productions. This information is found in the 'Description' variable, which will be renamed as 'text'. The other variables will not be relevant. Finally, the dataset meets the condition of having more than 20 rows.

## Data Wrangling

**In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.**



In [13]:
import os
import pandas as pd
import openai
import numpy as np

openai.api_key = "YOUR API KEY"

In [14]:
file_path = 'data/character_descriptions.csv'
df = pd.read_csv(file_path)

df.drop(columns=["Name", "Medium", "Setting"], inplace=True)
df.rename(columns={"Description": "text"}, inplace=True)
df = df[(df["text"].str.len() > 0) & (~df["text"].str.startswith("=="))]
df.head()

Unnamed: 0,text
0,"A young woman in her early 20s, Emily is an as..."
1,"A middle-aged man in his 40s, Jack is a succes..."
2,"A woman in her late 30s, Alice is a warm and n..."
3,"A man in his 50s, Tom is a retired soldier and..."
4,"A woman in her mid-20s, Sarah is a free-spirit..."


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 55 entries, 0 to 54
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    55 non-null     object
dtypes: object(1)
memory usage: 880.0+ bytes


## Custom Query Completion

**In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model.**

First of all, we need to prepare data by generating embeddings. We'll use the Embedding tooling from OpenAI documentation here to create vectors representing each row of our custom dataset. In order to avoid a RateLimitError we'll send our data in batches to the Embedding.create function.

In [16]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 100
embeddings = []
for i in range(0, len(df), batch_size):
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )
    embeddings.extend([data["embedding"] for data in response["data"]])
df["embeddings"] = embeddings
df

Unnamed: 0,text,embeddings
0,"A young woman in her early 20s, Emily is an as...","[-0.01689424365758896, -0.00756641011685133, -..."
1,"A middle-aged man in his 40s, Jack is a succes...","[0.00688925851136446, -0.019344933331012726, 0..."
2,"A woman in her late 30s, Alice is a warm and n...","[0.004401929676532745, -0.004343237262219191, ..."
3,"A man in his 50s, Tom is a retired soldier and...","[0.016953710466623306, -0.008723313920199871, ..."
4,"A woman in her mid-20s, Sarah is a free-spirit...","[-0.018364207819104195, -0.021533139050006866,..."
5,"A man in his early 30s, George is a charming a...","[-0.022485816851258278, -0.008806727826595306,..."
6,"A woman in her late 20s, Rachel is a shy and i...","[-0.005041478667408228, -0.006056156009435654,..."
7,"A man in his 60s, John is a retired professor ...","[0.022199273109436035, -0.010924174450337887, ..."
8,"A middle-aged Latina woman in her 40s, Maria i...","[-0.0074712433852255344, -0.000887294881977140..."
9,"A young African American man in his early 20s,...","[0.0017755168955773115, -0.018588444218039513,..."


In order to avoid having to run that code again in the future, we'll save the generated embeddings as a CSV file.

In [17]:
df.to_csv("data/embeddings.csv")

In [18]:
df = pd.read_csv("data/embeddings.csv", index_col=0)
df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)
df

Unnamed: 0,text,embeddings
0,"A young woman in her early 20s, Emily is an as...","[-0.01689424365758896, -0.00756641011685133, -..."
1,"A middle-aged man in his 40s, Jack is a succes...","[0.00688925851136446, -0.019344933331012726, 0..."
2,"A woman in her late 30s, Alice is a warm and n...","[0.004401929676532745, -0.004343237262219191, ..."
3,"A man in his 50s, Tom is a retired soldier and...","[0.016953710466623306, -0.008723313920199871, ..."
4,"A woman in her mid-20s, Sarah is a free-spirit...","[-0.018364207819104195, -0.021533139050006866,..."
5,"A man in his early 30s, George is a charming a...","[-0.022485816851258278, -0.008806727826595306,..."
6,"A woman in her late 20s, Rachel is a shy and i...","[-0.005041478667408228, -0.006056156009435654,..."
7,"A man in his 60s, John is a retired professor ...","[0.022199273109436035, -0.010924174450337887, ..."
8,"A middle-aged Latina woman in her 40s, Maria i...","[-0.0074712433852255344, -0.000887294881977140..."
9,"A young African American man in his early 20s,...","[0.0017755168955773115, -0.018588444218039513,..."


Next step is to create a function that finds related pieces of text for a given question. The next function will use the embeddings that we generated previously in order to compare the vectorized version of our question to the vectorized versions of the rows of the dataset.

In [19]:
from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

In the following cell, we can see an example of the result returned by that function when we pass a question about our dataset. As can be seen, the columns of the dataset are sorted from smallest to largest distance, that is, from most to least relevant.  In fact, the answer to the question we have introduced is found in the first text that appears as most relevant, so it has been verified that it is performing well.

In [20]:
get_rows_sorted_by_relevance("Who is Emily in a relationship with?", df)

Unnamed: 0,text,embeddings,distances
0,"A young woman in her early 20s, Emily is an as...","[-0.01689424365758896, -0.00756641011685133, -...",0.127694
5,"A man in his early 30s, George is a charming a...","[-0.022485816851258278, -0.008806727826595306,...",0.1572
2,"A woman in her late 30s, Alice is a warm and n...","[0.004401929676532745, -0.004343237262219191, ...",0.169843
6,"A woman in her late 20s, Rachel is a shy and i...","[-0.005041478667408228, -0.006056156009435654,...",0.196861
33,"A laid-back and easygoing firefighter, Jake is...","[-0.019588826224207878, -0.01592748612165451, ...",0.213937
4,"A woman in her mid-20s, Sarah is a free-spirit...","[-0.018364207819104195, -0.021533139050006866,...",0.213947
3,"A man in his 50s, Tom is a retired soldier and...","[0.016953710466623306, -0.008723313920199871, ...",0.214365
26,A confident and charismatic marketing executiv...,"[-0.0065472922287881374, -0.010503015480935574...",0.215254
32,"A driven and ambitious attorney, Chloe is alwa...","[-0.0116198118776083, -0.0031070932745933533, ...",0.219871
1,"A middle-aged man in his 40s, Jack is a succes...","[0.00688925851136446, -0.019344933331012726, 0...",0.220297


The next step is to create a function that composes a text promp. We're going to create a text prompt that provides context to a Completion model in order to help it answer a question. 

We want to fit as much of our dataset as possible into the "context" part of the prompt without exceeding the number of tokens allowed by the Completion model, which is currently 4,000. So we'll loop over the dataset, counting the tokens as we go, and stop when we hit the limit. Then we'll join that list of text data into a single string and add it to the prompt.

In [21]:
import tiktoken

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

We can see the prompt results in the next cell:

In [22]:
print(create_prompt("Who is Emily in a relationship with?", df, 200))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also in a relationship with George.

###

A man in his early 30s, George is a charming and charismatic businessman who is in a relationship with Emily. He's ambitious, confident, and always looking for the next big opportunity. However, he's also prone to bending the rules to get what he wants.

---

Question: Who is Emily in a relationship with?
Answer:


The final step is to send that text prompt to a Completion model and parse the model output.

In [23]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(
    question, df, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""

## Custom Performance Demonstration

**In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.**

### Question 1 - Who is Emily in a relationship with?

#### Text with the answer to verify that it is correct

In [29]:
df['text'][0]

"A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also in a relationship with George."

#### Custom query

In [24]:
question_1 = answer_question("Who is Emily in a relationship with?", df)
question_1

Emily is in a relationship with George.


#### Basic query

In [30]:
question_1_prompt = """
Question: "Who is Emily in a relationship with?"
Answer:
"""
question_1 = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=question_1_prompt,
    max_tokens=150
)["choices"][0]["text"].strip()
question_1

'I am an AI and I do not have personal relationships. I am not able to determine who Emily is in a relationship with. Can you please provide more context or information?'

### Question 2 - What is Karma's ability?

#### Text with the answer to verify that it is correct

In [26]:
df['text'][24]

"A chameleon-like performer, Karma is known for her ability to transform herself into any character. She's a master of illusion and is always pushing boundaries with her looks and performances, but can sometimes struggle with authenticity and staying true to herself. She's also a friend of Dolly, often offering her a listening ear when she needs it."

#### Custom query

In [27]:
question_2 = answer_question("What is Karma's ability?", df)
question_2

'Karma is known for her ability to transform herself into any character.'

#### Basic query

In [31]:
question_2_prompt = """
Question: "What is Karma's ability?"
Answer:
"""
question_2 = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=question_1_prompt,
    max_tokens=150
)["choices"][0]["text"].strip()
question_2

'There is not enough information given to accurately answer this question.'