# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

Building a custom OpenAI chatbot with ML Driven Prompt Engineering I wanted to ask questions about British royal events that happened in 2023 particularly about King Charles III's coronation. So I picked the 2023 Wiki data.

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [1]:
import openai
openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = "YOUR_API_KEY"


In [2]:
import numpy as np
import pandas as pd

In [3]:
coronation_prompt = """
Question: "Where did King Charles III coronation happen?"
Answer:
"""
initial_coronation_answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=coronation_prompt,
    max_tokens=150
)["choices"][0]["text"].strip()
print(initial_coronation_answer)

The coronation of King Charles III took place at Westminster Abbey in London, England on July 11, 2022.


In [4]:
queen_prompt = """
Question: "Who is King Charles' queen?"
Answer:
"""
initial_queen_answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=queen_prompt,
    max_tokens=150
)["choices"][0]["text"].strip()
print(initial_queen_answer)

King Charles' queen is Queen Camilla, Duchess of Cornwall.


In [5]:
# # Step 1: Prepare Dataset
# Loading and Wrangling Data
# **The data should be loaded into a pandas `DataFrame` called `df` where each row represents a text sample, and there is only one column, `"text"`, which contains the raw text data.**
# In this particular case we are collecting data from [the Wikipedia page for the year 2023](https://en.wikipedia.org/wiki/2023) and performing some data wrangling to get it into the appropriate format.

from dateutil.parser import parse
import pandas as pd
import requests

# Set a proper User-Agent header (required by Wikipedia)
headers = {
    'User-Agent': 'MyResearchBot/1.0 (your-email@example.com)'  # Replace with your actual contact info
}

# Get the Wikipedia page for "2023"
resp = requests.get(
    "https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exlimit=1&titles=2023&explaintext=1&formatversion=2&format=json",
    headers=headers
)

# Check if the request was successful
if resp.status_code != 200:
    raise Exception(f"Request failed with status code: {resp.status_code}")

# Load page text into a dataframe
df = pd.DataFrame()
df["text"] = resp.json()["query"]["pages"][0]["extract"].split("\n")

# Clean up text to remove empty lines and headings
df = df[(df["text"].str.len() > 0) & (~df["text"].str.startswith("=="))]

# In some cases dates are used as headings instead of being part of the
# text sample; adjust so dated text samples start with dates
prefix = ""
for (i, row) in df.iterrows():
    # If the row already has " – ", it already has the needed date prefix
    if " – " not in row["text"]:
        try:
            # If the row's text is a date, set it as the new prefix
            parse(row["text"])
            prefix = row["text"]
        except:
            # If the row's text isn't a date, add the prefix
            df.at[i, "text"] = prefix + " – " + row["text"]
            
df = df[df["text"].str.contains(" – ")]
df

Unnamed: 0,text
0,– 2023 (MMXXIII) was a common year starting o...
1,– Catastrophic natural disasters in 2023 incl...
2,– The Russian invasion of Ukraine and Myanmar...
3,– A banking crisis resulted in the collapse o...
11,January 1 – Croatia adopts the euro and joins ...
...,...
324,"Economics – Claudia Goldin, for her empirical ..."
325,"Literature – Jon Fosse, for his innovative pla..."
326,"Peace – Narges Mohammadi, for her works on the..."
327,"Physics – Pierre Agostini, Ferenc Krausz & Ann..."


Generating Embeddings
We'll use the Embedding tooling from OpenAI documentation here to create vectors representing each row of our custom dataset.

In order to avoid a RateLimitError we'll send our data in batches to the Embedding.create function.

In [6]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 100
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )
    
    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings
df

Unnamed: 0,text,embeddings
0,– 2023 (MMXXIII) was a common year starting o...,"[0.0036799046210944653, -0.0138942189514637, -..."
1,– Catastrophic natural disasters in 2023 incl...,"[-0.022169683128595352, -0.00321850529871881, ..."
2,– The Russian invasion of Ukraine and Myanmar...,"[-0.01289023831486702, -0.011169254779815674, ..."
3,– A banking crisis resulted in the collapse o...,"[-0.03203907236456871, -0.011590203270316124, ..."
11,January 1 – Croatia adopts the euro and joins ...,"[0.01309617143124342, -0.020668700337409973, 0..."
...,...,...
324,"Economics – Claudia Goldin, for her empirical ...","[-0.01694655232131481, -0.007920671254396439, ..."
325,"Literature – Jon Fosse, for his innovative pla...","[-0.009484800510108471, 0.01768663339316845, 0..."
326,"Peace – Narges Mohammadi, for her works on the...","[-0.013250409625470638, -0.012954494915902615,..."
327,"Physics – Pierre Agostini, Ferenc Krausz & Ann...","[-0.02043410949409008, 0.014828776940703392, 0..."


In [7]:
df.to_csv("embeddings.csv")

In [8]:
! ls

data  embeddings.csv  project.ipynb


Step 2: Create a Function that Finds Related Pieces of Text for a Given Question
What we are implementing here is similar to a search engine or recommendation algorithm. We want to sort all of the rows of our dataset from least relevant to most relevant.

This will use the embeddings that we generated previously in order to compare the vectorized version of our question to the vectorized versions of the rows of the dataset.

In [9]:
from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

In [10]:
#Let's test that out for questions - Example 1
get_rows_sorted_by_relevance("Where did King Charles III coronation happen?", df)

Unnamed: 0,text,embeddings,distances
124,May 6 – The coronation of Charles III and Cami...,"[0.004555157385766506, -0.0024165711365640163,...",0.121887
302,December 31 – Queen Margrethe II of Denmark an...,"[-0.006673671770840883, -0.02460246905684471, ...",0.227244
12,January 5 – The funeral of Pope Benedict XVI i...,"[0.007029644679278135, 0.002505552489310503, -...",0.232392
280,"November 23 – Riots broke out in Dublin, Irela...","[-0.003895672271028161, -0.0005883423145860434...",0.243534
23,January 20 – The Parliament of Trinidad and To...,"[-0.003036879003047943, -0.005730754230171442,...",0.247604
...,...,...,...
326,"Peace – Narges Mohammadi, for her works on the...","[-0.013250409625470638, -0.012954494915902615,...",0.323895
288,December 6 – Google DeepMind releases the Gemi...,"[-0.022914431989192963, 0.011136173270642757, ...",0.324059
26,January 21 – Tigray War: Eritrean forces withd...,"[-0.0053468444384634495, -0.012855730019509792...",0.325588
248,October 11 – ExxonMobil announces it will acqu...,"[-0.009091373533010483, -0.0194711834192276, 0...",0.332023


In [11]:
#Let's test that out for questions - Example 2
get_rows_sorted_by_relevance("Who is King Charles' queen?", df)

Unnamed: 0,text,embeddings,distances
124,May 6 – The coronation of Charles III and Cami...,"[0.004555157385766506, -0.0024165711365640163,...",0.149228
302,December 31 – Queen Margrethe II of Denmark an...,"[-0.006673671770840883, -0.02460246905684471, ...",0.191773
23,January 20 – The Parliament of Trinidad and To...,"[-0.003036879003047943, -0.005730754230171442,...",0.241233
282,November 27 – After forming a coalition Gover...,"[-0.0056718578562140465, -0.005239285994321108...",0.251400
27,January 25 – Chris Hipkins succeeds Jacinda Ar...,"[-0.01631171815097332, 0.012749671004712582, -...",0.255367
...,...,...,...
46,February 6 – A 7.8 Mww earthquake strikes sout...,"[-0.0024857702665030956, -0.017595678567886353...",0.327867
26,January 21 – Tigray War: Eritrean forces withd...,"[-0.0053468444384634495, -0.012855730019509792...",0.328582
88,March 31 – April 1 – A historic and widespread...,"[-0.03209320455789566, -0.02319134585559368, -...",0.329385
171,July 3 – In the largest incursion by Israel in...,"[-0.0222729854285717, 0.006456935778260231, 0....",0.333539


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [12]:
# Step 3: Create a Function that Composes a Text Prompt
# 
# Building on that sorted list of rows, we're going to select the create a text prompt that provides context to a `Completion` model in order to help it answer a question. The outline of the prompt looks like this:

# ```
# Answer the question based on the context below, and if the
# question can't be answered based on the context, say "I don't
# know"
# 
# Context:
# 
# {context}
# 
# ---
# 
# Question: {question}
# Answer:
# ```
# We want to fit as much of our dataset as possible into the "context" part of the prompt without exceeding the number of tokens allowed by the `Completion` model, which is currently 4,000. So we'll loop over the dataset, counting the tokens as we go, and stop when we hit the limit. Then we'll join that list of text data into a single string and add it to the prompt.

import tiktoken

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

Now let's test that out! We'll use a max_token_count below the actual limit just to keep the output shorter and more readable.

In [13]:
print(create_prompt("Who is King Charles queen?", df, 200))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

May 6 – The coronation of Charles III and Camilla as King and Queen of the United Kingdom and the other Commonwealth realms is held in Westminster Abbey, London.

###

December 31 – Queen Margrethe II of Denmark announces her abdication effective January 14, 2024, after 52 years on the throne.

###

January 20 – The Parliament of Trinidad and Tobago elects former senate president, minister and lawyer Christine Kangaloo as president in a 48–22 vote.

---

Question: Who is King Charles queen?
Answer:


In [14]:
print(create_prompt("Where did King Charles III coronation happen?", df, 100))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

May 6 – The coronation of Charles III and Camilla as King and Queen of the United Kingdom and the other Commonwealth realms is held in Westminster Abbey, London.

---

Question: Where did King Charles III coronation happen?
Answer:


Step 4: Create a Function that Answers a Question¶
Our final step is to send that text prompt to a Completion model and parse the model output!

In [15]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(
    question, df, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""
        

Now test it out!

In [16]:
custom_coronation_answer = answer_question("Where did King Charles III coronation happen?", df)
print(custom_coronation_answer)

The coronation of King Charles III took place in Westminster Abbey, London.


In [17]:
custom_queen_answer = answer_question("Who is King Charles queen?", df)
print(custom_queen_answer)

Camilla


## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [18]:
coronation_prompt = """
Question: "Where did King Charles III coronation happen?"
Answer:
"""
initial_coronation_answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=coronation_prompt,
    max_tokens=150
)["choices"][0]["text"].strip()
print(initial_coronation_answer)

King Charles III's coronation took place at Westminster Abbey in London, England.


In [19]:
custom_coronation_answer = answer_question("Where did King Charles III coronation happen?", df)
print(custom_coronation_answer)

The coronation of King Charles III and Camilla took place at Westminster Abbey in London.


### Question 2

In [20]:
cost_prompt = """
Question: "Who is King Charles' queen?"
Answer:
"""
initial_cost_answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=cost_prompt,
    max_tokens=150
)["choices"][0]["text"].strip()
print(initial_cost_answer)

King Charles does not currently have a queen, as he has not been married. His mother, Queen Elizabeth II, is the current monarch of England.


In [21]:
custom_cost_answer = answer_question("Who is King Charles' queen?", df)
print(custom_cost_answer)

Queen Camilla


#### Conclusion


The custom prompt questions came back with the correct answers for the following reasons.
The model responded with the correct information when we provided it a relevant dataset. In this case wiki page of 2023 which was relevant to King Charles III inauguration.
The Large Language Model we chose is not trained with data beyond 2022. So it is not aware of King Charles' coronation. After providing the relevant dataset it is able to come back with right answers.