# Custom Chatbot Project

I will be leveraging the Paris Olympics aritcle from wikipedia to ask questions about about the 2024 Paris Olympics. I selected it for its recency and for the fact that many details around the olymnpics were known prior to 2021, but others would only be revealed after the event. This provided an opportunity to ask a question that might have been known, mainly, who provided surface-to-air missle protection for the olypics. This was likely setup years in advance, but may not have been reported on as early as 2021. 

In [3]:
import openai
import os
from dateutil.parser import parse
import pandas as pd
import requests
from scipy.spatial.distance import cosine
from typing import Union, List, Optional, Dict

In [4]:
# Add embeddings to DataFrame and save to CSV
embedding_model = "text-embedding-3-small" 
completion_model = 'gpt-3.5-turbo'
batch_size = 25
csv_w_embeddings = './wikipedia_embeddings.csv'

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [5]:
# Get the Wikipedia page for "2022" since OpenAI's models stop in 2021
resp = requests.get("https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exlimit=1&titles=2024_Summer_Olympics&explaintext=1&formatversion=2&format=json")

# Load page text into a dataframe
df = pd.DataFrame()
df["text"] = resp.json()["query"]["pages"][0]["extract"].split("\n")

# Clean up text to remove empty lines and headings
df = df[(df["text"].str.len() > 0) & (~df["text"].str.startswith("=="))]

# text sample; adjust so dated text samples start with dates
prefix = ""
for (i, row) in df.iterrows():
    # If the row already has " - ", it already has the needed date prefix
    if " – " not in row["text"]:
        try:
            # If the row's text is a date, set it as the new prefix
            parse(row["text"])
            prefix = row["text"]
        except:
            # If the row's text isn't a date, add the prefix
            row["text"] = prefix + " – " + row["text"]
df = df[df["text"].str.contains(" – ")]
df

Unnamed: 0,text
0,– The 2024 Summer Olympics (French: Les Jeux ...
1,– Paris was awarded the Games at the 131st IO...
2,– Paris 2024 featured the debut of breaking a...
3,– The United States topped the medal table fo...
4,– Despite some controversies throughout relat...
...,...
199,1992 Winter Olympics – Albertville
200,2030 Winter Olympics – French Alps
201,– List of IOC country codes
212,"– ""Paris 2024"". Olympics.com. International O..."


## Custom Embeddings Database

In [7]:
# Initialize OpenAI client
from config import OPENAI_API_KEY
openai.api_key = OPENAI_API_KEY

In [8]:
# Function to get embeddings from OpenAI API
def get_embeddings(prompt: Union[str, List[str]], embedding_model: str) -> List[List[float]]:
    """
    Retrieves embeddings from OpenAI API for the given prompt using the specified embedding model.

    Args:
        prompt (Union[str, List[str]]): Input prompt or list of prompts.
        embedding_model (str): Name of the embedding model to use.

    Returns:
        List[List[float]]: List of embeddings for the input prompt(s).
    """
    try:
        response = openai.embeddings.create(
            input=prompt if isinstance(prompt, list) else [prompt],
            model=embedding_model
        )
        return [row.embedding for row in response.data]
    except Exception as e:
        print(f"Error fetching embeddings: {e}")
        return []

In [9]:
# Function to create embeddings for DataFrame
def create_embeddings(df: pd.DataFrame, embedding_model: str, batch_size: int) -> List[List[float]]:
    """
    Creates embeddings for the text data in the DataFrame using the specified embedding model.

    Args:
        df (pd.DataFrame): DataFrame containing text data.
        embedding_model_name (str): Name of the embedding model to use.
        batch_size (int): Size of batches for processing.

    Returns:
        List[List[float]]: List of embeddings corresponding to the text data.
    """
    embeddings_output = []
    for idx in range(0, len(df), batch_size):
        batch = df.iloc[idx:idx+batch_size]['text'].tolist()
        embeddings = get_embeddings(batch, embedding_model)
        embeddings_output.extend(embeddings)
    return embeddings_output

In [10]:
df['embedding'] = create_embeddings(df, embedding_model, batch_size)
df.to_csv(csv_w_embeddings, sep=',', index=False)

# Display DataFrame head
print(df.head())

                                                text  \
0   – The 2024 Summer Olympics (French: Les Jeux ...   
1   – Paris was awarded the Games at the 131st IO...   
2   – Paris 2024 featured the debut of breaking a...   
3   – The United States topped the medal table fo...   
4   – Despite some controversies throughout relat...   

                                           embedding  
0  [-0.00230206036940217, -0.011534268036484718, ...  
1  [-0.014381744898855686, -0.06176183745265007, ...  
2  [0.03339226171374321, -0.028818586841225624, 0...  
3  [0.020802229642868042, 0.0154111348092556, 0.0...  
4  [0.05418844893574715, 0.0010188579326495528, -...  


## Custom Query Completion

In [11]:
# Function to build custom context
def build_custom_context(question: str, database_df: pd.DataFrame, n: int = 5) -> List[str]:
    """
    Builds custom context based on the question and the database DataFrame.

    Args:
        question (str): The question to include in the prompt.
        database_df (pd.DataFrame): The DataFrame containing the database of facts.
        n (int): The number of closest facts to include in the context.

    Returns:
        List[str]: A list of context strings.
    """
    question_embedding = get_embeddings(question, embedding_model)[0]
    
    df = database_df.copy()
    df["distances"] = df['embedding'].apply(lambda embedding: cosine(embedding, question_embedding))

    df.sort_values("distances", ascending=True, inplace=True)
    return df.iloc[:n]['text'].tolist()

In [12]:
# Function for building the prompt
def build_prompt(question: str, csv_path: Optional[str] = None) -> List[Dict[str, str]]:
    """
    Builds a prompt for asking a question, optionally including context from a database DataFrame.

    Args:
        question (str): The question to include in the prompt.
        database_df (Optional[pd.DataFrame]): The DataFrame containing the database of facts. If None, no context is included. Facts are annotated with date and separated by lines.

    Returns:
        List[Dict[str, str]]: A list containing messages with the user role and optionally a system message with context.
    """
    if csv_path is not None:
        # Read the DataFrame from CSV file
        df = pd.read_csv(csv_path)

        # Convert embedding values from string to list of floats
        df['embedding'] = df['embedding'].apply(lambda value: [float(dim) for dim in value.replace('[', '').replace(']', '').split(',')])

        context = '\n\n'.join(build_custom_context(question, df))
        return [
            {
                'role': 'system',
                'content': f"""
                Answer the question based on the context provided. If the question cannot be answered based on provided context, say "I don't know the answer".
                Context: 
                    {context}
                """
            },
            {
                'role': 'user',
                'content': question
            }
        ]
    else:
        return [
            {
                'role': 'user',
                'content': question
            }
        ]

In [13]:
def handle_question(question: str, csv_path: Optional[str] = None) -> str:
    """
    Handles a question prompt by generating a response using the specified model.

    Args:
        prompt (List[Dict[str, str]]): The prompt messages to send to the model.
        model_name (str): The name of the completion model to use.

    Returns:
        str: The response generated by the model.
    """
    prompt = build_prompt(question, csv_path)
    response = openai.chat.completions.create(
        model=completion_model,
        messages=prompt,
        max_tokens=100
    )
    return response.choices[0].message.content

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1: A fact that would have been known prior to 2021

In [14]:
# Example usage using a fact that was likely known prior to 2021
question = 'In what year did the host country of the 2024 olympics last host the olympics?'
csv_path = './wikipedia_embeddings.csv'


# Print answer without context
print('Answer without Context: \n', handle_question(question))

# Print answer with context
print('\nAnswer with Context: \n', handle_question(question, csv_path))

Answer without Context: 
 The host country of the 2024 Olympics, France, last hosted the Olympics in 1992 in Albertville, where they hosted the Winter Olympics.

Answer with Context: 
 The host country of the 2024 Olympics, which is France, last hosted the Olympics in 1924.


### Question 2: A fact that would not have been known prior to 2021

In [15]:
question = 'Which country won the most gold medals at the 2024 Paris Olympics?'
csv_path = './wikipedia_embeddings.csv'


# Print answer without context
print('Answer without Context: \n', handle_question(question))

# Print answer with context
print('\nAnswer with Context: \n', handle_question(question, csv_path))



Answer without Context: 
 I'm sorry, but as of now, the 2024 Paris Olympics have not yet occurred. Therefore, I am not able to provide information on which country won the most gold medals at that Olympics.

Answer with Context: 
 The United States and China tied for winning the most gold medals at the 2024 Paris Olympics, both with 40 gold medals each.


### Question 3: A fact that might have been known in 2001

In [16]:
question = 'What surface-to-air missile system provided security for the 2024 Paris Olympics?'
csv_path = './wikipedia_embeddings.csv'


# Print answer without context
print('Answer without Context: \n', handle_question(question))

# Print answer with context
print('\nAnswer with Context: \n', handle_question(question, csv_path))



Answer without Context: 
 The surface-to-air missile system that provided security for the 2024 Paris Olympics was the SAMP/T (Sol-Air Moyenne Portée Terrestre) also known as the Aster missile system.

Answer with Context: 
 The British Army provided support by deploying Starstreak surface-to-air missile units for air security during the 2024 Paris Olympics.
