# Custom Chatbot Project

This is a custom chatbot project where a small text data is used to demonstrate the Retrieval Augmented Generation (RAG) application of the Generative AI models.

We use a character descriptions data for the demonstration. The data contains names, media (e.g. play, movie, etc.), settings (e.g. USA, Australia, etc.), and a short descriptions about the characters.

This type of data are ideal for RAGs as the descriptions are specific and the types of questions a user may ask about the characters are not answerable for Generative AI models, as the relevant information will not be available. Furthermore, since the information is mostly in the form of free text, it will be hard to devise a general method using traditional databases to answer the many types of questions a user might ask about the characters.

The idea is that when asked a question, the relevant data are sorted and inserted into the prompt, with which the generative model is asked to "complete" an answer.

In [19]:
# The following are the OpenAI API Key and model names to be used for the project.

import getpass

OPENAI_API_KEY = getpass.getpass("OpenAI API Key: ")
EMBEDDING_MODEL_NAME = 'text-embedding-3-small'
COMPLETION_MODEL_NAME = 'gpt-3.5-turbo-instruct'

OpenAI API Key: ········


## Data Wrangling

In the cells below, we load the chosen dataset into a `pandas` dataframe with a column named `"text"`. This column contains all of the text data.

In [20]:
import pandas as pd
import openai

openai.api_key = OPENAI_API_KEY

In [33]:
df = (
    pd.read_csv("data/character_descriptions.csv")
    .reset_index()
    .assign(
        text = lambda x:
            "Name: " + x["Name"] + ", " +
            "Medium: " + x["Medium"] + ", " +
            "Setting: " + x["Setting"] + ", " +
            "Description: " + x["Description"]
    )
    [["index", "text"]]
)

# Send text data to the model
response = openai.Embedding.create(
    input=df["text"].tolist(),
    engine=EMBEDDING_MODEL_NAME
)

# Extract embeddings
df["embedding"] = [data["embedding"] for data in response["data"]]

In [34]:
df

Unnamed: 0,index,text,embedding
0,0,"Name: Emily, Medium: Play, Setting: England, D...","[0.042800188064575195, -0.0026875631883740425,..."
1,1,"Name: Jack, Medium: Play, Setting: England, De...","[0.014133699238300323, -0.011379247531294823, ..."
2,2,"Name: Alice, Medium: Play, Setting: England, D...","[0.03738236427307129, -0.013138728216290474, 0..."
3,3,"Name: Tom, Medium: Play, Setting: England, Des...","[0.01938999444246292, 0.021279536187648773, -0..."
4,4,"Name: Sarah, Medium: Play, Setting: England, D...","[0.04892197623848915, 0.0193181075155735, 0.00..."
5,5,"Name: George, Medium: Play, Setting: England, ...","[0.016478193923830986, 0.025896253064274788, 0..."
6,6,"Name: Rachel, Medium: Play, Setting: England, ...","[-0.0009736692882142961, -0.03268888220191002,..."
7,7,"Name: John, Medium: Play, Setting: England, De...","[0.02701561525464058, 0.018301326781511307, -0..."
8,8,"Name: Maria, Medium: Movie, Setting: Texas, De...","[-0.004943996202200651, 0.022249484434723854, ..."
9,9,"Name: Caleb, Medium: Movie, Setting: Texas, De...","[0.050504956394433975, -0.002042090753093362, ..."


## Custom Query Completion

In the cells below, we compose a custom query using the chosen dataset and retrieve results from an OpenAI `Completion` model.

In [80]:
from openai.embeddings_utils import distances_from_embeddings
import tiktoken


def get_question_embeddings(user_question):
    # Generate the embedding response
    response = openai.Embedding.create(
        input=user_question,
        engine=EMBEDDING_MODEL_NAME
    )
    return response["data"][0]["embedding"]


def sorted_data_using_question_embeddings(question_embeddings, df):
    # Create a list containing the distances from question_embeddings
    distances = distances_from_embeddings(
        question_embeddings,
        df["embedding"],
        distance_metric="cosine"
    )
    
    context_data = (
        df.copy()
        .assign(distance = distances)
        .sort_values(by="distance", ascending=True)
    )
    
    return context_data


def construct_prompt(user_question, context_data, token_limit=800):
    
    tokenizer = tiktoken.get_encoding("cl100k_base")

    prompt_template = """
    Answer the question based on the context of character descriptions below.
    If the question can't be answered based on the context, say "I don't know".

    Context: 

    {}

    ---

    Question: {}
    Answer:"""

    token_count = len(tokenizer.encode(prompt_template)) + len(tokenizer.encode(user_question))

    # Create a list to store text for context
    context_list = []

    # Loop over rows of the sorted dataframe
    for text in context_data["text"].values:

        # Append text to context_list if there is enough room
        token_count += len(tokenizer.encode(text))
        if token_count <= token_limit:
            context_list.append(text)
        else:
            # Break once we're over the token limit
            break

    # Use string formatting to complete the prompt
    prompt = prompt_template.format(
        "\n\n###\n\n".join(context_list),
        USER_QUESTION
    )
    
    return prompt

In [81]:
def answer_question_without_context(user_question, max_tokens=150):
    
    prompt = f"""
    Answer the question about some fictional characters.
    If the question can't be answered based on the context, say "I don't know".
    
    ---

    Question: {user_question}
    Answer:"""
    
    response = openai.Completion.create(
        model=COMPLETION_MODEL_NAME,
        prompt=prompt,
        max_tokens=max_tokens,
    )
    answer = response["choices"][0]["text"].strip()
    return answer


def answer_question_using_context(user_question, data=df, max_tokens=150):
    
    question_embedding = get_question_embeddings(user_question)
    context_data = sorted_data_using_question_embeddings(question_embeddings, data)
    prompt = construct_prompt(user_question, context_data)
    
    response = openai.Completion.create(
        model=COMPLETION_MODEL_NAME,
        prompt=prompt,
        max_tokens=max_tokens,
    )
    answer = response["choices"][0]["text"].strip()
    return answer

## Custom Performance Demonstration

In the cells below, we demonstrate the performance of the custom query using 2 questions.

For each question, we show one answer generated without context and one answer generated with context.

### Question 1

In [82]:
USER_QUESTION = "Tell me about Manuel."

answer_question_without_context(USER_QUESTION)

"I don't know."

In [83]:
answer_question_using_context(USER_QUESTION)

"A middle-aged Hispanic man in his 50s, Manuel is a proud and hard-working farmer who's struggling to keep his family's farm afloat. He's fiercely loyal to his family and his community, and will do whatever it takes to protect them."

### Question 2

In [84]:
USER_QUESTION = "Which characters are set in Australia?"

answer_question_without_context(USER_QUESTION)

'The Crocodile Hunter and The Kangaroo Court Boys are set in Australia.'

In [85]:
answer_question_using_context(USER_QUESTION)

'Max'