# Custom Chatbot with RAG

This repository contains a minimal example of building a **Retrieval‑Augmented Generation (RAG)** chatbot that answers questions about a custom knowledge base – in this case, fictional character descriptions stored in `data/character_descriptions.csv`.

The notebook illustrates how to:

1. Turn raw CSV data into dense **OpenAI embeddings**
2. Perform a lightweight **similarity search** to retrieve the most relevant chunks
3. Build a token‑aware **prompt** that feeds the retrieved context to an LLM
4. Compare answers *with* and *without* retrieval to see the gain in factuality

## Why this dataset?

* Because it is different from the Wikipedia example seen in the lesson, thus requires a different preparation phase
* Because it contains totally invented content, thus it's a good dataset to show the effectiveness of RAG

In [1]:
import pandas as pd
import numpy as np
import openai
from openai.embeddings_utils import get_embedding, distances_from_embeddings
import tiktoken
import os
from dotenv import load_dotenv
load_dotenv()

True

In [2]:
openai.api_base = os.environ.get("OPENAI_API_BASE")
openai.api_key = os.environ.get("OPENAI_API_KEY")

if not openai.api_base or not openai.api_key:
    raise ValueError("OPENAI_API_BASE and OPENAI_API_KEY environment variables must be set.")

# STEP 1: Prepare dataset

In [3]:
df = pd.read_csv('../data/character_descriptions.csv')
# df.head(20)

In [5]:
# Create a new dataframe with concatenated text
df_combined = pd.DataFrame({
    'text': df.apply(lambda row: 
                      f"NAME: {row['Name']}\n"
                      f"DESCRIPTION: {row['Description']}\n"
                      f"MEDIUM: {row['Medium']}\n"
                      f"SETTING: {row['Setting']}", 
                      axis=1)
})

# Display the first few rows
df_combined.head()
print(df_combined.iloc[0]['text'])

NAME: Emily
DESCRIPTION: A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also in a relationship with George.
MEDIUM: Play
SETTING: England


# STEP 2:  Create embeddings

In [6]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
BATCH_SIZE = 100
embeddings = []
for i in range(0, len(df_combined), BATCH_SIZE):
    response = openai.Embedding.create(
        engine=EMBEDDING_MODEL_NAME,
        input=df_combined.iloc[i : i + BATCH_SIZE]["text"].tolist(),
    )
    embeddings.extend([d["embedding"] for d in response["data"]])

df_combined["embeddings"] = embeddings
df_combined.head()

Unnamed: 0,text,embeddings
0,NAME: Emily\nDESCRIPTION: A young woman in her...,"[-0.01714273914694786, -0.005702167749404907, ..."
1,NAME: Jack\nDESCRIPTION: A middle-aged man in ...,"[0.005255976226180792, -0.01799955405294895, -..."
2,NAME: Alice\nDESCRIPTION: A woman in her late ...,"[0.004895723424851894, -0.0010099101345986128,..."
3,"NAME: Tom\nDESCRIPTION: A man in his 50s, Tom ...","[0.013725598342716694, -0.013712245970964432, ..."
4,NAME: Sarah\nDESCRIPTION: A woman in her mid-2...,"[-0.020092617720365524, -0.0203151423484087, -..."


# STEP 3:  Similarity‑search helper

In [7]:
def get_rows_sorted_by_relevance(question: str, df) -> pd.DataFrame:
    """
    Return a copy of df sorted from most → least relevant for 'question'
    based on cosine distance between embeddings.
    """
    q_emb = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    df_tmp = df.copy()
    df_tmp["distance"] = distances_from_embeddings(
        q_emb, df_tmp["embeddings"].values, distance_metric="cosine"
    )
    return df_tmp.sort_values("distance")

In [8]:
get_rows_sorted_by_relevance("Who is Emily?", df_combined).head(3)[["text","distance"]]

Unnamed: 0,text,distance
0,NAME: Emily\nDESCRIPTION: A young woman in her...,0.122136
2,NAME: Alice\nDESCRIPTION: A woman in her late ...,0.162361
5,NAME: George\nDESCRIPTION: A man in his early ...,0.181894


# STEP 4:  Prompt builder

In [9]:
tokenizer = tiktoken.get_encoding("cl100k_base")

def create_prompt(question: str,
                  df,
                  max_token_count: int = 1800) -> str:
    """
    Build a prompt that contains as much relevant context as will fit
    inside 'max_token_count' tokens.
    """
    template = (
        "Answer the question based on the context below, and if the "
        "question cannot be answered from that context, say \"I don't know\".\n\n"
        "Context:\n\n{context}\n\n"
        "---\n\n"
        "Question: {question}\n"
        "Answer:"
    )

    used_tokens = len(tokenizer.encode(template.format(context="", question=question)))
    context_blocks = []
    for row_text in get_rows_sorted_by_relevance(question, df)["text"]:
        tokens_needed = len(tokenizer.encode(row_text))
        if used_tokens + tokens_needed > max_token_count:
            break
        context_blocks.append(row_text)
        used_tokens += tokens_needed

    context_str = "\n\n###\n\n".join(context_blocks)
    return template.format(context=context_str, question=question)

# STEP 5: Ask questions 

In [10]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(question: str,
                    df,
                    max_prompt_tokens: int = 1800,
                    max_answer_tokens: int = 150) -> str:
    prompt = create_prompt(question, df, max_prompt_tokens)
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens,
        )
        return response["choices"][0]["text"].strip()
    except Exception as ex:
        print(f"OpenAI error: {ex}")
        return ""

In [11]:
questions = [
    "Who is Emily?",
    "What is Jack's occupation?",
    "What is Max medium and setting?",
]

for question in questions:
    print(f"Q: {question}")
    print(f"A: {answer_question(question, df_combined)}\n")

Q: Who is Emily?
A: Emily is an aspiring actress and Alice's daughter. She's in a relationship with George.

Q: What is Jack's occupation?
A: Businessman

Q: What is Max medium and setting?
A: Max's medium is a Limited Series and the setting is Australia.



# Compare to an un‑contextualized answer

In [12]:
def compare_answers(question: str, df, max_tokens: int = 50):
    baseline = openai.Completion.create(
        model=COMPLETION_MODEL_NAME,
        prompt=f"Question: {question}\nAnswer:\n",
        max_tokens=max_tokens
    )["choices"][0]["text"].strip()

    custom = answer_question(question, df)

    print("Without context:", baseline)
    print("With context:   ", custom)

# Example usage
for question in questions:
    compare_answers(question, df_combined)

Without context: It is not possible for me to accurately answer this question as I do not have enough information. Emily could refer to many different people.
With context:    A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also in a relationship with George. (Note: Emily is mentioned in the second line of the context)
Without context: I'm sorry, I cannot accurately answer this question as there is not enough information provided. Please provide more context.
With context:    Successful businessman.
Without context: Max medium and setting refers to the maximum level or intensity at which a particular medium or setting can operate. This could apply to things like the volume level on electronic devices, the heat setting on a stove, or the brightness level on a computer screen
With context:    Max's medium is a limited series and the setting is Australia.
