# QUANAM Technical assessment RAG

In this notebook i go trough the steps it took to achieve a successful RAG, using Chromadb, LangChain/Smith/Graph, and an LLM (gpt-4o)

### Initial auth

In [1]:

from langchain.prompts import PromptTemplate
import os
from langchain.chat_models import ChatOpenAI
from dotenv import load_dotenv

load_dotenv()

os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = os.getenv("LANGSMITH_API_KEY")

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")


### Definition of initial variables for improved testing and automatization

Here i define the variables that we might change in order to test different combinations of RAG's, which include the actual LLM, the amount of documents retrieved for context (k), the name of the new collection based on the embeddings, and amount of examples to send to LangSmith for testing, and an initial basic prompt for testing.

The collection name indicates the embeddings selected to embed the documents and create the collection.

Four embeddings were selected for testing, two from OpenAI and two from HuggingFace. I couldn't find much information about the differences, other than the size and performance.

In [368]:
from langchain_openai import OpenAIEmbeddings
from langchain_huggingface import HuggingFaceEmbeddings

llm = ChatOpenAI(model="gpt-4o")

k_value=4
k_importance=0 # used for limiting the amount of context taken into account while addressing the distance score of context

collection_name = "OPENAI-small"
# embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
# embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
# embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
examples_amount = 100

# Basic RAG Prompt taken from https://smith.langchain.com/hub/rlm/rag-prompt?organizationId=386d8cfe-157b-445e-86e5-42faea85b914
prompt = PromptTemplate.from_template(
    """You are an assistant for question-answering tasks. 
    Use the following pieces of retrieved context to answer the question. 
    If you don't know the answer, just say that you don't know. 
    Use three sentences maximum and keep the answer concise.\n\nQuestion: {question}\nContext: {context}\nAnswer:"""
)

### Creation of database/collection

Based on initial definition of "collection_name" i populate or not, the vector database. If no collection exists, a new one is created, using the embeddings selected previously

In [365]:
from langchain_chroma import Chroma
from uuid import uuid4
import json
import os
from langchain_core.documents import Document
from chromadb import PersistentClient

# Define collection parameters
persist_directory = f"./chroma_langchain_db/{collection_name}"

# Check if the collection already exists
chroma_client = PersistentClient(path=persist_directory)
existing_collections = chroma_client.list_collections()

vector_store = Chroma(
    collection_name=collection_name,
    embedding_function=embeddings,
    persist_directory=persist_directory,
)

if collection_name in existing_collections:
    print(f"Collection '{collection_name}' already exists.")
else:
    print(f"Populating new collection '{collection_name}'.")
    with open("data/hotpotqa_docs_reduced.json", "r", encoding="utf-8") as f:
        documents = [Document(page_content=doc['text'], id=idx) for idx, doc in enumerate(json.load(f))]

    doc_ids = vector_store.add_documents(documents=documents, ids=[str(uuid4()) for _ in range(len(documents))])
    print(len(doc_ids))

Collection 'OPENAI-small' already exists.


## LangChain Graph for retrieval of information and generation of answers

This was achieved following the [LangChain tutorial](https://python.langchain.com/docs/tutorials/rag/)

Two functions are added to the sequence, retrieve for context retrieval, and generate for the execution of the previously defined LLM

The retrieve function was edited to use the method `similarity_search_with_score` instead of `similarity_search`. This method provides the cosine distance from the question to the context calculated by the vector database. This indicates the similarity between both sentences. We will use this later to evaluate context importance.

In [366]:
from langchain import hub
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict


class State(TypedDict):
    question: str
    context: List[Document]
    answer: str

# Here i use similarity_search_with_score instead of similarity_search
def retrieve(state: State):
    retrieved_docs = vector_store.similarity_search_with_score(state["question"], k=k_value)
        
    return {"context": retrieved_docs }


def generate(state: State):
    docs_content = "\n\n".join(doc.page_content for doc, _ in state["context"])
    messages = prompt.invoke({"question": state["question"], "context": docs_content})
    response = llm.invoke(messages)
    return {"answer": response.content}


graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

-------

# Local Tests
#### In the next sections i will show some issues found while testing:

In this case we can see how one of the questions is not very clear, which causes confusion to the model. This indicates that the dataset could have more of these issues, therefore indicating that a 100% success rate might not be feasable.

In [367]:
# Original question confuses the model:
response = graph.invoke({"question": """Cadmium Chloride is slightly soluble in this chemical, it is also called what?"""})
print(response["answer"])

# Slightly edited question yields a correct answer
response = graph.invoke({"question": """Cadmium Chloride is slightly soluble in this chemical, what is it called?"""})
print(response["answer"])

Cadmium chloride, which is slightly soluble in alcohol, is also known as CdCl₂.
Cadmium chloride is slightly soluble in alcohol.


Test function for simplocity

In [348]:
def test(test_question, show_context=False):
    response = graph.invoke({"question": test_question})
    print('ANSWER ==================')
    print(response["answer"])
    if show_context:
        print('CONTEXT ==================')
        question_context = vector_store.similarity_search_with_score(test_question, k=k_value)
        # se invierte el score utilizando '1 - score' porque para calcularlo se utiliza 
        for context in question_context:
            print(f"{1 - context[1]} * \n * {context[0].page_content}\n\n")

### Usage of LLM previous knowledge 

There is not enough context to answer this question without using knowledge gathered in the training phase of the LLM. Therefore, we can see how given the initial prompt, the model tends to allucinate, or to fullfill information with his knowledge, answering questions even though its not possible with the given context.

(None of the documents in the context relate the actor Steve Landesberg to the documentary "The Aristocrats", nor mentions the Emmy awards)


In [360]:
prompt = PromptTemplate.from_template(
    """You are an assistant for question-answering tasks. 
    Use the following pieces of retrieved context to answer the question. 
    If you don't know the answer, just say that you don't know. 
    Use three sentences maximum and keep the answer concise.\n\nQuestion: {question}\nContext: {context}\nAnswer:"""
)

test("""The 2005 documentary "The Aristocrats" was dedicated to a comedian that received how many Emmy Awards?""", True)

The 2005 documentary "The Aristocrats" was dedicated to comedian Steve Landesberg, who was nominated for three Emmy Awards.
-0.29104459285736084 * 
 * John Arthur Lithgow ( ; born October 19 , 1945) is an American actor, musician, singer, comedian, voice actor, and author.  He has received two Tony Awards, six Emmy Awards, two Golden Globe Awards, three Screen Actors Guild Awards, an American Comedy Award, four Drama Desk Awards and has also been nominated for two Academy Awards and four Grammy Awards.  Lithgow has received a star on the Hollywood Walk of Fame and has been inducted into the American Theater Hall of Fame.


-0.36132943630218506 * 
 * Steve Landesberg (November 23, 1936December 20, 2010) was an American actor, comedian, and voice actor known for his role as the erudite, unflappable police detective Arthur P. Dietrich on the ABC sitcom "Barney Miller", for which he was nominated for three Emmy Awards.


-0.37810349464416504 * 
 * Geoffrey Roy Rush {'1': ", '2': ", '3': ",

#### Same question, different prompt
Here we cann see how the same question but a more specific prompt yields a better response:

In [359]:
prompt = PromptTemplate.from_template(
    """Using only the provided context, answer the following quesitons with
    \n\nQuestion: {question}\nContext: {context}\nAnswer:\n\n
    Keep the answers as shost and concise as possible, and If you dont know the answer, just say that you don't know."""
)

test("""The 2005 documentary "The Aristocrats" was dedicated to a comedian that received how many Emmy Awards?""")

I don't know.


### Tests conclussions:

The way i see it, the idea of a RAG, at least in this case, is to answer questions based exclusively in the context. If we are feeding the model some context about a fictional story of a distant planet, we dont want the model to answer 9.807 m/s² (earth's gravity) when asked about the gravity of the planet in the story. If there is no information about the gravity in the planet, the right answer should be 'i don't know'

-------

# Using LangSmith to register the different RAG performances

I was able to adapt the [evaluate a chatbot](https://docs.smith.langchain.com/evaluation/tutorials/evaluation) tutorial from LangSmith to meet my needs to test the RAG. 

By defining two evaluators `correctness` and `context` we can check the performance of the RAG for a given question. 

The `correctness` evaluator uses the LLM as a judge technique to address the correctness of the generated answer, comparing it to the expected answer. Some simple eval instructions are defined for the LLM to evaluate it.

The `context` evaluator was added by me, to add the context provided to the 'experiment', along its score, to be able to study the performance in depth from the LangSmith console. This evaluator takes into account the inverse of the distance metric calculated by chromadb using cosine similarity. Higher values indicate higher similarity between a question and a context document.

Also, the `k_importance` value is used to determine how many of the `k_value` documents in the context are taken into account to determine the score of the context. This was implemented because there is no dynamic way to retrieve context from the database, its always `k_value`, no matter how similar the document is to the question. This causes problems because if a question can be answered with just one document, this document will have a high score, but since `k_value -1` more documents are in the context, which probably have a really low score, they will lower the overall score average, making this metric unusable.
By setting the `k_importance` value to `2` we make sure that the top 2 documents are taken into account to calculate this metric.


In [None]:

# Define evaluators
eval_instructions = "You are an expert professor specialized in grading students' answers to questions."

def correctness(inputs: dict, outputs: dict, reference_outputs: dict) -> bool:
    user_content = f"""You are grading the following question:{inputs['question']} Here is the real answer: {reference_outputs['answer']} You are grading the following predicted answer: {outputs['response']} Start the message with CORRECT or INCORRECT, and then provide argumentation for your decision"""
    response = llm.invoke([
        {"role": "system", "content": eval_instructions},
        {"role": "user", "content": user_content},
    ]).content
    return {'key': "correctness", 'score': "INCORRECT" not in response, 'comment': f"""Question: \n {inputs['question']}\n\nReference answer: \n {reference_outputs['answer']}\n\nResponse: \n {outputs['response']}\n\n\n\Argumentation: \n {response}\n\n"""} 

def extract_page_contents(documents):
    if not documents:
        return "No documents available."

    formatted_content = []
    for doc, score in documents:
        formatted_content.append(f"--- Document {score} ---\n{doc.page_content}\n")

    return "\n".join(formatted_content)

def context(outputs: dict) -> dict:
    score_avg = sum(score for _, score in outputs["context"][:k_value-k_importance]) / len(outputs["context"][:k_value-k_importance])
    return {'key': "context", 'score': 1 - score_avg, 'comment': extract_page_contents(outputs["context"])}

# Run evaluations
def ls_target(inputs: str) -> dict:
    response = graph.invoke({"question": inputs["question"]})
    return {"response": response["answer"], "context": response["context"]}


  return {'key': "correctness", 'score': "INCORRECT" not in response, 'comment': f"""Question: \n {inputs['question']}\n\nReference answer: \n {reference_outputs['answer']}\n\nResponse: \n {outputs['response']}\n\n\n\Argumentation: \n {response}\n\n"""}


### LangSmith dataset creation

Here i create a questions dataset in LangSmith, if it hasn't been created yet.

To do this we use the `examples_amount` value to determine how many questions of the questions dataset we want to randomly select for this test suite.

With this dataset we make sure to use the same questions to evaluate different RAG configurations,.

In [291]:
from langsmith import Client
import random

client = Client()

dataset_name = f"Random {examples_amount}"

datasets = client.list_datasets()
dataset_id = next((d.id for d in datasets if d.name == dataset_name), None)

if not dataset_id:
    print('Creating new questions dataset')
    dataset = client.create_dataset(dataset_name)
    dataset_id = dataset.id

    with open("data/hotpotqa_docs_reduced_qa.json", "r") as f:
        data = json.load(f)
    
    samples = random.sample(data, examples_amount)
    
    selected_questions = [{"question": item["question"]} for item in samples]
    selected_answers = [{"answer": item["answer"]} for item in samples]
    
    client.create_examples(
        inputs=selected_questions,
        outputs=selected_answers,
        dataset_id=dataset_id,
    )
else:
    print(f'Dataset {dataset_name} already created')



Dataset Random 500 already created


## LangSmith tests setups

In this section i define three basic tests, in which i use a different prompt for each one.

Each test was executed for each one of the 4 embeddings selected, being the experiment_prefix a combination of the collection_name and the prompt description.

For comparing these tests, a small sample of 100 random questions was taken, and the same 100 questions were asked with each of the 3 prompts, to each embedding collection.

In [this LangSmith link](https://smith.langchain.com/public/6db8dd34-cf63-4721-a8b1-05ae9d521e33/d) you can see the different results yielded by the combinations. The results are also addressed in the documents provided in the solution.

This is the base benchmark test, i use the default prompt

In [None]:
# TEST 1 ==========================

llm = ChatOpenAI(model="gpt-4o")

prompt = PromptTemplate.from_template(
    """You are an assistant for question-answering tasks. 
    Use the following pieces of retrieved context to answer the question. 
    If you don't know the answer, just say that you don't know. 
    Use three sentences maximum and keep the answer concise.
    \n\nQuestion: {question}\nContext: {context}\nAnswer:"""
)

client.evaluate(
    ls_target, 
    data=dataset_name,  
    evaluators=[correctness, context],  
    experiment_prefix=f"{collection_name}_base-prompt",  
)

View the evaluation results for experiment: 'OPENAI_large_k8_base-prompt-b60ddafd' at:
https://smith.langchain.com/o/386d8cfe-157b-445e-86e5-42faea85b914/datasets/1cf2bce7-e4e6-45d1-9f99-d680de2cde59/compare?selectedSessions=dfe74b7b-a1e2-4861-9220-fa06a6862526




434it [26:58,  4.93s/it]Error running target function: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}
Traceback (most recent call last):
  File "/Users/Nico/repos/quanam/quanam/lib/python3.12/site-packages/langsmith/evaluation/_runner.py", line 1914, in _forward
    fn(
  File "/Users/Nico/repos/quanam/quanam/lib/python3.12/site-packages/langsmith/run_helpers.py", line 629, in wrapper
    raise e
  File "/Users/Nico/repos/quanam/quanam/lib/python3.12/site-packages/langsmith/run_helpers.py", line 626, in wrapper
    function_result = run_container["context"].run(func, *args, **kwargs)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/folders/gr/ntznzvt96xv2_6vsz3rt8lxh0000gp/T/ipykernel_6962

This tests tries to avoid using any other data than the one in the context by using 'ONLY' to be more explicit.

In [286]:
# TEST 2 ==========================

llm = ChatOpenAI(model="gpt-4o")

prompt = PromptTemplate.from_template(
    """You are an assistant for question-answering tasks. 
    Use ONLY the following pieces of retrieved context to answer the question. 
    If you don't know the answer, just say that you don't know. 
    Use three sentences maximum and keep the answer concise.
    \n\nQuestion: {question}\nContext: {context}\nAnswer:"""
)

client.evaluate(
    ls_target, 
    data=dataset_name, 
    evaluators=[correctness, context], 
    experiment_prefix=f"{collection_name}_base-prompt-with-(ONLY)", 
)

View the evaluation results for experiment: 'HF-mpnet-base-v2_k8_base-prompt-with-(ONLY)-ff862a20' at:
https://smith.langchain.com/o/386d8cfe-157b-445e-86e5-42faea85b914/datasets/1cf2bce7-e4e6-45d1-9f99-d680de2cde59/compare?selectedSessions=963dcc49-f3b6-47d3-8d40-041f1021e96e




500it [27:39,  3.32s/it]


This this is a similar prompt, but a bit different and concise

In [258]:
# TEST 3 ==========================

llm = ChatOpenAI(model="gpt-4o")

prompt = PromptTemplate.from_template(
    """Using only the provided context, answer the following quesitons with
    \n\nQuestion: {question}\nContext: {context}\nAnswer:\n\n
    Keep the answers as shost and concise as possible, and If you dont know the answer, just say that you don't know."""
)

client.evaluate(
    ls_target, 
    data=dataset_name,  
    evaluators=[correctness, context], 
    experiment_prefix=f"{collection_name}_custom-prompt", 
)

View the evaluation results for experiment: 'HF-MiniLM-L6-v2_custom-prompt-57ca54c4' at:
https://smith.langchain.com/o/386d8cfe-157b-445e-86e5-42faea85b914/datasets/e6f93c89-bde6-45bb-b350-19e959d857fb/compare?selectedSessions=75bf055f-dc98-421f-be29-b14f95debe21




100it [04:59,  2.99s/it]


This was meant to test other LLM models

In [238]:
# # TEST 4 ==========================

# llm = ChatOpenAI(model="o1")

# prompt = PromptTemplate.from_template(
#     """You are an assistant for question-answering tasks. 
#     Use the following pieces of retrieved context to answer the question. 
#     If you don't know the answer, just say that you don't know. 
#     Use three sentences maximum and keep the answer concise.
#     \n\nQuestion: {question}\nContext: {context}\nAnswer:"""
# )

# client.evaluate(
#     ls_target,
#     data=dataset_name,
#     evaluators=[correctness, context], 
#     experiment_prefix="o1_base-prompt", 
# )