# How to evaluate a RAG application

This example uses [Langchain](https://www.langchain.com) and [Giskard](https://github.com/Giskard-AI/giskard) to evaluate the quality of a RAG application.

In [15]:
import os
from dotenv import load_dotenv

load_dotenv()

True

# 1) Load the Website and split it into smaller chunks

Here, we would like to create a question-answer system related to a webpage. In the first step:
* load the content of the webpage,
* split the content into smaller chunks

In [4]:
from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [14]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, # number of characters per chunk
    chunk_overlap=40
)

loader = WebBaseLoader("https://www.ml.school/")

documents = loader.load_and_split(text_splitter)
print("There are {} chunks.\n".format(len(documents)))
print(documents[0])

There are 11 chunks.

page_content='Building Machine Learning Systems That Don't Suck"This is the best machine learning course I've done. Worth every cent."Jose Reyes, AI/ML at Cevo AustraliaBuilding Machine Learning Systems That Don't SuckA live, interactive program that'll help you build production-ready machine learning systems from the ground up.Next cohort: February 3 - 20, 2025Check the schedule for more details about upcoming cohorts.I want to join!Sign inLearn how to design, build, deploy, and scale machine learning systems to solve real-world problems.I'll lose my mind if I see another book or course teaching people the same basic ideas for the hundredth time. Most people are stuck in beginner mode, and finding help to solve real-world problems is hard.I want to change that.I started writing software 30 years ago. I've written pipelines and trained models for some of the largest companies in the world. I want to show you how to do the same.This is the class I wish I had taken 

# 2) Load the content into a vector stores

We create a vector stores which contains:
* the vector embedding for each chunk

In [3]:
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import DocArrayInMemorySearch

In [23]:
vectorstore = DocArrayInMemorySearch.from_documents(
    documents,
    embedding=OpenAIEmbeddings()
)
vectorstore



<langchain_community.vectorstores.docarray.in_memory.DocArrayInMemorySearch at 0x7a95c1e41400>

In [25]:
search_res = vectorstore.similarity_search("What is this course for?")
search_res[0]

Document(metadata={'source': 'https://www.ml.school/', 'title': "Building Machine Learning Systems That Don't Suck", 'description': "A live, interactive program that'll help you build production-ready machine learning systems from the ground up.", 'language': 'en'}, page_content="taken when I started.This program will help you unlearn what you think machine learning is. It's a practical, hands-on class where you'll learn from years of experience and real-world examples.When you join, you get lifetime access to the following:18 hours of live, interactive sessions. We'll use this time to discuss the first principles behind building machine learning systems.10 hours of step-by-step coding instructions. These practical sessions will show you how to build an end-to-end system from scratch.A final project where you'll build a complete solution and receive direct feedback on your work.100 coding assignments and practice questions.The entire source code of a working production system. It's you

# 3) Generate the test set

We use another LLM to generate a list of `question`s and the desired `answer`s:
* First, we create knowledge base,
* Then, we use LLM to generate a test set.

**Note.** By default, RAGET automatically generates 6 different [question types](https://docs.giskard.ai/en/latest/open_source/testset_generation/testset_generation/index.html#q-types) (these can be selected if needed, see [advanced question generation](https://docs.giskard.ai/en/latest/open_source/testset_generation/testset_generation/index.html#advanced-config)). The total number of questions is divided equally between each question type.

In [40]:
import pandas as pd
import giskard
from giskard.rag import KnowledgeBase, generate_testset

In [33]:
df = pd.DataFrame([d.page_content for d in documents], columns=['text'])
df.head()

Unnamed: 0,text
0,Building Machine Learning Systems That Don't S...
1,taken when I started.This program will help yo...
2,upon completion.And the best part is that you ...
3,We'll move quickly over the basics to focus on...
4,"training, adversarial validation, human-in-the..."


In [34]:
knowledge_base = KnowledgeBase(df)

2024-11-24 23:31:43,958 pid:13932 MainThread giskard.llm.embeddings INFO     No embedding model set though giskard.llm.set_embedding_model. Defaulting to openai/text-embedding-3-small since OPENAI_API_KEY is set.


In [41]:
giskard.llm.set_llm_model("gpt-4o")

test_set = generate_testset(
    knowledge_base,
    num_questions=50, # number of samples to generate
    agent_description="A chatbot answering questions about the Machine Learning School Website", # helps generating better questions
)

Generating questions: 100%|█████████████████████| 50/50 [03:03<00:00,  3.67s/it]


In [42]:
test_set_df = test_set.to_pandas()

for index, row in enumerate(test_set_df.head(3).iterrows()):
    print(f"Question {index + 1}: {row[1]['question']}")
    print(f"Reference answer: {row[1]['reference_answer']}")
    print("Reference context:")
    print(row[1]['reference_context'])
    print("******************", end="\n\n")

Question 1: What resources are included in the machine learning program?
Reference answer: The program includes 18 hours of live, interactive sessions, 10 hours of step-by-step coding instructions, a final project with feedback, 100 coding assignments and practice questions, the entire source code of a working production system, access to a private community, direct access to the instructor, lifetime access to every past and future cohort, and a program certificate upon completion.
Reference context:
Document 1: taken when I started.This program will help you unlearn what you think machine learning is. It's a practical, hands-on class where you'll learn from years of experience and real-world examples.When you join, you get lifetime access to the following:18 hours of live, interactive sessions. We'll use this time to discuss the first principles behind building machine learning systems.10 hours of step-by-step coding instructions. These practical sessions will show you how to build an

In [43]:
test_set.save("./test-set.jsonl")

# 4) Evaluation

Now, we evaluate our question-answer system:
* use our model to generate answers to the questions (from our generated test set),
* evaluate the model by comparing pairs of answers (our model's answers vs. the answers generated by the other LLM)

In [46]:
from langchain_openai.chat_models import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain.prompts import PromptTemplate
from operator import itemgetter

In [45]:
# determining our model
MODEL = 'gpt-3.5-turbo-0125'
llm = ChatOpenAI(model=MODEL)

In [47]:
# defining the prompt template
prompt_template = """
Answer the question based on the context below. If you can't 
answer the question, reply "I don't know".

Context: {context}

Question: {question}
"""

prompt = PromptTemplate.from_template(prompt_template)
print(prompt.format(context='Some context', question='here is the question'))


Answer the question based on the context below. If you can't 
answer the question, reply "I don't know".

Context: Some context

Question: here is the question



### Create the RAG Chain

We create a `retriever` from the `Vector Store`. Given a question, it provides us the most relevant part of the text (with the highest similarity to the question) based on the similarity of the embedding vectors.

In [51]:
retriever = vectorstore.as_retriever()

retriever_res = retriever.get_relevant_documents("What is the Machine Learning School?")
print("We extracted {} chunks (as the most relevant".format(len(retriever_res)))
retriever_res[0]

We extracted 4 chunks (as the most relevant


Document(metadata={'source': 'https://www.ml.school/', 'title': "Building Machine Learning Systems That Don't Suck", 'description': "A live, interactive program that'll help you build production-ready machine learning systems from the ground up.", 'language': 'en'}, page_content="taken when I started.This program will help you unlearn what you think machine learning is. It's a practical, hands-on class where you'll learn from years of experience and real-world examples.When you join, you get lifetime access to the following:18 hours of live, interactive sessions. We'll use this time to discuss the first principles behind building machine learning systems.10 hours of step-by-step coding instructions. These practical sessions will show you how to build an end-to-end system from scratch.A final project where you'll build a complete solution and receive direct feedback on your work.100 coding assignments and practice questions.The entire source code of a working production system. It's you

In [52]:
chain = (
    {
        'context': itemgetter('question') | retriever,
        'question': itemgetter('question'),
    }
    | prompt
    | llm
    | StrOutputParser()
)
chain.invoke({"question": "What is the Machine Learning School?"})

'The Machine Learning School is a live, interactive program that helps individuals build production-ready machine learning systems from the ground up. It offers practical, hands-on classes, coding instructions, a final project, coding assignments, access to source code, a private community, direct access to the instructor, and a program certificate upon completion.'

### Evaluation

We evaluate the performance of the RAG system using the test set.

In [None]:
def answer(question, history=None):
    return chain.invoke({'question': question})