# Evaluating RAG Systems

Evaluation and benchmarking are crucial in developing LLM applications. Optimizing performance for applications like RAG (Retrieval Augmented Generation) requires a robust measurement mechanism.

LlamaIndex provides essential modules to assess the quality of generated outputs and evaluate content retrieval quality. It categorizes its evaluation into two main types:

*   **Response Evaluation** : Assesses quality of Generated Outputs
*   **Retrieval Evaluation** : Assesses Retrieval quality

[Documentation
](https://docs.llamaindex.ai/en/latest/module_guides/evaluating/)

In [1]:
# !pip install llama-index

## Response Evaluation

Evaluating results from LLMs is distinct from traditional machine learning's straightforward outcomes. LlamaIndex employs evaluation modules, using a benchmark LLM like GPT-4, to gauge answer accuracy. Notably, these modules often blend query, context, and response, minimizing the need for ground-truth labels.

The evaluation modules manifest in the following categories:

*   **Faithfulness:** Assesses whether the response remains true to the retrieved contexts, ensuring there's no distortion or "hallucination."
*   **Relevancy:** Evaluates the relevance of both the retrieved context and the generated answer to the initial query.
*   **Correctness:** Determines if the generated answer aligns with the reference answer based on the query (this does require labels).

Furthermore, LlamaIndex has the capability to autonomously generate questions from your data, paving the way for an evaluation pipeline to assess the RAG application.

In [2]:
# attach to the same event-loop
import nest_asyncio

nest_asyncio.apply()

import logging
import sys

from jet.llm.ollama import initialize_ollama_settings, Ollama
initialize_ollama_settings()

# Set up the root logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)  # Set logger level to INFO

# Clear out any existing handlers
logger.handlers = []

# Set up the StreamHandler to output to sys.stdout (Colab's output)
handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.INFO)  # Set handler level to INFO

# Add the handler to the logger
logger.addHandler(handler)

[1m[38;5;208mEvent: pre_start_hook[0m
[38;5;250mFile:[0m [1m[38;5;208mipykernel_launcher.py[0m

[1m[38;5;40mpre_start_hook triggered at: 2025-02-09|07:12:34[0m


In [3]:
import logging
import sys
import pandas as pd

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

from llama_index.core.evaluation import (
    DatasetGenerator,
    FaithfulnessEvaluator,
    RelevancyEvaluator,
    CorrectnessEvaluator,
    RetrieverEvaluator,
    generate_question_context_pairs,
)

from llama_index.core import (
    SimpleDirectoryReader,
    VectorStoreIndex,
    Response,
)

# from llama_index.llms.openai import OpenAI

import os

In [4]:
# os.environ["OPENAI_API_KEY"] = "sk-..."

#### Download Data

In [5]:
# !mkdir -p 'data/paul_graham/'
# !wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

#### Load Data

In [6]:
reader = SimpleDirectoryReader("/Users/jethroestrada/Desktop/External_Projects/Jet_Projects/JetScripts/data/jet-resume/data/")
documents = reader.load_data()

#### Generate Question

In [7]:
gpt4 = Ollama(model="llama3.1", temperature=0.1)

dataset_generator = DatasetGenerator.from_documents(
    documents,
    llm=gpt4,
    show_progress=True,
    num_questions_per_chunk=2,
)

eval_dataset = dataset_generator.generate_dataset_from_nodes(num=1)

Parsing nodes:   0%|          | 0/7 [00:00<?, ?it/s]

  return cls(
  0%|          | 0/1 [00:00<?, ?it/s]

Counting tokens for string: The original query is as follows: You are a Teacher/Professor. Your task is to setup                         2 questions for an upcoming                         quiz/examination. The questions should be diverse in nature                             across the document. Restrict the questions to the                                 context information provided.
We have provided an existing answer: 
We have the opportunity to refine the existing answer (only if needed) with some more context below.
------------

------------
Given the new context, refine the original answer to better answer the query. If the context isn't useful, return the original answer.
Refined Answer: 
Estimating tokens in tools...
Counting tokens for string: 
Counting tokens for string: Context information is below.
---------------------

---------------------
Given the context information and not prior knowledge.
generate only questions based on the below query.
You are a Teacher/Profess

100%|██████████| 1/1 [00:11<00:00, 11.17s/it]


[1m[38;5;40m)[0m[1m[38;5;40m[0m


[1m[38;5;15mModel:[0m [1m[38;5;45mllama3.1[0m
[1m[38;5;15mOptions:[0m [1m[38;5;45m{'num_ctx': 3900, 'seed': 42, 'temperature': 0.1, 'num_keep': 0, 'num_predict': -1}[0m
[1m[38;5;15mStream:[0m [1m[38;5;45mTrue[0m
[1m[38;5;15mResponse:[0m [1m[38;5;45m406[0m
[1m[38;5;15mContent:[0m [1m[38;5;45m2609[0m

[1m[38;5;213mDurations:[0m
[1m[38;5;15mtotal_duration:[0m [1m[38;5;220m9.23s[0m
[1m[38;5;15mload_duration:[0m [1m[38;5;82m41.49ms[0m
[1m[38;5;15mprompt_eval_duration:[0m [1m[38;5;82m697.00ms[0m
[1m[38;5;15meval_duration:[0m [1m[38;5;220m8.49s[0m


[1m[38;5;213mFinal tokens info:[0m
[1m[38;5;45mPrompt tokens:[0m [1m[38;5;40m482[0m
[1m[38;5;45mResponse tokens:[0m [1m[38;5;40m98[0m
[1m[38;5;45mTotal tokens:[0m [1m[38;5;40m580[0m



  0%|          | 0/5 [00:00<?, ?it/s]

Counting tokens for string: The original query is as follows: Here are two questions based on the provided context:
We have provided an existing answer: 
We have the opportunity to refine the existing answer (only if needed) with some more context below.
------------

------------
Given the new context, refine the original answer to better answer the query. If the context isn't useful, return the original answer.
Refined Answer: 
Estimating tokens in tools...
Counting tokens for string: 
Counting tokens for string: Context information is below.
---------------------

---------------------
Given the context information and not prior knowledge, answer the query.
Query: Here are two questions based on the provided context:
Answer: 
Estimating tokens in tools...
Counting tokens for string: 

[1m[38;5;208mCalling Ollama chat...[0m
[38;5;250mLLM model:[0m [1m[38;5;213mllama3.1[0m [1m[38;5;213m(4096)[0m [38;5;250m|[0m [1m[38;5;213mTokens:[0m [1m[38;5;213m474[0m


[38;5;250

100%|██████████| 5/5 [00:37<00:00,  7.56s/it]

[1m[38;5;40m".[0m[1m[38;5;40m[0m


[1m[38;5;15mModel:[0m [1m[38;5;45mllama3.1[0m
[1m[38;5;15mOptions:[0m [1m[38;5;45m{'num_ctx': 3900, 'seed': 42, 'temperature': 0.1, 'num_keep': 0, 'num_predict': -1}[0m
[1m[38;5;15mStream:[0m [1m[38;5;45mTrue[0m
[1m[38;5;15mResponse:[0m [1m[38;5;45m223[0m
[1m[38;5;15mContent:[0m [1m[38;5;45m2188[0m

[1m[38;5;213mDurations:[0m
[1m[38;5;15mtotal_duration:[0m [1m[38;5;220m4.69s[0m
[1m[38;5;15mload_duration:[0m [1m[38;5;82m34.78ms[0m
[1m[38;5;15mprompt_eval_duration:[0m [1m[38;5;82m429.00ms[0m
[1m[38;5;15meval_duration:[0m [1m[38;5;220m4.22s[0m


[1m[38;5;213mFinal tokens info:[0m
[1m[38;5;45mPrompt tokens:[0m [1m[38;5;40m457[0m
[1m[38;5;45mResponse tokens:[0m [1m[38;5;40m50[0m
[1m[38;5;45mTotal tokens:[0m [1m[38;5;40m507[0m




  return QueryResponseDataset(queries=queries, responses=responses_dict)


In [8]:
eval_queries = list(eval_dataset.queries.values())

In [9]:
(eval_queries)

['Here are two questions based on the provided context:']

In [10]:
len(eval_queries)

1

To be consistent we will fix evaluation query

In [11]:
eval_query = "How did the author describe their early attempts at writing code?"

In [12]:
# Fix GPT-3.5-TURBO LLM for generating response
gpt35 = Ollama(temperature=0, model="llama3.2")

# Fix GPT-4 LLM for evaluation
gpt4 = Ollama(temperature=0, model="llama3.1")

In [13]:
# create vector index
vector_index = VectorStoreIndex.from_documents(documents, llm=gpt35)

# Query engine to generate response
query_engine = vector_index.as_query_engine()

[1m[38;5;208mCalling OllamaEmbedding embed...[0m
[38;5;250mEmbed model:[0m [1m[38;5;45mnomic-embed-text[0m [1m[38;5;45m(768)[0m
HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
[1m[38;5;45mBatch Tokens:[0m [1m[38;5;40m768[0m
[1m[38;5;208mCalling OllamaEmbedding embed...[0m
[38;5;250mEmbed model:[0m [1m[38;5;45mnomic-embed-text[0m [1m[38;5;45m(768)[0m
HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
[1m[38;5;45mBatch Tokens:[0m [1m[38;5;40m768[0m
[1m[38;5;208mCalling OllamaEmbedding embed...[0m
[38;5;250mEmbed model:[0m [1m[38;5;45mnomic-embed-text[0m [1m[38;5;45m(768)[0m
HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
[1m[38;5;45mBatch Tokens:[0m [1m[38;5;40m768[0m


In [14]:
retriever = vector_index.as_retriever(similarity_top_k=3)
nodes = retriever.retrieve(eval_query)

[1m[38;5;208mCalling OllamaEmbedding embed...[0m
[38;5;250mEmbed model:[0m [1m[38;5;45mnomic-embed-text[0m [1m[38;5;45m(768)[0m
HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
[1m[38;5;45mBatch Tokens:[0m [1m[38;5;40m768[0m


In [15]:
from IPython.display import display, HTML

display(HTML(f'<p style="font-size:20px">{nodes[1].get_text()}</p>'))

#### Faithfullness Evaluator

 Measures if the response from a query engine matches any source nodes. This is useful for measuring if the response was hallucinated.

In [16]:
faithfulness_evaluator = FaithfulnessEvaluator(llm=gpt4)

In [17]:
# Generate response
response_vector = query_engine.query(eval_query)

[1m[38;5;208mCalling OllamaEmbedding embed...[0m
[38;5;250mEmbed model:[0m [1m[38;5;45mnomic-embed-text[0m [1m[38;5;45m(768)[0m
HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
[1m[38;5;45mBatch Tokens:[0m [1m[38;5;40m768[0m
Counting tokens for string: The original query is as follows: How did the author describe their early attempts at writing code?
We have provided an existing answer: 
We have the opportunity to refine the existing answer (only if needed) with some more context below.
------------

------------
Given the new context, refine the original answer to better answer the query. If the context isn't useful, return the original answer.
Refined Answer: 
Estimating tokens in tools...
Counting tokens for string: 
Counting tokens for string: Context information is below.
---------------------

---------------------
Given the context information and not prior knowledge, answ

In [18]:
eval_result = faithfulness_evaluator.evaluate_response(
    response=response_vector
)

Counting tokens for string: Please tell if a given piece of information is supported by the context.
You need to answer with either YES or NO.
Answer YES if any of the context supports the information, even if most of the context is unrelated. Some examples are provided below. 

Information: Apple pie is generally double-crusted.
Context: An apple pie is a fruit pie in which the principal filling ingredient is apples. 
Apple pie is often served with whipped cream, ice cream ('apple pie à la mode'), custard or cheddar cheese.
It is generally double-crusted, with pastry both above and below the filling; the upper crust may be solid or latticed (woven of crosswise strips).
Answer: YES
Information: Apple pies tastes bad.
Context: An apple pie is a fruit pie in which the principal filling ingredient is apples. 
Apple pie is often served with whipped cream, ice cream ('apple pie à la mode'), custard or cheddar cheese.
It is generally double-crusted, with pastry both above and below the filli

In [19]:
eval_result.passing

True

In [20]:
eval_result

EvaluationResult(query=None, contexts=['# Skills\n\n## Frontend\n\n- React\n- React Native\n- Vanilla JS/CSS\n- Expo\n- GraphQL\n- Redux\n- Gatsby\n- TypeScript\n\n## Backend\n\n- Node.js\n- Python\n\n## Databases\n\n- PostgreSQL\n- MongoDB\n\n## Platforms\n\n- Firebase\n- AWS\n- Google Cloud\n\n## Developer Tools\n\n- Photoshop\n- Jest (Unit testing)\n- Cypress (Integration testing)\n- Selenium (E2E testing)\n- Git\n- Sentry bug tracker\n- Android Studio\n- Xcode\n- Fastlane\n- Serverless\n- ChatGPT', '# Web apps\n\n## Jules Procure\n\n### Achievements\n\n- Started as the sole client side developer, built enterprise web and mobile CRM apps starting from provided mockups to production\n- JulesAI CEO was impressed and acquired ownership of existing CRM\n- Successfully integrated existing CRM with JulesAI\'s workflow to be rebranded as "Jules Procure"\n- Key features: Contact dashboard, Data builder, Task calendar, Workflow boards, Form builders, Price list generator, Automated emails ba

#### Relevency Evaluation

Measures if the response + source nodes match the query.

In [21]:
# Create RelevancyEvaluator using GPT-4 LLM
relevancy_evaluator = RelevancyEvaluator(llm=gpt4)

In [22]:
# Generate response
response_vector = query_engine.query(eval_query)

# Evaluation
eval_result = relevancy_evaluator.evaluate_response(
    query=eval_query, response=response_vector
)

[1m[38;5;208mCalling OllamaEmbedding embed...[0m
[38;5;250mEmbed model:[0m [1m[38;5;45mnomic-embed-text[0m [1m[38;5;45m(768)[0m
HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
[1m[38;5;45mBatch Tokens:[0m [1m[38;5;40m768[0m
Counting tokens for string: The original query is as follows: How did the author describe their early attempts at writing code?
We have provided an existing answer: 
We have the opportunity to refine the existing answer (only if needed) with some more context below.
------------

------------
Given the new context, refine the original answer to better answer the query. If the context isn't useful, return the original answer.
Refined Answer: 
Estimating tokens in tools...
Counting tokens for string: 
Counting tokens for string: Context information is below.
---------------------

---------------------
Given the context information and not prior knowledge, answ

In [23]:
eval_result.query

'How did the author describe their early attempts at writing code?'

In [24]:
eval_result.response

"**Repeat**\n\nThere is no mention of the author's early attempts at writing code in this context. It appears to be a list of skills and tools used by the author, but there is no narrative or personal anecdote about their coding journey."

In [25]:
eval_result.passing

False

Relevancy evaluation with multiple source nodes.

In [26]:
# Create Query Engine with similarity_top_k=3
query_engine = vector_index.as_query_engine(similarity_top_k=3)

# Create response
response_vector = query_engine.query(eval_query)

# Evaluate with each source node
eval_source_result_full = [
    relevancy_evaluator.evaluate(
        query=eval_query,
        response=response_vector.response,
        contexts=[source_node.get_content()],
    )
    for source_node in response_vector.source_nodes
]

# Evaluation result
eval_source_result = [
    "Pass" if result.passing else "Fail" for result in eval_source_result_full
]

[1m[38;5;208mCalling OllamaEmbedding embed...[0m
[38;5;250mEmbed model:[0m [1m[38;5;45mnomic-embed-text[0m [1m[38;5;45m(768)[0m
HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
[1m[38;5;45mBatch Tokens:[0m [1m[38;5;40m768[0m
Counting tokens for string: The original query is as follows: How did the author describe their early attempts at writing code?
We have provided an existing answer: 
We have the opportunity to refine the existing answer (only if needed) with some more context below.
------------

------------
Given the new context, refine the original answer to better answer the query. If the context isn't useful, return the original answer.
Refined Answer: 
Estimating tokens in tools...
Counting tokens for string: 
Counting tokens for string: Context information is below.
---------------------

---------------------
Given the context information and not prior knowledge, answ

In [27]:
eval_source_result

['Fail', 'Fail', 'Fail']

#### Correctness Evaluator

Evaluates the relevance and correctness of a generated answer against a reference answer.

In [28]:
correctness_evaluator = CorrectnessEvaluator(llm=gpt4)

In [29]:
query = "Can you explain the theory of relativity proposed by Albert Einstein in detail?"

reference = """
Certainly! Albert Einstein's theory of relativity consists of two main components: special relativity and general relativity. Special relativity, published in 1905, introduced the concept that the laws of physics are the same for all non-accelerating observers and that the speed of light in a vacuum is a constant, regardless of the motion of the source or observer. It also gave rise to the famous equation E=mc², which relates energy (E) and mass (m).

General relativity, published in 1915, extended these ideas to include the effects of gravity. According to general relativity, gravity is not a force between masses, as described by Newton's theory of gravity, but rather the result of the warping of space and time by mass and energy. Massive objects, such as planets and stars, cause a curvature in spacetime, and smaller objects follow curved paths in response to this curvature. This concept is often illustrated using the analogy of a heavy ball placed on a rubber sheet, causing it to create a depression that other objects (representing smaller masses) naturally move towards.

In essence, general relativity provided a new understanding of gravity, explaining phenomena like the bending of light by gravity (gravitational lensing) and the precession of the orbit of Mercury. It has been confirmed through numerous experiments and observations and has become a fundamental theory in modern physics.
"""

response = """
Certainly! Albert Einstein's theory of relativity consists of two main components: special relativity and general relativity. Special relativity, published in 1905, introduced the concept that the laws of physics are the same for all non-accelerating observers and that the speed of light in a vacuum is a constant, regardless of the motion of the source or observer. It also gave rise to the famous equation E=mc², which relates energy (E) and mass (m).

However, general relativity, published in 1915, extended these ideas to include the effects of magnetism. According to general relativity, gravity is not a force between masses but rather the result of the warping of space and time by magnetic fields generated by massive objects. Massive objects, such as planets and stars, create magnetic fields that cause a curvature in spacetime, and smaller objects follow curved paths in response to this magnetic curvature. This concept is often illustrated using the analogy of a heavy ball placed on a rubber sheet with magnets underneath, causing it to create a depression that other objects (representing smaller masses) naturally move towards due to magnetic attraction.
"""

In [30]:
correctness_result = correctness_evaluator.evaluate(
    query=query,
    response=response,
    reference=reference,
)


[1m[38;5;208mCalling Ollama chat...[0m
[38;5;250mLLM model:[0m [1m[38;5;213mllama3.1[0m [1m[38;5;213m(4096)[0m [38;5;250m|[0m [1m[38;5;213mTokens:[0m [1m[38;5;213m837[0m


[38;5;250mModel:[0m [1m[38;5;213mllama3.1[0m
[38;5;250mPrompt:[0m [1m[38;5;213m3918[0m
[38;5;250mTokens:[0m [1m[38;5;213m837[0m
[38;5;250mStream:[0m [1m[38;5;213mTrue[0m
[1m[38;5;45mGenerating response...[0m
[1m[38;5;208mEvent: call_ollama_chat[0m
[38;5;250mFile:[0m [1m[38;5;208mipykernel_launcher.py[0m
[1m[38;5;15mLog-Filename:[0m [1m[38;5;45mipykernel_launcher[0m
[1m[38;5;40m4[0m[1m[38;5;40m.[0m[1m[38;5;40m0[0m[1m[38;5;40m
[0m[1m[38;5;40mThe[0m[1m[38;5;40m generated[0m[1m[38;5;40m answer[0m[1m[38;5;40m is[0m[1m[38;5;40m highly[0m[1m[38;5;40m relevant[0m[1m[38;5;40m and[0m[1m[38;5;40m mostly[0m[1m[38;5;40m correct[0m[1m[38;5;40m,[0m[1m[38;5;40m but[0m[1m[38;5;40m contains[0m[1m[38;5;40m a[0m[1m[38;5;40m sign

In [31]:
correctness_result

EvaluationResult(query='Can you explain the theory of relativity proposed by Albert Einstein in detail?', contexts=None, response="\nCertainly! Albert Einstein's theory of relativity consists of two main components: special relativity and general relativity. Special relativity, published in 1905, introduced the concept that the laws of physics are the same for all non-accelerating observers and that the speed of light in a vacuum is a constant, regardless of the motion of the source or observer. It also gave rise to the famous equation E=mc², which relates energy (E) and mass (m).\n\nHowever, general relativity, published in 1915, extended these ideas to include the effects of magnetism. According to general relativity, gravity is not a force between masses but rather the result of the warping of space and time by magnetic fields generated by massive objects. Massive objects, such as planets and stars, create magnetic fields that cause a curvature in spacetime, and smaller objects foll

In [32]:
correctness_result.score

4.0

In [33]:
correctness_result.passing

True

In [34]:
correctness_result.feedback

'The generated answer is highly relevant and mostly correct, but contains a significant mistake regarding general relativity. The reference answer clearly states that general relativity includes the effects of gravity, not magnetism. However, the generated answer correctly describes special relativity and provides some accurate information about general relativity. The inclusion of magnetism instead of gravity in the explanation of general relativity is a major error, but it does not completely undermine the overall relevance and correctness of the generated answer.'

## Retrieval Evaluation

Evaluates the quality of any Retriever module defined in LlamaIndex.

To assess the quality of a Retriever module in LlamaIndex, we use metrics like hit-rate and MRR. These compare retrieved results to ground-truth context for any question. For simpler evaluation dataset creation, we utilize synthetic data generation.

In [36]:
reader = SimpleDirectoryReader("/Users/jethroestrada/Desktop/External_Projects/Jet_Projects/JetScripts/data/jet-resume/data")
documents = reader.load_data()

from llama_index.core.text_splitter import SentenceSplitter

# create parser and parse document into nodes
parser = SentenceSplitter(chunk_size=1024, chunk_overlap=100)
nodes = parser(documents)

In [37]:
vector_index = VectorStoreIndex(nodes)

[1m[38;5;208mCalling OllamaEmbedding embed...[0m
[38;5;250mEmbed model:[0m [1m[38;5;45mnomic-embed-text[0m [1m[38;5;45m(768)[0m
HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
[1m[38;5;45mBatch Tokens:[0m [1m[38;5;40m768[0m
[1m[38;5;208mCalling OllamaEmbedding embed...[0m
[38;5;250mEmbed model:[0m [1m[38;5;45mnomic-embed-text[0m [1m[38;5;45m(768)[0m
HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
[1m[38;5;45mBatch Tokens:[0m [1m[38;5;40m768[0m
[1m[38;5;208mCalling OllamaEmbedding embed...[0m
[38;5;250mEmbed model:[0m [1m[38;5;45mnomic-embed-text[0m [1m[38;5;45m(768)[0m
HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
[1m[38;5;45mBatch Tokens:[0m [1m[38;5;40m768[0m


In [38]:
# Define the retriever
retriever = vector_index.as_retriever(similarity_top_k=2)

In [39]:
retrieved_nodes = retriever.retrieve(eval_query)

[1m[38;5;208mCalling OllamaEmbedding embed...[0m
[38;5;250mEmbed model:[0m [1m[38;5;45mnomic-embed-text[0m [1m[38;5;45m(768)[0m
HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
[1m[38;5;45mBatch Tokens:[0m [1m[38;5;40m768[0m


In [40]:
from llama_index.core.response.notebook_utils import display_source_node

for node in retrieved_nodes:
    display_source_node(node, source_length=2000)

**Node ID:** e8c7a22c-c9fe-46b3-84c2-cff4ef42a541<br>**Similarity:** 0.5197927857574212<br>**Text:** # Skills

## Frontend

- React
- React Native
- Vanilla JS/CSS
- Expo
- GraphQL
- Redux
- Gatsby
- TypeScript

## Backend

- Node.js
- Python

## Databases

- PostgreSQL
- MongoDB

## Platforms

- Firebase
- AWS
- Google Cloud

## Developer Tools

- Photoshop
- Jest (Unit testing)
- Cypress (Integration testing)
- Selenium (E2E testing)
- Git
- Sentry bug tracker
- Android Studio
- Xcode
- Fastlane
- Serverless
- ChatGPT<br>

**Node ID:** c1209e55-77e4-41f3-b980-82e93fbd8b06<br>**Similarity:** 0.5173908664780609<br>**Text:** # Web apps

## Jules Procure

### Achievements

- Started as the sole client side developer, built enterprise web and mobile CRM apps starting from provided mockups to production
- JulesAI CEO was impressed and acquired ownership of existing CRM
- Successfully integrated existing CRM with JulesAI's workflow to be rebranded as "Jules Procure"
- Key features: Contact dashboard, Data builder, Task calendar, Workflow boards, Form builders, Price list generator, Automated emails based on triggers, and more
- Technologies used: React, React Native, AWS Lambdas, GraphQL, Docker, Serverless, Jest

## Digital Cities PH

### Achievements

- As the lead developer, I worked on a portal that showcases the profiles of provinces and cities in the Philippines
- Developed an interactive Philippine map with clickable provinces, enabling users to access detailed descriptions and statistics for each region
- Key features: Interactive map, Search, Filtering, Fast loading, SEO-friendly
- Technologies used: React, GraphQL, React Static, Headless CMS

## ADEC Kenya, AMDATEX

### Achievements

- Built UI components from mockups using Photoshop to achieve pixel-perfect look
- Key features: Responsive, Reusable components
- Technologies used: React, jQuery, Wordpress<br>

In [41]:
qa_dataset = generate_question_context_pairs(
    nodes, llm=gpt4, num_questions_per_chunk=2
)

  0%|          | 0/7 [00:00<?, ?it/s]


[1m[38;5;208mCalling Ollama chat...[0m
[38;5;250mLLM model:[0m [1m[38;5;213mllama3.1[0m [1m[38;5;213m(4096)[0m [38;5;250m|[0m [1m[38;5;213mTokens:[0m [1m[38;5;213m436[0m


[38;5;250mModel:[0m [1m[38;5;213mllama3.1[0m
[38;5;250mPrompt:[0m [1m[38;5;213m1857[0m
[38;5;250mTokens:[0m [1m[38;5;213m436[0m
[38;5;250mStream:[0m [1m[38;5;213mTrue[0m
[1m[38;5;45mGenerating response...[0m
[1m[38;5;208mEvent: call_ollama_chat[0m
[38;5;250mFile:[0m [1m[38;5;208mipykernel_launcher.py[0m
[1m[38;5;15mLog-Filename:[0m [1m[38;5;45mipykernel_launcher[0m
[1m[38;5;40mHere[0m[1m[38;5;40m are[0m[1m[38;5;40m two[0m[1m[38;5;40m questions[0m[1m[38;5;40m based[0m[1m[38;5;40m on[0m[1m[38;5;40m the[0m[1m[38;5;40m context[0m[1m[38;5;40m information[0m[1m[38;5;40m:

[0m[1m[38;5;40m**[0m[1m[38;5;40mQuestion[0m[1m[38;5;40m [0m[1m[38;5;40m1[0m[1m[38;5;40m**
[0m[1m[38;5;40mWhat[0m[1m[38;5;40m was[0m[1m[38;5;40m t

 14%|█▍        | 1/7 [00:18<01:52, 18.78s/it]

[1m[38;5;40m?[0m[1m[38;5;40m[0m


[1m[38;5;15mModel:[0m [1m[38;5;45mllama3.1[0m
[1m[38;5;15mOptions:[0m [1m[38;5;45m{'num_ctx': 3900, 'seed': 42, 'temperature': 0.0, 'num_keep': 0, 'num_predict': -1}[0m
[1m[38;5;15mStream:[0m [1m[38;5;45mTrue[0m
[1m[38;5;15mResponse:[0m [1m[38;5;45m288[0m
[1m[38;5;15mContent:[0m [1m[38;5;45m2145[0m

[1m[38;5;213mDurations:[0m
[1m[38;5;15mtotal_duration:[0m [1m[38;5;220m16.43s[0m
[1m[38;5;15mload_duration:[0m [1m[38;5;82m591.38ms[0m
[1m[38;5;15mprompt_eval_duration:[0m [1m[38;5;220m10.28s[0m
[1m[38;5;15meval_duration:[0m [1m[38;5;220m5.55s[0m


[1m[38;5;213mFinal tokens info:[0m
[1m[38;5;45mPrompt tokens:[0m [1m[38;5;40m403[0m
[1m[38;5;45mResponse tokens:[0m [1m[38;5;40m65[0m
[1m[38;5;45mTotal tokens:[0m [1m[38;5;40m468[0m


[1m[38;5;208mCalling Ollama chat...[0m
[38;5;250mLLM model:[0m [1m[38;5;213mllama3.1[0m [1m[38;5;213m(4096)[0m [38;5;250m|[0m [1m[38;

 29%|██▊       | 2/7 [00:28<01:08, 13.66s/it]

[1m[38;5;40mEducation[0m[1m[38;5;40m")[0m[1m[38;5;40m[0m


[1m[38;5;15mModel:[0m [1m[38;5;45mllama3.1[0m
[1m[38;5;15mOptions:[0m [1m[38;5;45m{'num_ctx': 3900, 'seed': 42, 'temperature': 0.0, 'num_keep': 0, 'num_predict': -1}[0m
[1m[38;5;15mStream:[0m [1m[38;5;45mTrue[0m
[1m[38;5;15mResponse:[0m [1m[38;5;45m292[0m
[1m[38;5;15mContent:[0m [1m[38;5;45m1306[0m

[1m[38;5;213mDurations:[0m
[1m[38;5;15mtotal_duration:[0m [1m[38;5;220m8.18s[0m
[1m[38;5;15mload_duration:[0m [1m[38;5;82m49.72ms[0m
[1m[38;5;15mprompt_eval_duration:[0m [1m[38;5;220m2.25s[0m
[1m[38;5;15meval_duration:[0m [1m[38;5;220m5.87s[0m


[1m[38;5;213mFinal tokens info:[0m
[1m[38;5;45mPrompt tokens:[0m [1m[38;5;40m238[0m
[1m[38;5;45mResponse tokens:[0m [1m[38;5;40m70[0m
[1m[38;5;45mTotal tokens:[0m [1m[38;5;40m308[0m


[1m[38;5;208mCalling Ollama chat...[0m
[38;5;250mLLM model:[0m [1m[38;5;213mllama3.1[0m [1m[38;5;213m(4096)[0m 

 43%|████▎     | 3/7 [00:42<00:54, 13.73s/it]

[1m[38;5;40m[0m


[1m[38;5;15mModel:[0m [1m[38;5;45mllama3.1[0m
[1m[38;5;15mOptions:[0m [1m[38;5;45m{'num_ctx': 3900, 'seed': 42, 'temperature': 0.0, 'num_keep': 0, 'num_predict': -1}[0m
[1m[38;5;15mStream:[0m [1m[38;5;45mTrue[0m
[1m[38;5;15mResponse:[0m [1m[38;5;45m400[0m
[1m[38;5;15mContent:[0m [1m[38;5;45m2142[0m

[1m[38;5;213mDurations:[0m
[1m[38;5;15mtotal_duration:[0m [1m[38;5;220m11.43s[0m
[1m[38;5;15mload_duration:[0m [1m[38;5;82m47.66ms[0m
[1m[38;5;15mprompt_eval_duration:[0m [1m[38;5;220m3.07s[0m
[1m[38;5;15meval_duration:[0m [1m[38;5;220m8.31s[0m


[1m[38;5;213mFinal tokens info:[0m
[1m[38;5;45mPrompt tokens:[0m [1m[38;5;40m355[0m
[1m[38;5;45mResponse tokens:[0m [1m[38;5;40m98[0m
[1m[38;5;45mTotal tokens:[0m [1m[38;5;40m453[0m


[1m[38;5;208mCalling Ollama chat...[0m
[38;5;250mLLM model:[0m [1m[38;5;213mllama3.1[0m [1m[38;5;213m(4096)[0m [38;5;250m|[0m [1m[38;5;213mTokens:[0m [1

 57%|█████▋    | 4/7 [01:01<00:47, 15.75s/it]

[1m[38;5;40m[0m


[1m[38;5;15mModel:[0m [1m[38;5;45mllama3.1[0m
[1m[38;5;15mOptions:[0m [1m[38;5;45m{'num_ctx': 3900, 'seed': 42, 'temperature': 0.0, 'num_keep': 0, 'num_predict': -1}[0m
[1m[38;5;15mStream:[0m [1m[38;5;45mTrue[0m
[1m[38;5;15mResponse:[0m [1m[38;5;45m734[0m
[1m[38;5;15mContent:[0m [1m[38;5;45m2470[0m

[1m[38;5;213mDurations:[0m
[1m[38;5;15mtotal_duration:[0m [1m[38;5;220m17.37s[0m
[1m[38;5;15mload_duration:[0m [1m[38;5;82m50.38ms[0m
[1m[38;5;15mprompt_eval_duration:[0m [1m[38;5;220m3.06s[0m
[1m[38;5;15meval_duration:[0m [1m[38;5;220m14.26s[0m


[1m[38;5;213mFinal tokens info:[0m
[1m[38;5;45mPrompt tokens:[0m [1m[38;5;40m344[0m
[1m[38;5;45mResponse tokens:[0m [1m[38;5;40m167[0m
[1m[38;5;45mTotal tokens:[0m [1m[38;5;40m511[0m


[1m[38;5;208mCalling Ollama chat...[0m
[38;5;250mLLM model:[0m [1m[38;5;213mllama3.1[0m [1m[38;5;213m(4096)[0m [38;5;250m|[0m [1m[38;5;213mTokens:[0m 

 71%|███████▏  | 5/7 [01:09<00:25, 12.99s/it]

[1m[38;5;40m pitch[0m[1m[38;5;40m?[0m[1m[38;5;40m[0m


[1m[38;5;15mModel:[0m [1m[38;5;45mllama3.1[0m
[1m[38;5;15mOptions:[0m [1m[38;5;45m{'num_ctx': 3900, 'seed': 42, 'temperature': 0.0, 'num_keep': 0, 'num_predict': -1}[0m
[1m[38;5;15mStream:[0m [1m[38;5;45mTrue[0m
[1m[38;5;15mResponse:[0m [1m[38;5;45m291[0m
[1m[38;5;15mContent:[0m [1m[38;5;45m1294[0m

[1m[38;5;213mDurations:[0m
[1m[38;5;15mtotal_duration:[0m [1m[38;5;220m6.80s[0m
[1m[38;5;15mload_duration:[0m [1m[38;5;82m45.04ms[0m
[1m[38;5;15mprompt_eval_duration:[0m [1m[38;5;220m1.72s[0m
[1m[38;5;15meval_duration:[0m [1m[38;5;220m5.04s[0m


[1m[38;5;213mFinal tokens info:[0m
[1m[38;5;45mPrompt tokens:[0m [1m[38;5;40m185[0m
[1m[38;5;45mResponse tokens:[0m [1m[38;5;40m61[0m
[1m[38;5;45mTotal tokens:[0m [1m[38;5;40m246[0m


[1m[38;5;208mCalling Ollama chat...[0m
[38;5;250mLLM model:[0m [1m[38;5;213mllama3.1[0m [1m[38;5;213m(4096)[0m [38;

 86%|████████▌ | 6/7 [01:20<00:12, 12.16s/it]

[1m[38;5;40m)[0m[1m[38;5;40m Git[0m[1m[38;5;40m[0m


[1m[38;5;15mModel:[0m [1m[38;5;45mllama3.1[0m
[1m[38;5;15mOptions:[0m [1m[38;5;45m{'num_ctx': 3900, 'seed': 42, 'temperature': 0.0, 'num_keep': 0, 'num_predict': -1}[0m
[1m[38;5;15mStream:[0m [1m[38;5;45mTrue[0m
[1m[38;5;15mResponse:[0m [1m[38;5;45m367[0m
[1m[38;5;15mContent:[0m [1m[38;5;45m1280[0m

[1m[38;5;213mDurations:[0m
[1m[38;5;15mtotal_duration:[0m [1m[38;5;220m9.09s[0m
[1m[38;5;15mload_duration:[0m [1m[38;5;82m35.84ms[0m
[1m[38;5;15mprompt_eval_duration:[0m [1m[38;5;220m1.72s[0m
[1m[38;5;15meval_duration:[0m [1m[38;5;220m7.34s[0m


[1m[38;5;213mFinal tokens info:[0m
[1m[38;5;45mPrompt tokens:[0m [1m[38;5;40m205[0m
[1m[38;5;45mResponse tokens:[0m [1m[38;5;40m88[0m
[1m[38;5;45mTotal tokens:[0m [1m[38;5;40m293[0m


[1m[38;5;208mCalling Ollama chat...[0m
[38;5;250mLLM model:[0m [1m[38;5;213mllama3.1[0m [1m[38;5;213m(4096)[0m [38;5;

100%|██████████| 7/7 [01:35<00:00, 13.58s/it]

[1m[38;5;40m above[0m[1m[38;5;40m[0m


[1m[38;5;15mModel:[0m [1m[38;5;45mllama3.1[0m
[1m[38;5;15mOptions:[0m [1m[38;5;45m{'num_ctx': 3900, 'seed': 42, 'temperature': 0.0, 'num_keep': 0, 'num_predict': -1}[0m
[1m[38;5;15mStream:[0m [1m[38;5;45mTrue[0m
[1m[38;5;15mResponse:[0m [1m[38;5;45m459[0m
[1m[38;5;15mContent:[0m [1m[38;5;45m1575[0m

[1m[38;5;213mDurations:[0m
[1m[38;5;15mtotal_duration:[0m [1m[38;5;220m13.57s[0m
[1m[38;5;15mload_duration:[0m [1m[38;5;82m35.86ms[0m
[1m[38;5;15mprompt_eval_duration:[0m [1m[38;5;220m2.28s[0m
[1m[38;5;15meval_duration:[0m [1m[38;5;220m11.25s[0m


[1m[38;5;213mFinal tokens info:[0m
[1m[38;5;45mPrompt tokens:[0m [1m[38;5;40m267[0m
[1m[38;5;45mResponse tokens:[0m [1m[38;5;40m134[0m
[1m[38;5;45mTotal tokens:[0m [1m[38;5;40m401[0m






In [42]:
queries = qa_dataset.queries.values()
print(list(queries)[5])

**Question 1**


In [43]:
len(list(queries))

14

In [44]:
retriever_evaluator = RetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"], retriever=retriever
)

In [45]:
# try it out on a sample query
sample_id, sample_query = list(qa_dataset.queries.items())[0]
sample_expected = qa_dataset.relevant_docs[sample_id]

eval_result = retriever_evaluator.evaluate(sample_query, sample_expected)
print(eval_result)

HTTP Request: POST http://localhost:11434/api/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/embeddings "HTTP/1.1 200 OK"
Query: Here are two questions based on the context information:
Metrics: {'mrr': 0.0, 'hit_rate': 0.0}



In [46]:
# try it out on an entire dataset
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)

HTTP Request: POST http://localhost:11434/api/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST http://localhost:11434/api/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST http:/

In [47]:
def display_results(name, eval_results):
    """Display results from evaluate."""

    metric_dicts = []
    for eval_result in eval_results:
        metric_dict = eval_result.metric_vals_dict
        metric_dicts.append(metric_dict)

    full_df = pd.DataFrame(metric_dicts)

    hit_rate = full_df["hit_rate"].mean()
    mrr = full_df["mrr"].mean()

    metric_df = pd.DataFrame(
        {"retrievers": [name], "hit_rate": [hit_rate], "mrr": [mrr]}
    )

    return metric_df

In [48]:
display_results("top-2 eval", eval_results)

Unnamed: 0,retrievers,hit_rate,mrr
0,top-2 eval,0.357143,0.285714
