# Evaluate using sample questions: Llama Stack vs. LlamaIndex

This notebook starts with sample questions and reference answers in the format generated by [make-sample-questions.ipynb](./make-sample-questions.ipynb). It then does the following:

1. It runs RAG with Llama Stack to generate answers.
2. It runs RAG with LlamaIndex to generate answers.
3. It uses Ragas to evaluate the outputs given the generated answers, and the reference answers.  This step depends on having a very powerful model to do the evaluation.  We are using gpt-4o for that purpose.
4. It determines whether the results are statistically significant.

## Import dependencies

In [33]:
import evaluation_utilities
import os
from IPython.display import clear_output

import copy
import importlib

from ragas.metrics import (
    AnswerAccuracy,
)
from ragas.metrics._domain_specific_rubrics import RubricsScore

from llama_index.llms.openai import OpenAI as LlamaIndexOpenAI
from ragas.llms import LlamaIndexLLMWrapper

from llama_index.core.llms import ChatMessage
from llama_index.llms.ibm import WatsonxLLM
from llama_index.core import SimpleDirectoryReader
from llama_index.core import VectorStoreIndex
from llama_index.core import get_response_synthesizer
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

from llama_stack_client import LlamaStackClient
from llama_stack_client.types import UserMessage
from llama_stack_client import RAGDocument
from llama_stack_client import Agent

import uuid

In [2]:
# Rerun this cell whenever you change evaluation_utilities
importlib.reload(evaluation_utilities)

<module 'evaluation_utilities' from '/Users/bmurdock/lls-comparisons/evaluation_utilities.py'>

## Configure and initialize models

The main configuration options for this notebook are in the following cell, so you may want to edit some values there before running.

In [43]:
# reuse_client and max_retries are useful for preventing Ragas from failing due to rate limiting
EVALUATOR_MODEL={"model": "gpt-4o", "reuse_client": False, "max_retries": 10}

EMBED_MODEL_ID_FOR_LLAMAINDEX="ibm-granite/granite-embedding-125m-english"
EMBED_MODEL_ID_FOR_LLAMA_STACK="granite-embedding-125m"

WATSONX_PROJECT_ID=os.environ.get("WATSONX_PROJECT_ID")

LLAMA_INDEX_RAG_MODELS_INFO_WATSONX= [ {"model_id": "meta-llama/llama-3-3-70b-instruct", "project_id": WATSONX_PROJECT_ID, "max_new_tokens": 4096, "additional_params": {"time_limit": 10000}}]
LLAMA_STACK_RAG_MODELS_INFO_WATSONX = ["meta-llama/llama-3-3-70b-instruct"]


CONTENT_URLS=["https://www.ibm.com/downloads/documents/us-en/1227c12d3a38b173"]
CONTENT_LOCATION="./docs/"

TEST_DATA_FILES = ["questions_and_reference_answers-ibm-50-10-1495.json"]

NUMBER_OF_SEARCH_RESULTS=5

LLAMA_INDEX_RAG_MODELS_INFO_WATSONX

[{'model_id': 'meta-llama/llama-3-3-70b-instruct',
  'project_id': 'cef5787a-161d-47e0-a549-2a7ecf2fb355',
  'max_new_tokens': 4096,
  'additional_params': {'time_limit': 10000}}]

In [4]:

LLAMA_INDEX_RAG_MODELS = {}
for info in LLAMA_INDEX_RAG_MODELS_INFO_WATSONX:
    model_id = info["model_id"]
    watsonx_llm = WatsonxLLM(**info)
    LLAMA_INDEX_RAG_MODELS[model_id] = watsonx_llm

EMBED_MODEL_FOR_LLAMAINDEX = HuggingFaceEmbedding(model_name=EMBED_MODEL_ID_FOR_LLAMAINDEX)

evaluator_model = LlamaIndexOpenAI(**EVALUATOR_MODEL)

INFO:ibm_watsonx_ai.client:Client successfully initialized
INFO:httpx:HTTP Request: GET https://us-south.ml.cloud.ibm.com/ml/v1/foundation_model_specs?version=2025-05-21&project_id=cef5787a-161d-47e0-a549-2a7ecf2fb355&filters=function_text_generation%2C%21lifecycle_withdrawn%3Aand&limit=200 "HTTP/1.1 200 OK"
INFO:ibm_watsonx_ai.wml_resource:Successfully finished Get available foundation models for url: 'https://us-south.ml.cloud.ibm.com/ml/v1/foundation_model_specs?version=2025-05-21&project_id=cef5787a-161d-47e0-a549-2a7ecf2fb355&filters=function_text_generation%2C%21lifecycle_withdrawn%3Aand&limit=200'
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: ibm-granite/granite-embedding-125m-english
INFO:sentence_transformers.SentenceTransformer:2 prompts are loaded, with the keys: ['query', 'text']


In [5]:
LLAMA_INDEX_RAG_MODELS["meta-llama/llama-3-3-70b-instruct"]

WatsonxLLM(callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x330304310>, system_prompt=None, messages_to_prompt=<function messages_to_prompt at 0x11e0cd6c0>, completion_to_prompt=<function default_completion_to_prompt at 0x11e5e7380>, output_parser=None, pydantic_program_mode=<PydanticProgramMode.DEFAULT: 'default'>, query_wrapper_prompt=None, model_id='meta-llama/llama-3-3-70b-instruct', deployment_id=None, temperature=None, max_new_tokens=4096, additional_params={'time_limit': 10000}, project_id='cef5787a-161d-47e0-a549-2a7ecf2fb355', space_id=None, url=SecretStr('**********'), apikey=SecretStr('**********'), token=None, password=None, username=None, instance_id=None, version=None, verify=None, validate_model=True, persistent_connection=True)

In [6]:
messages = [
    ChatMessage(role="user", content="Explain why WW1 started"),
]
print(evaluator_model.chat(messages))

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


assistant: World War I, also known as the Great War, began in 1914 and was the result of a complex interplay of factors that had been building up over several decades. Here are some of the key reasons why the war started:

1. **Assassination of Archduke Franz Ferdinand**: The immediate catalyst for the war was the assassination of Archduke Franz Ferdinand of Austria-Hungary and his wife, Sophie, on June 28, 1914, by Gavrilo Princip, a Bosnian Serb nationalist. This event set off a chain reaction of diplomatic and military mobilizations.

2. **Alliance System**: Europe was divided into two major alliance systems. The Triple Entente, consisting of France, Russia, and the United Kingdom, and the Triple Alliance, made up of Germany, Austria-Hungary, and Italy. These alliances were meant to provide mutual defense and deter aggression, but they also meant that a conflict involving one country could quickly involve others.

3. **Militarism**: There was a significant build-up of military force

In [7]:
for label, model in LLAMA_INDEX_RAG_MODELS.items():
    print(f"{label}: {model.chat(messages)}")

INFO:httpx:HTTP Request: POST https://us-south.ml.cloud.ibm.com/ml/v1/text/chat?version=2025-05-21 "HTTP/1.1 200 OK"
INFO:ibm_watsonx_ai.wml_resource:Successfully finished chat for url: 'https://us-south.ml.cloud.ibm.com/ml/v1/text/chat?version=2025-05-21'


meta-llama/llama-3-3-70b-instruct: assistant: The start of World War I was a complex and multifaceted event, involving various factors and parties. 

One major factor was the system of alliances between European countries, including the Triple Entente (France, Britain, and Russia) and the Triple Alliance (Germany, Austria-Hungary, and Italy). These alliances created a situation in which a small conflict between two countries could quickly escalate into a larger war.

Another key factor was the rise of nationalism and imperialism in various European countries, which led to increased tensions and competition for resources and territory. The Balkans, in particular, were a region of significant tension, with various ethnic and national groups seeking independence or greater autonomy.

The immediate trigger for the war was the assassination of Archduke Franz Ferdinand, the heir to the throne of Austria-Hungary, by a group of Serbian nationalists in June 1914. This event led Austria-Hungary 

In [8]:
client = LlamaStackClient(base_url="http://localhost:8321", timeout=12000)

message = UserMessage(
    content="Say 'Hello World'",
    role="user",
)
client.inference.chat_completion(
    model_id=LLAMA_STACK_RAG_MODELS_INFO_WATSONX[0],
    messages=[message],
    stream=False
).completion_message.content

INFO:httpx:HTTP Request: POST http://localhost:8321/v1/inference/chat-completion "HTTP/1.1 200 OK"


'Hello World!'

## Load the evaluation questions with reference answers

This loads the outputs of [make-sample-questions.ipynb](./make-sample-questions.ipynb)

In [9]:
loaded_data = []

for f in TEST_DATA_FILES:
    data = evaluation_utilities.read_json(f)
    loaded_data = loaded_data + data
print(len(loaded_data))
loaded_data[0:2]

1495


[{'user_input': "Based on IBM's 2024 Annual Report, how might the company's financial performance and strategic initiatives influence its ability to sustain or increase dividend payouts to shareholders?",
  'reference': "Based on IBM's 2024 Annual Report, the company's financial performance and strategic initiatives appear to support its ability to sustain or potentially increase dividend payouts to shareholders. Here are some key points from the report that influence this assessment:\n\n1. **Revenue Growth and Cash Flow**: IBM reported $62.8 billion in revenue, with a 3% increase at constant currency, and generated $12.7 billion in free cash flow, an increase of $1.5 billion year-over-year. Strong revenue growth and cash flow generation provide a solid foundation for sustaining dividend payouts.\n\n2. **Investment in Growth Areas**: IBM has made significant investments in AI and hybrid cloud, which are expected to drive future growth. The company allocated over $7 billion to research 

In [10]:
loaded_data_with_reference_contexts = list(filter(lambda element : len(element["reference_contexts"]) > 0, loaded_data))
#evaluation_utilities.write_json(loaded_data_with_reference_answers, "questions_and_reference_answers_combined.json")
print(f"{len(loaded_data_with_reference_contexts)} of {len(loaded_data)}")

706 of 1495


In [11]:
# Temporary code to cut scope for quick testing.  Uncomment this line to run a mini-test of the notebook.
# 
# #loaded_data_with_reference_contexts = loaded_data_with_reference_contexts[0:20]

In [12]:
count = len(loaded_data_with_reference_contexts)
print(count)

706


# Run Agentic RAG with Llama Stack

This attempts to recreate the flow in https://llama-stack.readthedocs.io/en/latest/building_applications/rag.html#using-the-rag-tool , i.e., the RAG that a naive user getting started with the getting-started documentation would build EXCEPT that it is configured with the following elements:

- Content is from the URLs configured in CONTENT_URLS at the top of this notebook
- Milvus-lite inline vector IO provider
- granite-embedding-125m embedding model
- meta-llama/llama-3-3-70b-instruct generative model using the watsonx inference provider
- max_tokens for output is 4096

In [13]:
# Register a vector db
vector_db_id = f"rag-eval-{uuid.uuid4().hex}"
response = client.vector_dbs.register(
    vector_db_id=vector_db_id,
    embedding_model=EMBED_MODEL_ID_FOR_LLAMA_STACK,
    embedding_dimension=768,
    provider_id="milvus",
)

INFO:httpx:HTTP Request: POST http://localhost:8321/v1/vector-dbs "HTTP/1.1 200 OK"


In [14]:
# Wrap the URLs as Llama Stack RAGDocument objects

documents = [
    RAGDocument(
        document_id=f"num-{i}",
        content=url,
        mime_type="application/pdf",
        metadata={"url": url},
    )
    for i, url in enumerate(CONTENT_URLS)
]

documents

[{'document_id': 'num-0',
  'content': 'https://www.ibm.com/downloads/documents/us-en/1227c12d3a38b173',
  'mime_type': 'application/pdf',
  'metadata': {'url': 'https://www.ibm.com/downloads/documents/us-en/1227c12d3a38b173'}}]

In [15]:
# Insert those Llama Stack RAGDocument objects into the vector database

client.tool_runtime.rag_tool.insert(
    documents=documents,
    vector_db_id=vector_db_id,
    chunk_size_in_tokens=512,
)

INFO:httpx:HTTP Request: POST http://localhost:8321/v1/tool-runtime/rag-tool/insert "HTTP/1.1 200 OK"


In [16]:
# Query documents

results = client.tool_runtime.rag_tool.query(
    vector_db_ids=[vector_db_id],
    content="Based on IBM's 2024 Annual Report, how might the company's financial performance and strategic initiatives influence its ability to sustain or increase dividend payouts to shareholders?",
)

results

INFO:httpx:HTTP Request: POST http://localhost:8321/v1/tool-runtime/rag-tool/query "HTTP/1.1 200 OK"


QueryResult(metadata={'document_ids': ['num-0', 'num-0', 'num-0', 'num-0', 'num-0']}, content=[TextContentItem(text='knowledge_search tool found 5 chunks:\nBEGIN of knowledge_search tool results.\n', type='text'), TextContentItem(text="Result 1\nContent: -year amounts have been reclassified to conform to the change in 2024 presentation.\nFrom the perspective of how management views cash flow, in 2024, after investing $1.1 billion in net capital investments, we \ngenerated free cash flow of $12.7 billion, an increase of $1.5 billion versus the prior year. The year-to-year increase in free cash \nflow primarily reflects current year performance-related improvements within net income and sustainable lower cash requirements\nthrough changes in our retirement plans. In 2024, net capital expenditures and net cash from operating activities include $0.4 \nbillion and $0.1 billion, respectively, of cash proceeds from the sale of certain QRadar SaaS assets. This benefit to net capital \nexpendit

In [None]:
# Register a RAG Agent in Llama Stack using the vector database

sampling_params = {
    "max_tokens": 4096
}

# Create agent with memory
agent = Agent(
    client,
    model=LLAMA_STACK_RAG_MODELS_INFO_WATSONX[0],
    instructions="You are a helpful assistant",
    sampling_params=sampling_params,
    tools=[
        {
            "name": "builtin::rag/knowledge_search",
            "args": {
                "vector_db_ids": [vector_db_id],
                "query_config": {
                    "chunk_size_in_tokens": 512,
                    "chunk_overlap_in_tokens": 0,
                    "max_chunks": NUMBER_OF_SEARCH_RESULTS,
                    "chunk_template": "Result {index}\nContent: {chunk.content}\nMetadata: {metadata}\n",
                },
            },
        }
    ],
)

INFO:httpx:HTTP Request: POST http://localhost:8321/v1/agents "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET http://localhost:8321/v1/tools?toolgroup_id=builtin%3A%3Arag%2Fknowledge_search "HTTP/1.1 200 OK"


In [18]:
session_id = agent.create_session(f"rag_session-{uuid.uuid4().hex}")


# Ask questions about documents in the vector db, and the agent will query the db to answer the question.
response = agent.create_turn(
    messages=[{"role": "user", "content": "Based on IBM's 2024 Annual Report, how might the company's financial performance and strategic initiatives influence its ability to sustain or increase dividend payouts to shareholders?"}],
    session_id=session_id,
    stream=False
)

INFO:httpx:HTTP Request: POST http://localhost:8321/v1/agents/d1793085-930b-453d-81aa-ebdb070e0c3b/session "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:8321/v1/agents/d1793085-930b-453d-81aa-ebdb070e0c3b/session/27c57397-cde7-4a43-8640-2cd556357720/turn "HTTP/1.1 200 OK"


In [19]:
response.output_message.content

"Based on the 2024 Annual Report, IBM's financial performance and strategic initiatives may influence its ability to sustain or increase dividend payouts to shareholders in several ways:\n\n1. **Revenue growth**: IBM's revenue increased by 3% at constant currency in 2024, which could provide a stable foundation for dividend payments.\n2. **Free cash flow**: The company generated $12.7 billion in free cash flow, an increase of $1.5 billion year-over-year, which could be used to fund dividend payments.\n3. **Dividend payments**: IBM returned $6.1 billion to shareholders through dividends in 2024, and the company's Board of Directors increased the quarterly common stock dividend from $1.66 to $1.67 per share in the second quarter of 2024.\n4. **Investments in innovation**: IBM invested $3.3 billion in acquisitions and $7 billion in research and development, which could drive future growth and potentially increase dividend payments.\n5. **Retirement-related plans**: The company's retiremen

In [None]:
# Wait 30 seconds after a failure in the hopes that any temporary server glitches will be sorted out by then.
DELAY=30

# If it fails more than 15 times, give up.
MAX_RETRIES=15

def run_lls_rag(data, generator_model_ids, vector_db_id, instructions="You are a helpful assistant", label="Llama Stack"):
    datasets = {}
    i = 1
    sampling_params = {
        "max_tokens": 4096
    }

    for generator_model_id in generator_model_ids:
        dataset_for_test_model = copy.deepcopy(data)
        # Create agent with memory
        agent = Agent(
            client,
            model=generator_model_id,
            instructions=instructions,
            sampling_params=sampling_params,
            tools=[
                {
                    "name": "builtin::rag/knowledge_search",
                    "args": {
                        "vector_db_ids": [vector_db_id],
                        # Defaults
                        "query_config": {
                            "chunk_size_in_tokens": 512,
                            "chunk_overlap_in_tokens": 0,
                            "max_chunks": NUMBER_OF_SEARCH_RESULTS,
                            "chunk_template": "Result {index}\nContent: {chunk.content}\nMetadata: {metadata}\n",
                        },
                    },
                }
            ],
        )
    
        count = len(dataset_for_test_model)
        for entry in dataset_for_test_model:
            clear_output(wait=True)
            print(generator_model_id)
            print(f"{i} / {count}")
            i += 1
            question = entry["user_input"]
            
            session_id = agent.create_session(f"rag_session-{uuid.uuid4().hex}")
            response = evaluation_utilities.run_with_retries(
                lambda: agent.create_turn(
                    messages=[{"role": "user", "content": question}],
                    session_id=session_id,
                    stream=False
                ),
                MAX_RETRIES,
                DELAY)
            text = response.output_message.content
            entry["response"] = text

        datasets[label + ":" + generator_model_id] = dataset_for_test_model
    return datasets

In [21]:
lls_datasets = run_lls_rag(loaded_data_with_reference_contexts, LLAMA_STACK_RAG_MODELS_INFO_WATSONX, vector_db_id)

INFO:httpx:HTTP Request: POST http://localhost:8321/v1/agents/2026027d-ab6c-4a4c-ba29-559e513cc82a/session "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:8321/v1/agents/2026027d-ab6c-4a4c-ba29-559e513cc82a/session/1ff005d5-cfbd-4e9d-b361-f87633ecfcde/turn "HTTP/1.1 200 OK"


706


In [22]:
lls_datasets['Llama Stack:meta-llama/llama-3-3-70b-instruct'][0:2]

[{'user_input': "Based on IBM's 2024 Annual Report, how might the company's financial performance and strategic initiatives influence its ability to sustain or increase dividend payouts to shareholders?",
  'reference': "Based on IBM's 2024 Annual Report, the company's financial performance and strategic initiatives appear to support its ability to sustain or potentially increase dividend payouts to shareholders. Here are some key points from the report that influence this assessment:\n\n1. **Revenue Growth and Cash Flow**: IBM reported $62.8 billion in revenue, with a 3% increase at constant currency, and generated $12.7 billion in free cash flow, an increase of $1.5 billion year-over-year. Strong revenue growth and cash flow generation provide a solid foundation for sustaining dividend payouts.\n\n2. **Investment in Growth Areas**: IBM has made significant investments in AI and hybrid cloud, which are expected to drive future growth. The company allocated over $7 billion to research 

In [27]:
# Save checkpoint
evaluation_utilities.write_json(lls_datasets, f"./questions_and_reference_answers_and_system_answers-lls-{count}.json")

# Run RAG with LlamaIndex

This attempts to recreate the flow in https://docs.llamaindex.ai/en/stable/understanding/rag/ , i.e., the RAG that a naive user getting started with the getting-started documentation would build EXCEPT that it is configured with the following elements:

- Content is from the URLs configured in CONTENT_URLS at the top of this notebook
- Milvus vector IO provider
- granite-embedding-125m embedding model
- meta-llama/llama-3-3-70b-instruct generative model using the watsonx inference provider
- max_tokens for output is 4096
- number of search results to return is 5

In [36]:
documents = SimpleDirectoryReader(CONTENT_LOCATION).load_data()
vector_index = VectorStoreIndex.from_documents(documents=documents, embed_model=EMBED_MODEL_FOR_LLAMAINDEX)
vector_index.as_query_engine()



<llama_index.core.query_engine.retriever_query_engine.RetrieverQueryEngine at 0x3e586f710>

In [44]:
query_engine = vector_index.as_query_engine(llm=LLAMA_INDEX_RAG_MODELS['meta-llama/llama-3-3-70b-instruct'], similarity_top_k=NUMBER_OF_SEARCH_RESULTS)

In [38]:
question = "Why does IBM exist?"
result = query_engine.query(question)
result.response.strip()

INFO:httpx:HTTP Request: POST https://us-south.ml.cloud.ibm.com/ml/v1/text/generation?version=2025-05-21 "HTTP/1.1 200 OK"
INFO:ibm_watsonx_ai.wml_resource:Successfully finished generate for url: 'https://us-south.ml.cloud.ibm.com/ml/v1/text/generation?version=2025-05-21'


'IBM exists to help clients unlock their next chapter of technology-led business growth, built across hybrid multi-cloud and leveraging AI, by providing integrated solutions and products that leverage data, information technology, deep expertise in industries and business processes, with trust and security and a broad ecosystem of partners and alliances.'

In [42]:
def run_li_rag(qna, generator_models, idx, number_of_search_results):
    outputs = {}
    for generator_model_id, generator_model in generator_models.items():
        dataset = copy.deepcopy(qna)

        retriever = VectorIndexRetriever(
            index=idx,
            similarity_top_k=number_of_search_results,
        )

        query_engine = RetrieverQueryEngine(
            retriever=retriever,
            response_synthesizer=get_response_synthesizer(llm = generator_model)
        )

        i = 1
        count = len(dataset)
        for entry in dataset:
            clear_output(wait=True)
            print(generator_model_id)
            print(f"{i} / {count}")
            i += 1
            question = entry["user_input"]
            result = evaluation_utilities.run_with_retries(
                    lambda: query_engine.query(question),
                    MAX_RETRIES,
                    DELAY
                )
            query_engine.query(question)
            entry["response"] = result.response.strip()
            entry["retrieved_contexts"] = [n.text for n in result.source_nodes]
        outputs["LlamaIndex" + ":" + generator_model_id] = dataset
    return outputs

In [45]:
li_datasets = run_li_rag(loaded_data_with_reference_contexts, LLAMA_INDEX_RAG_MODELS, vector_index, number_of_search_results=NUMBER_OF_SEARCH_RESULTS)

meta-llama/llama-3-3-70b-instruct
706 / 706


INFO:httpx:HTTP Request: POST https://us-south.ml.cloud.ibm.com/ml/v1/text/generation?version=2025-05-21 "HTTP/1.1 200 OK"
INFO:ibm_watsonx_ai.wml_resource:Successfully finished generate for url: 'https://us-south.ml.cloud.ibm.com/ml/v1/text/generation?version=2025-05-21'
INFO:httpx:HTTP Request: POST https://us-south.ml.cloud.ibm.com/ml/v1/text/generation?version=2025-05-21 "HTTP/1.1 200 OK"
INFO:ibm_watsonx_ai.wml_resource:Successfully finished generate for url: 'https://us-south.ml.cloud.ibm.com/ml/v1/text/generation?version=2025-05-21'


In [46]:
li_datasets['LlamaIndex:meta-llama/llama-3-3-70b-instruct'][0:2]

[{'user_input': "Based on IBM's 2024 Annual Report, how might the company's financial performance and strategic initiatives influence its ability to sustain or increase dividend payouts to shareholders?",
  'reference': "Based on IBM's 2024 Annual Report, the company's financial performance and strategic initiatives appear to support its ability to sustain or potentially increase dividend payouts to shareholders. Here are some key points from the report that influence this assessment:\n\n1. **Revenue Growth and Cash Flow**: IBM reported $62.8 billion in revenue, with a 3% increase at constant currency, and generated $12.7 billion in free cash flow, an increase of $1.5 billion year-over-year. Strong revenue growth and cash flow generation provide a solid foundation for sustaining dividend payouts.\n\n2. **Investment in Growth Areas**: IBM has made significant investments in AI and hybrid cloud, which are expected to drive future growth. The company allocated over $7 billion to research 

# Combine the two sets of outputs

Here we combine the outputs from Llama Stack and LlamaIndex into one data structure and save it to disk.  This data structure will then be used in later sections to do the evaluation.

In [47]:
datasets = lls_datasets | li_datasets

datasets.keys()

dict_keys(['Llama Stack:meta-llama/llama-3-3-70b-instruct', 'LlamaIndex:meta-llama/llama-3-3-70b-instruct'])

In [48]:
count

706

In [49]:
# Save checkpoint
evaluation_utilities.write_json(datasets, f"./questions_and_reference_answers_and_system_answers-{count}.json")

In [50]:
# Re-load the checkpoint from disk.  This is just here to make it easier to restart the notebook from this point.

datasets = evaluation_utilities.read_json(f"./questions_and_reference_answers_and_system_answers-{count}.json")

## Evaluate results using Ragas and the evaluator model

In [51]:
# Reference for LlamaIndexLLMWrapper: https://docs.ragas.io/en/stable/howtos/integrations/_llamaindex/#building-the-queryengine
evaluator_llm_for_ragas = LlamaIndexLLMWrapper(evaluator_model)

# Rubrics from https://github.com/instructlab/eval/blob/main/src/instructlab/eval/ragas.py which got them from ragas v0.2.11
# and has them "hardcoded in case ragas makes any changes to their DEFAULT_WITH_REFERENCE_RUBRICS in the future".
SCORING_RUBRICS = {
    "score1_description": "The response is entirely incorrect, irrelevant, or does not align with the reference in any meaningful way.",
    "score2_description": "The response partially matches the reference but contains major errors, significant omissions, or irrelevant information.",
    "score3_description": "The response aligns with the reference overall but lacks sufficient detail, clarity, or contains minor inaccuracies.",
    "score4_description": "The response is mostly accurate, aligns closely with the reference, and contains only minor issues or omissions.",
    "score5_description": "The response is fully accurate, completely aligns with the reference, and is clear, thorough, and detailed.",
}

metrics = [
    AnswerAccuracy(llm=evaluator_llm_for_ragas),
    RubricsScore(llm=evaluator_llm_for_ragas, rubrics=SCORING_RUBRICS)
]

In [52]:
results = evaluation_utilities.run_ragas(datasets, evaluator_llm_for_ragas, metrics)

Batch 1/353:   0%|          | 0/4 [00:00<?, ?it/s]INFO:ragas.llms.base:temperature kwarg passed to LlamaIndex LLM
INFO:ragas.llms.base:callbacks not supported for LlamaIndex LLMs, ignoring callbacks
INFO:ragas.llms.base:temperature kwarg passed to LlamaIndex LLM
INFO:ragas.llms.base:callbacks not supported for LlamaIndex LLMs, ignoring callbacks
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:ragas.llms.base:temperature kwarg passed to LlamaIndex LLM
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:ragas.llms.base:temperature kwarg passed to LlamaIndex LLM
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 1/353:  25%|██▌       | 1/4 [00:02<00:06,  2.08s/it]INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
Batch 1/353:  50%|█████     | 2/4 [00:03<00:03,  1.62s/it]INFO:httpx:HTTP Request: POST https://api.op

In [53]:
results

{'Llama Stack:meta-llama/llama-3-3-70b-instruct': {'nv_accuracy': 0.5089, 'domain_specific_rubrics': 4.0241},
 'LlamaIndex:meta-llama/llama-3-3-70b-instruct': {'nv_accuracy': 0.4851, 'domain_specific_rubrics': 3.9618}}

In [54]:
keys = list(results.keys())
print(keys[0])
results[keys[0]].to_pandas().head()

Llama Stack:meta-llama/llama-3-3-70b-instruct


Unnamed: 0,user_input,reference_contexts,response,reference,nv_accuracy,domain_specific_rubrics
0,"Based on IBM's 2024 Annual Report, how might t...","[## Dear IBM Investor:\n\nIn 2024, IBM made si...","Based on the 2024 Annual Report, IBM's financi...","Based on IBM's 2024 Annual Report, the company...",1.0,5
1,Based on the trends and projections outlined i...,"[## 2024 Performance\n\nFor the year, IBM gene...","Based on the IBM 2024 Annual Report, the compa...",Based on the trends and projections outlined i...,0.75,5
2,How could IBM's investment in research and dev...,"[## Dear IBM Investor:\n\nIn 2024, IBM made si...",IBM's investment in research and development h...,IBM's investment in research and development (...,0.75,5
3,How did IBM adapt its operations in response t...,[## DESCRIPTION OF BUSINESS\n\nPlease refer to...,"In 2024, IBM adapted to technological disrupti...","In 2024, IBM adapted its operations in respons...",0.5,5
4,How did IBM measure the success of its digital...,[## Our go-to-market approach\n\nOver the last...,IBM measured the success of its digital innova...,"In 2024, IBM measured the success of its digit...",0.75,4


In [55]:
print(keys[1])
results[keys[1]].to_pandas().head()

LlamaIndex:meta-llama/llama-3-3-70b-instruct


Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,nv_accuracy,domain_specific_rubrics
0,"Based on IBM's 2024 Annual Report, how might t...",[1\nFinancial Position\nDynamics\nOur balance ...,"[## Dear IBM Investor:\n\nIn 2024, IBM made si...","Based on IBM's 2024 Annual Report, the company...","Based on IBM's 2024 Annual Report, the company...",1.0,4
1,Based on the trends and projections outlined i...,[IBM 2024 Annual Report 1\nArvind Krishna\nCha...,"[## 2024 Performance\n\nFor the year, IBM gene...",1. Hybrid Cloud and AI Integration: IBM will l...,Based on the trends and projections outlined i...,0.5,5
2,How could IBM's investment in research and dev...,"[In support for each business segment, our AI ...","[## Dear IBM Investor:\n\nIn 2024, IBM made si...",IBM's investment in research and development c...,IBM's investment in research and development (...,0.75,4
3,How did IBM adapt its operations in response t...,"[In support for each business segment, our AI ...",[## DESCRIPTION OF BUSINESS\n\nPlease refer to...,IBM adapted its operations in response to tech...,"In 2024, IBM adapted its operations in respons...",0.5,5
4,How did IBM measure the success of its digital...,[We recognize \nthat no single company can pro...,[## Our go-to-market approach\n\nOver the last...,IBM measured the success of its digital innova...,"In 2024, IBM measured the success of its digit...",0.75,5


# Assess statistical significance

Here we use Fisher's randomization test to judge whether those results are statistically significant both for the full set and for the subset that are marked as having complete reference answers.  We set the permutation_type='samples' because each of the models is using the same questions in the same order so we're doing a paired-sample test here.

In [56]:
import itertools

result_pairs = list(itertools.combinations(results.keys(), 2))
result_summary = []
result_pairs

[('Llama Stack:meta-llama/llama-3-3-70b-instruct',
  'LlamaIndex:meta-llama/llama-3-3-70b-instruct')]

In [62]:
subset_of_rows = []

In [64]:
results0 = results[result_pairs[0][0]].to_pandas()
results0


Unnamed: 0,user_input,reference_contexts,response,reference,nv_accuracy,domain_specific_rubrics
0,"Based on IBM's 2024 Annual Report, how might t...","[## Dear IBM Investor:\n\nIn 2024, IBM made si...","Based on the 2024 Annual Report, IBM's financi...","Based on IBM's 2024 Annual Report, the company...",1.00,5
1,Based on the trends and projections outlined i...,"[## 2024 Performance\n\nFor the year, IBM gene...","Based on the IBM 2024 Annual Report, the compa...",Based on the trends and projections outlined i...,0.75,5
2,How could IBM's investment in research and dev...,"[## Dear IBM Investor:\n\nIn 2024, IBM made si...",IBM's investment in research and development h...,IBM's investment in research and development (...,0.75,5
3,How did IBM adapt its operations in response t...,[## DESCRIPTION OF BUSINESS\n\nPlease refer to...,"In 2024, IBM adapted to technological disrupti...","In 2024, IBM adapted its operations in respons...",0.50,5
4,How did IBM measure the success of its digital...,[## Our go-to-market approach\n\nOver the last...,IBM measured the success of its digital innova...,"In 2024, IBM measured the success of its digit...",0.75,4
...,...,...,...,...,...,...
701,Who is the Chairman of the Board at IBM accord...,"[## Arvind Krishna\n\nChairman, President and ...",The Chairman of the Board at IBM according to ...,"I'm sorry, but I don't have access to the 2024...",0.75,4
702,Who is the chairperson of IBM's Board of Direc...,"[## Arvind Krishna\n\nChairman, President and ...",The chairperson of IBM's Board of Directors in...,Arvind Krishna is the Chairman of IBM.,0.75,4
703,Who is the current CEO of IBM as mentioned in ...,"[## Arvind Krishna\n\nChairman, President and ...",The current CEO of IBM as mentioned in the 202...,The context information provided does not spec...,1.00,4
704,Who is the intended audience for IBM's vision ...,"[## Dear IBM Investor:\n\nIn 2024, IBM made si...",The intended audience for IBM's vision and mis...,The intended audience for IBM's vision and mis...,0.50,4


In [70]:

for result_pair in result_pairs:
    result_summary_for_metric = {}
    for metric in metrics:
        results0 = results[result_pair[0]].to_pandas()
        results1 = results[result_pair[1]].to_pandas()

        if subset_of_rows:
            results0 = results0.iloc[subset_of_rows]
            results1 = results1.iloc[subset_of_rows]

        overview_label = f"{result_pair[0]} {result_pair[1]} {metric.name}"
        _, p_value, score0, score1 = evaluation_utilities.print_stats_significance(results0[metric.name], results1[metric.name], overview_label, result_pair[0], result_pair[1])

        result_summary_for_metric[metric.name] = {
            result_pair[0]: float(score0),
            result_pair[1]: float(score1),
            "p": p_value
        }
    result_summary.append(result_summary_for_metric)

Llama Stack:meta-llama/llama-3-3-70b-instruct LlamaIndex:meta-llama/llama-3-3-70b-instruct nv_accuracy
 Llama Stack:meta-llama/llama-3-3-70b-instruct     :     0.5089
 LlamaIndex:meta-llama/llama-3-3-70b-instruct      :     0.4851
 p_value                                           :     0.0634
  p_value>=0.05 so this result is NOT statistically significant.
  You can conclude that there is not enough data to tell which is better.
  Note that this data includes 706 questions which typically produces a margin of error of around +/-3.8%.
  So the two are probably roughly within that margin of error or so.
Llama Stack:meta-llama/llama-3-3-70b-instruct LlamaIndex:meta-llama/llama-3-3-70b-instruct domain_specific_rubrics
 Llama Stack:meta-llama/llama-3-3-70b-instruct     :     4.0241
 LlamaIndex:meta-llama/llama-3-3-70b-instruct      :     3.9618
 p_value                                           :     0.1440
  p_value>=0.05 so this result is NOT statistically significant.
  You can conclude

In [71]:
print(f"All data ({len(results[next(iter(results.keys()))].to_pandas())})")
result_summary_all_rows = evaluation_utilities.report_results_with_significance(results, metrics)
result_summary_all_rows

All data (706)
Llama Stack:meta-llama/llama-3-3-70b-instruct LlamaIndex:meta-llama/llama-3-3-70b-instruct nv_accuracy
 Llama Stack:meta-llama/llama-3-3-70b-instruct     :     0.5089
 LlamaIndex:meta-llama/llama-3-3-70b-instruct      :     0.4851
 p_value                                           :     0.0520
  p_value>=0.05 so this result is NOT statistically significant.
  You can conclude that there is not enough data to tell which is better.
  Note that this data includes 706 questions which typically produces a margin of error of around +/-3.8%.
  So the two are probably roughly within that margin of error or so.
Llama Stack:meta-llama/llama-3-3-70b-instruct LlamaIndex:meta-llama/llama-3-3-70b-instruct domain_specific_rubrics
 Llama Stack:meta-llama/llama-3-3-70b-instruct     :     4.0241
 LlamaIndex:meta-llama/llama-3-3-70b-instruct      :     3.9618
 p_value                                           :     0.1476
  p_value>=0.05 so this result is NOT statistically significant.
  Y

[{'nv_accuracy': {'Llama Stack:meta-llama/llama-3-3-70b-instruct': 0.5088526912181303,
   'LlamaIndex:meta-llama/llama-3-3-70b-instruct': 0.48512747875354106,
   'p': 0.051994800519948},
  'domain_specific_rubrics': {'Llama Stack:meta-llama/llama-3-3-70b-instruct': 4.024079320113314,
   'LlamaIndex:meta-llama/llama-3-3-70b-instruct': 3.961756373937677,
   'p': 0.14758524147585242}}]

## Write reports

In [None]:
evaluation_utilities.write_excel(results, result_summary_all_rows, None, f"report-lls-vs-li-{count}.xlsx")