# Evaluating different models from the Azure AI catalog with LlamaIndex and Phoenix

The following notebook demonstrate how users can use multiple models from Azure AI studio depending on the scenario and use the right model for the right job. In this case, we will use LLamaIndex to build a RAG system and select different models from different providers, maximizing the capabilities they have on each case.

## Preparing

In this example, we will use multiple models deployed in this project, including Phi-3, Cohere Command R+, Cohere Embed V3, Mistral Large, and OpenAI GPT-4o. Endpoints URLs and keys are stored in the `.env` file. Please update it accordingly:

In [1]:
import os
from dotenv import load_dotenv

load_dotenv(".env", override=True)

True

Let's configure asynchronous operations on the notebook.

In [2]:
import nest_asyncio

nest_asyncio.apply()

## Configure instrumentation

We will use LlamaIndex to build a RAG system to answer different questions from the Paul Grahm dataset. To identify opportunities of improvement, we are using Phonix for tracing and monitoring. The following section configures automatic instrumentation of LlamaIndex and connects it with a Phoenix instance running locally:

In [3]:
import phoenix as px
from openinference.instrumentation.llama_index import LlamaIndexInstrumentor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import SpanLimits, TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor

In [4]:
session = px.launch_app()

INFO:phoenix.config:📋 Ensuring phoenix working directory: /Users/john0isaac/.phoenix


🌍 To view the Phoenix app in your browser, visit http://localhost:6006/
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix


Let's configure tracing:

In [5]:
endpoint = session.url + "v1/traces"
tracer_provider = TracerProvider(
    span_limits=SpanLimits(max_attributes=100_000)
)
tracer_provider.add_span_processor(
    SimpleSpanProcessor(OTLPSpanExporter(endpoint))
)

Let's configure instrumentation:

In [6]:
LlamaIndexInstrumentor().instrument(tracer_provider=tracer_provider)

## Building a RAG system with models from the catalog

Let's use the Cohere model ecosystem to implement our RAG solution. Cohere models are optimized for RAG patterns and they can work in a large range of languages, specially when using the Cohere Embed V3 Multilingual:

In [7]:
from llama_index.core import Settings
from llama_index.core import SimpleDirectoryReader
from llama_index.core import StorageContext
from llama_index.core import SummaryIndex
from llama_index.core import VectorStoreIndex
from llama_index.core.selectors import LLMSingleSelector

from llama_index.llms.azure_inference import AzureAICompletionsModel
from llama_index.embeddings.azure_inference import AzureAIEmbeddingsModel

Cohere Command R+:

In [8]:
llm = AzureAICompletionsModel(
    endpoint=os.environ["AZURE_AI_COHERE_CMDR_ENDPOINT_URL"],
    credential=os.environ["AZURE_AI_COHERE_CMDR_ENDPOINT_KEY"],
    model_name=os.environ["AZURE_AI_COHERE_CMDR_MODEL_NAME"]
)

In [9]:
# Test the model
response = llm.complete("The sky is a beautiful blue and")
print(response)

WARNI [openinference.instrumentation.llama_index._handler] Open span is missing for event.span_id='AzureAICompletionsModel.complete-30830167-55bf-4161-ad2f-37906b20648e', event.id_=UUID('022bec49-437b-4d77-8295-7458a555800b')
WARNI [openinference.instrumentation.llama_index._handler] Open span is missing for event.span_id='AzureAICompletionsModel.chat-26abf0e9-35ef-4c37-a98a-252891915713', event.id_=UUID('cbc4d1ff-fa22-442e-850e-15cd529a7cc0')
WARNI [openinference.instrumentation.llama_index._handler] Open span is missing for event.span_id='AzureAICompletionsModel.chat-26abf0e9-35ef-4c37-a98a-252891915713', event.id_=UUID('9b96fd83-e84e-4e03-b7c1-94b06295f869')
WARNI [openinference.instrumentation.llama_index._handler] Open span is missing for id_='AzureAICompletionsModel.chat-26abf0e9-35ef-4c37-a98a-252891915713'
WARNI [openinference.instrumentation.llama_index._handler] Open span is missing for event.span_id='AzureAICompletionsModel.complete-30830167-55bf-4161-ad2f-37906b20648e', eve

the sun is shining brightly, creating a picturesque scene. The gentle breeze carries a hint of freshness, inviting you to take a deep breath and embrace the beauty of the day. It's a perfect moment to appreciate nature's palette and the calming tranquility it offers.


Cohere Embed V3 - Multilingual:

In [10]:
embed_model = AzureAIEmbeddingsModel(
    endpoint=os.environ["AZURE_AI_COHERE_EMBED_ENDPOINT_URL"],
    credential=os.environ["AZURE_AI_COHERE_EMBED_ENDPOINT_KEY"],
    model_name=os.environ["AZURE_AI_COHERE_EMBED_MODEL_NAME"]
)

### Building the index

To demostrate how to use different models, let's first create an index using Cohere models.

In [11]:
Settings.llm = llm
Settings.embed_model = embed_model
Settings.chunk_size = 1024

In this example, we will use the Paul Graham dataset.

In [12]:
documents = SimpleDirectoryReader("data/paul_graham").load_data()

Once we have documents, we create nodes by applying chunking into it as it is configured:

In [13]:
nodes = Settings.node_parser.get_nodes_from_documents(documents)

Let's initialize storage context, by default it's in-memory so we don't have to worry about persisting them:

In [14]:
storage_context = StorageContext.from_defaults()
storage_context.docstore.add_documents(nodes)

### Search tools

Our RAG system will be able to answer questions that look to summarize multiple sources of information vs a more simple retrieval strategy.

#### Tree summarize

The summary index is a simple data structure where nodes are stored in a sequence. During index construction, the document texts are chunked up, converted to nodes, and stored in a list. During query time, the summary index iterates through the nodes with some optional filter parameters, and synthesizes an answer from all the nodes.

In [15]:
summary_index = SummaryIndex(nodes, storage_context=storage_context)

In [16]:
summarize_query_engine = summary_index.as_query_engine(
    llm=llm,
    response_mode="tree_summarize",
    use_async=True,
)

HttpResponseError: (no_model_name) No model specified in request. Please provide a model name in the request body or as a x-ms-model-mesh-model-name header.
Code: no_model_name
Message: No model specified in request. Please provide a model name in the request body or as a x-ms-model-mesh-model-name header.

#### Vector index

`VectorStoreIndex` only stores nodes in document store if vector store does not store text.

In [17]:
vector_index = VectorStoreIndex(nodes, storage_context=storage_context)

In [18]:
vector_query_engine = vector_index.as_query_engine()

HttpResponseError: (no_model_name) No model specified in request. Please provide a model name in the request body or as a x-ms-model-mesh-model-name header.
Code: no_model_name
Message: No model specified in request. Please provide a model name in the request body or as a x-ms-model-mesh-model-name header.

#### Constructing the query engine with the search tools

In [19]:
from llama_index.core.tools import QueryEngineTool

summary_tool = QueryEngineTool.from_defaults(
    query_engine=summarize_query_engine,
    description=(
        "Useful for summarization questions related to Paul Graham essay on"
        " What I Worked On."
    ),
)

vector_tool = QueryEngineTool.from_defaults(
    query_engine=vector_query_engine,
    description=(
        "Useful for retrieving specific context from Paul Graham essay on What"
        " I Worked On."
    ),
)

NameError: name 'summarize_query_engine' is not defined

In [20]:
from llama_index.core.query_engine import RouterQueryEngine

query_engine = RouterQueryEngine(
    selector=LLMSingleSelector.from_defaults(),
    query_engine_tools=[
        summary_tool,
        vector_tool,
    ],
)

NameError: name 'summary_tool' is not defined

Let's see how this works:

In [21]:
response = query_engine.query("What is the summary of the document?")
print(str(response))

NameError: name 'query_engine' is not defined

In [22]:
response = query_engine.query("What did Paul Graham do after RICS?")
print(str(response))

NameError: name 'query_engine' is not defined

## Using an smaller model for simpler tasks

Using exactly the same class `AzureAIModelInferenceLLM` we can instantiate another model, in this case a Phi-3-mini-4K model.

In [23]:
slm = AzureAICompletionsModel(
    endpoint=os.environ["AZURE_AI_PHI3_MINI_ENDPOINT_URL"],
    credential=os.environ["AZURE_AI_PHI3_MINI_ENDPOINT_KEY"],
    model_name=os.environ["AZURE_AI_PHI3_MINI_MODEL_NAME"]
)

Now, let's configure the `RouterQueryEngine` to use Phi-3 for the routing task instead of the larger model:

In [24]:
query_engine = RouterQueryEngine(
    selector=LLMSingleSelector.from_defaults(llm=slm),
    query_engine_tools=[
        summary_tool,
        vector_tool,
    ],
)

NameError: name 'summary_tool' is not defined

In [25]:
response = query_engine.query("What did Paul Graham do after RICS?")
print(str(response))

NameError: name 'query_engine' is not defined

# Build an evaluation dataset

Let's build an evaluation dataset to see the effect of the change in the model. We will use another LLM to generate examples, in this case Mistral Large which is a good model for RAG:

In [26]:
generator_llm = AzureAICompletionsModel(
    endpoint=os.environ["AZURE_AI_MISTRAL_ENDPOINT_URL"],
    credential=os.environ["AZURE_AI_MISTRAL_ENDPOINT_KEY"],
    model_name=os.environ["AZURE_AI_MISTRAL_MODEL_NAME"],
    temperature=0,
)

In [27]:
from llama_index.core.llama_dataset import LabelledRagDataset
from llama_index.core.llama_dataset.generator import RagDatasetGenerator

Let's create the generator:

In [28]:
dataset_generator = RagDatasetGenerator.from_documents(
    documents=documents,
    llm=generator_llm,
    num_questions_per_chunk=2,
)

In [29]:
rag_dataset = dataset_generator.generate_questions_from_nodes()

HttpResponseError: (no_model_name) No model specified in request. Please provide a model name in the request body or as a x-ms-model-mesh-model-name header.
Code: no_model_name
Message: No model specified in request. Please provide a model name in the request body or as a x-ms-model-mesh-model-name header.

Let's see an example:

In [31]:
print("Query:", rag_dataset[1].query)
print("Context:", rag_dataset[1].reference_contexts[0][:50], "...")

NameError: name 'rag_dataset' is not defined

Let's save the examples:

In [30]:
rag_dataset.save_json("evals/pg_rag_dataset.json")

NameError: name 'rag_dataset' is not defined

We can reload them as follows:

In [32]:
rag_dataset = LabelledRagDataset.from_json("evals/pg_rag_dataset.json")

### Use evaluations for retrieval

__FaithfulnessEvaluator__

`FaithfulnessEvaluator` is used to measure if the response from a query engine matches any response nodes. This is useful for measuring if the response has hallucinated.

__RelevancyEvaluator__

`RelevancyEvaluator` is used to measure if the response and the source nodes match the query. This is useful for measuring if the query was actually answered by the response.

In [33]:
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.core.evaluation import BatchEvalRunner
from llama_index.core.evaluation import RelevancyEvaluator, FaithfulnessEvaluator

In this case let's use a more powerful model as a judge, being GPT-4:

In [34]:
gpt4judge = AzureOpenAI(
    deployment="gpt-4",
    azure_endpoint=os.environ["AZURE_OPENAI_GPT_4_ENDPOINT_URL"],
    api_key=os.environ["AZURE_OPENAI_GPT_4_ENDPOINT_KEY"],
    api_version="2023-07-01-preview"
)

Configure the evaluators with this LLM:

In [35]:
relevancy_evaluator = RelevancyEvaluator(llm=gpt4judge)
faithfulness_evaluator = FaithfulnessEvaluator(llm=gpt4judge)

Let's create a dataset of the `query` property only.

In [36]:
batch_eval_queries = [sample.query for sample in rag_dataset[1::2]]

> `rag_dataset[1::2]` retries the odd indexes only, since the dataset contains "question 1:" as part of the generation. It probably requires to change the generation template.

A `BatchEvalRunner` will allow us to run evalutions over all the dataset:

In [37]:
runner = BatchEvalRunner(
    {
        "faithfulness": faithfulness_evaluator,
        "relevancy": relevancy_evaluator
    },
    workers=8,
)

Compute the evaluations:

In [38]:
eval_results = await runner.aevaluate_queries(
    query_engine, queries=batch_eval_queries
)

NameError: name 'query_engine' is not defined

Let's write the evaluation results:

In [39]:
import json

eval_results_dict = {}
eval_results_dict["faithfulness"] = [
    dict(result) for result in eval_results["faithfulness"]]
eval_results_dict["relevancy"] = [
    dict(result) for result in eval_results["relevancy"]]

with open("evals/pg_rag_eval_results_phi3.json", "w") as f:
    json.dump(eval_results_dict, f)

NameError: name 'eval_results' is not defined

Compute the scores:

In [40]:
faithfulness_score = sum(
    result.passing for result in eval_results['faithfulness']) / len(eval_results['faithfulness'])
relevancy_score = sum(
    result.passing for result in eval_results['relevancy']) / len(eval_results['relevancy'])

NameError: name 'eval_results' is not defined

Let's see the results:

In [41]:
print(f"Faithfulness Score: {faithfulness_score}")
print(f"Relevancy Score: {relevancy_score}")

NameError: name 'faithfulness_score' is not defined