# Evaluating different models from the GitHub Models catalog with LlamaIndex and Phoenix

The following notebook demonstrate how users can use multiple models from GitHub Model depending on the scenario and use the right model for the right job. In this case, we will use LLamaIndex to build a RAG system and select different models from different providers, maximizing the capabilities they have on each case.

## Preparing

In this example, we will use multiple models deployed in this project, including Phi-3, Cohere Command R+, Cohere Embed V3, Mistral Large, and OpenAI GPT-4o. Endpoints URLs and keys are stored in the `.env` file. Please update it accordingly:

In [None]:
import os
from dotenv import load_dotenv
from azure.ai.inference import ChatCompletionsClient
from azure.ai.inference.models import SystemMessage, UserMessage
from azure.core.credentials import AzureKeyCredential

# load_dotenv(".env", override=True)

Let's configure asynchronous operations on the notebook.

In [None]:

import nest_asyncio

nest_asyncio.apply()

## Configure instrumentation

We will use LlamaIndex to build a RAG system to answer different questions from the Paul Grahm dataset. To identify opportunities of improvement, we are using Phonix for tracing and monitoring. The following section configures automatic instrumentation of LlamaIndex and connects it with a Phoenix instance running locally:

In [None]:
import phoenix as px
from openinference.instrumentation.llama_index import LlamaIndexInstrumentor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import SpanLimits, TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor

In [None]:
session = px.launch_app()

Let's configure tracing:

In [None]:
endpoint = session.url + "v1/traces"
tracer_provider = TracerProvider(span_limits=SpanLimits(max_attributes=100_000))
tracer_provider.add_span_processor(SimpleSpanProcessor(OTLPSpanExporter(endpoint)))

Let's configure instrumentation:

In [None]:
LlamaIndexInstrumentor().instrument(tracer_provider=tracer_provider)

## Building a RAG system with models from the catalog

Let's use the Cohere model ecosystem to implement our RAG solution. Cohere models are optimized for RAG patterns and they can work in a large range of languages, specially when using the Cohere Embed V3 Multilingual:

In [None]:
from llama_index.core import Settings
from llama_index.core import SimpleDirectoryReader
from llama_index.core import StorageContext
from llama_index.core import SummaryIndex
from llama_index.core import VectorStoreIndex
from llama_index.core.selectors import LLMSingleSelector

from llama_index.llms.azure_inference import AzureAICompletionsModel
from llama_index.embeddings.azure_inference import AzureAIEmbeddingsModel

Cohere Command R+:

In [None]:
llm = AzureAICompletionsModel(
                endpoint="https://models.inference.ai.azure.com",
                creditional=os.environ["GITHUB_TOKEN"],
                temperature=0.1,
                max_tokens=1024,
                streaming=True,
                model_name="gpt-4o-mini",
)
llm._model_name = "gpt-4o-mini"

In [None]:
# Test the model
response = llm.complete("The sky is a beautiful blue and")
print(response)

Cohere Embed V3 - Multilingual:

In [None]:
embed_model = AzureAIEmbeddingsModel(
                endpoint="https://models.inference.ai.azure.com",
                credential=os.environ["GITHUB_TOKEN"],
                model_name="cohere-embed-v3-english",
)

### Building the index

To demostrate how to use different models, let's first create an index using Cohere models.

In [None]:
Settings.llm = llm
Settings.embed_model = embed_model
Settings.chunk_size = 1024

In this example, we will use the kinfey dataset.

In [None]:
documents = SimpleDirectoryReader("data/kinfey").load_data()

Once we have documents, we create nodes by applying chunking into it as it is configured:

In [None]:
nodes = Settings.node_parser.get_nodes_from_documents(documents)

Let's initialize storage context, by default it's in-memory so we don't have to worry about persisting them:

In [None]:
nodes

In [None]:
storage_context = StorageContext.from_defaults()
storage_context.docstore.add_documents(nodes)

### Search tools

Our RAG system will be able to answer questions that look to summarize multiple sources of information vs a more simple retrieval strategy.

#### Tree summarize

The summary index is a simple data structure where nodes are stored in a sequence. During index construction, the document texts are chunked up, converted to nodes, and stored in a list. During query time, the summary index iterates through the nodes with some optional filter parameters, and synthesizes an answer from all the nodes.

In [None]:
storage_context

In [None]:
summary_index = SummaryIndex(nodes, storage_context=storage_context)

In [None]:
summarize_query_engine = summary_index.as_query_engine(
    llm=llm,
    response_mode="tree_summarize",
    use_async=True,
)

#### Vector index

`VectorStoreIndex` only stores nodes in document store if vector store does not store text.

In [None]:
nodes

In [None]:
vector_index = VectorStoreIndex(nodes, storage_context=storage_context)

In [None]:
vector_query_engine = vector_index.as_query_engine()

In [None]:
vector_index

#### Constructing the query engine with the search tools

In [None]:
from llama_index.core.tools import QueryEngineTool

summary_tool = QueryEngineTool.from_defaults(
    query_engine=summarize_query_engine
)

vector_tool = QueryEngineTool.from_defaults(
    query_engine=vector_query_engine
)

In [None]:
from llama_index.core.query_engine import RouterQueryEngine

query_engine = RouterQueryEngine(
    selector=LLMSingleSelector.from_defaults(),
    query_engine_tools=[
        summary_tool,
        vector_tool,
    ],
)

Let's see how this works:

In [None]:
response = query_engine.query("Kinfey's job")
print(str(response))

In [None]:
response = query_engine.query("what Kinfey worked on")
print(str(response))

## Using an smaller model for simpler tasks

Using exactly the same class `AzureAIModelInferenceLLM` we can instantiate another model, in this case a Phi-3-mini-4K model.

In [None]:

slm = AzureAICompletionsModel(
                endpoint="https://models.inference.ai.azure.com",
                credential=os.environ["GITHUB_TOKEN"],
                temperature=0.1,
                max_tokens=1024,
                streaming=True,
                model_name="Phi-3-mini-4k-instruct",
)
slm._model_name = "Phi-3-mini-4k-instruct"

Now, let's configure the `RouterQueryEngine` to use Phi-3 for the routing task instead of the larger model:

In [None]:
query_engine = RouterQueryEngine(
    selector=LLMSingleSelector.from_defaults(llm=slm),
    query_engine_tools=[
        summary_tool,
        vector_tool,
    ],
)

In [None]:
response = query_engine.query("What Kinfey is good at")
print(str(response))

# Build an evaluation dataset

Let's build an evaluation dataset to see the effect of the change in the model. We will use another LLM to generate examples, in this case Mistral Large which is a good model for RAG:

In [None]:
generator_llm = AzureAICompletionsModel(
                endpoint="https://models.inference.ai.azure.com",
                credential=os.environ["GITHUB_TOKEN"],
                temperature=0,
                model_name="Mistral-large",
)
generator_llm._model_name = "Mistral-large"

In [None]:
from llama_index.core.llama_dataset import LabelledRagDataset
from llama_index.core.llama_dataset.generator import RagDatasetGenerator

Let's create the generator:

In [None]:
dataset_generator = RagDatasetGenerator.from_documents(
    documents=documents,
    llm=generator_llm,
    num_questions_per_chunk=2,
)

In [None]:
rag_dataset = dataset_generator.generate_questions_from_nodes()

Let's see an example:

In [None]:
print("Query:", rag_dataset[1].query)
print("Context:", rag_dataset[1].reference_contexts[0][:50], "...")

Let's save the examples:

In [None]:
rag_dataset.save_json("evals/pg_rag_dataset.json")

We can reload them as follows:

In [None]:
# rag_dataset = LabelledRagDataset.from_json("evals/pg_rag_dataset.json")

### Use evaluations for retrieval

__FaithfulnessEvaluator__

`FaithfulnessEvaluator` is used to measure if the response from a query engine matches any response nodes. This is useful for measuring if the response has hallucinated.

__RelevancyEvaluator__

`RelevancyEvaluator` is used to measure if the response and the source nodes match the query. This is useful for measuring if the query was actually answered by the response.

In [None]:
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.core.evaluation import BatchEvalRunner
from llama_index.core.evaluation import RelevancyEvaluator, FaithfulnessEvaluator

In this case let's use a more powerful model as a judge, being GPT-4:

In [None]:
gpt4judge = AzureOpenAI(
    deployment="gpt-4o",
    azure_endpoint="https://models.inference.ai.azure.com",
    api_key=os.environ["GITHUB_TOKEN"],
)

Configure the evaluators with this LLM:

In [None]:
relevancy_evaluator = RelevancyEvaluator(llm=gpt4judge)
faithfulness_evaluator = FaithfulnessEvaluator(llm=gpt4judge)

Let's create a dataset of the `query` property only.

In [None]:
batch_eval_queries = [sample.query for sample in rag_dataset[1::2]]

> `rag_dataset[1::2]` retries the odd indexes only, since the dataset contains "question 1:" as part of the generation. It probably requires to change the generation template.

A `BatchEvalRunner` will allow us to run evalutions over all the dataset:

In [None]:
runner = BatchEvalRunner(
    {"faithfulness": faithfulness_evaluator, "relevancy": relevancy_evaluator},
    workers=8,
)

Compute the evaluations:

In [None]:
eval_results = await runner.aevaluate_queries(query_engine, queries=batch_eval_queries)

Let's write the evaluation results:

In [None]:
import json

eval_results_dict = {}
eval_results_dict["faithfulness"] = [
    dict(result) for result in eval_results["faithfulness"]
]
eval_results_dict["relevancy"] = [dict(result) for result in eval_results["relevancy"]]

with open("evals/pg_rag_eval_results_phi3.json", "w") as f:
    json.dump(eval_results_dict, f)

Compute the scores:

In [None]:
faithfulness_score = sum(
    result.passing for result in eval_results["faithfulness"]
) / len(eval_results["faithfulness"])
relevancy_score = sum(result.passing for result in eval_results["relevancy"]) / len(
    eval_results["relevancy"]
)

Let's see the results:

In [None]:
print(f"Faithfulness Score: {faithfulness_score}")
print(f"Relevancy Score: {relevancy_score}")