![phoenix_image](images/arize_phoenix.jpg)

# RAG Evaluations Demonstration: Arize Phoenix

"Phoenix is an open-source observability library designed for experimentation, evaluation, and troubleshooting. It allows AI Engineers and Data Scientists to quickly visualize their data, evaluate performance, track down issues, and export data to improve."

As compared to RAGAS and DeepEval, Phoenix offers a self-hosted UI for LLM tracing, so it's a little heavier of a lift to get started.

#### Imports

In [1]:
from copy import copy
import pandas as pd
pd.set_option('display.max_colwidth', 100)

#### Loading test dataset

To showcase Phoenix's UI, we need to combine our test dataset with trace data to mock what interacting with the Phoenix UI would look like

In [2]:
# Test trace data to analyze
from urllib.request import urlopen
from phoenix.trace.trace_dataset import TraceDataset
from phoenix.trace.utils import json_lines_to_df

# Replace with the URL to your trace data
traces_url = "https://storage.googleapis.com/arize-assets/phoenix/datasets/unstructured/llm/context-retrieval/trace.jsonl"
with urlopen(traces_url) as response:
    lines = [line.decode("utf-8") for line in response.readlines()]
json_df = json_lines_to_df(lines)

Demo data

In [3]:
# Download amnesty_qa dataset
from datasets import load_dataset

amnesty_qa = load_dataset("explodinggradients/amnesty_qa", "english_v2", trust_remote_code=True)
eval_data = amnesty_qa['eval'][2:6]

Repo card metadata block was not found. Setting CardData to empty.


#### Data Wrangling

Phoenix uses Pandas dataframes for handling most of its tracing and evaluations. Because we're combining two datasets for demonstration purposes, we have to do some ugly data wrangling.

*This wouldn't be necessary for actual use cases*

In [4]:
# inject sample data into sample trace (much easier than generating a new one)
for ii in range(4):
    json_df.loc[0+(5*ii), ('attributes.input.value')] = eval_data['question'][ii]
    json_df.loc[0+(5*ii), ('attributes.output.value')] = eval_data['answer'][ii]
    json_df.loc[1+(5*ii),('attributes.input.value')] = eval_data['question'][ii]
    json_df.loc[1+(5*ii),('attributes.output.value')] = eval_data['answer'][ii]
    json_df.loc[2+(5*ii),('attributes.llm.prompt_template.variables')]['context_str'] = ' '.join(eval_data['contexts'][ii])
    json_df.loc[2+(5*ii),('attributes.llm.prompt_template.variables')]['query_str'] = eval_data['question'][ii]
    json_df.loc[2+(5*ii),('attributes.llm.input_messages')][1]['message.content'] = ' '.join(eval_data['contexts'][ii])
    json_df.loc[2+(5*ii),('attributes.llm.output_messages')][0]['message.content'] = eval_data['answer'][ii]
    json_df.loc[2+(5*ii),('attributes.output.value')] = eval_data['answer'][ii]
    json_df.loc[2+(5*ii),('attributes.input.value')] = eval_data['question'][ii]
    json_df.loc[3+(5*ii),('attributes.input.value')] = eval_data['question'][ii]
    json_df.loc[3+(5*ii),('attributes.retrieval.documents')][0]['document.content'] = eval_data['contexts'][ii][0]
    json_df.loc[3+(5*ii),('attributes.retrieval.documents')][1]['document.content'] = eval_data['contexts'][ii][1]
    json_df.loc[3+(5*ii),('attributes.retrieval.documents')].append(copy(json_df.loc[3+(5*ii),('attributes.retrieval.documents')][1]))
    json_df.loc[3+(5*ii),('attributes.retrieval.documents')][2]['document.content'] = eval_data['contexts'][ii][2]
    json_df.loc[3+(5*ii),('attributes.retrieval.documents')][2]['document.id'] = 'cb50530e-166e-4045-b76d-123456789abc'
    json_df.loc[4+(5*ii),('attributes.embedding.embeddings')][0]['embedding.text'] = eval_data['question'][ii]

trace_ds = TraceDataset(json_df)

trace_ds.dataframe.head()

Unnamed: 0,name,span_kind,parent_id,start_time,end_time,status_code,status_message,events,conversation,context.trace_id,...,attributes.llm.invocation_parameters,attributes.llm.output_messages,attributes.llm.token_count.prompt,attributes.llm.token_count.completion,attributes.llm.token_count.total,attributes.llm.prompt_template.template,attributes.llm.prompt_template.variables,attributes.retrieval.documents,attributes.embedding.model_name,attributes.embedding.embeddings
0,query,CHAIN,,2023-12-11 17:57:17.891021+00:00,2023-12-11 17:57:20.075141+00:00,OK,,[],,f40dc5d5-08b7-4e23-80e1-2cd6e9f0cf29,...,,,,,,,,,,
1,synthesize,CHAIN,bce5b9ae-4587-4ead-9ccc-de3fe29257bc,2023-12-11 17:57:18.973513+00:00,2023-12-11 17:57:20.075056+00:00,OK,,[],,f40dc5d5-08b7-4e23-80e1-2cd6e9f0cf29,...,,,,,,,,,,
2,llm,LLM,3d59ca9b-5d68-4773-856f-5243cba51647,2023-12-11 17:57:18.985506+00:00,2023-12-11 17:57:20.074314+00:00,OK,,[],,f40dc5d5-08b7-4e23-80e1-2cd6e9f0cf29,...,"{""model"": ""gpt-3.5-turbo"", ""temperature"": 0.0, ""max_tokens"": null}","[{'message.role': 'assistant', 'message.content': 'According to the Carbon Majors database, the ...",385.0,21.0,406.0,system: You are an expert Q&A system that is trusted around the world.\nAlways answer the query ...,{'context_str': 'The issue of greenhouse gas emissions has become a major concern for environmen...,,,
3,retrieve,RETRIEVER,bce5b9ae-4587-4ead-9ccc-de3fe29257bc,2023-12-11 17:57:17.891487+00:00,2023-12-11 17:57:18.973316+00:00,OK,,[],,f40dc5d5-08b7-4e23-80e1-2cd6e9f0cf29,...,,,,,,,,"[{'document.id': 'b083ee22-965f-4086-856f-4f667a0b107d', 'document.score': 0.8747966162902147, '...",,
4,embedding,EMBEDDING,eef727de-9f27-4b41-aa79-acaccdf92383,2023-12-11 17:57:17.891757+00:00,2023-12-11 17:57:18.390893+00:00,OK,,[],,f40dc5d5-08b7-4e23-80e1-2cd6e9f0cf29,...,,,,,,,,,text-embedding-ada-002,[{'embedding.text': 'Which private companies in the Americas are the largest GHG emitters accord...


#### Using the Phoenix UI

We can pre-load the UI with our trace data

In [5]:
import phoenix as px

session = px.launch_app(trace=trace_ds)

🌍 To view the Phoenix app in your browser, visit http://localhost:6006/
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix


We can then pull that exact data back out if we wish

In [6]:
spans_df = px.active_session().get_spans_dataframe()
spans_df.head()

  df_attributes = pd.DataFrame.from_records(
ERROR [phoenix.db.bulk_inserter] Failed to insert evaluation: Cannot insert a document evaluation for a non-existent document position: evaluation_name='Relevance', span_id='eef727de-9f27-4b41-aa79-acaccdf92383', document_position=0
Traceback (most recent call last):
  File "/home/johnalling-desktop/education/RAG_eval_tests/.venv/lib/python3.11/site-packages/phoenix/db/bulk_inserter.py", line 185, in _insert_evaluations
    result = await insert_evaluation(session, evaluation)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/johnalling-desktop/education/RAG_eval_tests/.venv/lib/python3.11/site-packages/phoenix/db/insertion/evaluation.py", line 56, in insert_evaluation
    return await _insert_document_evaluation(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/johnalling-desktop/education/RAG_eval_tests/.venv/lib/python3.11/site-packages/phoenix/db/insertion/evaluation.py", line 180, in _insert_document_eva

Unnamed: 0_level_0,name,span_kind,parent_id,start_time,end_time,status_code,status_message,events,context.span_id,context.trace_id,...,attributes.llm.model_name,attributes.llm.invocation_parameters,attributes.llm.prompt_template.template,attributes.llm.prompt_template.variables,attributes.llm.input_messages,attributes.llm.token_count.prompt,attributes.llm.token_count.total,attributes.retrieval.documents,attributes.embedding.model_name,attributes.embedding.embeddings
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
bce5b9ae-4587-4ead-9ccc-de3fe29257bc,query,CHAIN,,2023-12-11 17:57:17.891021+00:00,2023-12-11 17:57:20.075141+00:00,OK,,[],bce5b9ae-4587-4ead-9ccc-de3fe29257bc,f40dc5d5-08b7-4e23-80e1-2cd6e9f0cf29,...,,,,,,,,,,
3d59ca9b-5d68-4773-856f-5243cba51647,synthesize,CHAIN,bce5b9ae-4587-4ead-9ccc-de3fe29257bc,2023-12-11 17:57:18.973513+00:00,2023-12-11 17:57:20.075056+00:00,OK,,[],3d59ca9b-5d68-4773-856f-5243cba51647,f40dc5d5-08b7-4e23-80e1-2cd6e9f0cf29,...,,,,,,,,,,
3505e349-9183-48fc-adb1-9394d00d5d21,llm,LLM,3d59ca9b-5d68-4773-856f-5243cba51647,2023-12-11 17:57:18.985506+00:00,2023-12-11 17:57:20.074314+00:00,OK,,[],3505e349-9183-48fc-adb1-9394d00d5d21,f40dc5d5-08b7-4e23-80e1-2cd6e9f0cf29,...,gpt-3.5-turbo,"{""model"": ""gpt-3.5-turbo"", ""temperature"": 0.0, ""max_tokens"": null}",system: You are an expert Q&A system that is trusted around the world.\nAlways answer the query ...,{'context_str': 'The issue of greenhouse gas emissions has become a major concern for environmen...,[{'message.content': 'You are an expert Q&A system that is trusted around the world. Always answ...,385.0,406.0,,,
eef727de-9f27-4b41-aa79-acaccdf92383,retrieve,RETRIEVER,bce5b9ae-4587-4ead-9ccc-de3fe29257bc,2023-12-11 17:57:17.891487+00:00,2023-12-11 17:57:18.973316+00:00,OK,,[],eef727de-9f27-4b41-aa79-acaccdf92383,f40dc5d5-08b7-4e23-80e1-2cd6e9f0cf29,...,,,,,,,,[{'document.content': 'The issue of greenhouse gas emissions has become a major concern for envi...,,
efc452b6-eeda-40de-b1ab-d39cbcc43304,embedding,EMBEDDING,eef727de-9f27-4b41-aa79-acaccdf92383,2023-12-11 17:57:17.891757+00:00,2023-12-11 17:57:18.390893+00:00,OK,,[],efc452b6-eeda-40de-b1ab-d39cbcc43304,f40dc5d5-08b7-4e23-80e1-2cd6e9f0cf29,...,,,,,,,,,text-embedding-ada-002,[{'embedding.text': 'Which private companies in the Americas are the largest GHG emitters accord...


#### Local LLM-as-a-judge (evaluator llm)

The evaluator llm is setup in a very similar manner to Deepeval, with a few differences.

In [7]:
import nest_asyncio
nest_asyncio.apply()

from llama3_phoenix import Llama3_8B
from transformers import AutoModelForCausalLM, AutoTokenizer

model_str = "solidrust/Meta-Llama-3-8B-Instruct-hf-AWQ"

model = AutoModelForCausalLM.from_pretrained(model_str, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_str, device_map="auto")

llama_3 = Llama3_8B(model=model, tokenizer=tokenizer)



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [10]:
# test basic prompting of the local llm
gen_output = llama_3("Why is the sky blue?")

print(gen_output)

The sky appears blue because of a phenomenon called Rayleigh scattering, named after the British physicist Lord Rayleigh. He discovered that shorter wavelengths of light (like blue and violet) are scattered more than longer wavelengths (like red and orange) by the tiny molecules of gases in the atmosphere, such as nitrogen and oxygen.

Here's what happens:

1. When sunlight enters Earth's atmosphere, it encounters these tiny molecules.
2. The shorter wavelengths (blue and violet) are scattered in all directions by the molecules, because they are more easily deflected by the tiny particles.
3. The longer wavelengths (red and orange) continue to travel in a straight line, with less scattering, because they are less affected by the molecules.
4. As a result, our eyes perceive the scattered blue and violet light as the dominant colors, making the sky appear blue.

This effect is more pronounced during the daytime, when the sun is overhead, and less pronounced during sunrise and sunset, whe

#### Preparing for evals

Technically all that is needed to run retrieval evaluations in Phoenix is a dataframe of question-context pairs. However, to log the evals in Phoenix, you have to match them to a specific Phoenix span id. This requires a little more data wrangling.

In [9]:
retrievals = spans_df[["name", "span_kind", "context.trace_id", "attributes.input.value", "attributes.retrieval.documents"]].query("name in ['retrieve']")
retrievals = retrievals.explode('attributes.retrieval.documents')
retrievals = retrievals.rename(columns = {"attributes.input.value": "input"})
scores = [retrievals['attributes.retrieval.documents'][ii]['document.score'] for ii in range(len(retrievals))]
references = [retrievals['attributes.retrieval.documents'][ii]['document.content'] for ii in range(len(retrievals))]
retrievals['document_score'] = scores
retrievals['reference'] = references
retrievals = retrievals.drop(['span_kind','name','attributes.retrieval.documents'],axis=1)
retrievals['document_position'] = [0,1,2,0,1,2,0,1,2,0,1,2]
retrievals.reset_index(inplace=True)
retrievals.set_index(['context.span_id','document_position'], inplace=True)

retrievals.head()

  scores = [retrievals['attributes.retrieval.documents'][ii]['document.score'] for ii in range(len(retrievals))]
  references = [retrievals['attributes.retrieval.documents'][ii]['document.content'] for ii in range(len(retrievals))]


Unnamed: 0_level_0,Unnamed: 1_level_0,context.trace_id,input,document_score,reference
context.span_id,document_position,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
eef727de-9f27-4b41-aa79-acaccdf92383,0,f40dc5d5-08b7-4e23-80e1-2cd6e9f0cf29,Which private companies in the Americas are the largest GHG emitters according to the Carbon Maj...,0.874797,The issue of greenhouse gas emissions has become a major concern for environmentalists and polic...
eef727de-9f27-4b41-aa79-acaccdf92383,1,f40dc5d5-08b7-4e23-80e1-2cd6e9f0cf29,Which private companies in the Americas are the largest GHG emitters according to the Carbon Maj...,0.852543,Reducing greenhouse gas emissions from private companies is a complex challenge that requires co...
eef727de-9f27-4b41-aa79-acaccdf92383,2,f40dc5d5-08b7-4e23-80e1-2cd6e9f0cf29,Which private companies in the Americas are the largest GHG emitters according to the Carbon Maj...,0.852543,"The private companies responsible for the most emissions during this period, according to the da..."
a71236bc-6f1f-46c6-b799-4163048c8c51,0,1bceff06-0bb9-48d6-a498-ca5bc7afbb7d,What action did Amnesty International urge its supporters to take in response to the killing of ...,0.775176,"In the case of the Ogoni 9, Amnesty International called on its supporters to take action by sig..."
a71236bc-6f1f-46c6-b799-4163048c8c51,1,1bceff06-0bb9-48d6-a498-ca5bc7afbb7d,What action did Amnesty International urge its supporters to take in response to the killing of ...,0.774646,Amnesty International called on its vast network of supporters to deluge Nigerian authorities fi...


#### Run the Evaluation

Phoenix evaluations are run using a base `llm_classify` function call and use a specific template and set of rails to guide what is actually being evaluated

In [10]:
from phoenix.evals import (
    RAG_RELEVANCY_PROMPT_RAILS_MAP, # {True: "relevant", False: "unrelated"}
    RAG_RELEVANCY_PROMPT_TEMPLATE,
    llm_classify,
)

rails = list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values())
relevance_classifications = llm_classify(
    dataframe=retrievals,
    template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    model=llama_3,
    rails=rails,
    run_sync=True,
    provide_explanation=True, #optional to generate explanations for the value produced by the eval LLM
)
relevance_classifications["score"] = (
    relevance_classifications.label[~relevance_classifications.label.isna()] == "relevant"
).astype(int)

llm_classify |          | 0/12 (0.0%) | ⏳ 00:00<? | ?it/s

In [15]:
pd.set_option('display.max_colwidth', None)
relevance_classifications.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,label,explanation,score
context.span_id,document_position,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
eef727de-9f27-4b41-aa79-acaccdf92383,0,relevant,"************\nEXPLANATION: The first step is to identify the main topic of the question, which is private companies in the Americas and their greenhouse gas emissions. The reference text also discusses private companies in the Americas and their role in contributing to greenhouse gas emissions. This similarity suggests that the reference text may contain relevant information.\n\nThe second step is to look for specific details in the reference text that match the requirements of the question. The question asks for the largest GHG emitters in the Americas, and the reference text states that some private companies in the Americas are identified as the largest emitters in the region according to the Carbon Majors database. This matches the requirements of the question, indicating that the reference text contains relevant information.\n\nFinally, the reference text does not provide a list of specific private companies in the Americas that are the largest GHG emitters, but it does provide context and information about the role of private companies in contributing to greenhouse gas emissions. This suggests that the reference text is providing relevant information that can help answer the question.\n\nLABEL: ""relevant""",1
eef727de-9f27-4b41-aa79-acaccdf92383,1,unrelated,"************\nEXPLANATION: First, I will analyze the question to identify the key elements that need to be answered. The question asks for private companies in the Americas that are the largest GHG emitters according to the Carbon Majors database.\n\nNext, I will examine the reference text to see if it provides any information related to the question. The reference text mentions reducing greenhouse gas emissions from private companies, which is a topic relevant to the question. However, the text does not specifically mention the Carbon Majors database or provide a list of private companies in the Americas that are the largest GHG emitters.\n\nAlthough the reference text touches on the topic of reducing greenhouse gas emissions from private companies, it does not contain the specific information requested in the question. Therefore, I conclude that the reference text is:\n\nLABEL: ""unrelated""",0
eef727de-9f27-4b41-aa79-acaccdf92383,2,relevant,"************\nEXPLANATION: To determine if the reference text is relevant to the question, we need to compare the question's requirements with the information provided in the reference text.\n\nThe question asks about private companies in the Americas that are the largest GHG emitters, according to the Carbon Majors database. The question specifically mentions private companies, and the database is mentioned as the source of the information.\n\nThe reference text provides information about the largest emitters, but it does not specify that they are private companies. However, it does mention that the companies are from the United States, which is in the Americas.\n\nThe reference text also mentions that the largest emitter amongst state-owned companies in the Americas is Mexican company Pemex, followed by Venezuelan company Petróleos de Venezuela, S.A. This information does not directly answer the question, as it focuses on state-owned companies instead of private companies.\n\nHowever, the reference text does mention that the private companies responsible for the most emissions during this period, according to the database, are from the United States: ExxonMobil, Chevron and Peabody. This information matches the requirements of the question, which asks about private companies in the Americas.\n\nTherefore, the reference text contains information that can help answer the question. The label is:\n\nLABEL: ""relevant""",1
a71236bc-6f1f-46c6-b799-4163048c8c51,0,relevant,"************\nEXPLANATION: First, I will identify the main idea of the question, which is to determine what action Amnesty International urged its supporters to take in response to the killing of the Ogoni 9.\n\nNext, I will analyze the reference text to see if it contains any information related to the question. The text mentions that Amnesty International called on its supporters to take action, and specifies the actions taken, including signing petitions, writing letters, participating in protests and demonstrations.\n\nSince the reference text directly answers the question, providing the specific actions that Amnesty International urged its supporters to take, I conclude that the reference text is relevant to answering the question.\n\nLABEL: ""relevant""",1
a71236bc-6f1f-46c6-b799-4163048c8c51,1,relevant,"************\nEXPLANATION: First, I will identify the key terms in the question: ""What action did Amnesty International urge its supporters to take in response to the killing of the Ogoni 9?"" These key terms are ""action"", ""urge"", ""supporters"", and ""response to the killing of the Ogoni 9"".\n\nNext, I will analyze the reference text to see if it contains any information that matches these key terms. The reference text mentions Amnesty International, its supporters, and the Nigerian authorities, which are all relevant to the question.\n\nThe reference text also mentions ""appeals"" and ""letters of outrage"", which could be considered actions that Amnesty International urged its supporters to take. However, the text does not explicitly state what action was taken in response to the killing of the Ogoni 9.\n\nBased on this analysis, I conclude that the reference text contains information that is relevant to answering the question, but it does not directly answer the question. Therefore, the label is:\n\nLABEL: ""relevant""",1


#### Log the evaluations into the Phoenix client

Now that we have scores and reasonings associated with each of our retrieval steps, we can push these evals back into Phoenix.

In [11]:
from phoenix.trace import DocumentEvaluations

px.Client().log_evaluations(
    DocumentEvaluations(
        eval_name="Relevance", 
        dataframe=relevance_classifications
    )
)

![phoenix_meme](images/evals_meme.jpg)

Unfortunately, the `log_evaluations` function doesn't seem to be working in my example session. It could be any number of problems, but here is what a logged evaluation would look like in the Phoenix UI:

![phoenix_evals_ex](images/phoenix_retrieve_eval.avif)

#### Phoenix Pros
- The big selling point is the UI, which provides a great visualization of the entire RAG process 
- If your RAG pipeline is hooked into Phoenix, you can generated visuals for the retreivals from your vector store
- Lots of templates for custom LLM-as-a-judge evals (both RAG and not)
- No issues so far running the evals themselves

#### Phoenix Cons
- Also slower than RAGAS (getting reasonings takes time)
- Doesn't work well with non-LLM evals (the UI can accept these, but you have to resolve them in a different framework)
- The focus feels more on the visualizations and LLM tracing than the evals themselves