Using this  - [Notebook](https://github.com/Arize-ai/phoenix/blob/main/tutorials/llm_ops_overview.ipynb) - as a jumping off point

In [1]:
import nest_asyncio
nest_asyncio.apply()

In [2]:

import os
from getpass import getpass

if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")

os.environ["OPENAI_API_KEY"] = openai_api_key

In [3]:

import phoenix as px
from llama_index.core import set_global_handler

# Setup phoenix tracing
px.launch_app()
set_global_handler("arize_phoenix")

🌍 To view the Phoenix app in your browser, visit http://localhost:6006/
📺 To view the Phoenix app in a notebook, run `px.active_session().view()`
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix


In [4]:
import os
from getpass import getpass

import phoenix as px
from llama_index.core import (
    Settings,
    StorageContext,
    load_index_from_storage,
    VectorStoreIndex
)
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core.readers import SimpleDirectoryReader
from llama_index.core.node_parser import SimpleNodeParser


Settings.llm = OpenAI(model="gpt-3.5-turbo-0125")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

In [5]:
import tempfile
from urllib.request import urlretrieve

with tempfile.NamedTemporaryFile() as tf:
    urlretrieve(
        "https://raw.githubusercontent.com/Arize-ai/phoenix-assets/main/data/paul_graham/paul_graham_essay.txt",
        tf.name,
    )
    documents = SimpleDirectoryReader(input_files=[tf.name]).load_data()

node_parser = SimpleNodeParser.from_defaults(chunk_size=512)
nodes = node_parser.get_nodes_from_documents(documents)
index = VectorStoreIndex(nodes)
query_engine = index.as_query_engine()

In [6]:

from tqdm import tqdm

queries = [
    "What is Paul Graham's contribution to computer science?",
    "What startups has Paul Graham founded?",
    "What is the impact of Paul Graham's Y Combinator on the tech industry?",
    "What are some notable essays written by Paul Graham?"
]

for query in tqdm(queries):
    response = query_engine.query(query)
    print(f"Query: {query}")
    print(f"Response: {response}")

 25%|██▌       | 1/4 [00:01<00:05,  1.93s/it]

Query: What is Paul Graham's contribution to computer science?
Response: Paul Graham's contribution to computer science is his focus on Lisp and his book "On Lisp." He emphasized the importance of Lisp for its own sake, beyond just its association with AI, and decided to write a book about Lisp hacking. This book, "On Lisp," which he worked on during grad school and was published in 1993, showcases his efforts to delve into Lisp and share his insights with others in the field.


 50%|█████     | 2/4 [00:02<00:02,  1.18s/it]

Query: What startups has Paul Graham founded?
Response: Paul Graham founded Y Combinator.


 75%|███████▌  | 3/4 [00:04<00:01,  1.73s/it]

Query: What is the impact of Paul Graham's Y Combinator on the tech industry?
Response: Paul Graham's Y Combinator had a significant impact on the tech industry by introducing the batch model of funding startups twice a year and providing intensive support to them for three months. This approach helped make starting a startup more accessible and common, challenging the traditional customs of venture capital that were still rooted in the past. Additionally, Y Combinator's focus on helping founders in the early stages of their startups and its innovative strategies, such as self-funding and unique branding choices, contributed to its influence on the tech industry.


100%|██████████| 4/4 [00:06<00:00,  1.70s/it]

Query: What are some notable essays written by Paul Graham?
Response: Some notable essays written by Paul Graham include those discussing the use of Lisp at Viaweb, reflections on the changing landscape of publishing essays online, and insights on the value of working on unprestigious endeavors. Additionally, his collection of essays titled "Hackers & Painters" stands out as a significant work that showcases his thoughts on various topics.





Export Spans to DF

In [7]:
spans_df = px.Client().get_spans_dataframe()
spans_df[["name", "span_kind", "attributes.input.value", "attributes.retrieval.documents"]].head()

Unnamed: 0_level_0,name,span_kind,attributes.input.value,attributes.retrieval.documents
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
7a8e49b2807e71b5,llm,LLM,,
135ea7edc15998f0,chunking,CHAIN,,
e6343af34f069b8d,chunking,CHAIN,,
dba49cbebbb219d3,synthesize,CHAIN,What are some notable essays written by Paul G...,
2d6627d1ed1e905e,embedding,EMBEDDING,,


## Eval

Convert traces to datasets

In [8]:
from phoenix.session.evaluation import get_qa_with_reference, get_retrieved_documents

retrieved_documents_df = get_retrieved_documents(px.active_session())
queries_df = get_qa_with_reference(px.active_session())

In [9]:
import nest_asyncio
from phoenix.evals import (
    HALLUCINATION_PROMPT_RAILS_MAP,
    HALLUCINATION_PROMPT_TEMPLATE,
    QA_PROMPT_RAILS_MAP,
    QA_PROMPT_TEMPLATE,
    OpenAIModel,
    llm_classify,
)

nest_asyncio.apply()  # Speeds up OpenAI API calls

# Check if the application has any indications of hallucinations
hallucination_eval = llm_classify(
    dataframe=queries_df,
    model=OpenAIModel(model="gpt-3.5-turbo-0125", temperature=0.0),
    template=HALLUCINATION_PROMPT_TEMPLATE,
    rails=list(HALLUCINATION_PROMPT_RAILS_MAP.values()),
    provide_explanation=True,  # Makes the LLM explain its reasoning
)
hallucination_eval["score"] = (
    hallucination_eval.label[~hallucination_eval.label.isna()] == "factual"
).astype(int)

# Check if the application is answering questions correctly
qa_correctness_eval = llm_classify(
    dataframe=queries_df,
    model=OpenAIModel(model_name="gpt-3.5-turbo-0125", temperature=0.0),
    template=QA_PROMPT_TEMPLATE,
    rails=list(QA_PROMPT_RAILS_MAP.values()),
    provide_explanation=True,  # Makes the LLM explain its reasoning
    concurrency=4,
)

qa_correctness_eval["score"] = (
    hallucination_eval.label[~qa_correctness_eval.label.isna()] == "correct"
).astype(int)

llm_classify |          | 0/4 (0.0%) | ⏳ 00:00<? | ?it/s

The `model_name` field is deprecated. Use `model` instead.                 This will be removed in a future release.


llm_classify |          | 0/4 (0.0%) | ⏳ 00:00<? | ?it/s

In [10]:

hallucination_eval.head()

Unnamed: 0_level_0,label,explanation,score
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
29cdb52c81f42e91,factual,The answer is factual. The reference text ment...,1
7e52ec676d79da63,factual,The answer is factual based on the reference t...,1
86eed41c4da9935f,factual,The answer 'Paul Graham founded Y Combinator' ...,1
f0f75e537493bce4,factual,The answer is factual based on the reference t...,1


In [11]:

qa_correctness_eval.head()

Unnamed: 0_level_0,label,explanation,score
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
29cdb52c81f42e91,correct,The answer correctly identifies some notable e...,0
7e52ec676d79da63,correct,The answer correctly addresses the impact of P...,0
86eed41c4da9935f,incorrect,The answer provided is incorrect because the q...,0
f0f75e537493bce4,correct,The reference text provides information about ...,0


In [12]:
from phoenix.trace import SpanEvaluations

px.Client().log_evaluations(
    SpanEvaluations(eval_name="Hallucination", dataframe=hallucination_eval),
    SpanEvaluations(eval_name="QA Correctness", dataframe=qa_correctness_eval),
)

In [13]:

print("The Phoenix UI:", px.active_session().url)

The Phoenix UI: http://localhost:6006/


### Eval Relevance of RAG Chunks

In [14]:
from phoenix.experimental.evals import (
    RAG_RELEVANCY_PROMPT_RAILS_MAP,
    RAG_RELEVANCY_PROMPT_TEMPLATE,
    OpenAIModel,
    llm_classify,
)

retrieved_documents_eval = llm_classify(
    dataframe=retrieved_documents_df,
    model=OpenAIModel(model="gpt-4-turbo-preview", temperature=0.0),
    template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    rails=list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values()),
    provide_explanation=True,
)

retrieved_documents_eval["score"] = (
    retrieved_documents_eval.label[~retrieved_documents_eval.label.isna()] == "relevant"
).astype(int)

Evals are moving out of experimental. Install the evals extra with `pip install arize-phoenix[evals]` and import `phoenix.evals`. For more info, see the [migration guide](https://github.com/Arize-ai/phoenix/blob/main/MIGRATION.md).


llm_classify |          | 0/8 (0.0%) | ⏳ 00:00<? | ?it/s

In [15]:
retrieved_documents_eval.head()
     

Unnamed: 0_level_0,Unnamed: 1_level_0,label,explanation,score
context.span_id,document_position,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
e512dbe22717a147,0,relevant,The question asks for notable essays written b...,1
e512dbe22717a147,1,relevant,The question asks for notable essays written b...,1
83a8a6fd71b8df1e,0,relevant,The question asks about the impact of Paul Gra...,1
83a8a6fd71b8df1e,1,relevant,The reference text provides detailed informati...,1
c1102c6f03da54ee,0,unrelated,The question asks about startups founded by Pa...,0


In [16]:
from phoenix.trace import DocumentEvaluations

px.Client().log_evaluations(
    DocumentEvaluations(eval_name="Relevance", dataframe=retrieved_documents_eval)
)