# Langchain x Phoenix

The following notebooks showcases the power of Langchain and Phoenix combined. 

I am quite passionate about the Trust and Safety domain, and thus thought there would be lots of utility in scraping and building a RAG around this website: https://features.integrityinstitute.org/. I call it the Integrity RAG. For the sake of simplicity, I am focusing on a subset of online harms.

## Scope of Notebook

The order of steps is the following:
1. Setup OpenAI, Phoenix, and Langchain
2. Build Integrity RAG.
3. Generate LLM Based Questions (substituting for user questions)
4. Generate RAG-based Answers 
5. Generate LLM Based Evaluations (substituting for user based)

## Setup Phoenix

It's quite simple actually to setup. Connect phoenix using the LangChainInstrumentor which acts like magic.


Side Note: This code was hard to find in the onboarding flow. Llama-index was much easier to find

In [1]:
import os
import nest_asyncio
import phoenix as px
from phoenix.trace.langchain import LangChainInstrumentor

os.environ["OPENAI_API_KEY"] = "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"
nest_asyncio.apply()
LangChainInstrumentor().instrument()
px.launch_app()

  from .autonotebook import tqdm as notebook_tqdm


🌍 To view the Phoenix app in your browser, visit http://localhost:6006/
📺 To view the Phoenix app in a notebook, run `px.active_session().view()`
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix


<phoenix.session.session.ThreadSession at 0x2a17d7a50>

## Setup the RAG

Sets up a simple RAG using a RecursiveUrlLoader, simple length 1000 chunking, k=6. No experimentation was done.

In [11]:
from bs4 import BeautifulSoup as Soup
from langchain_community.document_loaders import RecursiveUrlLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain_community.vectorstores import Chroma
from langchain_community.vectorstores.utils import filter_complex_metadata
from langchain_openai import OpenAIEmbeddings
from langchain_openai import ChatOpenAI
from langchain import hub

from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

URL = "https://features.integrityinstitute.org/" 

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

loader = RecursiveUrlLoader(url=URL, max_depth=2, extractor=lambda x: Soup(x, "html.parser").text)
docs = loader.load()


splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = splitter.split_documents(docs)
filtered_splits = filter_complex_metadata(all_splits)

vector_store = Chroma.from_documents(documents=filtered_splits, embedding=OpenAIEmbeddings())
retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 6})

prompt = hub.pull("rlm/rag-prompt")
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)


RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}

In [14]:
spans_df

Unnamed: 0_level_0,name,span_kind,parent_id,start_time,end_time,status_code,status_message,events,conversation,context.trace_id,...,attributes.output.value,attributes.input.value,attributes.__computed__.latency_ms,attributes.__computed__.error_count,attributes.llm.output_messages,attributes.output.mime_type,attributes.llm.model_name,attributes.llm.invocation_parameters,attributes.llm.prompts,attributes.retrieval.documents
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
36c72d1e27ed1ded,StrOutputParser,UNKNOWN,07c1ba3d6e3e4ea8,2024-02-09T16:57:43.622319+00:00,2024-02-09T16:57:47.223579+00:00,OK,,[],,dc2a063f94ffe4d8e654c9bee3212791,...,"To stop scams, platforms can implement mechani...","{""input"": ""content='To stop scams, platforms c...",3601.26,0,,,,,,
69ebd4e32072ba26,ChatOpenAI,LLM,07c1ba3d6e3e4ea8,2024-02-09T16:57:43.037828+00:00,2024-02-09T16:57:47.220132+00:00,OK,,[],,dc2a063f94ffe4d8e654c9bee3212791,...,"{""generations"": [[{""text"": ""To stop scams, pla...","{""prompts"": [""Human: You are an assistant for ...",4182.304,0,"[{'message.content': 'To stop scams, platforms...",application/json,gpt-3.5-turbo,"{""model"": ""gpt-3.5-turbo"", ""model_name"": ""gpt-...",[Human: You are an assistant for question-answ...,
627238859608d606,ChatPromptTemplate,UNKNOWN,07c1ba3d6e3e4ea8,2024-02-09T16:57:43.035796+00:00,2024-02-09T16:57:43.036167+00:00,OK,,[],,dc2a063f94ffe4d8e654c9bee3212791,...,"{""lc"": 1, ""type"": ""constructor"", ""id"": [""langc...","{""question"": ""How does one stop scams?"", ""cont...",0.371,0,,application/json,,,,
975809830241da11,format_docs,CHAIN,f8b13701a3ede85d,2024-02-09T16:57:43.021710+00:00,2024-02-09T16:57:43.022005+00:00,OK,,[],,dc2a063f94ffe4d8e654c9bee3212791,...,highly personalized frauds designed to appeal ...,"{""input"": [""page_content=\""highly personalized...",0.295,0,,,,,,
301db92a93c61a42,RunnablePassthrough,CHAIN,5ee105d69cb8163a,2024-02-09T16:57:42.788637+00:00,2024-02-09T16:57:42.792912+00:00,OK,,[],,dc2a063f94ffe4d8e654c9bee3212791,...,How does one stop scams?,How does one stop scams?,4.275,0,,,,,,
409c4f108cf7503e,Retriever,RETRIEVER,f8b13701a3ede85d,2024-02-09T16:57:42.788336+00:00,2024-02-09T16:57:43.019306+00:00,OK,,[],,dc2a063f94ffe4d8e654c9bee3212791,...,"{""documents"": [""page_content=\""highly personal...",How does one stop scams?,230.97,0,,application/json,,,,[{'document.metadata': {'description': 'Commer...
f8b13701a3ede85d,RunnableSequence,CHAIN,5ee105d69cb8163a,2024-02-09T16:57:42.787796+00:00,2024-02-09T16:57:43.024914+00:00,OK,,[],,dc2a063f94ffe4d8e654c9bee3212791,...,highly personalized frauds designed to appeal ...,How does one stop scams?,237.118,0,,,,,,
5ee105d69cb8163a,"RunnableParallel<context,question>",CHAIN,07c1ba3d6e3e4ea8,2024-02-09T16:57:42.786919+00:00,2024-02-09T16:57:43.034047+00:00,OK,,[],,dc2a063f94ffe4d8e654c9bee3212791,...,"{""question"": ""How does one stop scams?"", ""cont...",How does one stop scams?,247.128,0,,application/json,,,,
07c1ba3d6e3e4ea8,RunnableSequence,CHAIN,,2024-02-09T16:57:42.781196+00:00,2024-02-09T16:57:47.225275+00:00,OK,,[],,dc2a063f94ffe4d8e654c9bee3212791,...,"To stop scams, platforms can implement mechani...",How does one stop scams?,4444.079,0,,,,,,
37d2af01d63993e2,Retriever,RETRIEVER,,2024-02-09T16:51:42.923352+00:00,2024-02-09T16:51:43.142226+00:00,OK,,[],,0dd19018c86cd8364241889c98502de8,...,"{""documents"": [""page_content=\""highly personal...",How does one stop scams?,218.874,0,,application/json,,,,[{'document.metadata': {'description': 'Commer...


In [19]:
filtered[10].page_content

'Focus on Features A project of the Integrity InstituteTo minimize the impact of harms like Child Sexual Abuse Imagery Manipulated Media / Deep Fakes Election Misinformation Political Misinformation Health Misinformation Malware Non-consensual Explicit Imagery Promoting Illegal Activity, change the design of File/Link Sharing Search. Intervention:Media Provenance\xa0 Definition: Record and display the chain of custody and original source for media.Kind of Intervention:  HistoryReversible:Easily Tested + Abandoned Suitability:General Technical Difficulty:Hard Legislative Target:Yes Knowing where media came from is often more important than knowing what the media contains:Hide CommentsIn cases like Deep Fakes and Manipulated Media, one of the best signals we have for veracity is the source of media, and where it showed up first. The credibility/plausibility of the initial poster having access to the media is a fabulous analogy for the plausibility of the media itself.In cases like Child'

In [20]:
# Next is to add in fake data
# some feedback items I have, it's hard to find how to incorporate pheonix with langchain
# once you do, the api for pheonix is not obvious
# why doesn't pheonix UI have an easy way to make LLM data and use it to eval
# why doesn't pheonix UI have like idk, buttons to check eval.
# too much pandas here, and too many things to do. I had to llm to dummy data, and then i had to get another llm to actually do the evaluate
# the onboarding is crazy
# give open source onboarding ideas
# and give non-open source onboarding dieas


# the growth is ass, because the onboarding is ass
# the reddit is ass

filtered_docs = filtered


In [23]:
import pandas as pd

document_chunks_df = pd.DataFrame({"text": [doc.page_content for doc in filtered_docs]})


In [24]:
document_chunks_df

Unnamed: 0,text
0,Focus on Features | Prevent Harm Through Design
1,Focus on Features Minimize digital harms throu...
2,platform's business model.Hide CommentsSeeing ...
3,Doxxing Blackmail Election Misinformation Heal...
4,FoF | Harm | Child Sexual Abuse Imagery
...,...
650,idea in common: a user taking a proactive acti...
651,decide which content to show the user first.Hi...
652,Number of SubscriptionsBy limiting the number ...
653,ReachVary posting limitations in line with sub...


In [26]:
generate_questions_template = """\
Context information is below.

---------------------
{text}
---------------------

Given the context information and not prior knowledge.
generate only questions based on the below query.

You are a Teacher/ Professor. Your task is to setup \
3 questions for an upcoming \
quiz/examination. The questions should be diverse in nature \
across the document. Restrict the questions to the \
context information provided."

Output the questions in JSON format with the keys question_1, question_2, question_3.
"""

In [27]:
import json

from phoenix.experimental.evals import OpenAIModel, llm_generate


def output_parser(response: str, index: int):
    try:
        return json.loads(response)
    except json.JSONDecodeError as e:
        return {"__error__": str(e)}


questions_df = llm_generate(
    dataframe=document_chunks_df,
    template=generate_questions_template,
    model=OpenAIModel(
        model_name="gpt-3.5-turbo",
    ),
    output_parser=output_parser,
    concurrency=20,
)

llm_generate |██████████| 655/655 (100.0%) | ⏳ 02:27<00:00 |  4.45it/s


In [28]:
# Construct a dataframe of the questions and the document chunks
questions_with_document_chunk_df = pd.concat([questions_df, document_chunks_df], axis=1)
questions_with_document_chunk_df = questions_with_document_chunk_df.melt(
    id_vars=["text"], value_name="question"
).drop("variable", axis=1)
# If the above step was interrupted, there might be questions missing. Let's run this to clean up the dataframe.
questions_with_document_chunk_df = questions_with_document_chunk_df[
    questions_with_document_chunk_df["question"].notnull()
]

In [29]:
questions_with_document_chunk_df.head(10)

Unnamed: 0,text,question
0,Focus on Features | Prevent Harm Through Design,How can design features be used to prevent harm?
1,Focus on Features Minimize digital harms throu...,What is the core idea behind the project of th...
2,platform's business model.Hide CommentsSeeing ...,What are the drawbacks of relying on content m...
3,Doxxing Blackmail Election Misinformation Heal...,What are some examples of harmful activities t...
4,FoF | Harm | Child Sexual Abuse Imagery,What measures can be taken by educational inst...
5,Focus on Features A project of the Integrity I...,What is the definition of Child Sexual Abuse I...
6,acts.Hide CommentsThe universal condemnation o...,What is the notable area of broad consensus on...
7,that otherwise can continue to cause harm out ...,What is one feature that can facilitate Child ...
8,LimitationsRequire users to create an account ...,What is one limitation of the current system?
9,FoF | Intervention | Media Provenance,What is the purpose of the FoF intervention?


In [37]:
# loop over the questions and generate the answers
for _, row in questions_with_document_chunk_df.iterrows():
    question = row["question"]
    response = rag_chain.invoke(question)
    print(f"Question: {question}\nAnswer: {response}\n")

Question: How can design features be used to prevent harm?
Answer: Design features can be used to prevent harm by implementing interventions such as right-sizing content visibility, giving users access to more powerful features as they engage more deeply, broadening a user's information diet through deliberate mechanisms of alternative exposure, and giving users capability controls over a circumscribed perimeter. By changing the design of features, platforms can be made less capable of being abused and can mitigate the impact of digital harms. Legislative action that constrains the set of features that platforms can offer is also likely to be effective in preventing harm.

Question: What is the core idea behind the project of the Integrity Institute?
Answer: The core idea behind the project of the Integrity Institute is to prevent digital harms through platform design. It aims to illustrate the connections between feature design choices and harms, and catalog proposed changes (interven

KeyboardInterrupt: 

In [38]:
from phoenix.session.evaluation import get_retrieved_documents

retrieved_documents_df = get_retrieved_documents(px.Client())
retrieved_documents_df

Unnamed: 0_level_0,Unnamed: 1_level_0,context.trace_id,input,reference,document_score
context.span_id,document_position,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
7c6bddfe1b66f2b3,0,1481bd6aa824b0988648b2eb54a12872,What is Coordinated Inauthentic Activity (CIA)...,Focus on Features A project of the Integrity I...,
7c6bddfe1b66f2b3,1,1481bd6aa824b0988648b2eb54a12872,What is Coordinated Inauthentic Activity (CIA)...,like what is sometimes seen by ride share driv...,
7c6bddfe1b66f2b3,2,1481bd6aa824b0988648b2eb54a12872,What is Coordinated Inauthentic Activity (CIA)...,aligned with the incentives of those conductin...,
7c6bddfe1b66f2b3,3,1481bd6aa824b0988648b2eb54a12872,What is Coordinated Inauthentic Activity (CIA)...,FoF | Harm | Coordinated Inauthentic Activity,
7c6bddfe1b66f2b3,4,1481bd6aa824b0988648b2eb54a12872,What is Coordinated Inauthentic Activity (CIA)...,through large volumes of inauthentic feedback....,
...,...,...,...,...,...
37d2af01d63993e2,1,0dd19018c86cd8364241889c98502de8,How does one stop scams?,to mechanisms by which they can be stymied in ...,
37d2af01d63993e2,2,0dd19018c86cd8364241889c98502de8,How does one stop scams?,FoF | Harm | Scams,
37d2af01d63993e2,3,0dd19018c86cd8364241889c98502de8,How does one stop scams?,users.Because many scams rely on bulk distribu...,
37d2af01d63993e2,4,0dd19018c86cd8364241889c98502de8,How does one stop scams?,the capability to build lots of high quality h...,


In [39]:
from phoenix.experimental.evals import (
    RelevanceEvaluator,
    run_evals,
)

relevance_evaluator = RelevanceEvaluator(OpenAIModel(model_name="gpt-4-turbo-preview"))

retrieved_documents_relevance_df = run_evals(
    evaluators=[relevance_evaluator],
    dataframe=retrieved_documents_df,
    provide_explanation=True,
    concurrency=20,
)[0]

run_evals |█████▊    | 799/1368 (58.4%) | ⏳ 05:58<1:24:54 |  8.95s/it 

Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing


run_evals |█████▊    | 799/1368 (58.4%) | ⏳ 05:58<1:24:54 |  8.95s/it 

Worker timeout, requeuing


run_evals |█████▊    | 799/1368 (58.4%) | ⏳ 05:58<1:24:54 |  8.95s/it 

Worker timeout, requeuing


run_evals |█████▊    | 799/1368 (58.4%) | ⏳ 05:59<1:24:54 |  8.95s/it 

Worker timeout, requeuing


run_evals |█████▊    | 799/1368 (58.4%) | ⏳ 05:59<1:24:54 |  8.95s/it 

Worker timeout, requeuing


run_evals |█████▊    | 799/1368 (58.4%) | ⏳ 06:05<1:24:54 |  8.95s/it 

Worker timeout, requeuing


run_evals |█████▊    | 799/1368 (58.4%) | ⏳ 06:09<1:24:54 |  8.95s/it 

Worker timeout, requeuing


run_evals |█████▊    | 799/1368 (58.4%) | ⏳ 06:15<1:24:54 |  8.95s/it 

Worker timeout, requeuing
Worker timeout, requeuing


run_evals |█████▊    | 799/1368 (58.4%) | ⏳ 06:19<1:24:54 |  8.95s/it 

Worker timeout, requeuing


run_evals |█████▊    | 799/1368 (58.4%) | ⏳ 07:22<1:24:54 |  8.95s/it 

Worker timeout, requeuing


run_evals |█████▊    | 799/1368 (58.4%) | ⏳ 08:01<1:24:54 |  8.95s/it Task exception was never retrieved
future: <Task finished name='Task-30954' coro=<run_evals.<locals>._arun_eval() done, defined at /Users/talhabaig/work/simulacra/.venv/lib/python3.11/site-packages/phoenix/experimental/evals/functions/classify.py:405> exception=RateLimitError('Exceeded max (10) retries')>
Traceback (most recent call last):
  File "/opt/homebrew/Cellar/python@3.11/3.11.7_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/tasks.py", line 277, in __step
    result = coro.send(None)
             ^^^^^^^^^^^^^^^
  File "/Users/talhabaig/work/simulacra/.venv/lib/python3.11/site-packages/phoenix/experimental/evals/functions/classify.py", line 408, in _arun_eval
    label, score, explanation = await payload.evaluator.aevaluate(
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/talhabaig/work/simulacra/.venv/lib/python3.11/site-packages/phoenix/experimental/eval

Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing


run_evals |█████▊    | 799/1368 (58.4%) | ⏳ 08:05<1:24:54 |  8.95s/it 

Worker timeout, requeuing


run_evals |█████▊    | 799/1368 (58.4%) | ⏳ 08:09<1:24:54 |  8.95s/it 

Worker timeout, requeuing


run_evals |█████▊    | 799/1368 (58.4%) | ⏳ 08:15<1:24:54 |  8.95s/it 

Worker timeout, requeuing
Worker timeout, requeuing


run_evals |█████▊    | 799/1368 (58.4%) | ⏳ 08:20<1:24:54 |  8.95s/it 

Worker timeout, requeuing


run_evals |█████▊    | 799/1368 (58.4%) | ⏳ 09:22<1:24:54 |  8.95s/it 

Worker timeout, requeuing


run_evals |█████▊    | 799/1368 (58.4%) | ⏳ 10:01<1:24:54 |  8.95s/it 

Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing


run_evals |█████▊    | 799/1368 (58.4%) | ⏳ 10:07<1:24:54 |  8.95s/it 

Worker timeout, requeuing


run_evals |█████▊    | 799/1368 (58.4%) | ⏳ 10:09<1:24:54 |  8.95s/it 

Worker timeout, requeuing


run_evals |█████▊    | 799/1368 (58.4%) | ⏳ 10:17<1:24:54 |  8.95s/it 

Worker timeout, requeuing
Worker timeout, requeuing


run_evals |█████▊    | 799/1368 (58.4%) | ⏳ 10:20<1:24:54 |  8.95s/it 

Worker timeout, requeuing


Task exception was never retrieved
future: <Task finished name='Task-30970' coro=<run_evals.<locals>._arun_eval() done, defined at /Users/talhabaig/work/simulacra/.venv/lib/python3.11/site-packages/phoenix/experimental/evals/functions/classify.py:405> exception=RateLimitError('Exceeded max (10) retries')>
Traceback (most recent call last):
  File "/opt/homebrew/Cellar/python@3.11/3.11.7_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/tasks.py", line 277, in __step
    result = coro.send(None)
             ^^^^^^^^^^^^^^^
  File "/Users/talhabaig/work/simulacra/.venv/lib/python3.11/site-packages/phoenix/experimental/evals/functions/classify.py", line 408, in _arun_eval
    label, score, explanation = await payload.evaluator.aevaluate(
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/talhabaig/work/simulacra/.venv/lib/python3.11/site-packages/phoenix/experimental/evals/evaluators.py", line 145, in aevaluate
    unparsed_output = await v

Worker timeout, requeuing


run_evals |█████▊    | 799/1368 (58.4%) | ⏳ 12:04<1:24:54 |  8.95s/it 

Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing


run_evals |█████▊    | 799/1368 (58.4%) | ⏳ 12:07<1:24:54 |  8.95s/it 

Worker timeout, requeuing


run_evals |█████▊    | 799/1368 (58.4%) | ⏳ 12:13<1:24:54 |  8.95s/it 

Worker timeout, requeuing


run_evals |█████▊    | 799/1368 (58.4%) | ⏳ 12:17<1:24:54 |  8.95s/it 

Worker timeout, requeuing
Worker timeout, requeuing


run_evals |█████▊    | 799/1368 (58.4%) | ⏳ 12:23<1:24:54 |  8.95s/it 

Worker timeout, requeuing


Task exception was never retrieved
future: <Task finished name='Task-30943' coro=<run_evals.<locals>._arun_eval() done, defined at /Users/talhabaig/work/simulacra/.venv/lib/python3.11/site-packages/phoenix/experimental/evals/functions/classify.py:405> exception=RateLimitError('Exceeded max (10) retries')>
Traceback (most recent call last):
  File "/opt/homebrew/Cellar/python@3.11/3.11.7_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/tasks.py", line 277, in __step
    result = coro.send(None)
             ^^^^^^^^^^^^^^^
  File "/Users/talhabaig/work/simulacra/.venv/lib/python3.11/site-packages/phoenix/experimental/evals/functions/classify.py", line 408, in _arun_eval
    label, score, explanation = await payload.evaluator.aevaluate(
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/talhabaig/work/simulacra/.venv/lib/python3.11/site-packages/phoenix/experimental/evals/evaluators.py", line 145, in aevaluate
    unparsed_output = await v

Worker timeout, requeuing


run_evals |█████▊    | 799/1368 (58.4%) | ⏳ 14:04<1:24:54 |  8.95s/it Task exception was never retrieved
future: <Task finished name='Task-30904' coro=<run_evals.<locals>._arun_eval() done, defined at /Users/talhabaig/work/simulacra/.venv/lib/python3.11/site-packages/phoenix/experimental/evals/functions/classify.py:405> exception=RateLimitError('Exceeded max (10) retries')>
Traceback (most recent call last):
  File "/opt/homebrew/Cellar/python@3.11/3.11.7_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/tasks.py", line 277, in __step
    result = coro.send(None)
             ^^^^^^^^^^^^^^^
  File "/Users/talhabaig/work/simulacra/.venv/lib/python3.11/site-packages/phoenix/experimental/evals/functions/classify.py", line 408, in _arun_eval
    label, score, explanation = await payload.evaluator.aevaluate(
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/talhabaig/work/simulacra/.venv/lib/python3.11/site-packages/phoenix/experimental/eval

Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing


run_evals |█████▊    | 799/1368 (58.4%) | ⏳ 14:11<1:24:54 |  8.95s/it 

Worker timeout, requeuing


run_evals |█████▊    | 799/1368 (58.4%) | ⏳ 14:13<1:24:54 |  8.95s/it 

Worker timeout, requeuing


run_evals |█████▊    | 799/1368 (58.4%) | ⏳ 14:21<1:24:54 |  8.95s/it 

Worker timeout, requeuing
Worker timeout, requeuing


run_evals |█████▊    | 799/1368 (58.4%) | ⏳ 14:23<1:24:54 |  8.95s/it 

Worker timeout, requeuing


run_evals |█████▊    | 799/1368 (58.4%) | ⏳ 15:03<1:24:54 |  8.95s/it 

Process was interrupted. The return value will be incomplete...


run_evals |█████▊    | 799/1368 (58.4%) | ⏳ 15:10<10:48 |  1.14s/it  


ValueError: not enough values to unpack (expected 5, got 2)

Task exception was never retrieved
future: <Task finished name='Task-30946' coro=<run_evals.<locals>._arun_eval() done, defined at /Users/talhabaig/work/simulacra/.venv/lib/python3.11/site-packages/phoenix/experimental/evals/functions/classify.py:405> exception=RateLimitError('Exceeded max (10) retries')>
Traceback (most recent call last):
  File "/opt/homebrew/Cellar/python@3.11/3.11.7_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/tasks.py", line 277, in __step
    result = coro.send(None)
             ^^^^^^^^^^^^^^^
  File "/Users/talhabaig/work/simulacra/.venv/lib/python3.11/site-packages/phoenix/experimental/evals/functions/classify.py", line 408, in _arun_eval
    label, score, explanation = await payload.evaluator.aevaluate(
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/talhabaig/work/simulacra/.venv/lib/python3.11/site-packages/phoenix/experimental/evals/evaluators.py", line 145, in aevaluate
    unparsed_output = await v

In [40]:
documents_with_relevance_df = pd.concat(
    [retrieved_documents_df, retrieved_documents_relevance_df.add_prefix("eval_")], axis=1
)
documents_with_relevance_df

NameError: name 'retrieved_documents_relevance_df' is not defined

In [41]:
import numpy as np
from sklearn.metrics import ndcg_score

def _compute_ndcg(df, n):
    n = max(len(df), 2)
    eval_scores = np.zeros(n)
    doc_scores = np.zeros(n)
    eval_scores[: len(df)] = df.eval_scores
    doc_scores[: len(df)] = df.document_scores
    
    try:
        return ndcg_score([eval_scores], [doc_scores], k=k)
    except ValueError:
        return np.nan

ndcg_at_2 = pd.DataFrame(
    {"score": documents_with_relevance_df.groupby("context.span_id").apply(_compute_ndcg, k=2)}
)

NameError: name 'documents_with_relevance_df' is not defined