# Evaluations

This notebook shows how to pull traces from a running phoenix instance and evaluate them using the `arize-phoenix-evals` library.

In [1]:
# pip install arize-phoenix-evals openai nest_asyncio arize-phoenix arize

In [2]:
# Run async evaluation in the notebook
import nest_asyncio

nest_asyncio.apply()

OPEN_AI_API_KEY = ""

In [3]:
import phoenix as px

client = px.Client(endpoint="http://localhost:6006")

In [4]:
from datetime import datetime, timedelta
from phoenix.trace.dsl.helpers import get_qa_with_reference, get_retrieved_documents

qa_df = get_qa_with_reference(client)
documents_df = get_retrieved_documents(client)

In [5]:
import pandas as pd
pd.set_option('display.max_colwidth', 800)
qa_df.head()

Unnamed: 0_level_0,input,output,reference
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
407ebff65544f91c,What is distribution of sensitive content in the context of the docs?,"In the context of the documents, ""distribution of sensitive content"" refers to a risk category associated with large language models (LLMs) where the LLM shares information that is considered sensitive. This risk is characterized by the likelihood of the user intent being malicious, and the party harmed by this action is a third party, which could be other individuals not directly interacting with the LLM. The types of harms associated with the distribution of sensitive content include:\n\n- Violation of privacy, which could occur through the leakage of training data or the inference of private information not intended to be shared.\n- Dissemination of graphic material, such as child sexual abuse material (CSAM), which is illegal and highly harmful content.\n\nThis risk category unders...","LLM Test and Evaluation\nB Risks Taxonomy\nPresent risks\n1.Harmful information\nLLM provides information that harms the user.\nUser intent : Benign\nParty harmed: First party (user)\nType of harms:\n• Misleading or misinforming a user (""hallucination"")\n• Causing material harm due to unqualiﬁed advice (e.g., medical, legal, ﬁnancial)\n• Leading users to perform unethical or illegal actions\n• Causing psychological harm due to toxic, graphic, or violent content\n2.Harm against groups\nLLM provides information that can lead to harm to a group.\nUser intent : Benign\nParty harmed: Third party (targeted groups)\nType of harms:\n• Promoting discrimination\n• Promoting bias and exclusionary norms\n3.Distribution of sensitive content\nLLM shares information that is sensitive.\nUser intent : ..."
edbb4ea7ff9853ec,What is reasoning?,"The documents do not explicitly define ""reasoning"" within the provided excerpts. Generally, in the context of large language models (LLMs) and artificial intelligence (AI), reasoning refers to the process by which a model or system processes information, draws inferences, or makes decisions based on the data it has been trained on or is analyzing. This can include a wide range of cognitive tasks, such as understanding natural language, solving problems, making predictions, or generating responses that are coherent and contextually appropriate. Reasoning is a critical capability of LLMs, enabling them to perform complex tasks across various domains, from engaging in conversation to generating written content and beyond. It involves not just the retrieval of information, but also the app...","A H OLISTIC APPROACH FOR TEST AND EVALUATION OF LARGE\nLANGUAGE MODELS\nDylan Slack∗, Jean Wang∗, Denis Semenenko∗, Kate Park, Sean Hendryx\nScale AI\nABSTRACT\nAs large language models (LLMs) become increasingly prevalent in diverse applications, ensuring\nthe utility and safety of model generations becomes paramount. We present a holistic approach for\ntest and evaluation of large language models. The approach encompasses the design of evaluation\ntaxonomies, a novel framework for safety testing, and hybrid methodologies that improve the\nscalability and cost of evaluations. In particular, we introduce a hybrid methodology for the evaluation\nof large language models (LLMs) that leverages both human expertise and AI assistance. Our hybrid\nmethodology generalizes across both LLM capa..."
f1549d963f28bbcc,What is a mitigation strategy,"The documents do not explicitly detail a ""mitigation strategy"" within the provided excerpts. Typically, in the context of evaluating and testing large language models (LLMs), a mitigation strategy would refer to a plan or set of actions designed to address and reduce the impact of identified vulnerabilities, risks, or undesirable behaviors in LLMs. This could involve techniques to improve model reliability, reduce susceptibility to adversarial prompts, or align the model's actions more closely with user intentions and ethical guidelines. Mitigation strategies might include refining training data, adjusting model parameters, implementing additional layers of review for model outputs, or developing more sophisticated testing protocols to catch and correct errors or biases before deployme...","LLM Test and Evaluation\n2.2.2 Types of vulnerabilities\nAlong with our taxonomy of risks, we also consider the types of vulnerabilities that may result in the various risks\ndescribed above. In contrast to existing standards such as the OWASP Top 10 for LLMs that cover broader structural\nvulnerabilities, such as supply chain vulnerabilities, the vulnerabilities below are focused on the potential misuse and\nunintended consequences of AI systems.\n1.Unreliability Unreliability refers to when the model produces harmful results unintentionally. These are\nsituations in which the users are using the model as expected, without any adversarial or malicious intent, but\nthe model outputs undesirable content. The harms caused by this vulnerability are typically related to harmful\ninformatio..."
5b3abc8ea9afc82f,What are capabilities?,"Capabilities, in the context of evaluating large language models (LLMs), refer to the broad categories of tasks or functions that LLMs can perform. These capabilities are broken down into several categories to comprehensively cover the wide range of uses and applications of LLMs across different domains. The document outlines a taxonomy for understanding these capabilities, which includes, but is not limited to, categories like Conversation, Generation, Math, and Coding. Each category is designed to capture a full spectrum of use cases within that domain, providing additional granularity beyond top-level categories to ensure a thorough evaluation of LLMs' abilities. For example, the Generation category might include tasks ranging from creative writing, like composing a rap song, to pra...","LLM Test and Evaluation\n(a) Capabilities evaluation framework.\n(b) Safety evaluation framework.\nFigure 1: Overview of Capabilites andSafety evaluation frameworks. We introduce AI augmented approaches for\nevaluating the capabilities and safety of large language models, LLMs.\nhigh degree of accuracy (86%) while ofﬂoading a signiﬁcant portion of the work to AI models (up to 20%). Similarly,\nwe ﬁnd that our hybrid approach to red teaming, combining automated methods, generalist teamers, and expert red\nteamers, enables us to ﬁnd more successful and harmful red teams attacks. In particular, we ﬁnd that generalist red\nteamers, who have experience red teaming across a variety of different domains and projects, had the highest rate of\nred teaming success (80%), while expert red teamers..."
cb1ddb6bd3ca84f3,Can LLM evaluations be completely automated?,"No, LLM evaluations cannot be completely automated. The documents highlight that while hybrid approaches for evaluating LLM capabilities and safety are quite effective, they find that full automation of LLM evaluation is not feasible. Automated systems can be inaccurate or less comprehensive in certain instances, necessitating the inclusion of humans in the loop to improve accuracy and coverage. Therefore, despite the significant advancements in automating the evaluation process, human evaluators remain essential for achieving the best results in LLM evaluations.","A H OLISTIC APPROACH FOR TEST AND EVALUATION OF LARGE\nLANGUAGE MODELS\nDylan Slack∗, Jean Wang∗, Denis Semenenko∗, Kate Park, Sean Hendryx\nScale AI\nABSTRACT\nAs large language models (LLMs) become increasingly prevalent in diverse applications, ensuring\nthe utility and safety of model generations becomes paramount. We present a holistic approach for\ntest and evaluation of large language models. The approach encompasses the design of evaluation\ntaxonomies, a novel framework for safety testing, and hybrid methodologies that improve the\nscalability and cost of evaluations. In particular, we introduce a hybrid methodology for the evaluation\nof large language models (LLMs) that leverages both human expertise and AI assistance. Our hybrid\nmethodology generalizes across both LLM capa..."


In [6]:
documents_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,context.trace_id,input,reference,document_score
context.span_id,document_position,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
27d940b42e985f1c,0,c45cd427908a22fd1545d50b69fdf0ad,"What does ""distribution of sensitive content"" mean according to the provided documents?","LLM Test and Evaluation\nB Risks Taxonomy\nPresent risks\n1.Harmful information\nLLM provides information that harms the user.\nUser intent : Benign\nParty harmed: First party (user)\nType of harms:\n• Misleading or misinforming a user (""hallucination"")\n• Causing material harm due to unqualiﬁed advice (e.g., medical, legal, ﬁnancial)\n• Leading users to perform unethical or illegal actions\n• Causing psychological harm due to toxic, graphic, or violent content\n2.Harm against groups\nLLM provides information that can lead to harm to a group.\nUser intent : Benign\nParty harmed: Third party (targeted groups)\nType of harms:\n• Promoting discrimination\n• Promoting bias and exclusionary norms\n3.Distribution of sensitive content\nLLM shares information that is sensitive.\nUser intent : ...",0.387272
27d940b42e985f1c,1,c45cd427908a22fd1545d50b69fdf0ad,"What does ""distribution of sensitive content"" mean according to the provided documents?","LLM Test and Evaluation\nthese capabilities are sufﬁciently important and distinct in the nature of the outputs and the way in which the outputs are\nevaluated. To a certain extent, they each have their own ""language"" that is independent of the human language (i.e.,\nEnglish).\nBelow are our ten top-level categories for capabilities evaluation, which describe the different use cases and capabilities\nof language models. The full taxonomy including subcategories can be found in Appendix A.\n•Classiﬁcation : Determining the appropriate category according to shared qualities or characteristics. Accuracy\nof the classiﬁcation is a key criteria.\n•Information retrieval: Answering requests based on a provided text (e.g., summarize). Faithfulness to the\nreference text and synthesis are key c...",0.327899
27d940b42e985f1c,2,c45cd427908a22fd1545d50b69fdf0ad,"What does ""distribution of sensitive content"" mean according to the provided documents?","LLM Test and Evaluation\nof the model to 1.0, and repeatedly sampling the model a ﬁxed number of times for both response orderings in the\ngrading prompt. To compute conﬁdence on a particular pairwise comparison, we compute the entropy of the Monte\nCarlo estimate. Speciﬁcally, if the rates of voting for response 1 and response 2 for pair iare given as Ri\n1andRi\n2,\nrespectively, and the rate neither are voted is given as N(occasionally, models such as GPT-4 will notindicate one\nresponse is better, even when prompted), the entropy Wis written as,\nWi=−1∑\nr∈{Ri\n1,Ri\n2,N⟩}rlog(r) (1)\nIn general, we expect predictions which have lower entropy to be more accurate than predictions with higher entropy,\nindicating the model has more uncertainty surrounding the prediction. To compute t...",0.271519
0bdb02694a7c018f,0,65f76af1c4dd344618b63ee30b24a23c,"What does ""reasoning"" mean in the context of evaluating large language models (LLMs)?","A H OLISTIC APPROACH FOR TEST AND EVALUATION OF LARGE\nLANGUAGE MODELS\nDylan Slack∗, Jean Wang∗, Denis Semenenko∗, Kate Park, Sean Hendryx\nScale AI\nABSTRACT\nAs large language models (LLMs) become increasingly prevalent in diverse applications, ensuring\nthe utility and safety of model generations becomes paramount. We present a holistic approach for\ntest and evaluation of large language models. The approach encompasses the design of evaluation\ntaxonomies, a novel framework for safety testing, and hybrid methodologies that improve the\nscalability and cost of evaluations. In particular, we introduce a hybrid methodology for the evaluation\nof large language models (LLMs) that leverages both human expertise and AI assistance. Our hybrid\nmethodology generalizes across both LLM capa...",0.573309
0bdb02694a7c018f,1,65f76af1c4dd344618b63ee30b24a23c,"What does ""reasoning"" mean in the context of evaluating large language models (LLMs)?","LLM Test and Evaluation\n2.2.2 Types of vulnerabilities\nAlong with our taxonomy of risks, we also consider the types of vulnerabilities that may result in the various risks\ndescribed above. In contrast to existing standards such as the OWASP Top 10 for LLMs that cover broader structural\nvulnerabilities, such as supply chain vulnerabilities, the vulnerabilities below are focused on the potential misuse and\nunintended consequences of AI systems.\n1.Unreliability Unreliability refers to when the model produces harmful results unintentionally. These are\nsituations in which the users are using the model as expected, without any adversarial or malicious intent, but\nthe model outputs undesirable content. The harms caused by this vulnerability are typically related to harmful\ninformatio...",0.540085


In [7]:
## Evaluate Retrieval

from phoenix.evals import (
    OpenAIModel,
    RelevanceEvaluator,
    run_evals,
)

relevance_evaluator = RelevanceEvaluator(OpenAIModel(model="gpt-4-turbo-preview", api_key=OPEN_AI_API_KEY))

relevance_evals = run_evals(
    evaluators=[relevance_evaluator],
    dataframe=documents_df,
    provide_explanation=True,
    concurrency=20,
)[0]

run_evals |          | 0/21 (0.0%) | ⏳ 00:00<? | ?it/s

In [8]:
relevance_evals.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,label,score,explanation
context.span_id,document_position,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
27d940b42e985f1c,0,relevant,1,"The question asks for the meaning of ""distribution of sensitive content"" according to the provided documents. The reference text directly addresses this query under the section titled ""Distribution of sensitive content,"" where it explains that this term refers to instances where an LLM shares information that is sensitive, with the user intent likely being malicious. It further details the parties harmed (third party, other individuals) and the types of harms involved, such as violation of privacy and dissemination of graphic material. Therefore, the reference text contains specific information that is directly relevant to answering the question."
27d940b42e985f1c,1,relevant,1,"The question asks for the meaning of ""distribution of sensitive content"" according to the provided documents. The reference text directly addresses this query under the section titled ""2.2.1 Risks"" where it lists ""Distribution of sensitive content"" as one of the risk areas, defining it as a scenario where an LLM shares information that is sensitive. This directly relates to the question, providing a clear definition of the term within the context of risks associated with language model outputs. Therefore, the reference text contains information that is relevant to answering the question."
27d940b42e985f1c,2,unrelated,0,"The question asks for the meaning of ""distribution of sensitive content"" according to the provided documents. The reference text discusses the test and evaluation of language model (LLM) safety, including the methodology for evaluating LLM safety, the process of offloading unconfident examples to humans, and the evaluation of risks and vulnerabilities associated with LLMs. It covers topics such as entropy estimates for grading confidence, the use of human annotators for ambiguous cases, and the methodology for evaluating LLM safety through vulnerability analysis and exploitation. However, it does not specifically address the ""distribution of sensitive content"" or define it in any way. The reference text focuses on the evaluation process for LLM safety and does not provide information d..."
0bdb02694a7c018f,0,relevant,1,"The question asks for the meaning of ""reasoning"" in the context of evaluating large language models (LLMs). The reference text provides a detailed overview of a holistic approach for testing and evaluating LLMs, including the design of evaluation taxonomies, a novel framework for safety testing, and hybrid methodologies that improve the scalability and cost of evaluations. It discusses the challenges of evaluating LLMs, such as the need for comprehensive, scalable, and efficient methodologies, and the importance of safety and competence in these evaluations. Although the text does not explicitly define ""reasoning"" in the context of LLM evaluation, it implicitly addresses the cognitive processes involved in assessing LLMs' capabilities and safety, which are central to the concept of rea..."
0bdb02694a7c018f,1,unrelated,0,"The question asks for the meaning of ""reasoning"" in the context of evaluating large language models (LLMs). The reference text discusses various aspects of testing and evaluating LLMs, including vulnerabilities, methods for automatically testing and evaluating LLM capabilities, and the framework for evaluation. However, it does not specifically address the concept of ""reasoning"" as it pertains to LLMs or the evaluation process. The text focuses more on the technical and procedural aspects of LLM evaluation rather than the cognitive or logical processes (i.e., reasoning) involved in LLMs or their evaluation. Therefore, the reference text does not contain information that directly helps answer the question about the meaning of ""reasoning"" in the context of LLM evaluation."


In [9]:
## Evaluate Responses

from phoenix.evals import (
    OpenAIModel,
    QAEvaluator,
    HallucinationEvaluator,
    run_evals,
)

qa_evaluator = QAEvaluator(OpenAIModel(model="gpt-4-turbo-preview", api_key=OPEN_AI_API_KEY))
hallucination_evaluator = HallucinationEvaluator(OpenAIModel(model="gpt-4-turbo-preview", api_key=OPEN_AI_API_KEY))

qa_evals, hallucination_evals = run_evals(
    evaluators=[qa_evaluator, hallucination_evaluator],
    dataframe=qa_df,
    provide_explanation=True,
    concurrency=20,
)

run_evals |          | 0/14 (0.0%) | ⏳ 00:00<? | ?it/s

Add custom eval

In [10]:
ANSWER_RELEVANCE_TEMPLATE = ''' In this task, you will be presented with a query, a reference text and an answer. The answer is
generated to the question based on the reference text. The answer may contain irrelevant information. For the provided list of statements, 
determine whether each statement is relevant to address the input. If one or more statements are not relevant to the query, please label the answer 
as "irrelevant". If all statements are relevant to the query, please label the answer as "relevant".

Here is an example where the answer is "relevant" because the answer includes a recommendation for a winery with white wines (Chardonnay):
    # Query: What's a good place to go wine tasting for white wines in Napa?
    # Answer: Castle Winery has amazing Chardonnay.

Here is an example where the answer is "irrelevant" because the query is asking about white wines, but the answer recommends a winery based on its red wine (Cabernet):
    # Query: Where can I go wine tasting for white wines in Napa?
    # Answer: Stags Leap has great Cabernet.

Please provide your evaluation for the query and answer below:
    # Query: {input}
    # Answer: {output}

Is the answer above relevant or irrelevant to the above query?'''

In [11]:
# import phoenix.experimental.evals.templates.default_templates as templates
from phoenix.evals import (
    llm_classify,
)

In [12]:
custom_qa_relevance_classifications = llm_classify(
    dataframe=qa_df, 
    template=ANSWER_RELEVANCE_TEMPLATE, 
    model=OpenAIModel(model="gpt-4-turbo-preview", api_key=OPEN_AI_API_KEY), 
    rails=["relevant", "irrelevant"],
    provide_explanation=True, #optional to generate explanations for the value produced by the eval LLM
)

llm_classify |          | 0/7 (0.0%) | ⏳ 00:00<? | ?it/s

In [13]:
custom_qa_relevance_classifications.head()

Unnamed: 0_level_0,label,explanation
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1
407ebff65544f91c,relevant,"The answer is relevant to the query as it directly addresses the query's focus on the distribution of sensitive content in the context of documents, specifically within the framework of large language models (LLMs). It explains what the distribution of sensitive content entails, identifies the risk category associated with LLMs, and outlines the types of harms that can result from this distribution, such as violation of privacy and dissemination of graphic material. This directly responds to the query's request for information on the distribution of sensitive content."
edbb4ea7ff9853ec,relevant,"The answer is relevant to the query. It provides a general explanation of what reasoning is, especially in the context of large language models (LLMs) and artificial intelligence (AI), which directly addresses the query about the nature of reasoning. The answer covers the process of how a model or system processes information, draws inferences, or makes decisions, which are key aspects of reasoning. Therefore, all parts of the answer contribute to a comprehensive understanding of the concept of reasoning as asked in the query."
f1549d963f28bbcc,relevant,"The answer is relevant to the query. It provides a general explanation of what a mitigation strategy is, especially in the context of evaluating and testing large language models (LLMs). It outlines potential actions and techniques that could be part of a mitigation strategy, such as improving model reliability, reducing susceptibility to adversarial prompts, and aligning the model's actions with user intentions and ethical guidelines. Although it mentions that the provided documents do not explicitly detail a mitigation strategy, the explanation given directly addresses the query by defining and discussing the concept of a mitigation strategy."
5b3abc8ea9afc82f,relevant,"The answer is relevant to the query. It directly addresses the question about what capabilities are by explaining that in the context of evaluating large language models (LLMs), capabilities refer to the broad categories of tasks or functions that LLMs can perform. It further elaborates on how these capabilities are categorized and provides examples of tasks within those categories, which directly responds to the query seeking information on capabilities."
cb1ddb6bd3ca84f3,relevant,"The answer directly addresses the query about the feasibility of completely automating LLM evaluations. It explains why full automation is not feasible and emphasizes the importance of human involvement in the evaluation process for accuracy and comprehensive coverage. Therefore, all statements in the answer are relevant to the query."


In [14]:
from phoenix.trace import DocumentEvaluations, SpanEvaluations

# Log the evaluations back to
client.log_evaluations(DocumentEvaluations(dataframe=relevance_evals, eval_name="document_relevance"),
                       SpanEvaluations(dataframe=custom_qa_relevance_classifications, eval_name="answer_relevance"),
                       SpanEvaluations(dataframe=qa_evals, eval_name="qa"),
                       SpanEvaluations(dataframe=hallucination_evals, eval_name="hallucination"))

In [15]:
spans_df = px.Client().get_spans_dataframe()

In [30]:
spans_df.head()

Unnamed: 0_level_0,name,span_kind,parent_id,start_time,end_time,status_code,status_message,events,conversation,context.trace_id,...,attributes.embedding.model_name,attributes.embedding.embeddings,attributes.retrieval.documents,attributes.input.value,attributes.llm.token_count.completion,attributes.llm.token_count.total,attributes.llm.token_count.prompt,attributes.llm.output_messages,attributes.llm.prompt_template.variables,attributes.llm.prompt_template.template
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
418799fd283d951b,llm,LLM,407ebff65544f91c,2024-05-07T00:30:02.985424+00:00,2024-05-07T00:30:33.293316+00:00,OK,,[],,c45cd427908a22fd1545d50b69fdf0ad,...,,,,,,,,,,
3fe77ddaaa589c11,embedding,EMBEDDING,27d940b42e985f1c,2024-05-07T00:30:02.581977+00:00,2024-05-07T00:30:02.943923+00:00,OK,,[],,c45cd427908a22fd1545d50b69fdf0ad,...,text-embedding-3-small,"[{'embedding.text': 'What does ""distribution of sensitive content"" mean according to the provided documents?', 'embedding.vector': [0.05179618299007416, 0.0018338831141591072, 0.025626907125115395, 0.017016809433698654, 0.03449463099241257, -0.010101611725986004, 0.007735529448837042, 0.016759183257818222, 0.00014597337576560676, -0.0655723363161087, 0.039999671280384064, -0.016406644135713577, -0.04564030095934868, -0.006233846768736839, 0.025626907125115395, 0.0008317728061228991, 0.042765747755765915, 0.019321873784065247, -0.03148448467254639, -0.007498243357986212, 0.04997924715280533, 0.04257591813802719, -0.030426867306232452, 0.012921927496790886, -0.009986357763409615, -0.0394844189286232, -0.006549098528921604, -0.006684690713882446, 0.04610131308436394, -0.000921178841963410...",,,,,,,,
27d940b42e985f1c,retrieve,RETRIEVER,407ebff65544f91c,2024-05-07T00:30:02.581429+00:00,2024-05-07T00:30:02.975468+00:00,OK,,[],,c45cd427908a22fd1545d50b69fdf0ad,...,,,"[{'document.content': 'LLM Test and Evaluation B Risks Taxonomy Present risks 1.Harmful information LLM provides information that harms the user. User intent : Benign Party harmed: First party (user) Type of harms: • Misleading or misinforming a user (""hallucination"") • Causing material harm due to unqualiﬁed advice (e.g., medical, legal, ﬁnancial) • Leading users to perform unethical or illegal actions • Causing psychological harm due to toxic, graphic, or violent content 2.Harm against groups LLM provides information that can lead to harm to a group. User intent : Benign Party harmed: Third party (targeted groups) Type of harms: • Promoting discrimination • Promoting bias and exclusionary norms 3.Distribution of sensitive content LLM shares information that is sensitive. User intent ...","What does ""distribution of sensitive content"" mean according to the provided documents?",,,,,,
bd6c040efa62bcd0,llm,LLM,407ebff65544f91c,2024-05-07T00:30:00.079720+00:00,2024-05-07T00:30:02.566396+00:00,OK,,[],,c45cd427908a22fd1545d50b69fdf0ad,...,,,,,15.0,1437.0,1422.0,"[{'message.content': 'What does ""distribution of sensitive content"" mean according to the provided documents?', 'message.role': 'assistant'}]","{'chat_history': 'user: What is the capabilities evaluation framework? assistant: The capabilities evaluation framework, as outlined in the documents, is part of a holistic approach for testing and evaluating large language models (LLMs). This framework is designed to assess the capabilities of LLMs in a comprehensive and scalable manner, leveraging both human expertise and AI assistance to ensure precise and efficient evaluation. Here's a detailed breakdown of the capabilities evaluation framework based on the provided documents: 1. **Hybrid Methodology**: The framework employs a hybrid methodology that combines human evaluations with automated AI-assisted evaluations. This approach is aimed at generalizing across both the capabilities and safety of LLMs, identifying areas where AI a...","\n Given the following conversation between a user and an AI assistant and a follow up question from user,\n rephrase the follow up question to be a standalone question.\n\n Chat History:\n {chat_history}\n Follow Up Input: {question}\n Standalone question:"
407ebff65544f91c,chat,CHAIN,,2024-05-07T00:30:00.078122+00:00,2024-05-07T00:30:33.376901+00:00,UNSET,,[],,c45cd427908a22fd1545d50b69fdf0ad,...,,,,What is distribution of sensitive content in the context of the docs?,,,,,,


In [29]:
from arize.pandas.logger import Client

SPACE_KEY = ""
API_KEY = ""


if SPACE_KEY == "SPACE_KEY" or API_KEY == "API_KEY":
    raise ValueError("❌ NEED TO CHANGE SPACE AND/OR API_KEY")
else:
    print("✅ Import and Setup Arize Client Done! Now we can start using Arize!")
    
arize_client = Client(space_key=SPACE_KEY, api_key=API_KEY)
model_id = "julia-onboarding-model" # the model name in Arize
model_version = "1.0" # (optional) the model version

response = arize_client.log_spans(
    dataframe=spans_df,
    model_id=model_id,
    model_version=model_version, # optional
    validate=False
)

# If successful, the server will return a status_code of 200
if response.status_code != 200:
    print(f"❌ logging failed with response code {response.status_code}, {response.text}")
else:
    print(f"✅ You have successfully logged training set to Arize")

✅ Import and Setup Arize Client Done! Now we can start using Arize!
[38;21m  arize.utils.logging | INFO | Success! Check out your data at https://app.arize.com/organizations/QWNjb3VudE9yZ2FuaXphdGlvbjpRV05qYjNWdWRFOXlaMkZ1YVhwaGRHbHZiam8yT1RFMk9sSlNlVE09/spaces/U3BhY2U6VTNCaFkyVTZOekkyTmpvME5XdG8=/models/modelName/julia-onboarding-model[0m
✅ You have successfully logged training set to Arize
