<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://raw.githubusercontent.com/Arize-ai/phoenix-assets/9e6101d95936f4bd4d390efc9ce646dc6937fb2d/images/socal/github-large-banner-phoenix.jpg" width="1000"/>
        <br>
        <br>
        <a href="https://arize.com/docs/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email">Community</a>
    </p>
</center>
<h1 align="center">Quickstart: Datasets and Experiments</h1>

Phoenix helps you run experiments over your AI and LLM applications to evaluate and iteratively improve their performance. This quickstart shows you how to get up and running quickly.

## Setup

Install Phoenix.

In [1]:
!pip install "arize-phoenix[evals]" openai 'httpx<0.28'

Collecting httpx<0.28
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting arize-phoenix[evals]
  Downloading arize_phoenix-11.0.0-py3-none-any.whl.metadata (27 kB)
Collecting aioitertools (from arize-phoenix[evals])
  Downloading aioitertools-0.12.0-py3-none-any.whl.metadata (3.8 kB)
Collecting aiosqlite (from arize-phoenix[evals])
  Downloading aiosqlite-0.21.0-py3-none-any.whl.metadata (4.3 kB)
Collecting alembic<2,>=1.3.0 (from arize-phoenix[evals])
  Downloading alembic-1.16.2-py3-none-any.whl.metadata (7.3 kB)
Collecting arize-phoenix-client (from arize-phoenix[evals])
  Downloading arize_phoenix_client-1.11.0-py3-none-any.whl.metadata (4.3 kB)
Collecting arize-phoenix-evals>=0.20.6 (from arize-phoenix[evals])
  Downloading arize_phoenix_evals-0.21.0-py3-none-any.whl.metadata (4.8 kB)
Collecting arize-phoenix-otel>=0.10.3 (from arize-phoenix[evals])
  Downloading arize_phoenix_otel-0.12.1-py3-none-any.whl.metadata (9.3 kB)
Collecting authlib (from arize-phoeni

Launch Phoenix.

In [None]:
import phoenix as px

px.launch_app().view()

Set your OpenAI API key.

In [None]:
import os
from getpass import getpass

import openai

if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
openai.api_key = openai_api_key
os.environ["OPENAI_API_KEY"] = openai_api_key

## Datasets

Upload a dataset.

In [None]:
import pandas as pd

import phoenix as px

df = pd.DataFrame(
    [
        {
            "question": "What is Paul Graham known for?",
            "answer": "Co-founding Y Combinator and writing on startups and techology.",
            "metadata": {"topic": "tech"},
        }
    ]
)
phoenix_client = px.Client()
dataset = phoenix_client.upload_dataset(
    dataset_name="test-dataset",
    dataframe=df,
    input_keys=["question"],
    output_keys=["answer"],
    metadata_keys=["metadata"],
)

## Tasks

Create a task to evaluate.

In [None]:
from openai import OpenAI

openai_client = OpenAI()

task_prompt_template = "Answer in a few words: {question}"


def task(input) -> str:
    question = input["question"]
    message_content = task_prompt_template.format(question=question)
    response = openai_client.chat.completions.create(
        model="gpt-4o", messages=[{"role": "user", "content": message_content}]
    )
    return response.choices[0].message.content

## Evaluators

Use pre-built evaluators to grade task output with code...

In [None]:
from phoenix.experiments.evaluators import ContainsAnyKeyword

contains_keyword = ContainsAnyKeyword(keywords=["Y Combinator", "YC"])

or LLMs.

In [None]:
from phoenix.evals.models import OpenAIModel
from phoenix.experiments.evaluators import ConcisenessEvaluator

model = OpenAIModel(model="gpt-4o")
conciseness = ConcisenessEvaluator(model=model)

Define custom evaluators with code...

In [None]:
from typing import Any, Dict


def jaccard_similarity(output: str, expected: Dict[str, Any]) -> float:
    # https://en.wikipedia.org/wiki/Jaccard_index
    actual_words = set(output.lower().split(" "))
    expected_words = set(expected["answer"].lower().split(" "))
    words_in_common = actual_words.intersection(expected_words)
    all_words = actual_words.union(expected_words)
    return len(words_in_common) / len(all_words)

or LLMs.

In [None]:
from phoenix.experiments.evaluators import create_evaluator

eval_prompt_template = """
Given the QUESTION and REFERENCE_ANSWER, determine whether the ANSWER is accurate.
Output only a single word (accurate or inaccurate).

QUESTION: {question}

REFERENCE_ANSWER: {reference_answer}

ANSWER: {answer}

ACCURACY (accurate / inaccurate):
"""


@create_evaluator(kind="llm")  # need the decorator or the kind will default to "code"
def accuracy(input: Dict[str, Any], output: str, expected: Dict[str, Any]) -> float:
    message_content = eval_prompt_template.format(
        question=input["question"], reference_answer=expected["answer"], answer=output
    )
    response = openai_client.chat.completions.create(
        model="gpt-4o", messages=[{"role": "user", "content": message_content}]
    )
    response_message_content = response.choices[0].message.content.lower().strip()
    return 1.0 if response_message_content == "accurate" else 0.0

## Experiments

Run an experiment and evaluate the results.

In [None]:
from phoenix.experiments import run_experiment

experiment = run_experiment(
    dataset,
    task,
    experiment_name="initial-experiment",
    evaluators=[jaccard_similarity, accuracy],
)

Run more evaluators after the fact.

In [None]:
from phoenix.experiments import evaluate_experiment

experiment = evaluate_experiment(experiment, evaluators=[contains_keyword, conciseness])

And iterate 🚀