# Tracing and Evaluating a Groq Agent Application
Observability is a critical part of building and maintaining a robust application. [Arize Phoenix](https://phoenix.arize.com) allows you to easily trace and evaluate your application. A Phoenix instance can be self-hosted or run in the cloud alongside your application, and [Arize's Groq instrumentor](https://docs.arize.com/phoenix/tracing/integrations-tracing/groq) lets you automatically capture traces, latency data, and token usage data.

This guide will walk you through the process of tracing and evaluating a basic Groq function calling agent application.

## Install dependencies & Set environment variables

In [None]:
%%bash
pip install -q "arize-phoenix>=4.29.0" "openinference-instrumentation-groq>=0.1.3" groq

In [None]:
import os
from getpass import getpass
import dotenv
dotenv.load_dotenv()

if not (groq_api_key := os.getenv("GROQ_API_KEY")):
    groq_api_key = getpass("🔑 Enter your Groq API key: ")

os.environ["GROQ_API_KEY"] = groq_api_key

## Connect to Phoenix

In this example, we'll connect to a cloud instance of Phoenix. If you'd rather self-host Phoenix, follow the instructions [here](https://docs.arize.com/phoenix/setup/environments).

To get an API key, sign up for a free account on [Arize Phoenix](https://phoenix.arize.com).

In [None]:
if not (phoenix_api_key := os.getenv("PHOENIX_API_KEY")):
    phoenix_api_key = getpass("🔑 Enter your Phoenix API key: ")

os.environ["PHOENIX_API_KEY"] = phoenix_api_key

In [None]:
from phoenix.otel import register
from openinference.instrumentation.groq import GroqInstrumentor

os.environ["PHOENIX_CLIENT_HEADERS"] = f"api_key={os.environ['PHOENIX_API_KEY']}"
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"

tracer_provider = register(project_name="groq-function-calling-agent")

GroqInstrumentor().instrument(tracer_provider=tracer_provider)

Now that we have Phoenix configured, any calls to Groq in our application will be traced and logged to Phoenix. If you're incorporating Phoenix into an existing application, the code above is all you need to add to start tracing.

Read on for an example, and for details on running evaluations.

## Set up your Groq Agent

In [None]:
from groq import Groq

# Groq client automatically picks up API key
client = Groq()

Here we'll set up a basic Groq agent that can use tools to generate jokes, look up the weather, and calculate age.

In [None]:
import json

def generate_joke():
    """Generate a simple joke."""
    try:
        response = client.chat.completions.create(
            messages=[
                {
                    "role": "user",
                    "content": "Generate a simple joke",
                }
            ],
            model="llama-3.1-8b-instant",
        )
        response = response.choices[0].message.content
    except Exception as e:
        print(f"Error generating joke: {e}")
        response =  "Error: Could not generate joke."
    return response

def get_current_weather(location: str):
    """Get the current weather for a given location."""
    # This is a mock function. In a real scenario, you'd call a weather API.
    return json.dumps({"location": location, "temperature": "22°C", "condition": "Sunny"})

def calculate_age(birth_year: int):
    """Calculate age based on birth year."""
    from datetime import datetime
    current_year = datetime.now().year
    return current_year - birth_year

# Define the tools
tools = [
    {
        "type": "function",
        "function": {
            "name": "generate_joke",
            "description": "Generate a simple joke",
            "parameters": {"type": "object", "properties": {}}
        }
    },
    {
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "The city and state, e.g. San Francisco, CA"}
                },
                "required": ["location"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "calculate_age",
            "description": "Calculate age based on birth year",
            "parameters": {
                "type": "object",
                "properties": {
                    "birth_year": {"type": "integer", "description": "The year of birth"}
                },
                "required": ["birth_year"]
            }
        }
    }
]


In [None]:
from opentelemetry import trace
from openinference.semconv.trace import SpanAttributes

def call_agent(question: str):
    
    # Here we do a small amount of manual instrumentation to group all the calls our agent makes into a single span.
    # Phoenix will automatically create spans for each call to Groq, but this allows us to further group them.
    tracer = trace.get_tracer(__name__)
    with tracer.start_as_current_span("agent") as span:
        span.set_attribute(SpanAttributes.OPENINFERENCE_SPAN_KIND, "AGENT")
        span.set_attribute(SpanAttributes.INPUT_VALUE, question)
        
        chat_completion = client.chat.completions.create(
            messages=[
                {
                    "role": "user",
                    "content": question,
                }
            ],
            model="llama-3.1-8b-instant",
            tools=tools,
            tool_choice="auto",
        )


        message = chat_completion.choices[0].message
        if message.tool_calls:
            for tool_call in message.tool_calls:
                function_name = tool_call.function.name
                function_args = json.loads(tool_call.function.arguments)
                
                if function_name == "get_current_weather":
                    result = get_current_weather(**function_args)
                elif function_name == "calculate_age":
                    result = calculate_age(**function_args)
                elif function_name == "generate_joke":
                    result = generate_joke(**function_args)
                else:
                    result = f"Unknown function: {function_name}"
                
                print(f"Result: {result}")
                span.set_attribute(SpanAttributes.OUTPUT_VALUE, str(result))
        else:
            print(f"Message: {message.content}")


In [None]:
questions = [
    "Tell me a joke",
    "What's the weather in San Francisco?",
    "I was born in 1990. How old am I?",
    "What's the weather in New York?",
    "Tell me a good joke"
]

for question in questions:
    call_agent(question)

# View Traces in Phoenix

You should now see traces in [Phoenix](https://app.phoenix.arize.com/)!

## Download trace dataset from Phoenix

In [None]:
import phoenix as px

spans_df = px.Client().get_spans_dataframe(project_name="groq-function-calling-agent")
spans_df.head()

## Generate evaluations

Now that we have our trace dataset, we can generate evaluations for each trace. Evaluations can be generated in many different ways. Ultimately, we want to end up with a set of labels and/or scores for our traces.

You can generate evaluations using:
- Plain code
- Phoenix's [built-in LLM as a Judge evaluators](https://docs.arize.com/phoenix/evaluation/how-to-evals/running-pre-tested-evals)
- Your own [custom LLM as a Judge evaluator](https://docs.arize.com/phoenix/evaluation/how-to-evals/bring-your-own-evaluator)
- Other evaluation packages

As long as you format your evaluation results properly, you can upload them to Phoenix and visualize them in the UI.

For this example, we'll use an LLM as a Judge evaluator to determine whether the output of our agent matches the user's input.

In [None]:
LLM_EVALUATOR_TEMPLATE = """
You are a helpful assistant. You will be given a question and an answer. 
You should determine whether the answer is a valid response to the question.

Question:
{question}

Answer:
{answer}

Respond with an explanation for your answer, and a label of VALID or INVALID, nothing else.

Example Response:
EXPLANATION: The answer is valid because the user asks for an age and the answer contains an age. 
LABEL: VALID
"""

In [None]:
def evaluate_row(row):
    question = row['attributes.input.value']
    answer = row['attributes.output.value']
    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": LLM_EVALUATOR_TEMPLATE.format(question=question, answer=answer),
            }
        ],
        model="mixtral-8x7b-32768",
    )
    explanation, label = chat_completion.choices[0].message.content.split("LABEL")
    if "INVALID" in label:
        label = "INVALID"
    else:
        label = "VALID"
    return explanation, label

In [None]:
spans_df['explanation'], spans_df['label'] = zip(*spans_df.apply(evaluate_row, axis=1))
spans_df['score'] = spans_df['label'].apply(lambda x: 1 if x == 'VALID' else 0)
spans_df.head()

We now have a DataFrame with a column for whether each joke is a repeat of a previous joke. Let's upload this to Phoenix.

## Upload evaluations to Phoenix

Our evals_df has a column for the span_id and a column for the evaluation result. The span_id is what allows us to connect the evaluation to the correct trace in Phoenix. Phoenix will also automatically look for columns named "label" and "score" to display in the UI.

In [None]:
from phoenix.trace import SpanEvaluations

px.Client().log_evaluations(SpanEvaluations(eval_name="Response Format", dataframe=spans_df))

You should now see evaluations in the Phoenix UI!

From here you can continue collecting and evaluating traces, or move on to one of these other guides:
* If you're interested in more complex evaluation and evaluators, start with [how to use LLM as a Judge evaluators](https://docs.arize.com/phoenix/evaluation/how-to-evals/running-pre-tested-evals)
* If you're ready to start testing your application in a more rigorous manner, check out [how to run structured experiments](https://docs.arize.com/phoenix/datasets-and-experiments/how-to-experiments/run-experiments)

![Function Calling Evaluations](images/function-calling-evals.png)
