# Llamastack Eval

We would like to be able to evaluate our models and applications before they go into production.  
To do that we can use the llamastack eval endpoint.  
It allows us to run prompts and expected answers through different evaluations to make sure that the model answers as we expect.  
The prompts and expected answers can either be some you custom add, or it can be taken from an evaluation dataset such as this one from HuggingFace: https://huggingface.co/datasets/llamastack/simpleqa  

We will be testing two types of evaluations here:
- "subset_of" which tests if the LLM output is an exact subset of the expected answer 
- "llm_as_judge" which lets an LLM evaluate how similar the LLM output is to the expected answer

In here we will primarily test the backend endpoints, so that we also evaluate how effective our system prompts are, but all of this can be applied to the raw model as well to see how performant it is. We will se an example of this in a later chapter ;)

## Set-up
Let's start by installing the llamastack client and pointing it to our llamastack server

In [91]:
!pip install -q git+https://github.com/meta-llama/llama-stack.git@release-0.2.12

In [92]:
from llama_stack_client import LlamaStackClient
from pprint import pprint
base_url = "http://llamastack-with-config-service.default.svc.cluster.local:8321"
client = LlamaStackClient(
    base_url=base_url,
    timeout=600.0 # Default is 1 min which is far too little for some agentic tests, we set it to 10 min
)

# Llamastack Eval endpoint

Before we evaluate our backend endpoint, let's just try out llamastack eval and see how it works.  
Here we create some `handmade_eval_rows` with the input and the expected answer, but we also add the generated answer already filled out.  

## Subset Of

Let's start by setting the `scoring_params` to use the `subset_of` function mentioned earlier and see what it comes back with.

In [93]:
# Check ALL available providers
providers = client.providers.list()
print("All available providers:")
for provider in providers:
    print(f"- ID: {provider.provider_id}, Type: {provider.provider_type}")

INFO:httpx:HTTP Request: GET http://llamastack-with-config-service.default.svc.cluster.local:8321/v1/providers "HTTP/1.1 200 OK"


All available providers:
- ID: basic, Type: inline::basic
- ID: llm-as-judge, Type: inline::llm-as-judge
- ID: meta-reference, Type: inline::meta-reference
- ID: vllm, Type: remote::vllm
- ID: vllm-llama-3-2-3b, Type: remote::vllm
- ID: vllm-llama-4-guard, Type: remote::vllm
- ID: model-context-protocol, Type: remote::model-context-protocol
- ID: brave-search, Type: remote::brave-search
- ID: tavily-search, Type: remote::tavily-search


In [94]:
handmade_eval_rows = [
    {
        "input_query": "What is your favorite food?",
        "generated_answer": "Tapas are my favorites.",
        "expected_answer": "Tapas",
    },
    {
        "input_query": "What is your favorite food?",
        "generated_answer": "I really like tapas.",
        "expected_answer": "Tapas",
    }
]
pprint(handmade_eval_rows)

scoring_params = {
    "basic::subset_of": None,
}
scoring_response = client.scoring.score(
    input_rows=handmade_eval_rows, scoring_functions=scoring_params
)
pprint(scoring_response)

INFO:httpx:HTTP Request: POST http://llamastack-with-config-service.default.svc.cluster.local:8321/v1/scoring/score "HTTP/1.1 200 OK"


[{'expected_answer': 'Tapas',
  'generated_answer': 'Tapas are my favorites.',
  'input_query': 'What is your favorite food?'},
 {'expected_answer': 'Tapas',
  'generated_answer': 'I really like tapas.',
  'input_query': 'What is your favorite food?'}]
ScoringScoreResponse(results={'basic::subset_of': ScoringResult(aggregated_results={'accuracy': {'accuracy': 0.5, 'num_correct': 1.0, 'num_total': 2}}, score_rows=[{'score': 1.0}, {'score': 0.0}])})


Hmm, we got half of the answers correct 🤔  
This is because we expect tapas to be spelled with a big T in front. As mentioned before, the subset_of function expects exact matches within the generated answer.

## LLM as judge

Now let's try the same thing but with LLM as judge.  
Here we are going to feed our eval_rows into an LLM to evaluate how well the generated answer matches the expected answer.  
To make sure that the judge LLM does this, we also provide it a `JUDGE_PROMPT` it should follow, as well as a regex of expected scores (`judge_score_regexes`) from the judge.  
In our case we let it grade the generated answers from A to E (no F's in this class), each with its own meaning that you can see in the judge prompt.
We also choose Llama 3.2 to be our judge. This means that when we later later evaluate replies from the backend, we will use the same LLM to generate our answer and judge them, essentially doing a self-judge strategy. This is not always the best, but works pretty well with Llama 3.2 and we don't have any other model to use right now.

In [95]:
handmade_eval_rows = [
    {
        "input_query": "What is your favorite food?",
        "generated_answer": "Tapas are my favorites.",
        "expected_answer": "Tapas",
    },
    {
        "input_query": "What is your favorite food?",
        "generated_answer": "I really like tapas.",
        "expected_answer": "Tapas",
    }
]
pprint(handmade_eval_rows)

judge_model_id = "llama-3-2-3b"
#judge_model_id = "llama-4-scout-17b-16e-w4a16"
#judge_model_id = "granite-31-2b-instruct"
JUDGE_PROMPT = """
Given a QUESTION and GENERATED_RESPONSE and EXPECTED_RESPONSE.

Compare the factual content of the GENERATED_RESPONSE with the EXPECTED_RESPONSE. Ignore any differences in style, grammar, or punctuation.
  The GENERATED_RESPONSE may either be a subset or superset of the EXPECTED_RESPONSE, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:
  (A) The GENERATED_RESPONSE is a subset of the EXPECTED_RESPONSE and is fully consistent with it.
  (B) The GENERATED_RESPONSE is a superset of the EXPECTED_RESPONSE and is fully consistent with it.
  (C) The GENERATED_RESPONSE contains all the same details as the EXPECTED_RESPONSE.
  (D) There is a disagreement between the GENERATED_RESPONSE and the EXPECTED_RESPONSE.
  (E) The answers differ, but these differences don't matter from the perspective of factuality.

Give your answer in the format "Answer: One of ABCDE, Explanation: ".

Your actual task:

QUESTION: {input_query}
GENERATED_RESPONSE: {generated_answer}
EXPECTED_RESPONSE: {expected_answer}
"""

scoring_params = {
    "llm-as-judge::base": {
        "judge_model": judge_model_id,
        "prompt_template": JUDGE_PROMPT,
        "type": "llm_as_judge",
        "judge_score_regexes": ["Answer: (A|B|C|D|E)"],
    },
}

scoring_response = client.scoring.score(
    input_rows=handmade_eval_rows, scoring_functions=scoring_params
)
pprint(scoring_response)

[{'expected_answer': 'Tapas',
  'generated_answer': 'Tapas are my favorites.',
  'input_query': 'What is your favorite food?'},
 {'expected_answer': 'Tapas',
  'generated_answer': 'I really like tapas.',
  'input_query': 'What is your favorite food?'}]


INFO:httpx:HTTP Request: POST http://llamastack-with-config-service.default.svc.cluster.local:8321/v1/scoring/score "HTTP/1.1 200 OK"


ScoringScoreResponse(results={'llm-as-judge::base': ScoringResult(aggregated_results={}, score_rows=[{'score': 'C', 'judge_feedback': 'Answer: C, Explanation: The GENERATED_RESPONSE contains all the same details as the EXPECTED_RESPONSE.'}, {'score': 'C', 'judge_feedback': 'Answer: C, Explanation: The GENERATED_RESPONSE and EXPECTED_RESPONSE both mention "tapas" as the favorite food, indicating that they contain all the same details.'}])})


You should have gotten at least a C from the judge, and you can see the reasoning for it in the `judge_feedback` field.  
Feel free to try out some other inputs, generated answers, and expected answers 🧪

## LLM as a Judge for GitHub Issues

In [97]:
handmade_eval_rows = [
    {
        "input_query": "Pod 'database-primary-0' in namespace 'data-services' is in a `Pending` state. Reason: `Unschedulable`. 0/5 nodes are available: 5 node(s) had untolerated taint {node-role.kubernetes.io/master: }.",
        "generated_answer": """### Cluster/namespace location
data-services/database-primary-0

### Summary of the problem
The database pod is stuck in a 'Pending' state because no nodes are available for scheduling due to a node taint.

### Detailed error/code
State: Pending
Reason: Unschedulable
Message: 0/5 nodes are available: 5 node(s) had untolerated taint {node-role.kubernetes.io/master: }.

### Possible solutions
1. Add a toleration for the 'node-role.kubernetes.io/master' taint to the pod spec.
2. Ensure there are worker nodes available that do not have this taint.""",
        "expected_answer": ""
    },
    {
        "input_query": 'from pod \'user-auth-service-pod-xyz\' in ns \'app-dev\': `Error: secret "db-credentials" not found`',
        "generated_answer": """### Cluster/namespace location
app-dev/user-auth-service-pod-xyz

### Summary of the problem
The application failed to start because it could not find a required Kubernetes secret.

### Detailed error/code
Error: secret "db-credentials" not found""",
        "expected_answer": ""
    },
    {
        "input_query": "pod='job-processor-123' namespace='batch-jobs' log: `ConnectException: Connection refused (localhost:6379)`",
        "generated_answer": """### Cluster/namespace location
batch-jobs/job-processor-123

### Summary of problem
The job processor cannot connect to its Redis cache, resulting in a connection refused error.

### Detailed error/code
ConnectException: Connection refused (localhost:6379)

### Potential Fixes
1. Verify the Redis service is running and available at the specified address.
2. Check network policies that might be blocking the connection.""",
        "expected_answer": ""
    },
    {
        "input_query": "Container 'metrics-exporter' in pod 'web-server-abc' in ns 'monitoring' is restarting. Last state: Exit Code 1.",
        "generated_answer": """### Cluster/namespace location
monitoring/web-server-abc

### Summary of the problem
Container is restarting.

### Detailed error/code
Exit Code 1

### Possible solutions
Check the logs for more details.""",
        "expected_answer": ""
    }
]

pprint(handmade_eval_rows)

#judge_model_id = "llama-3-2-3b"
judge_model_id = "llama-4-scout-17b-16e-w4a16"
#judge_model_id = "granite-31-2b-instruct"

JUDGE_PROMPT = """
You are an expert in evaluating structured text formats for GitHub issues.
Your task is to determine if the GENERATED RESPONSE follows the required format based on the original INPUT QUERY.

The required format for the GitHub issue is as follows:
- Must contain a heading: '### Cluster/namespace location'
- Must contain a heading: '### Summary of the problem'
- Must contain a heading: '### Detailed error/code'
- Must contain a heading: '### Possible solutions'

Analyze the GENERATED RESPONSE and check if it includes all four of the required headings.
- If all four headings are present, the format is correct.
- If one or more headings are missing, the format is incorrect.

Provide your answer as a numerical score followed by a brief explanation.
Format: "Score: [score], Explanation: [your reasoning]"
- Use a score of 1 for a correct format.
- Use a score of 0 for an incorrect format.

Your actual task:

INPUT QUERY: {input_query}
GENERATED RESPONSE: {generated_answer}
EXPECTED RESPONSE: {expected_answer}
"""

scoring_params = {
    "llm-as-judge::base": {
        "judge_model": judge_model_id,
        "prompt_template": JUDGE_PROMPT,
        "type": "llm_as_judge",
        "judge_score_regexes": ["Score: (0|1)"],
    },
}

scoring_response = client.scoring.score(
    input_rows=handmade_eval_rows, scoring_functions=scoring_params
)
pprint(scoring_response)

[{'expected_answer': '',
  'generated_answer': '### Cluster/namespace location\n'
                      'data-services/database-primary-0\n'
                      '\n'
                      '### Summary of the problem\n'
                      "The database pod is stuck in a 'Pending' state because "
                      'no nodes are available for scheduling due to a node '
                      'taint.\n'
                      '\n'
                      '### Detailed error/code\n'
                      'State: Pending\n'
                      'Reason: Unschedulable\n'
                      'Message: 0/5 nodes are available: 5 node(s) had '
                      'untolerated taint {node-role.kubernetes.io/master: }.\n'
                      '\n'
                      '### Possible solutions\n'
                      '1. Add a toleration for the '
                      "'node-role.kubernetes.io/master' taint to the pod "
                      'spec.\n'
                      '2. Ensure t

INFO:httpx:HTTP Request: POST http://llamastack-with-config-service.default.svc.cluster.local:8321/v1/scoring/score "HTTP/1.1 200 OK"


ScoringScoreResponse(results={'llm-as-judge::base': ScoringResult(aggregated_results={}, score_rows=[{'score': '1', 'judge_feedback': "Score: 1, Explanation: The GENERATED RESPONSE includes all four required headings: '### Cluster/namespace location', '### Summary of the problem', '### Detailed error/code', and '### Possible solutions'. The format is correct."}, {'score': '0', 'judge_feedback': "Score: 0, Explanation: The GENERATED RESPONSE is missing the '### Possible solutions' heading, which is one of the required headings. It includes the other three headings: '### Cluster/namespace location', '### Summary of the problem', and '### Detailed error/code'. \n\nHere is the complete evaluation:\n- '### Cluster/namespace location' is present.\n- '### Summary of the problem' is present.\n- '### Detailed error/code' is present.\n- '### Possible solutions' is missing.\n\nTherefore, the format is incorrect."}, {'score': '0', 'judge_feedback': "Score: 0, Explanation: The GENERATED RESPONSE is

# Involve the LLM

So far we have just hardcoded the generated answers but these should be generated from an LLM, otherwise we are just evaluating ourselves.  
To do this, let's send some requests to our LLM through llamastack, and then also to our backend and see how that looks like.

# Datasets

Finally, let's use a dataset with already populated inputs and expected answers and see how well our model and backend does against those.

## SimpleQA

In [84]:
simpleqa_dataset_id = "huggingface::simpleqa"

_ = client.datasets.register(
    purpose="eval/messages-answer",
    source={
        "type": "uri",
        "uri": "huggingface://datasets/llamastack/simpleqa?split=train",
    },
    dataset_id=simpleqa_dataset_id,
)

eval_rows = client.datasets.iterrows(
    dataset_id=simpleqa_dataset_id,
    limit=5,
)

INFO:httpx:HTTP Request: POST http://llamastack-with-config-service.default.svc.cluster.local:8321/v1/datasets "HTTP/1.1 400 Bad Request"


BadRequestError: Error code: 400 - {'detail': 'Invalid value: Provider `huggingface` not found'}

In [93]:
pprint(eval_rows)

### First just to the raw model

In [85]:
client.benchmarks.register(
    benchmark_id="meta-reference::simpleqa",
    dataset_id=simpleqa_dataset_id,
    scoring_functions=["llm-as-judge::base"],
)

response = client.eval.evaluate_rows(
    benchmark_id="meta-reference::simpleqa",
    input_rows=eval_rows.data,
    scoring_functions=["llm-as-judge::base"],
    benchmark_config={
        "eval_candidate": {
            "type": "model",
            "model": "llama32",
            "sampling_params": {
                "strategy": {
                    "type": "greedy",
                },
                "max_tokens": 4096,
                "repeat_penalty": 1.0,
            },
        },
    },
)
pprint(response)


INFO:httpx:HTTP Request: POST http://llamastack-with-config-service.default.svc.cluster.local:8321/v1/eval/benchmarks "HTTP/1.1 400 Bad Request"


BadRequestError: Error code: 400 - {'detail': 'Invalid value: No provider specified and multiple providers available. Please specify a provider_id.'}

### Then to an agent

In [98]:
agent_config = {
    "model": "llama32",
    "instructions": "You are a helpful assistant that have access to tool to search the web. ",
    "sampling_params": {
        "strategy": {
            "type": "top_p",
            "temperature": 0.5,
            "top_p": 0.9,
        }
    },
    "toolgroups": [
        "builtin::websearch",
    ],
    "tool_choice": "auto",
    "tool_prompt_format": "json",
    "input_shields": [],
    "output_shields": [],
    "enable_session_persistence": False,
}

response = client.eval.evaluate_rows(
    benchmark_id="meta-reference::simpleqa",
    input_rows=eval_rows.data,
    scoring_functions=["llm-as-judge::base"],
    benchmark_config={
        "eval_candidate": {
            "type": "agent",
            "config": agent_config,
        },
    },
)
pprint(response)

INFO:httpx:HTTP Request: POST http://llama-stack.genaiops-rag.svc.cluster.local/v1/eval/benchmarks/meta-reference::simpleqa/evaluations "HTTP/1.1 200 OK"
