# Lab 3.1: Evaluate Langfuse Traces using an external evaluation pipeline

#### An external evaluation pipeline is useful when you need:
- More control over when traces get evaluated. You could schedule the pipeline to run at specific times or responding to event-based triggers like Webhooks.
- Greater flexibility with your custom evaluations, when your needs go beyond what’s possible with the Langfuse UI
- Version control for your custom evaluations
- The ability to evaluate data using existing evaluation frameworks and pre-defined metrics

In this notebook, we will learn to implement a external evaluation pipeline by doing the following:
1. Create a synthetic dataset to test your models.
2. Use the Langfuse client to gather and filter traces of previous model runs
3. Evaluate these traces offline and incrementally
4. Add scores to existing Langfuse traces

## Prerequisites

> ℹ️ You can **skip these prerequisite steps** if you're in an instructor-led workshop using temporary accounts provided by AWS

In [None]:
# Uncomment the following line to install dependencies if you are not using AWS workshop environment
!uv pip install --force-reinstall -U -r ./requirements.txt --quiet

Please make sure you have completed the prerequisites to setup the Langfuse project and API keys in the .env file to connect to self-hosted or cloud Langfuse environment.

1. Navigate to the directory `genai-ml-platform-examples/integration/genaiops-langfuse-on-aws/` within your workshop environment.

2. Locate the file named `.env.example` and create a copy of this file in the same directory, renaming the copy to `.env`.

3. Open the `.env` file in your editor and prepare to add your actual Langfuse credentials. You will need three values from your Langfuse project settings under the API Keys section.

The completed configuration in `.env` should follow this format:

```
LANGFUSE_PUBLIC_KEY=pk-lf-your-actual-public-key
LANGFUSE_SECRET_KEY=sk-lf-your-actual-secret-key
LANGFUSE_HOST=xxx
```

Save the file after adding your actual credential values. The notebook will load these environment variables automatically when executing the Langfuse integration exercises in agents two and three.

In [None]:
# If you completed the .env setup above, skip this cell.
# Otherwise, uncomment and set your Langfuse credentials below:

# import os
# os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-..."  # Your Langfuse project secret key
# os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-..."  # Your Langfuse project public key
# os.environ["LANGFUSE_HOST"] = "xxx"  # Your Langfuse host URL

## Initialization and Authentication Check
Run the following cells to initialize common libraries and clients.

In [None]:
import os
import sys

import boto3
from dotenv import load_dotenv
from langfuse.decorators import langfuse_context, observe


load_dotenv("../.env")

Initialize AWS Bedrock clients and check models available in your account.

In [None]:
# used to access Bedrock configuration
bedrock = boto3.client(service_name="bedrock", region_name="us-west-2")

# Check if Nova models are available in this region
models = bedrock.list_inference_profiles()
nova_found = False
for model in models["inferenceProfileSummaries"]:
    if (
        "Nova Pro" in model["inferenceProfileName"]
        or "Nova Lite" in model["inferenceProfileName"]
        or "Nova Micro" in model["inferenceProfileName"]
    ):
        print(f"Found Nova model: {model['inferenceProfileName']} - {model['inferenceProfileId']}")
        nova_found = True
if not nova_found:
    raise ValueError("No Nova models found in available models. Please ensure you have access to Nova models.")

Initialize the Langfuse client and check credentials are valid.

In [None]:
from langfuse import Langfuse


# langfuse client
langfuse = Langfuse()
if langfuse.auth_check():
    print("Langfuse has been set up correctly")
    print(f"You can access your Langfuse instance at: {os.environ['LANGFUSE_HOST']}")
else:
    print("Credentials not found or invalid. Check your Langfuse API key and host in the .env file.")

### Langfuse Wrappers for Bedrock Converse API 

In [None]:
sys.path.append(os.path.abspath(".."))  # Add parent directory to path
from config import MODEL_CONFIG
from utils import converse

# Generate synthetic data

In this notebook, we consider a use case of leveraging a LLM to generate product descriptions that can be used in advising the product on an e-commerce page. The first step is to generate a list of products and for each of them, instruct Amazon Nova Lite to 
generate brief product descriptions.

In [None]:
# Let's prompt the model to generate 50 products
messages = [
    {
        "role": "user",
        "content": "For a variety of 50 different product categories sold on a e-commerce website, \
    generate one product that is interesting to a consumer. The product names should be reflective of the actual product being sold. \
    Generate the 50 product items as comma separated values. Do not generate any additional words apart from the product names",
    },
]

# Make the API call to the Nova Lite model
model_response = converse(messages=messages, **MODEL_CONFIG["nova_lite"])

# Print the generated text
print("\n[Response Content Text]")
print(model_response)

In [None]:
# check the model generation outputs
products = [item.strip() for item in model_response.split(",")]

for prd in products:
    print(prd)

For each of the products, we will now generate product descriptions using Amazon Nova Lite, and capture the traces to Langfuse using the ```@observe()``` decorator

In [None]:
# Generate product descriptions for each product
prompt_template = "You are a product marketer and you need to generate detailed \
product descriptions for products which will be used for selling \
the product on a e-commerce website. Any catchy phrases from the \
descriptions will also be used for social meda campaigns. \
From the product descriptions, customers should be able to understand \
how the product can help them in their lives but also be able to trust \
this company. Your descriptions are fun and engaging. \
Your answer should be 4 sentences at max."


@observe(name="Batch Product Description Generation")
def main():
    langfuse_context.update_current_trace(
        user_id="nova-user-1",
        session_id="nova-batch-generation-session",
        tags=["lab3.1"],
    )

    for product in products:
        print(f"Input: Generate a description for {product}")
        messages = [
            {"role": "system", "content": prompt_template},
            {"role": "user", "content": f"Generate a description for {product}"},
        ]
        response = converse(messages, metadata={"product": product}, **MODEL_CONFIG["nova_lite"])
        print(f"Output: {response} \n")


main()

langfuse_context.flush()

Now you should see these product descriptions in the Traces section of the langfuse UI.

![Traces collected from the LLM generations](./images/product_description_traces.png "Langfuse Traces")


The goal of this tutorial is to show you how to build an model-based evaluation pipeline. These pipelines will run in your CI/CD environment, or be run in a different orchestrated container service. No matter the environment you choose, three key steps always apply:

1. Fetch Your Traces: Get your application traces to your evaluation environment
2. Run Your Evaluations: Apply any evaluation logic you prefer
3. Save Your Results: Attach your evaluations back to the Langfuse trace used for calculating them.

***
Goal: This evaluation pipeline is executed on all the traces over the past 24 hours
***

## 1. Fetch the traces

The ```fetch_traces()``` function has arguments to filter the traces by tags, timestamps, and beyond. We can also choose the number of samples for pagination.

In [None]:
from datetime import datetime, timedelta


now = datetime.now()
last_24_hours = now - timedelta(days=1)

traces_batch = langfuse.fetch_traces(
    page=1,
    limit=1,
    tags="lab3.1",
    session_id="nova-batch-generation-session",
    from_timestamp=last_24_hours,
    to_timestamp=datetime.now(),
).data

print(f"Trace ID: {traces_batch[0].id}")

In [None]:
observations = langfuse.fetch_observations(trace_id=traces_batch[0].id, type="GENERATION").data
observations[0].output

## 2. Categorical Evaluation using LLM-as-a-judge

Evaluation functions should take a trace as input and yield a valid score.
When analyzing the outputs of your LLM applications, you may want to evaluate traits that are defined qualitatively such as readability, helpfulness or measures for reducing hallucinations such as completeness.

We're building product descriptions and to ensure it resonates with customers, we want to measure readability. For more LLM-as-a-judge definitions, check out the judge based evaluator prompts defined in the [Amazon Bedrock Evaluator Prompts](https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-type-judge-prompt.html)

In [None]:
template_readability = """
You are a helpful agent that can assess an LLM response according to the given rubrics.

You are given a product description generated by a LLM. Your task is to assess the readability of the LLM response to the question, in other words, how easy it is for a typical reading audience to comprehend the response at a normal reading rate.

Please rate the readability of the response based on the following scale:
- unreadable: The response contains gibberish or could not be comprehended by any normal audience.
- poor readability: The response is comprehensible, but it is full of poor readability factors that make comprehension very challenging.
- fair readability: The response is comprehensible, but there is a mix of poor readability and good readability factors, so the average reader would need to spend some time processing the text in order to understand it.
- good readability: Very few poor readability factors. Mostly clear, well-structured sentences. Standard vocabulary with clear context for any challenging words. Clear organization with topic sentences and supporting details. The average reader could comprehend by reading through quickly one time.
- excellent readability: No poor readability factors. Consistently clear, concise, and varied sentence structures. Simple, widely understood vocabulary. Logical organization with smooth transitions between ideas. The average reader may be able to skim the text and understand all necessary points.

Here is the product description that needs to be evaluated: {prd_desc}

Firstly explain your response, followed by your final answer. You should follow the format
Explanation: [Explanation], Answer: [Answer],
where '[Answer]' can be one of the following:
```
unreadable
poor readability
fair readability
good readability
excellent readability
```
"""


@observe(name="Readability Evaluation")
def generate_readability_score(trace_output):
    prd_desc_readability = template_readability.format(prd_desc=trace_output)
    # query = [f"Rate the readability of product description: {traces_batch[1].output}"]
    readability_score = converse(
        messages=[{"role": "user", "content": prd_desc_readability}], **MODEL_CONFIG["nova_pro"]
    )
    explanation, score = readability_score.split("\n\n")
    return explanation, score


print(f"User query: {observations[0].input[1]['content']}")
print(f"Model answer: {observations[0].output}")
explanation, score = generate_readability_score(observations[0].output)
print(f"Readability: {score}, Explanation: {explanation}")

## 3. Add the evaluation to the trace

Now that we have generated a readability score as well as a explanation, we can use the Langfuse client to add scores to existing traces.

In [None]:
langfuse.score(
    trace_id=traces_batch[0].id,
    observation_id=observations[0].id,
    name="readability",
    value=score,
    comment=explanation,
)

# Putting everything together

We just saw how to do this for one trace, let's put it all together in a function to run it on all the traces collected in the last 24 hours.

In [None]:
@observe(name="Batch Readability Evaluation")
def batch_evaluate():
    langfuse_context.update_current_trace(
        user_id="nova-user-1",
        session_id="nova-batch-evals-session",
        tags=["lab3.1"],
    )
    traces_batch = langfuse.fetch_traces(
        page=1,
        limit=1,
        tags="lab3.1",
        session_id="nova-batch-generation-session",
        from_timestamp=last_24_hours,
        to_timestamp=datetime.now(),
    ).data

    observations = langfuse.fetch_observations(trace_id=traces_batch[0].id, type="GENERATION").data

    for observation in observations[-5:]:  # Only evaluate the last 5 observations to save time
        print(f"Processing {observation.id}")
        if observation.output is None:
            print(
                f"Warning: \n Trace {observation.id} had no generated output, \
            it was skipped"
            )
            continue
        explanation, score = generate_readability_score(observation.output)
        langfuse.score(
            trace_id=traces_batch[0].id,
            observation_id=observation.id,
            name="readability",
            value=score,
            comment=explanation,
        )

In [None]:
batch_evaluate()

langfuse_context.flush()

#### If your pipeline ran successfully, you should now see scores added to your traces

![Langfuse Trace with score added for readability](./images/scored_trace.png "Scored trace on langfuse")

### Congratuations
You have successfully finished Lab 3.1.

If you are at an AWS event, you can return to the workshop studio for additional instructions before moving into the next lab, where we will explore GenAI guardrails.