# Evaluate Retrieval-Augmented Generation (RAG) pipelines with Ragas and Langfuse

In this notebook we'll explore ways to evaluate the quality of Retrieval-Augmented Generation (RAG) pipelines with the opensource tools like [RAGAS](https://docs.ragas.io/en/v0.1.21/index.html) and leverage the features in [Langfuse](https://langfuse.com/) to manage and trace the RAG pipelines with traces and spans. We will create a Bedrock knowledge base and the RAG batch generation results to show offline evaluation and scoring.

> ℹ️ Note: This notebook requires user configurations for some steps. 
>
> When a cell requires user configurations, you will see a message like this callout with the 👉 emoji.
>
> Pay attention to the instructions with the 👉 emoji and perform the configurations in the AWS Console or in the corresponding cell before running the code cell.

## Pre-requisites

> If you haven't selected the kernel, please click on the "Select Kernel" button at the upper right corner, select Python Environments and choose ".venv (Python 3.9.20) .venv/bin/python Recommended".

> To execute each notebook cell, press Shift + Enter.

> ℹ️ You can **skip these prerequisite steps** if you're in an instructor-led workshop using temporary accounts provided by AWS

### Additional permissions for Amazon OpenSearch

To complete the manual Bedrock Knowledge setup steps in this notebook, your **AWS Console user/role** will need:

- [Permissions to work with Amazon OpenSearch vector collections](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-vector-search.html)
- Permission to **create IAM roles** and attach policies to them, including: `iam:AttachRolePolicy`, `iam:CreateRole`, `iam:DetachRolePolicy`, `iam:GetRole`, `iam:PassRole`, `iam:CreatePolicy`, `iam:CreatePolicyVersion`, and `iam:DeletePolicyVersion`.

> ℹ️ **Note:** In testing, we saw `NetworkError` issues when attempting to create Bedrock KBs using only the above-linked `aoss` policy statements. This was resolved by granting `aoss:*` on `*` instead, but you should consider reducing these permissions before using in production environments.

Refer to the [AWS Console for Identity and Access Management (IAM)](https://console.aws.amazon.com/iam/home?#/home) to grant permissions to your user or role.

### Dependencies and Environment Variables

In [None]:
# Uncomment the following line to install dependencies if you are not using AWS workshop environment
!uv pip install --force-reinstall -U -r ./requirements.txt --quiet

Please make sure you have completed the prerequisites to setup the Langfuse project and API keys in the .env file to connect to self-hosted or cloud Langfuse environment.

1. Navigate to the directory `genai-ml-platform-examples/integration/genaiops-langfuse-on-aws/` within your workshop environment.

2. Locate the file named `.env.example` and create a copy of this file in the same directory, renaming the copy to `.env`.

3. Open the `.env` file in your editor and prepare to add your actual Langfuse credentials. You will need three values from your Langfuse project settings under the API Keys section.

The completed configuration in `.env` should follow this format:

```
LANGFUSE_PUBLIC_KEY=pk-lf-your-actual-public-key
LANGFUSE_SECRET_KEY=sk-lf-your-actual-secret-key
LANGFUSE_HOST=xxx
```

Save the file after adding your actual credential values. The notebook will load these environment variables automatically when executing the Langfuse integration exercises in agents two and three.

In [None]:
# If you completed the .env setup above, skip this cell.
# Otherwise, uncomment and set your Langfuse credentials below:

# import os
# os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-..."  # Your Langfuse project secret key
# os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-..."  # Your Langfuse project public key
# os.environ["LANGFUSE_HOST"] = "xxx"  # Your Langfuse host URL

See [Langfuse documentation](https://langfuse.com/docs/get-started) for more details.

## Initialization and Authentication Check
Run the following cells to initialize common libraries and clients.

In [None]:
import json
import os
from typing import Any

# External Dependencies:
import pandas as pd  # For working with tabular data
from dotenv import load_dotenv


load_dotenv("../.env")

Initialize AWS Bedrock clients and check models available in your account.

In [None]:
import boto3  # General Python SDK for AWS (including Bedrock)


# used to access Bedrock configuration
bedrock = boto3.client(service_name="bedrock", region_name="us-west-2")

bedrock_agent_runtime = boto3.client(service_name="bedrock-agent-runtime", region_name="us-west-2")

# Check which models are available in your account
models = bedrock.list_inference_profiles()
for model in models["inferenceProfileSummaries"]:
    print(model["inferenceProfileName"] + " - " + model["inferenceProfileId"])

Initialize the Langfuse client and check credentials are valid.

In [None]:
from langfuse import Langfuse


# langfuse client
langfuse = Langfuse()
if langfuse.auth_check():
    print("Langfuse has been set up correctly")
    print(f"You can access your Langfuse instance at: {os.environ['LANGFUSE_HOST']}")
else:
    print("Credentials not found or invalid. Check your Langfuse API key and host in the .env file.")

## Set up the Knowledge Base
Next, let's upload the documents to Amazon S3 and create a vector store (knowledge base) so we can perform retrieval-augmented generation (RAG) given a user query. In the following steps, we'll configure:

- An Amazon S3 bucket_name to store our document corpus. 
- A folder prefix under the bucket where artifacts will be stored.

In [None]:
from botocore.exceptions import ClientError


botosess = boto3.Session(region_name="us-west-2")
region = botosess.region_name
account_id = boto3.client("sts").get_caller_identity()["Account"]
bucket_name = f"eval-{account_id}-{region}"
s3_prefix = "bedrock-rag-eval"

# check if s3 bucket exists or not, if not, create bucket
s3 = boto3.client("s3")
try:
    s3.head_bucket(Bucket=bucket_name)
    print(f"Bucket {bucket_name} exists")
except ClientError:
    print(f"Creating bucket {bucket_name}")
    s3.create_bucket(Bucket=bucket_name, CreateBucketConfiguration={"LocationConstraint": region})

### Upload the documents to Amazon S3

First, we'll need to upload the sample documents to Amazon S3 - for which you can just run the code cell below:

In [None]:
corpus_s3uri = f"s3://{bucket_name}/{s3_prefix}/corpus"
print(f"Syncing corpus to:\n{corpus_s3uri}/")

# We will use the AWS CLI to recursively sync the folder to the S3 bucket.
!aws s3 sync --quiet ./datasets/corpus {corpus_s3uri}/

### Create the knowledge base in AWS Console
> 👉 This section includes steps you'll need to take manually, not just running the code cells!

The simplest way to set up the actual Bedrock Knowledge Base for testing is **manually through the AWS Console**:

1. First, **open** the [AWS Console for Amazon Bedrock](https://console.aws.amazon.com/bedrock/home?#/knowledge-bases) and select *Orchestration > Knowledge bases* from the left sidebar menu, as shown in the screenshot below:

    > ℹ️ **Check** you're working in the correct *AWS Region* in the top right corner of the UI

![KB Console](images/bedrock-kbs/01-bedrock-kb-console.png "Screenshot of AWS Console for Amazon Bedrock Knowledge Bases, showing 'Create knowledge base' action button")

2. Click the **Create knowledge base** button and select **Knowledge Base with vector store**. In the screen that opens:

- For **knowledge base name**, enter `example-squad-kb`
- For **knowledge base description**, you can provide (something like) `Demo knowledge base for question answering evaluation`
- Leave the other settings as default (allow creating a new execution role, and no tags)
- Please chose Amazon S3 as the data source (default)

Your configuration should look like the screenshot below:

![KB Basics](images/bedrock-kbs/02a-create-kb-basics.png "Screenshot of step 1 in Bedrock Knowledge Base creation workflow: with KB name, description, (create new) execution role, and (empty) tags configured. At the end of the form, a 'Next' button is visible.")

3. In the **Next** screen, you'll configure the S3 data source.

    Leave the data source as S3 and then select the bucket and prefix per you created in the previous step and use Amazon Bedrock default parser.

![](images/bedrock-kbs/02b-create-kb-data-source.png "Screenshot of Knowledge Base vector index settings including Cohere Embed Multilingual embedding model, and quick-create vector store. 'Next' button is visible.")

4. In the **Next** screen, you'll configure the vector index:

    For *embeddings model*, select `Amazon Titan Embeddings V2`

    For *Vector database*, select `Quick create a new vector store`

    You can find more information from this screen or the [Amazon Bedrock Developer Guide](https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base-setup.html) about the different vector stores Bedrock Knowledge Bases support. This default option will create a new [Amazon OpenSearch Serverless](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-overview.html) cluster

    Leave other settings at their defaults as shown below, and you should be ready to proceed:

![](images/bedrock-kbs/02c-create-kb-index.png "Screenshot of Knowledge Base vector index settings including Cohere Embed Multilingual embedding model, and quick-create vector store. 'Next' button is visible.")

5. Click **Next** to review your configuration, and then **Create knowledge base** to complete the process.

    > ⏰ It might take **a few minutes** for the creation to complete. A progress indicator banner should be visible if you scroll up. Alternatively in a separate tab, you could check the [Amazon OpenSearch Serverless Collections console](https://console.aws.amazon.com/aos/home?#opensearch/collections) - where you should see the underlying vector collection being created.

    Once your Knowledge Base is completed successfully, you'll be directed to the its detail screen as shown below:

![](images/bedrock-kbs/03-kb-detail-page.png "Detail screen for the created Amazon Bedrock Knowledge Base, showing creation success banner. Includes sections 'Knowledge base overview' (containing the KB ID, name, and other details); 'Tags' (empty); 'Data source' (one Amazon S3 data source listed); 'Embeddings model' (Cohere Embed); and an interactive 'Test knowledge base' chat sidebar on the right with a warning that some data sources have not been synced.")

6. As mentioned in the alert box shown ahead, your new knowledge base will not yet contain your documents until we **sync** the data source:

    **Select** your S3 data source by selecting the checkbox to the left of it's name in the data sources list, and click the **Sync** button above to start the sync.

    The *Status* will change to `Syncing` for a few seconds, after which it will return to `Available`

![](images/bedrock-kbs/04a-kb-data-source-after-sync.png "Screenshot of KB 'data source' section after running sync, with the data source selected and status showing as 'available'")

With the sync completed, your Knowledge Base should be ready to use.

Optionally, you can click into your data source to check the sync `Added` the 20 files as expected:

<img src="images/bedrock-kbs/04b-kb-data-sync-details.png" width="600" alt="Data source details screen showing sync completed successfully with 20 files detected and added to the index, and 0 files failed"/>

### Test out the Knowledge Base

Before we discuss evaluation at scale, let's run a test queries to check the KB is working properly. Let's go back to the detail page of the knowledge base.

For example, you can find the knowledge base id is `Z746ERZP5X` in the screenshot below (please check your own *Knowledge Base ID*) on the top of the page in the *Knowledge Base overview* panel.

![](images/bedrock-kbs/04c-kb-main-page.png "Screenshot of the main page of the knowledge base")

👉 **Replace** the below placeholder with your knowledge base's unique ID, and run the cells below to continue:

In [None]:
knowledge_base_id = "<TO FILL>"  # Something like "Z746ERZP5X"

With the ID identified, you can use the Bedrock runtime [RetrieveAndGenerate API](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent-runtime_RetrieveAndGenerate.html) to query your knowledge base.

In [None]:
query = "What kind of economy does Victoria have?"

In [None]:
# Use the RetrieveAndGenerate API with Nova Pro model to query the knowledge base
rag_resp = bedrock_agent_runtime.retrieve_and_generate(
    input={"text": query},
    retrieveAndGenerateConfiguration={
        "knowledgeBaseConfiguration": {
            "knowledgeBaseId": knowledge_base_id,
            "modelArn": f"arn:aws:bedrock:us-west-2:{account_id}:inference-profile/us.amazon.nova-pro-v1:0",
        },
        "type": "KNOWLEDGE_BASE",
    },
    # Optional session ID can help improve results for follow-up questions:
    # sessionId='string'
)

print("Plain text response:")
print("--------------------")
print(rag_resp["output"]["text"], end="\n\n\n")

print("Full API output:")
print("----------------")
rag_resp

As shown in the full API response from the above cell, the `RetrieveAndGenerate` action provides:

- The final text answer
- The `retrievedReferences` from the search engine
- Specific `citations` localizing which references should be cited by different parts of the text answer


It's also possible to run **only the retrieval** through the API, and skip the generative answer synthesis step - as shown below:

In [None]:
retrieve_resp = bedrock_agent_runtime.retrieve(
    knowledgeBaseId=knowledge_base_id,
    retrievalQuery={"text": query},
)
print(json.dumps(retrieve_resp["retrievalResults"], indent=2))

## Set up dataset and metrics for evaluation

### Load Dataset

For this example, we are going to use a dataset with reference input/output pairs by querying a RAG system and curating the results. See below for instruction on how to fetch your production data from Langfuse.

The dataset contains the following columns:

- `question`: list[str] - These are the questions your RAG pipeline will be evaluated on.

- `contexts`: list[list[str]] - The contexts which were passed into the LLM to answer the question.

- `answer`: list[str] - The answer generated from the RAG pipeline and given to the user.

- `ground_truths`: list[list[str]] - The ground truth answer to the questions. However, this can be ignored for online evaluations since we will not have access to ground-truth data in our case.

For the details of this dataset, please refer to [Exploding Gradients Dataset](https://huggingface.co/datasets/explodinggradients/fiqa/viewer/ragas_eval)


Let's start by loading the dataset.

In [None]:
from datasets import load_dataset


fiqa_eval = load_dataset("explodinggradients/fiqa", "ragas_eval")["baseline"]
fiqa_eval

### RAGAS metrics
We're going to measure the following aspects of a RAG system. These metrics are defined in [RAGAS](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/):

- [Faithfulness](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/faithfulness/): This measures the factual consistency of the generated answer against the given context.
- [Response relevancy](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/answer_relevance/): The ResponseRelevancy metric measures how relevant a response is to the user input.
- [Context precision](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/context_precision/): Context Precision is a metric that evaluates whether all of the ground-truth relevant items present in the contexts are ranked high. Ideally all the relevant chunks must appear at the top ranks.

Checkout the [RAGAS documentation](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/) to know more about these metrics and how they work.

In [None]:
# import metrics
from ragas.metrics import (
    Faithfulness,
    LLMContextPrecisionWithoutReference,
    ResponseRelevancy,
)


# metrics you chose
metrics = [
    Faithfulness(),
    ResponseRelevancy(),
    LLMContextPrecisionWithoutReference(),
]

In [None]:
from ragas.metrics.base import MetricWithEmbeddings, MetricWithLLM
from ragas.run_config import RunConfig


# util function to init Ragas Metrics
def init_ragas_metrics(metrics, llm, embedding):
    for metric in metrics:
        if isinstance(metric, MetricWithLLM):
            print(metric.name + " llm")
            metric.llm = llm
        if isinstance(metric, MetricWithEmbeddings):
            print(metric.name + " embedding")
            metric.embeddings = embedding
        run_config = RunConfig()
        metric.init(run_config)

Now we have to initialize the metrics with LLMs and embedding models of your choice. In this example we are going to use the Bedrock Nova Pro model and Titan embedding model, and use the convenience wrappers from the `langchain-aws` library.

In [None]:
from langchain_aws import BedrockEmbeddings, ChatBedrockConverse
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas.llms import LangchainLLMWrapper


config = {
    "region_name": "us-west-2",  # E.g. "us-east-1"
    "llm": "us.amazon.nova-pro-v1:0",  # E.g you can also use the claude models "anthropic.claude-3-5-sonnet-20241022-v2:0"
    "embeddings": "cohere.embed-english-v3",  # E.g or "amazon.titan-embed-text-v2:0"
    "temperature": 0.4,
}

evaluator_llm = LangchainLLMWrapper(
    ChatBedrockConverse(
        region_name=config["region_name"],
        base_url=f"https://bedrock-runtime.{config['region_name']}.amazonaws.com",
        model=config["llm"],
        temperature=config["temperature"],
    )
)

evaluator_embeddings = LangchainEmbeddingsWrapper(
    BedrockEmbeddings(
        region_name=config["region_name"],
        model_id=config["embeddings"],
    )
)

init_ragas_metrics(
    metrics,
    llm=evaluator_llm,
    embedding=evaluator_embeddings,
)

## Trace eval results with Langfuse

You can use model-based evaluation with Ragas in 2 ways:
1. Score every trace: This means you will run the evaluations for each trace item. This gives you much better idea of how each call made to your RAG pipelines is performing, but please be mindful of the cost.

2. Score with sampling: In this method we will take random samples of traces on a periodic basis and score them. This brings down the cost and gives you a rough estimate the performance of your app but may miss out on important samples.

In this example, we will demonstrate both solutions using prebuilt dataset and a live RAG pipeline with Bedrock Knowlegebase.

### Score every trace

Lets take a small example of a single trace and see how you can score that with Ragas. We first define a utility function to score your trace with the metrics you chose.

In [None]:
from ragas.dataset_schema import SingleTurnSample


async def score_with_ragas(query, chunks, answer, metrics):
    scores = {}
    for metric in metrics:
        sample = SingleTurnSample(
            user_input=query,
            retrieved_contexts=chunks,
            response=answer,
        )
        print(f"calculating {metric.name}")
        scores[metric.name] = await metric.single_turn_ascore(sample)
    return scores

#### Scoring sample dataset item

You compute the score with each request. Below we will go through a dummy application that does the following steps:

- Gets a question from the user
- Fetch context from the database or vector store that can be used to answer the question from the user
- Pass the question and the contexts to the LLM to generate the answer

In this case we are demonstrating the use of the Langfuse Python [low-level SDK](https://langfuse.com/docs/sdk/python/low-level-sdk) to log the traces with more granular controls. You can also see an example with the [decorator](https://langfuse.com/docs/sdk/python/decorators) in the later section or read more about them the [langfuse documentation](https://langfuse.com/docs/sdk/overview).

In [None]:
# start a new trace when you get a question
row = fiqa_eval[0]
question = row["question"]
contexts = row["contexts"]
answer = row["answer"]


# Create trace with proper input
trace = langfuse.trace(name="rag-fiqa", input={"question": question}, metadata={"dataset": "fiqa"})

# Create retrieval span and properly end it
retrieval_span = trace.span(name="retrieval", input={"question": question})
retrieval_span.end(output={"contexts": contexts})

# Create generation and properly end it
generation = trace.generation(name="generation", input={"question": question, "contexts": contexts})
generation.end(output={"answer": answer})

# End the trace with final output
trace.update(output={"answer": answer})

# compute scores for the question, context, answer tuple
ragas_scores = await score_with_ragas(question, contexts, answer, metrics)
ragas_scores

In [None]:
print(
    f"Now you can see this is traced in langfuse but with no score attached, we can check it in the Langfuse UI at:\n{os.environ['LANGFUSE_HOST']}"
)

You can then attach the scores to the trace by running the following

In [None]:
for m in metrics:
    langfuse.score(name=m.name, value=ragas_scores[m.name], trace_id=trace.id)

Now the score is attached

![](images/bedrock-kbs/04e-langfuse-single-eval-trace-score.png)

#### Scoring RAG
We have already setup the Bedrock Knowledge Base in the first section, we can now **evaluate** the quality of its results against a test dataset - to help us **optimize** the configuration for high quality and low cost.

First, let's load the sample dataset of questions, reference answers, and their source documents (to find more of how to prepare this dataset, please see more details in [this github](https://github.com/aws-samples/llm-evaluation-methodology/blob/main/datasets/Prepare-SQuAD.ipynb)):


In [None]:
dataset_df = pd.read_json("datasets/qa.manifest.jsonl", lines=True)
dataset_df.head(10)

Records in this dataset include:

- (`doc`) The full text of the source document for this example
- (`doc_id`) A unique identifier for the source document
- (`question`) The user question to be asked
- (`question_id`) A unique identifier for the question
- (`answers`) A list of (possibly multiple) reference 'correct' answers, supported by the document

As shown in [Ragas' API Reference](https://docs.ragas.io/en/latest/references/evaluation.html), records in Ragas evaluation datasets typically include:

- The `question` that was asked
- The `answer` the system generated
- The actual text `contexts` the answer was based on (i.e. snippets of document text retrieved by the search engine)
- The `ground_truth` answer(s)

Here we will integrate [Langfuse Tracking](https://langfuse.com/docs/tracing) into the RAG pipeline with the Langfuse Python SDK using the `@observe()` decorator.

We can run an example question through the Bedrock KB retrieve and generate pipeline as shown below, and extract the references ready to calculate metrics.

In [None]:
from langfuse.decorators import langfuse_context, observe


@observe(name="Knowledge Base Retrieve and Generate")
def retrieve_and_generate(
    question: str,
    kb_id: str,
    generate_model_arn: str = f"arn:aws:bedrock:us-west-2:{account_id}:inference-profile/us.amazon.nova-pro-v1:0",
    **kwargs,
):
    rag_resp = bedrock_agent_runtime.retrieve_and_generate(
        input={"text": question},
        retrieveAndGenerateConfiguration={
            "knowledgeBaseConfiguration": {
                "knowledgeBaseId": kb_id,
                "modelArn": generate_model_arn,
            },
            "type": "KNOWLEDGE_BASE",
        },
    )
    answer = rag_resp["output"]["text"]

    # Fetch flat list of references from the nested citations -> retrievedReferences:
    all_refs = [r for cite in rag_resp["citations"] for r in cite["retrievedReferences"]]
    contexts = [r["content"]["text"] for r in all_refs]
    ref_s3uris = [r["location"]["s3Location"]["uri"] for r in all_refs]
    # Map e.g. 's3://.../doc_id.txt' to 'doc_id':
    ref_ids = [uri.rpartition("/")[2].rpartition(".")[0] for uri in ref_s3uris]

    # Log additional data to the trace
    langfuse_context.update_current_observation(
        input={"question": question, "contexts": contexts},
        output=answer,
        model="us.amazon.nova-pro-v1:0",
        session_id="kb-rag-session",
        tags=["dev"],
        metadata=kwargs,
    )

    # Get the trace ID for independent scoring
    trace_id = langfuse_context.get_current_trace_id()
    return {
        "answer": answer,
        "retrieved_doc_ids": ref_ids,
        "retrieved_doc_texts": contexts,
        "trace_id": trace_id,
    }

Run RAG as requests come in and score the results immediately.

In [None]:
from asyncio import run
import asyncio

from langfuse.decorators import observe


@observe(name="Knowledge Base Pipeline")
async def rag_pipeline(
    question,
    user_id: str | None = None,
    session_id: str | None = None,
    kb_id: str | None = None,
    metrics: Any | None = None,
):
    generated_answer = retrieve_and_generate(
        question=question,
        kb_id=kb_id,
        kwargs={"database": "Bedrock Knowledge Base", "kb_id": kb_id},
    )
    contexts = generated_answer["retrieved_doc_texts"]
    answer = generated_answer["answer"]
    trace_id = generated_answer["trace_id"]

    score = await score_with_ragas(question, contexts, answer=answer, metrics=metrics)
    
    langfuse_context.update_current_trace(
        user_id=user_id,
        session_id=session_id,
        tags=["dev"],
    )
    for s in score:
        langfuse.score(name=s, value=score[s], trace_id=trace_id)
    return generated_answer

In [None]:
response = await rag_pipeline(dataset_df.iloc[0]["question"], kb_id=knowledge_base_id, metrics=metrics)
response

### Scoring with sampling

Scoring every production trace can be time-consuming and costly depending on your application architecture and traffic. In that case, it's better to start off with a sampling method. Decide a timespan you want to run the batch process and the number of traces you want to sample from that time slice. Create a dataset and call ragas.evaluate to analyze the result.

You can run this periodically to keep track of how the scores are changing across timeslices and figure out if there are any discrepancies.

We will evaluate the existing results generated previously by the `retrieve_and_generate()` function.

Simulate 10 production traces by running RAG on the first 10 questions in the dataset.

In [None]:
rag_generated_outputs = [
    retrieve_and_generate(
        question=rec.question,
        kb_id=knowledge_base_id,
        kwargs={"database": "Bedrock Knowledge Base", "kb_id": knowledge_base_id},
    )
    for _, rec in dataset_df.head(10).iterrows()
]
rag_generated_outputs[0]

Now that the results are uploaded to langfuse you can retrieve it as needed with this handy function.

In [None]:
from langfuse.api.resources.commons.types.trace_with_details import TraceWithDetails


def get_traces(
    limit: int = 5,
    name: str | None = None,
    user_id: str | None = None,
    session_id: str | None = None,
    from_timestamp: str | None = None,
    to_timestamp: str | None = None,
) -> list[TraceWithDetails]:
    """Query Langfuse for traces matching the given filters.
    See https://langfuse.com/docs/query-traces for more details."""

    all_data = []
    page = 1

    while True:
        response = langfuse.fetch_traces(
            page=page,
            name=name,
            user_id=user_id,
            session_id=session_id,
            from_timestamp=from_timestamp,
            to_timestamp=to_timestamp,
        )
        if not response.data:
            break
        page += 1
        all_data.extend(response.data)
        if len(all_data) > limit:
            break

    return all_data[:limit]

In [None]:
from random import sample


NUM_TRACES_TO_SAMPLE = 3
traces = get_traces(name="Knowledge Base Retrieve and Generate", limit=10)
if len(traces) > NUM_TRACES_TO_SAMPLE:  # noqa: SIM108
    traces_sample = sample(traces, NUM_TRACES_TO_SAMPLE)
else:
    traces_sample = traces

print(f"Sampled {len(traces_sample)} traces from {len(traces)} filtered traces")
for trace in traces_sample:
    print(f"Trace ID: {trace.id}")

Now lets make a batch and score it. Ragas uses huggingface dataset object to build the dataset and run the evaluation. If you run this on your own production data, use the right keys to extract the question, contexts and answer from the trace

In [None]:
# score on a sample
evaluation_batch = {
    "question": [],
    "contexts": [],
    "answer": [],
    "trace_id": [],
}

for sample in traces_sample:
    evaluation_batch["question"].append(sample.input["question"])
    evaluation_batch["contexts"].append(sample.input["contexts"])
    evaluation_batch["answer"].append(sample.output)
    evaluation_batch["trace_id"].append(sample.id)

Using the ragas evaluate function to score an entire dataset instead of single turn. See [Ragas evaluate](https://docs.ragas.io/en/latest/references/evaluate/) for more information.

In [None]:
# run ragas evaluate
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import Faithfulness, ResponseRelevancy


ds = Dataset.from_dict(evaluation_batch)
evals_results = evaluate(
    ds,
    llm=evaluator_llm,
    embeddings=evaluator_embeddings,
    metrics=[Faithfulness(), ResponseRelevancy()],
)
evals_results

And that is it! You can see the scores over a time period. Let's render the results in a dataframe to see the scores.

In [None]:
df = evals_results.to_pandas()

# add the langfuse trace_id to the result dataframe
df["trace_id"] = ds["trace_id"]

df.head()

You can also push the scores back into Langfuse and attach them to the traces.

In [None]:
for _, row in df.iterrows():
    for metric_name in ["faithfulness", "answer_relevancy"]:
        langfuse.score(name=metric_name, value=row[metric_name], trace_id=row["trace_id"])

You can now go back to the Langfuse console and check the updated scores in the traces.

![](images/bedrock-kbs/score-with-sampling.png)

### Congratuations!
You have successfully finished Lab 2.

If you are at an AWS event, you can return to the workshop studio for additional instructions before moving into the next lab, where we will explore model-based evaluation and guardrails.