## Let's do some dataset runs in langfuse

A dataset is a collection of inputs and expected outputs and is used to test your application. Before executing your first dataset run, you need to create a dataset. 

1. First step make a Langfuse dataset:

In [1]:
from langfuse import get_client
from dotenv import load_dotenv

# MUST ALWAYS INITIALIZE LANGFUSE FIRST
load_dotenv()
langfuse = get_client()

langfuse.create_dataset(
    name="my-first-dataset",
    # optional description
    description="My first dataset",
    # optional metadata
    metadata={
        "author": "ML",
        "date": "2025-18-09",
        "type": "benchmark"
    }
)

Dataset(id='cmfp9we0500g4ad070gio1s97', name='my-first-dataset', description='My first dataset', metadata={'date': '2025-18-09', 'type': 'benchmark', 'author': 'ML'}, project_id='cmfp5po0q05r3ad06e3l1053t', created_at=datetime.datetime(2025, 9, 18, 10, 33, 47, 237000, tzinfo=datetime.timezone.utc), updated_at=datetime.datetime(2025, 9, 18, 11, 19, 35, 967000, tzinfo=datetime.timezone.utc))

2. Create new dataset items:

In [2]:
langfuse.create_dataset_item(
    dataset_name="my-first-dataset",
    # any python object or value, optional
    input={
        "text": "What is 2+1?"
    },
    # any python object or value, optional
    expected_output={
        "text": """Let’s do it carefully, digit by digit:
                    Start with 2.
                    Add 1 more.
                    2+1=3. ✅
                    So the answer is 3."""
    },
    # metadata, optional
    metadata={
        "model": "openai",
    }
)

DatasetItem(id='4bf6de3c-a678-4704-b0a5-215706f68997', status=<DatasetStatus.ACTIVE: 'ACTIVE'>, input={'text': 'What is 2+1?'}, expected_output={'text': 'Let’s do it carefully, digit by digit:\n                    Start with 2.\n                    Add 1 more.\n                    2+1=3. ✅\n                    So the answer is 3.'}, metadata={'model': 'openai'}, source_trace_id=None, source_observation_id=None, dataset_id='cmfp9we0500g4ad070gio1s97', dataset_name='my-first-dataset', created_at=datetime.datetime(2025, 9, 18, 11, 19, 36, 301000, tzinfo=datetime.timezone.utc), updated_at=datetime.datetime(2025, 9, 18, 11, 19, 36, 301000, tzinfo=datetime.timezone.utc))

Can also create synthetic datasets which I cover in another notebook.

Create items from production data

A common workflow is to select production traces where the application did not perform as expected. Then you let an expert add the expected output to test new versions of your application on the same data.

AH HA MOMENT: So basically you can use this in prod or whenver you have some input that previously performed badly and you can put it in your dataset and the source doesn't really do anything except now you can click on it and it takes you back to the trace you got the example from so you can compare it.

<p align="center">
    <img src="..\assets\langfuse\langfuse-source.png" alt="Langfuse Source" width="600"/>
</p>

In [3]:
langfuse.create_dataset_item(
    dataset_name="my-first-dataset",
    input={ "text": "hello world" },
    expected_output={ "text": "hello world" },
    # optional: link to a trace
    source_trace_id="c5a2292d7a59688e2eec3be7e0145109",
    # optional: link to a specific span, event, or generation
    #source_observation_id="<observation_id>"
)

DatasetItem(id='87f865f8-ff85-41e3-9a85-b1958e61fb8a', status=<DatasetStatus.ACTIVE: 'ACTIVE'>, input={'text': 'hello world'}, expected_output={'text': 'hello world'}, metadata=None, source_trace_id='c5a2292d7a59688e2eec3be7e0145109', source_observation_id=None, dataset_id='cmfp9we0500g4ad070gio1s97', dataset_name='my-first-dataset', created_at=datetime.datetime(2025, 9, 18, 11, 19, 36, 624000, tzinfo=datetime.timezone.utc), updated_at=datetime.datetime(2025, 9, 18, 11, 19, 36, 624000, tzinfo=datetime.timezone.utc))

You can also edit/archive dataset items
Archiving items will remove them from future experiment runs.

In [7]:
langfuse.create_dataset_item(
    dataset_name="my-first-dataset",
    id="9bbaa9c7-3ab3-4cad-b13b-cd7ea7edcfb3",
    # example: update status to "ARCHIVED"
    status="ARCHIVED"
)

DatasetItem(id='9bbaa9c7-3ab3-4cad-b13b-cd7ea7edcfb3', status=<DatasetStatus.ARCHIVED: 'ARCHIVED'>, input='What is a dog?', expected_output=None, metadata=None, source_trace_id='c5a2292d7a59688e2eec3be7e0145109', source_observation_id=None, dataset_id='cmfp9we0500g4ad070gio1s97', dataset_name='my-first-dataset', created_at=datetime.datetime(2025, 9, 18, 10, 48, 55, tzinfo=datetime.timezone.utc), updated_at=datetime.datetime(2025, 9, 18, 11, 26, 16, 961000, tzinfo=datetime.timezone.utc))

To edit them just use the same dataset item id as edit.

In [8]:

langfuse.create_dataset_item(
    dataset_name="my-first-dataset",
    id="9bbaa9c7-3ab3-4cad-b13b-cd7ea7edcfb3",
    input="What is a cat?"
)

DatasetItem(id='9bbaa9c7-3ab3-4cad-b13b-cd7ea7edcfb3', status=<DatasetStatus.ARCHIVED: 'ARCHIVED'>, input='What is a cat?', expected_output=None, metadata=None, source_trace_id='c5a2292d7a59688e2eec3be7e0145109', source_observation_id=None, dataset_id='cmfp9we0500g4ad070gio1s97', dataset_name='my-first-dataset', created_at=datetime.datetime(2025, 9, 18, 10, 48, 55, tzinfo=datetime.timezone.utc), updated_at=datetime.datetime(2025, 9, 18, 11, 27, 8, 321000, tzinfo=datetime.timezone.utc))

## Do remote dataset runs 
Once you created a dataset, you can use the dataset to test how your application performs on different inputs. Remote Dataset Runs are used to programmatically loop your applications or prompts through a dataset and optionally apply Evaluation Methods to the results.

They are called “Remote Dataset Runs” because they can make use of “remote” or external logic and code.

FIRST EXAMPLE NO LANGFUSE DATASET

When running experiments on local data, only traces are created in Langfuse - no dataset runs are generated. Each task execution creates an individual trace for observability and debugging.

In [10]:
from langfuse import get_client
from langfuse.openai import OpenAI
 
# Initialize client
langfuse = get_client()
 
# Define your task function
def my_task(*, item, **kwargs):
    question = item["input"]
    response = OpenAI().chat.completions.create(
        model="gpt-4.1", messages=[{"role": "user", "content": question}]
    )
 
    return response.choices[0].message.content
 
 
# Run experiment on local data
local_data = [
    {"input": "What is the capital of France?"},
    {"input": "What is the capital of Germany?"},
]
 
result = langfuse.run_experiment(
    name="Geography Quiz",
    description="Testing basic functionality",
    data=local_data,
    task=my_task,
)
 
# Use format method to display results
print(result.format())

Individual Results: Hidden (2 items)\n💡 Set include_item_results=True to view them\n\n──────────────────────────────────────────────────\n🧪 Experiment: Geography Quiz
📋 Run name: Geography Quiz - 2025-09-18T11:43:27.599485Z - Testing basic functionality\n2 items


<p align="center">
    <img src="..\assets\langfuse\langfuse-no-dataset-remote-run.png" alt="Langfuse" width="900"/>
</p>

Now try Usage with Langfuse Datasets
Run experiments directly on datasets stored in Langfuse for automatic tracing and comparison.

In [16]:
# VIEW ALL TRACES IN DATASET

# Get your dataset object
dataset = langfuse.get_dataset("my-first-dataset")

# Fetch all items
items = dataset.items

# Print each item's id and input
for i, item in enumerate(items):
    print(f"Item {i}: id={item.id}, input={item.input}, expected_output={item.expected_output}, status={getattr(item, 'status', None)}")

Item 0: id=87f865f8-ff85-41e3-9a85-b1958e61fb8a, input={'text': 'hello world'}, expected_output={'text': 'hello world'}, status=DatasetStatus.ARCHIVED
Item 1: id=4bf6de3c-a678-4704-b0a5-215706f68997, input={'text': 'What is 2+1?'}, expected_output={'text': 'Let’s do it carefully, digit by digit:\n                    Start with 2.\n                    Add 1 more.\n                    2+1=3. ✅\n                    So the answer is 3.'}, status=DatasetStatus.ACTIVE
Item 2: id=9bbaa9c7-3ab3-4cad-b13b-cd7ea7edcfb3, input=What is a cat?, expected_output=None, status=DatasetStatus.ARCHIVED
Item 3: id=2deec1d3-d621-437a-ad04-1a01dd6351d0, input={'text': 'hello world'}, expected_output={'text': 'hello world'}, status=DatasetStatus.ACTIVE
Item 4: id=443c3054-9777-48c0-831e-76da3cbf52d6, input={'text': 'What is 2+1?'}, expected_output={'text': 'Let’s do it carefully, digit by digit:\n                    Start with 2.\n                    Add 1 more.\n                    2+1=3. ✅\n                

In [17]:
# Get dataset from Langfuse
dataset = langfuse.get_dataset("my-first-dataset")

# Define your task function
def prod_task(*, item, **kwargs):
    # item.input could be a dict
    if isinstance(item.input, dict):
        question = item.input.get("text", "")
    # or item.input could be a string
    else:
        question = item.input
    response = OpenAI().chat.completions.create(
        model="gpt-4.1", messages=[{"role": "user", "content": question}]
    )
 
    return response.choices[0].message.content

# Run experiment directly on the dataset
result = dataset.run_experiment(
    name="Production Model Test",
    description="Monthly evaluation of our production model",
    task=prod_task
)
 
# Use format method to display results
print(result.format())

Failed to create dataset run item: status_code: 404, body: {'message': 'Dataset item not found', 'error': 'LangfuseNotFoundError'}
Failed to create dataset run item: status_code: 404, body: {'message': 'Dataset item not found', 'error': 'LangfuseNotFoundError'}


Individual Results: Hidden (6 items)\n💡 Set include_item_results=True to view them\n\n──────────────────────────────────────────────────\n🧪 Experiment: Production Model Test
📋 Run name: Production Model Test - 2025-09-18T11:55:08.809831Z - Monthly evaluation of our production model\n6 items


<p align="center">
    <img src="..\assets\langfuse\dataset-run-with-dataset.png" alt="Langfuse" width="900"/>
    <img src="..\assets\langfuse\first-dataset-run.png" alt="Langfuse" width="900"/>
</p>

# Enhance your dataset runs with evaluators
Evaluators assess the quality of task outputs at the item level. They receive the input, metadata, output, and expected output for each item and return evaluation metrics that are reported as scores on the traces in Langfuse.

In [23]:
runs_list = langfuse.api.datasets.get_runs(
    dataset_name="my-first-dataset", page=1, limit=100
)

for run_summary in runs_list.data:
    run_full = langfuse.api.datasets.get_run(
        dataset_name="my-first-dataset", run_name=run_summary.name
    )
    for run_item in run_full.dataset_run_items:
        print(run_item)

id='fd80f2eb-4c23-4e12-a885-2f9b65d6874e' dataset_run_id='2be6100a-b272-4313-90c5-7f83c65dca1a' dataset_run_name='Multi-metric Evaluation - 2025-09-18T12:02:18.920890Z' dataset_item_id='26d0f54d-e2b1-4cd5-925a-df6866013473' trace_id='9d91cb22fe6fcb0afe948ef97d33b3e6' observation_id='1238b30038485665' created_at=datetime.datetime(2025, 9, 18, 12, 2, 34, 51000, tzinfo=datetime.timezone.utc) updated_at=datetime.datetime(2025, 9, 18, 12, 2, 34, 51000, tzinfo=datetime.timezone.utc)
id='2e6e5bcd-baf4-458b-a8b0-a41ee372f06e' dataset_run_id='2be6100a-b272-4313-90c5-7f83c65dca1a' dataset_run_name='Multi-metric Evaluation - 2025-09-18T12:02:18.920890Z' dataset_item_id='443c3054-9777-48c0-831e-76da3cbf52d6' trace_id='4c509c49f3f429a2994e00d1f1d8df73' observation_id='5b41419152cf9395' created_at=datetime.datetime(2025, 9, 18, 12, 2, 32, 604000, tzinfo=datetime.timezone.utc) updated_at=datetime.datetime(2025, 9, 18, 12, 2, 32, 604000, tzinfo=datetime.timezone.utc)
id='4d5932d8-e302-4c2c-975c-348908

In [27]:
run_full = langfuse.api.datasets.get_run(
    dataset_name="my-first-dataset", run_name="Production Model Test - 2025-09-18T11:55:08.809831Z"
)

for run_item in run_full.dataset_run_items:
    print(run_item)
    # Fetch the dataset item to get input and expected output
    dataset_item = langfuse.api.dataset_items.get(run_item.dataset_item_id)
    print("Input:", dataset_item.input)
    print("Expected Output:", dataset_item.expected_output)
    # Output is not directly stored on the DatasetItem; it is typically found in the trace linked via run_item.trace_id

id='9be82ff3-6cc8-4736-a25b-1a033ee0d51e' dataset_run_id='dda46a69-a167-4853-aa24-7522ec531da6' dataset_run_name='Production Model Test - 2025-09-18T11:55:08.809831Z' dataset_item_id='26d0f54d-e2b1-4cd5-925a-df6866013473' trace_id='fd41743a8f75dbe2a0de485100564c33' observation_id='923076ee2f6eb5c5' created_at=datetime.datetime(2025, 9, 18, 11, 55, 20, 996000, tzinfo=datetime.timezone.utc) updated_at=datetime.datetime(2025, 9, 18, 11, 55, 20, 996000, tzinfo=datetime.timezone.utc)
Input: {'text': 'What is 2+2?'}
Expected Output: {'text': 'Let’s do it carefully, digit by digit:\n                    Start with 2.\n                    Add 2 more.\n                    2+2=4. ✅\n                    So the answer is 4.'}
id='1bdc4d05-ec83-4420-8adf-4cd2c2a68bc0' dataset_run_id='dda46a69-a167-4853-aa24-7522ec531da6' dataset_run_name='Production Model Test - 2025-09-18T11:55:08.809831Z' dataset_item_id='443c3054-9777-48c0-831e-76da3cbf52d6' trace_id='6834fec8b3edf6a8f7ba76e5af3d5015' observation

In [None]:
from langfuse import Evaluation
 
# Define evaluation functions
def accuracy_evaluator(*, input, output, expected_output, metadata, **kwargs):
    if output.get("text").lower() in expected_output.get("text").lower() in output.lower():
        return Evaluation(name="accuracy", value=1.0, comment="Correct answer found")
 
    return Evaluation(name="accuracy", value=0.0, comment="Incorrect answer")
 
def length_evaluator(*, input, output, **kwargs):
    return Evaluation(name="response_length", value=len(output), comment=f"Response has {len(output)} characters")
 
# Use multiple evaluators
result = dataset.run_experiment(
    name="Multi-metric Evaluation",
    task=prod_task,
    evaluators=[accuracy_evaluator, length_evaluator]
)
 
print(result.format())

Failed to create dataset run item: status_code: 404, body: {'message': 'Dataset item not found', 'error': 'LangfuseNotFoundError'}
Evaluator accuracy_evaluator failed: 'dict' object has no attribute 'lower'
Evaluator accuracy_evaluator failed: 'dict' object has no attribute 'lower'
Failed to create dataset run item: status_code: 404, body: {'message': 'Dataset item not found', 'error': 'LangfuseNotFoundError'}
Evaluator accuracy_evaluator failed: 'NoneType' object has no attribute 'lower'
Evaluator accuracy_evaluator failed: 'dict' object has no attribute 'lower'
Evaluator accuracy_evaluator failed: 'dict' object has no attribute 'lower'
Evaluator accuracy_evaluator failed: 'dict' object has no attribute 'lower'


Individual Results: Hidden (6 items)\n💡 Set include_item_results=True to view them\n\n──────────────────────────────────────────────────\n🧪 Experiment: Multi-metric Evaluation
📋 Run name: Multi-metric Evaluation - 2025-09-18T12:02:18.920890Z\n6 items\nEvaluations:\n  • response_length\n\nAverage Scores:\n  • response_length: 206.833\n
