# Langfuse Dataset Creation and Experimentation

This notebook demonstrates how to create a dataset in Langfuse and run experiments to evaluate model performance against known answers for a pet food customer service use case.

## 1. Environment Setup

Load environment variables (API keys, configuration) from a .env file.

In [None]:

from dotenv import load_dotenv
load_dotenv()

## 2. Test the Model with Sample Questions

Before creating the dataset, let's test our model with some sample customer questions to verify it's working correctly.

In [None]:
from model.run_model import run_model

import json

# Method 1: Read from a file
with open('example_data/prepared_answers_pet_food.json', 'r') as file:
    prepared_answers = json.load(file)


# Example customer questions
test_questions = [
    "My dog has been having stomach issues after eating your kibble",
    "When will my order #12345 be delivered?",
    "What's the best food for a senior Golden Retriever?",
    "I want to return this bag of food"
]

for question in test_questions:

    print(run_model(question))

## 3. Initialize Langfuse Client

Create a connection to Langfuse for dataset management and experiment tracking.

In [None]:
from langfuse import Langfuse
 
langfuse = Langfuse()

## 4. Create the Dataset

Create a new dataset in Langfuse with metadata describing the purpose and structure of our evaluation data.

In [None]:
langfuse.create_dataset(
    name="pet_food_kwnown_answers",
    # optional description
    description="example dataset with customer questions with known answers",
    # optional metadata
    metadata={
        "author": "Paolo Tamagnini",
        "date": "2025-06-28",
        "type": "benchmark"
    }
)

## 5. Load Dataset and Prepared Answers

Load the raw dataset containing customer questions and the corresponding prepared answers that serve as ground truth for evaluation.

In [None]:
import json

# Read from file
with open('example_data/pet_food.json', 'r') as file:
    data = json.load(file)

# Read from file
with open('example_data/prepared_answers_pet_food.json', 'r') as file:
    prepared_answers = json.load(file)



## 6. Populate Dataset Items

Create individual dataset items in Langfuse, each containing:
- Input: Customer question
- Expected output: The correct prepared answer ID and text
- Metadata: Additional context like customer info, pet details, and timestamps

In [None]:
for item in data:

    langfuse.create_dataset_item(
        dataset_name="pet_food_kwnown_answers",
        # any python object or value, optional
        input={
             "question": item["message"],
        },
        # any python object or value, optional
        expected_output={
            "prepared_answer_id": item["answer_id"],
            "prepared_answer_text": prepared_answers[str(item["answer_id"])]
        },
        # metadata, optional
        metadata={
            "model": "Pet Food Customer Service",
            'customer_name': item['customer_name'],
            'pet_type': item['pet_type'],
            'pet_name': item['pet_name'],
            'category': item['category'],
            'timestamp': item['timestamp'],
            'response_timestamp': item['response_timestamp'],
            'status': item['status'],
        }
    )

## 7. Define Experiment Functions

Create functions to:
- Compare model outputs with expected answers
- Run experiments across the entire dataset
- Track success metrics and individual trace scores

In [None]:
def compare_prepared_answer_ids(output, expected_output):
  if int(output["prepared_answer_id"]) == expected_output["prepared_answer_id"]:
    return 1
  else:
    return 0

# experiment_name = "test_reproducibility"
def run_experiment(experiment_name):
    dataset = langfuse.get_dataset("pet_food_kwnown_answers")

    items_number = len(dataset.items)
    success_count = 0
    for item in dataset.items:

        # Use the item.run() context manager
        with item.run(
            run_name = experiment_name,

        ) as root_span: # root_span is the root span of the new trace for this item and run.
            # All subsequent langfuse operations within this block are part of this trace.

            # Call your application logic
            output = run_model(item.input["question"])


            comparison_result = compare_prepared_answer_ids(output, item.expected_output)
            success_count += comparison_result

            # Optionally, score the result against the expected output
            root_span.score_trace(name="prepared_answer_id", value = comparison_result)

    success_metric = success_count/items_number
    print(f"\nFinished processing dataset 'Pet Food Customer Service' for run '{experiment_name}'.")
    print(f"Success rate: {success_count}/{items_number} ({(success_count/items_number)*100:.2f}%)")

    return success_metric

## 8. Run the Experiment

Execute the experiment to evaluate model performance against the dataset, measuring how often the model selects the correct prepared answer.

In [None]:
from langfuse import Langfuse
from model.run_model import run_model

from dotenv import load_dotenv
load_dotenv()

langfuse = Langfuse()

run_experiment("test_reproducibility")