# Migrating from Langfuse to LangSmith


## Migrating Resources

Contained in this repo are scripts to migrate your resources from Langfuse to LangSmith.

This includes:
- Datasets
- Prompts
- Recent Traces

To migrate your resources over, refer to ```providers/langfuse/main.py```. Specific scripts for each are provided in the ```providers/langfuse/data``` directory.


## Updating Code

In the process of migrating to LangSmith, you will also need to update your instrumentation code as well. 

In the following sections we break down some common patterns used in LangFuse, and their equivalent implementation in LangSmith. Not all features are shared, but common constructs are available across both frameworks.

First, let's load in our environment variables.


In [24]:
import os
os.environ["LANGSMITH_PROJECT"] = "default"

from dotenv import load_dotenv
load_dotenv("../../.env", override=True)

True

### **Tracing**



#### Observe decorator

If you're using the ```@observe``` decorator, the equivalent in LangSmith is the ```@traceable``` decorator

In [2]:
from langfuse import observe, get_client
 
@observe
def my_function():
    return "Hello, world!" # Input/output and timings are automatically captured
 
my_function()
 
# Flush events in short-lived applications
langfuse = get_client()
langfuse.flush()

Note: with LangSmith you do not need to flush short-lived events.

In [19]:
from langsmith import traceable

@traceable
def my_function():
    return "Hello, world!" # Input/output and timings are automatically captured

my_function()

'Hello, world!'

#### Context Managers

If you're using context managers in LangFuse, LangSmith has an equivalent ls.trace() context manager.

In [65]:
from langfuse import get_client
 
langfuse = get_client()
 
# Create a span using a context manager
with langfuse.start_as_current_span(name="process-request") as span:
    # Your processing logic here
    span.update(output="Processing complete")
 
    # Create a nested generation for an LLM call
    with langfuse.start_as_current_generation(name="llm-response", model="gpt-3.5-turbo") as generation:
        # Your LLM call logic here
        generation.update(output="Generated response")
 

  with langfuse.start_as_current_generation(name="llm-response", model="gpt-3.5-turbo") as generation:


In [66]:
import langsmith as ls

# Create a trace using the context manager
with ls.trace(name="process-request") as rt:
    # Your processing logic here
    # Create a nested generation for an LLM call
    with ls.trace(name="llm-response", run_type="llm", metadata={"model": "gpt-3.5-turbo"}) as generation:
        # Your LLM call logic here
        generation.end(outputs={"output": "Generated response"})

    rt.end(outputs={"output": "Processing complete"})

#### OpenTelemetry

If you're using OpenTelemetry, LangSmith supports [OTel tracing natively.](https://docs.langchain.com/langsmith/trace-with-opentelemetry#trace-with-opentelemetry)

You'll likely be switching out the exporter endpoints you had [set with Langfuse](https://langfuse.com/integrations/native/opentelemetry)

### **Evaluations**

#### Datasets

In offline evaluations, a dataset is often used to run evaluations over. LangFuse allows you to create a dataset and add examples using the SDK.

In [29]:
langfuse.create_dataset(
    name="basic",
    # optional description
    description="Basic dataset",
    # optional metadata
    metadata={
        "type": "benchmark"
    }
)

langfuse.create_dataset_item(
    dataset_name="basic",
    # any python object or value, optional
    input={
        "text": "What is the capital of France?"
    },
    # any python object or value, optional
    expected_output={
        "text": "Paris"
    },
)

langfuse.create_dataset_item(
    dataset_name="basic",
    input={
        "text": "What is the capital of Germany?"
    },
    expected_output={
        "text": "Berlin"
    },
)

DatasetItem(id='3d528ef2-d906-46bc-97b9-38628ce64b5c', status=<DatasetStatus.ACTIVE: 'ACTIVE'>, input={'text': 'What is the capital of Germany?'}, expected_output={'text': 'Berlin'}, metadata=None, source_trace_id=None, source_observation_id=None, dataset_id='cmga8r3ll02fuad06uf23cwmf', dataset_name='basic', created_at=datetime.datetime(2025, 10, 3, 2, 44, 50, 788000, tzinfo=datetime.timezone.utc), updated_at=datetime.datetime(2025, 10, 3, 2, 44, 50, 788000, tzinfo=datetime.timezone.utc))

LangSmith allows you to create datasets using the LangSmith SDK as well

In [None]:
from langsmith import Client

client = Client()
# Create a dataset
examples = [
    {
        "input": "What is the capital of France?",
        "expected_output": "Paris"
    },
    {
        "input": "What is the capital of Germany?",
        "expected_output": "Berlin"
    }
]

dataset_name = "basic"

if not client.has_dataset(dataset_name=dataset_name):
    langsmith_dataset = client.create_dataset(dataset_name=dataset_name)
    client.create_examples(
        inputs=[{"input": ex["input"]} for ex in examples],
        outputs=[{"expected_output": ex["expected_output"]} for ex in examples],
        dataset_id=langsmith_dataset.id
    )

#### Experiments

Running experiments with LangFuse in the SDK is done through ```run_experiment```

In [59]:
from langfuse import Evaluation
from langfuse.openai import OpenAI

# Define your task function
def my_task(*, item, **kwargs):
    question = item.input["text"]
    print(question)
    response = OpenAI().chat.completions.create(
        model="gpt-4.1", messages=[{"role": "user", "content": question}]
    )
    return response.choices[0].message.content
 

# Define evaluation functions
def accuracy_evaluator(*, input, output, expected_output, **kwargs):
    if expected_output and expected_output["text"].lower() in output.lower():
        return Evaluation(name="accuracy", value=1.0, comment="Correct answer found")
    return Evaluation(name="accuracy", value=0.0, comment="Incorrect answer")
 
def length_evaluator(*, input, output, **kwargs):
    return Evaluation(name="response_length", value=len(output), comment=f"Response has {len(output)} characters")
 
# Use multiple evaluators
dataset = langfuse.get_dataset("basic")
result = langfuse.run_experiment(
    name="Multi-metric Evaluation",
    data=dataset.items,
    task=my_task,
    evaluators=[accuracy_evaluator, length_evaluator]
)
 
print(result.format())

Individual Results: Hidden (2 items)\n💡 Set include_item_results=True to view them\n\n──────────────────────────────────────────────────\n🧪 Experiment: Multi-metric Evaluation
📋 Run name: Multi-metric Evaluation - 2025-10-03T03:12:26.355573Z\n2 items\nEvaluations:\n  • response_length\n  • accuracy\n\nAverage Scores:\n  • response_length: 36.000\n  • accuracy: 1.000\n\n🔗 Dataset Run:\n   https://us.cloud.langfuse.com/project/cmg9xbp62008had07ab7us47z/datasets/cmga8r3ll02fuad06uf23cwmf/runs/dbd7a5a7-2a7a-42b4-a01c-606bd5e9eb16


The equivalent in LangSmith is using ```evaluate()```

In [36]:
from langsmith import Client, trace
from langsmith.wrappers import wrap_openai
from openai import OpenAI

client = Client()

dataset = client.read_dataset(dataset_name="basic")

# Wrap OpenAI client for tracing
openai_client = wrap_openai(OpenAI())

# Define your task function
def my_task(inputs: dict) -> dict:
    question = inputs["input"]
    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",  # Use a LangSmith-supported model name
        messages=[{"role": "user", "content": question}],
    )
    return {"output": response.choices[0].message.content}


# Define evaluation functions
def accuracy_evaluator(inputs: dict, outputs: dict, reference_outputs: dict) -> dict:
    output = outputs.get("output", "")
    expected = reference_outputs.get("expected_output", "")
    if expected and expected.lower() in output.lower():
        return {"key": "accuracy", "score": 1.0, "comment": "Correct answer found"}
    return {"key": "accuracy", "score": 0.0, "comment": "Incorrect answer"}

def length_evaluator(inputs: dict, outputs: dict) -> dict:
    output = outputs.get("output", "")
    return {"key": "response_length", "score": len(output), "comment": f"Response has {len(output)} characters"}

# Run experiment
result = client.evaluate(
    my_task,
    data=dataset.id,
    evaluators=[accuracy_evaluator, length_evaluator],
    experiment_prefix="multi-metric-eval"
)

  from .autonotebook import tqdm as notebook_tqdm


View the evaluation results for experiment: 'multi-metric-eval-89657962' at:
https://smith.langchain.com/o/4015447c-43ab-4414-8539-633d4cb47217/datasets/335a7b19-f7bf-426f-a288-d5b39a5402fb/compare?selectedSessions=4c83a493-5938-487f-9929-1f26fd0c157f




2it [00:01,  1.27it/s]


Both LangFuse and LangSmith support flexible evaluation types, including summary evaluators being defined in experiments.

### **Prompts**

LangFuse and LangSmith both have prompting interfaces in the UI and the SDK.

LangFuse uses the ```create_prompt``` method, shown below

In [60]:
langfuse.create_prompt(
    name="movie-critic-chat",
    type="chat",
    prompt=[
      { "role": "system", "content": "You are an {{criticlevel}} movie critic" },
      { "role": "user", "content": "Do you like {{movie}}?" },
    ],
    labels=["production"],  # directly promote to production
    config={
        "model": "gpt-4o",
        "temperature": 0.7,
        "supported_languages": ["en", "fr"],
    },  # optionally, add configs (e.g. model parameters or model tools) or tags
)

<langfuse.model.ChatPromptClient at 0x108c19160>

LangSmith has a comparable ```push_prompt``` function in the SDK, which automatically detects which model you're using and includes your exact configuration in the metadata.

In [64]:
from langsmith import Client
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

client = Client()

model = ChatOpenAI(model="gpt-4o-mini")
prompt = ChatPromptTemplate([
    ("system", "You are an {criticlevel} movie critic"),
    ("human", "Do you like {movie}?")
])
chain = prompt | model
client.push_prompt("movie-critic-chat", object=chain)

'https://smith.langchain.com/prompts/movie-critic-chat/4cb87751?organizationId=4015447c-43ab-4414-8539-633d4cb47217'