# Comparing Q&A System Outputs
[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langsmith-cookbook/blob/master/./testing-examples/comparing-runs/comparing-qa.ipynb)

The most common way to compare two models is to benchmark them both on a dataset and compare the aggregate metrics.

This approach is useful but it may filter out helpful information about the quality of the two system variants. In this case, it can be helpful to directly perform pairwise comparisons on the responses and take the resulting preference scores into consideration.

In this tutorial, we will share one way to do this in code. We will use a retrieval Q&A system over LangSmith's docs as a motivating example.

The main steps are:

1. Setup
   - Create a dataset of questions and answers.
   - Define different versions of your chains to evaluate.
   - Evaluate chains directly on a dataset using regular metrics (e.g. correctness).
4. Evaluate the pairwise preferences over that dataset

In this case, we will test the impact of chunk sizes on our result quality. Let's begin!

## Prerequisites

This tutorial uses OpenAI for the model, ChromaDB to store documents, and LangChain to compose the chain. To make sure the tracing and evals are set up for [LangSmith](https://smith.langchain.com), please configure your API Key appropriately.

We will also use pandas to render the results in the notebook.

In [1]:
import os

os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com" # Update with your API URL if using a hosted instance of Langsmith.
os.environ["LANGCHAIN_API_KEY"] = "YOUR API KEY" # Update with your API key

Install the required packages. `lxml` and `html2text` are used by the document loader.

In [2]:
# %pip install -U "langchain[openai]" --quiet
# %pip install chromadb --quiet
# %pip install lxml --quiet
# %pip install html2text --quiet
# %pip install pandas --quiet
# %pip install nest_asyncio --quiet

In [3]:
# %env OPENAI_API_KEY=<YOUR-API-KEY>

In [4]:
# Used for running in jupyter
import nest_asyncio

nest_asyncio.apply()

## 1. Setup

#### a. Create a dataset

No evaluation process is complete without a development dataset. We've hard-coded a few examples below to demonstrate the process. In general, you'll want a lot more (>100) pairs for statistically significant results. Drawing from actual user queries can be helpful to ensure better representation of the domain.

In [5]:
examples = [
    ("What is LangChain?", "LangChain is an open-source framework for building applications using large language models. It is also the name of the company building LangSmith."),
    ("How might I query for all runs in a project?", "client.list_runs(project_name='my-project-name'), or in TypeScript, client.ListRuns({projectName: 'my-project-anme'})"),
    ("What's a langsmith dataset?", "A LangSmith dataset is a collection of examples. Each example contains inputs and optional expected outputs or references for that data point."),
    ("How do I use a traceable decorator?", """The traceable decorator is available in the langsmith python SDK. To use, configure your environment with your API key,\
import the required function, decorate your function, and then call the function. Below is an example:
```python
from langsmith.run_helpers import traceable
@traceable(run_type="chain") # or "llm", etc.
def my_function(input_param):
    # Function logic goes here
    return output
result = my_function(input_param)
```"""),
    ("Can I trace my Llama V2 llm?", "So long as you are using one of LangChain's LLM implementations, all your calls can be traced"),
    ("Why do I have to set environment variables?", "Environment variables can tell your LangChain application to perform tracing and contain the information necessary to authenticate to LangSmith."
     " While there are other ways to connect, environment variables tend to be the simplest way to configure your application."),
    ("How do I move my project between organizations?", "LangSmith doesn't directly support moving projects between organizations.")
]

In [6]:
from langsmith import Client

client = Client()

In [7]:
import uuid 

dataset_name = f"Retrieval QA Questions {str(uuid.uuid4())}"
dataset = client.create_dataset(dataset_name=dataset_name)
for q, a in examples:
    client.create_example(inputs={"question": q}, outputs={"answer": a}, dataset_id=dataset.id)

#### b. Define RAG Q&A system

Our Q&A system uses a simple retriever and LLM response generator. To break that down further, the chain will be composed of:

1. A [VectorStoreRetriever](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.base.VectorStoreRetriever.html#langchain.vectorstores.base.VectorStoreRetriever) to retrieve documents. This uses:
   - An embedding model to vectorize documents and user queries for retrieval. In this case, the [OpenAIEmbeddings](https://api.python.langchain.com/en/latest/embeddings/langchain.embeddings.openai.OpenAIEmbeddings.html) model.
   - A vectorstore, in this case we will use [Chroma](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.chroma.Chroma.html#langchain.vectorstores.chroma.Chroma).
2. A response generator. This uses:
   - A [ChatPromptTemplate](https://api.python.langchain.com/en/latest/prompts/langchain.prompts.chat.ChatPromptTemplate.html#langchain.prompts.chat.ChatPromptTemplate) to combine the query and documents. 
   - An LLM, in this case, the 16k token context window version of `gpt-3.5-turbo` via [ChatOpenAI](https://api.python.langchain.com/en/latest/chat_models/langchain.chat_models.openai.ChatOpenAI.html#langchain.chat_models.openai.ChatOpenAI).

We will combine them using LangChain's [expression syntax](https://python.langchain.com/docs/guides/expression_language/cookbook).

First, load the documents to populate the vectorstore:

In [8]:
from langchain.document_loaders import RecursiveUrlLoader
from langchain.document_transformers import Html2TextTransformer
from langchain.text_splitter import TokenTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

api_loader = RecursiveUrlLoader("https://docs.smith.langchain.com")
doc_transformer = Html2TextTransformer()
raw_documents = api_loader.load()
transformed = doc_transformer.transform_documents(raw_documents)

def create_retriever(transformed_documents, text_splitter):
    documents = text_splitter.split_documents(transformed_documents)
    embeddings = OpenAIEmbeddings()
    vectorstore = Chroma.from_documents(documents, embeddings)
    return vectorstore.as_retriever(search_kwargs={"k": 4})



Next up, we'll define the chain. Since we are going to vary the retriever parameters, our constructor will
take the retriever as an argument.

In [9]:
from datetime import datetime
from operator import itemgetter

from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.schema.output_parser import StrOutputParser


def create_chain(retriever):
    prompt = ChatPromptTemplate.from_messages(
            [
                ("system", "You are a helpful documentation Q&A assistant, trained to answer"
                " questions from LangSmith's documentation."
                " LangChain is a framework for building applications using large language models."
                "\nThe current time is {time}.\n\nRelevant documents will be retrieved in the following messages."),
                ("system", "{context}"),
                ("human","{question}")
            ]
        ).partial(time=str(datetime.now()))

    model = ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0)
    response_generator = (
        prompt 
        | model 
        | StrOutputParser()
    )
    chain = (
        # The runnable map here routes the original inputs to a context and a question dictionary to pass to the response generator
        {
            "context": itemgetter("question") | retriever | (lambda docs: "\n".join([doc.page_content for doc in docs])),
            "question": itemgetter("question")
        }
        | response_generator
    )
    return chain

With the documents prepared, and the chain constructor ready, it's time to create and evaluate our chains.
We will vary the split size and overlap to evaluate its impact on the response quality.

In [10]:
text_splitter = TokenTextSplitter(
    model_name="gpt-3.5-turbo",
    chunk_size=2000,
    chunk_overlap=200,
)
retriever = create_retriever(transformed, text_splitter)

chain_1 = create_chain(retriever)

In [11]:
# We will shrink both the chunk size and overlap
text_splitter_2 = TokenTextSplitter(
    model_name="gpt-3.5-turbo",
    chunk_size=500,
    chunk_overlap=50,
)
retriever_2 = create_retriever(transformed, text_splitter_2)

chain_2 = create_chain(retriever_2)

#### c. Evaluate the chains

At this point, we are still going through the regular development -> evaluation process. We have two candidates and will evaluate them with a correctness evaluator from LangChain. By running `run_on_dataset`, we will generate predicted answers to each question in the dataset and log feedback from the evaluator for that data point.

In [12]:
from langchain.smith import RunEvalConfig

eval_config = RunEvalConfig(
    # We will use the chain-of-thought Q&A correctness evaluator
    evaluators=["cot_qa"],
)

In [13]:
results = client.run_on_dataset(
    dataset_name=dataset_name,
    llm_or_chain_factory=chain_1,
    evaluation=eval_config
)
project_name = results["project_name"]

View the evaluation results for project 'test-terrific-sneeze-7' at:
https://smith.langchain.com/o/9a6371ef-ea6a-4860-b3bd-9614084873e7/projects/p/86b4ea18-3749-4c56-99bf-e1399541a0d7
[------------------------------------------------->] 7/7

In [14]:
results_2 = client.run_on_dataset(
    dataset_name=dataset_name,
    llm_or_chain_factory=chain_2,
    evaluation=eval_config
)
project_name_2 = results_2["project_name"]

View the evaluation results for project 'test-warm-lettuce-8' at:
https://smith.langchain.com/o/9a6371ef-ea6a-4860-b3bd-9614084873e7/projects/p/4980a6ab-15df-476e-ac8a-fcf35adc8ba1
[------------------------------------------------->] 7/7

Now you should have two test run projects over the same dataset. If you click on one, it should look something like the following:
    
![Original Feedback](img/original_eval.png)

You can look at the aggregate results here and for the other project to compare them. You could also view them in a dataframe:

In [21]:
# Helper function to wrap the results in the dataframe table
from IPython.core.display import HTML

def word_wrap_on_hover(df):
    styles = """
    <style>
        .hover_table td {
            max-width: 200px; /* You can adjust this value */
            overflow: hidden;
            text-overflow: ellipsis;
            white-space: nowrap;
        }
        .hover_table td:hover {
            white-space: normal;
            word-wrap: break-word;
        }
    </style>
    """
    html_table = df.to_html(index=False, classes='hover_table')
    return HTML(styles + html_table)

In [16]:
import pandas as pd

runs_1 = list(client.list_runs(project_name=project_name, execution_order=1))
runs_2 = list(client.list_runs(project_name=project_name_2, execution_order=1))

def get_project_df(runs):
    return pd.DataFrame([{**run.outputs, **{k: v.get('avg') for k, v in run.feedback_stats.items()}} for run in runs], index = [run.reference_example_id for run in runs])

runs_1_df = get_project_df(runs_1)
runs_2_df = get_project_df(runs_2)
joined_df = runs_1_df.join(runs_2_df, lsuffix='_1', rsuffix='_2')
columns_1 = [col for col in joined_df.columns if col.endswith('_1')]
columns_2 = [col for col in joined_df.columns if col.endswith('_2')]
new_columns_order = [col for pair in zip(columns_1, columns_2) for col in pair]
joined_df = joined_df[new_columns_order]

In [17]:
word_wrap_on_hover(joined_df)

output_1,output_2,COT Contextual Accuracy_1,COT Contextual Accuracy_2
"LangChain is a framework for building applications using large language models (LLMs). It simplifies the process of setting up and using LLMs, such as OpenAI's GPT models, by providing a set of tools and utilities. LangChain aims to bridge the gap between the initial setup of LLMs and their reliable use in production applications. It includes features like tracing, debugging, testing, evaluation, and monitoring to help developers build and improve LLM applications.","LangChain is a framework developed by LangChain, Inc. for building applications using large language models. It provides a set of tools and libraries that enable developers to leverage the power of language models in their applications. With LangChain, developers can easily integrate language models into their projects, allowing them to perform tasks such as natural language understanding, text generation, sentiment analysis, and more. The framework supports various programming languages and provides a range of features to simplify the development process.",1.0,0.0
"To query for all runs in a project, you can use the LangSmith client's `list_runs` method. Here's an example in Python:\n\n```python\nfrom langsmith import Client\n\nclient = Client()\nruns = list(client.list_runs(project_name=""<your_project>""))\n```\n\nIn this example, replace `<your_project>` with the name of your project. The `list_runs` method returns a generator that you can iterate over to access each run in the project.\n\nIf you're using TypeScript, you can use a similar approach:\n\n```typescript\nimport { Client, Run } from ""langsmith"";\n\nconst client = new Client();\nconst runs: Run[] = [];\nfor await (const run of client.listRuns({ projectName: ""<your_project>"" })) {\n runs.push(run);\n}\n```\n\nAgain, replace `<your_project>` with the name of your project. The `listRuns` method returns an asynchronous iterable that you can loop over to access each run in the project.","To query for all runs in a project using the LangSmith API, you can use the `list_runs` method of the `Client` object. Here's an example in Python:\n\n```python\nfrom langsmith import Client\n\nclient = Client()\nproject_runs = client.list_runs(project_name=""<your_project>"")\n```\n\nThis will retrieve all runs in the specified project. You can replace `<your_project>` with the name of your project.\n\nIn TypeScript, you can use the `listRuns` method of the `Client` class. Here's an example:\n\n```typescript\nimport { Client } from ""langsmith"";\n\nconst client = new Client();\nconst projectRuns = await client.listRuns({ projectName: ""<your_project>"" });\n```\n\nAgain, replace `<your_project>` with the name of your project.\n\nThe `list_runs` or `listRuns` method returns a list of `Run` objects that represent the runs in the project. You can iterate over this list to access the individual runs and their properties.",1.0,1.0
"Yes, you can trace your Llama V2 LLM by enabling tracing in your LangChain application. Tracing allows you to log the inputs and outputs of each component in your LLM application, including the Llama V2 LLM. This can be helpful for debugging and understanding the behavior of your LLM.\n\nTo enable tracing, you need to set the `LANGCHAIN_TRACING_V2` environment variable to `true` before running your LangChain application. This will ensure that all calls to LLMs, chains, agents, tools, and retrievers are logged to LangSmith.\n\nHere's an example of how to enable tracing in Python:\n\n```python\nimport os\nfrom langchain.chat_models import LlamaV2\n\nos.environ[""LANGCHAIN_TRACING_V2""] = ""true""\n\nllm = LlamaV2()\nllm.invoke(""Hello, world!"")\n```\n\nAnd here's an example in TypeScript:\n\n```typescript\nimport { LlamaV2 } from ""langchain/chat_models/llama_v2"";\n\nprocess.env.LANGCHAIN_TRACING_V2 = ""true"";\n\nconst llm = new LlamaV2();\nawait llm.invoke(""Hello, world!"");\n```\n\nRemember to replace `""Hello, world!""` with your desired input to the LLM.","Yes, you can trace your Llama V2 LLM using LangSmith. LangSmith's tracing feature allows you to log and visualize the inputs, outputs, and sequence of events for LLM calls, including Llama V2. This can be helpful for debugging, understanding the exact input to the LLM, analyzing the sequence of events, identifying slow components, and tracking token usage. You can enable tracing for Llama V2 LLM calls by setting the appropriate environment variables or using the LangChainTracer callback when initializing the LLM.",1.0,1.0
"The `traceable` decorator is a convenient way to log your function calls as runs in LangSmith. It automatically captures the inputs, outputs, and metadata of the function and sends them to LangSmith for logging.\n\nTo use the `traceable` decorator, follow these steps:\n\n1. Import the `traceable` decorator from the `langsmith.run_helpers` module.\n2. Decorate your function with `@traceable(run_type=""chain"")`, specifying the `run_type` as ""chain"" (or any other appropriate value).\n3. Call your decorated function as usual.\n\nHere's an example in Python:\n\n```python\nfrom langsmith.run_helpers import traceable\n\n@traceable(run_type=""chain"")\ndef my_function(input1, input2):\n # Your function logic here\n output = input1 + input2\n return output\n\nresult = my_function(3, 4)\n```\n\nIn this example, the `my_function` is decorated with `@traceable(run_type=""chain"")`. When the function is called, the inputs, outputs, and metadata will be logged as a run in LangSmith.\n\nNote that you can customize the `run_type` to match the type of operation your function represents (e.g., ""llm"", ""tool"", etc.). You can also pass additional metadata to the decorator using the `metadata` parameter.\n\nMake sure you have set up the LangSmith SDK and configured the necessary environment variables before using the `traceable` decorator.","To use the `traceable` decorator in LangSmith, you can follow these steps:\n\n1. Import the necessary modules:\n - For Python:\n ```python\n from langsmith.run_helpers import traceable\n ```\n - For TypeScript:\n ```typescript\n import { traceable } from ""langsmith/run_helpers"";\n ```\n\n2. Decorate the function you want to trace with the `traceable` decorator. This decorator will automatically create a run tree for the function and log it to LangSmith. You can specify the `run_type` and `name` parameters to customize the type and name of the run. For example:\n - For Python:\n ```python\n @traceable(run_type=""chain"", name=""my_chain_function"")\n def my_chain_function(input_data):\n # Function logic here\n return output_data\n ```\n - For TypeScript:\n ```typescript\n @traceable({ run_type: ""chain"", name: ""my_chain_function"" })\n function myChainFunction(inputData: any): any {\n // Function logic here\n return outputData;\n }\n ```\n\n3. Call the traced function as usual. The function execution will be logged to LangSmith.\n\nBy using the `traceable` decorator, you can easily log the execution of your functions and visualize the call hierarchy in the LangSmith app.",1.0,1.0
"A LangSmith dataset is a collection of examples that can be used to evaluate or improve a chain, agent, or model built using the LangSmith framework. Each example in a dataset consists of inputs and (optionally) expected outputs for a given interaction. \n\nThere are three types of datasets in LangSmith:\n\n1. Key-value datasets: These datasets allow for arbitrary key-value pairs as inputs and outputs. They are useful when evaluating chains and agents that require multiple inputs or return multiple outputs.\n\n2. LLM datasets: These datasets correspond to string inputs and outputs from ""completion"" style language models (LLMs). The inputs and outputs are simple strings.\n\n3. Chat datasets: These datasets correspond to structured ""chat"" messages as inputs and outputs. Each example row expects a list of serialized chat messages as inputs and outputs.\n\nDatasets can be managed and created using the LangSmith web app or through the LangSmith client library. Examples can be added to datasets from existing runs, uploaded from CSV files, or created programmatically using the client library. Datasets can also be exported to CSV or OpenAI evals format for further analysis or sharing.","A LangSmith dataset is a collection of input-output examples that are used for testing and evaluation purposes. It is a way to organize and store representative runs from your debug projects. These examples can be added to a dataset using the ""Add to Dataset"" button in LangSmith.\n\nDatasets are valuable for benchmarking and comparing different versions of your chains or prompts. They provide a set of inputs and expected outputs that can be used to test the performance and accuracy of your models. Datasets can be used to evaluate changes to prompts or chains, and they can also be shared with others for collaborative debugging or analysis.\n\nIn LangSmith, datasets can be easily uploaded and managed. Once a dataset is created, you can run your chains over the data points and log the results to a new project associated with the dataset. You can review the results, assign feedback to runs, and mark them as correct or incorrect. Datasets can be a powerful tool for testing, evaluating, and improving your language model applications.",1.0,1.0
"Currently, LangSmith does not support project migration between organizations. If you want to move your project to a different organization, the recommended approach is to create a new project within the desired organization and manually transfer the necessary data, such as runs and datasets, to the new project.\n\nYou can achieve this by using the LangSmith SDK. Here's an example of how you can export runs from one project and import them into another project:\n\n1. Export runs from the source project:\n ```python\n from langsmith import Client\n\n client = Client()\n runs = list(client.list_runs(project_name=""<source_project>""))\n\n # Save the runs to a file or perform any necessary processing\n # ...\n\n ```\n\n2. Import runs into the destination project:\n ```python\n from langsmith import Client\n\n client = Client()\n with open(""<runs_file>"", ""r"") as file:\n runs_data = file.read()\n\n # Create the runs in the destination project\n client.create_runs(project_name=""<destination_project>"", runs_data=runs_data)\n ```\n\nYou can follow a similar approach for transferring datasets or any other data associated with your project.\n\nPlease note that this process requires manual intervention and may take some time depending on the size and complexity of your project.","Currently, LangSmith does not support project migration between organizations. However, you can manually export and import runs and datasets using the SDK to imitate project migration. Here's how you can do it:\n\n1. Export Runs: Use the LangSmith SDK to export the runs from your current organization. You can use the `export_run` function to export individual runs or the `export_runs` function to export multiple runs at once. Save the exported runs to a file.\n\n2. Export Datasets: Similarly, use the LangSmith SDK to export the datasets from your current organization. You can use the `export_dataset` function to export individual datasets or the `export_datasets` function to export multiple datasets at once. Save the exported datasets to a file.\n\n3. Create a New Project: In your new organization, create a new project where you want to import the runs and datasets.\n\n4. Import Runs: Use the LangSmith SDK to import the exported runs into the new project. You can use the `import_run` function to import individual runs or the `import_runs` function to import multiple runs at once. Provide the path to the exported runs file.\n\n5. Import Datasets: Similarly, use the LangSmith SDK to import the exported datasets into the new project. You can use the `import_dataset` function to import individual datasets or the `import_datasets` function to import multiple datasets at once. Provide the path to the exported datasets file.\n\nBy following these steps, you can effectively move your project from one organization to another.",1.0,1.0
"Setting environment variables is necessary in order to configure and customize the behavior of LangSmith. Environment variables provide a way to pass configuration information to the LangSmith framework without hardcoding it in your code. This allows for flexibility and easy configuration changes without modifying the code itself.\n\nIn the context of LangSmith, setting environment variables is particularly important for enabling tracing and logging functionality. By setting the appropriate environment variables, you can enable LangSmith to log all calls to LLMs, chains, agents, tools, and retrievers. This logging is essential for debugging and understanding the behavior of your application.\n\nAdditionally, environment variables are used to configure other aspects of LangSmith, such as API keys for accessing the LangChain Hub and other services. By setting these variables, you can securely authenticate and access the necessary resources.\n\nOverall, setting environment variables is a best practice for configuring and customizing the behavior of LangSmith and ensuring smooth operation of your application.","Setting environment variables is necessary in order to provide configuration information to your application. Environment variables are used to store sensitive information, such as API keys or project names, that should not be hard-coded in your codebase. By using environment variables, you can easily manage and update these configurations without modifying your code.\n\nIn the context of the LangSmith framework, setting environment variables allows you to configure important information such as your API key, project name, and endpoint. These variables are used by the LangSmith SDK to authenticate and interact with the LangSmith API.\n\nOverall, using environment variables for configuration provides a more secure and flexible approach to managing sensitive information in your applications.",1.0,1.0


It looks like the benchmark performance is similar, so let's move on to the pairwise comparison.

## 2. Pairwise Evaluation

Suppose both approaches return similar scores when evaluated in isolation.

We can run a pairwise evaluator to see how try to predict preferred outputs. We will first define a couple helper functions to run the evaluator
on each prediction pair. Let's break this function down:

- The function accepts a dataset example and loads each model's predictions on that data point.
- It then randomizes the order of the predictions and calls the evaluator. This is done to aveage out the impact of any ordering bias in the evaluator LLM.
- Once the evaluation result is returned, we check it to make sure it is valid and then log feedback for both models.

Once this is complete, the values are all returned so we can display them in a table in the notebook below. 

In [15]:
import random
import logging

def _get_run_and_prediction(example_id, project_name):
    run = next(client.list_runs(reference_example_id=example_id, project_name=project_name, execution_order=1))
    prediction = next(iter(run.outputs.values()))
    return run, prediction

def _log_feedback(run_ids):
    for score, run_id in enumerate(run_ids):
        client.create_feedback(run_id, key="preference", score=score)

def predict_preference(example, project_a, project_b, eval_chain):
    example_id = example.id
    print(example)
    run_a, pred_a = _get_run_and_prediction(example_id, project_a)
    run_b, pred_b = _get_run_and_prediction(example_id, project_b)
    input_, answer = example.inputs['question'], example.outputs['answer']
    result = {"input": input_, "answer": answer, "A": pred_a, "B": pred_b}

    # Flip a coin to average out persistent positional bias
    if random.random() < 0.5:
        result['A'], result['B'] = result['B'], result['A']
        run_a, run_b = run_b, run_a
    try:
        eval_res = eval_chain.evaluate_string_pairs(
            prediction=result['A'],
            prediction_b=result['B'],
            input=input_, 
            reference=answer
        )
    except Exception as e:
        logging.warning(e)
        return result

    if eval_res["value"] is None:
        return result

    preferred_run = (run_a.id, "A") if eval_res["value"] == "A" else (run_b.id, "B")
    runner_up_run = (run_b.id, "B") if eval_res["value"] == "A" else (run_a.id, "A")
    _log_feedback((runner_up_run[0], preferred_run[0]))
    result["Preferred"] = preferred_run[1]
    return result


For this example, we will use the `labeled_pairwise_string` evaluator from LangChain off-the-shelf. By default, instructs the evaluation llm to choose the preference based on helpfulness, relevance, correctness, and depth of thought. In your case, you will likely want to customize the criteria used!

For more information on how to configure it, check out the [Labeled Pairwise String Evaluator](https://python.langchain.com/docs/guides/evaluation/comparison/labeled_pairwise_string) documentation and inspect the resulting traces when running this notebook.


In [16]:
from langchain.evaluation import load_evaluator

pairwise_evaluator = load_evaluator("labeled_pairwise_string")

In [17]:
import functools
from langchain.schema.runnable import RunnableLambda


eval_func = functools.partial(
    predict_preference,
    project_a=project_name,
    project_b=project_name_2,
    eval_chain=pairwise_evaluator,
)


# We will wrap in a lambda to take advantage of its default `batch` convenience method
runnable = RunnableLambda(eval_func)

In [18]:
examples = list(client.list_examples(dataset_id=dataset.id))
values = runnable.batch(examples)

dataset_id=UUID('a0ceb51f-002d-4dd3-b4d6-bda9cb6bf958') inputs={'question': 'Why do I have to set environment variables?'} outputs={'answer': 'Environment variables can tell your LangChain application to perform tracing and contain the information necessary to authenticate to LangSmith. While there are other ways to connect, environment variables tend to be the simplest way to configure your application.'} id=UUID('960664ec-5777-46b9-a04d-1054c3530d19') created_at=datetime.datetime(2023, 9, 20, 21, 23, 16, 210654) modified_at=datetime.datetime(2023, 9, 20, 21, 23, 16, 363607) runs=[]dataset_id=UUID('a0ceb51f-002d-4dd3-b4d6-bda9cb6bf958') inputs={'question': 'How do I move my project between organizations?'} outputs={'answer': "LangSmith doesn't directly support moving projects between organizations."} id=UUID('6f2c55a3-c17b-456e-9eed-aa74e2b5dcd5') created_at=datetime.datetime(2023, 9, 20, 21, 23, 16, 417243) modified_at=datetime.datetime(2023, 9, 20, 21, 23, 16, 562850) runs=[]

datas



By running the function above, the "preference" feedback was automatically logged to the test projects you created in step 3. Below is a view of the same test run as before with the preference scores added. This model seems to be less preferred than the other! 

![Preference Tags](img/with_preferences.png)

The `predict_preference` function we wrote above is set up to not log feedback in the case of a tie, meaning some of the examples do not have a corresponding preference score. You can adjust this behavior as you see fit. 

You can also view the feedback results for the other test run in the app to see how well the evaluator's results match your expectations.

In [22]:
import pandas as pd 
df = pd.DataFrame(values)
word_wrap_on_hover(df)

input,answer,A,B,Preferred
How do I move my project between organizations?,LangSmith doesn't directly support moving projects between organizations.,"Currently, LangSmith does not support project migration between organizations. If you want to move a project from one organization to another, the recommended approach is to manually export the runs and datasets from the source organization and import them into the destination organization.\n\nHere are the general steps to move a project between organizations:\n\n1. Export Runs: Use the LangSmith SDK or API to export the runs from the source organization. You can iterate through the runs and save them to a file or database.\n\n2. Export Datasets: If your project uses datasets, export them from the source organization using the LangSmith SDK or API.\n\n3. Import Runs: In the destination organization, create a new project and import the exported runs. Use the LangSmith SDK or API to create new runs in the destination project and populate them with the exported run data.\n\n4. Import Datasets: If your project uses datasets, import them into the destination organization using the LangSmith SDK or API.\n\nBy following these steps, you can effectively move your project from one organization to another. Remember to update any references or configurations in your application to use the new organization and project.\n\nIf you have a large number of runs or complex data dependencies, it may be helpful to reach out to the LangSmith support team at support@langchain.dev for further assistance and guidance.","Currently, LangSmith does not support project migration between organizations. However, you can manually export and import runs and datasets using the SDK to imitate project migration. Here's how you can do it:\n\n1. Export Runs: Use the LangSmith SDK to export the runs from your current organization. You can use the `export_run` function to export individual runs or the `export_runs` function to export multiple runs at once. Save the exported runs to a file or storage location.\n\n2. Export Datasets: Similarly, use the LangSmith SDK to export the datasets from your current organization. You can use the `export_dataset` function to export individual datasets or the `export_datasets` function to export multiple datasets at once. Save the exported datasets to a file or storage location.\n\n3. Create a New Project: In your new organization, create a new project where you want to import the runs and datasets.\n\n4. Import Runs: Use the LangSmith SDK to import the exported runs into the new project. You can use the `import_run` function to import individual runs or the `import_runs` function to import multiple runs at once. Provide the path or location of the exported runs file.\n\n5. Import Datasets: Similarly, use the LangSmith SDK to import the exported datasets into the new project. You can use the `import_dataset` function to import individual datasets or the `import_datasets` function to import multiple datasets at once. Provide the path or location of the exported datasets file.\n\nBy following these steps, you can effectively move your project from one organization to another.",B
Why do I have to set environment variables?,"Environment variables can tell your LangChain application to perform tracing and contain the information necessary to authenticate to LangSmith. While there are other ways to connect, environment variables tend to be the simplest way to configure your application.","Setting environment variables is necessary to configure the runtime environment for your LangSmith application. These variables provide important information such as the API key, project name, and endpoint to connect to LangSmith services.\n\nBy setting environment variables, you can easily manage and update these configurations without modifying your code. It also allows you to keep sensitive information, such as API keys, separate from your codebase, enhancing security.\n\nIn the context of LangSmith, setting the LANGCHAIN_TRACING_V2, LANGCHAIN_ENDPOINT, LANGCHAIN_API_KEY, and LANGCHAIN_PROJECT environment variables is essential for enabling tracing, specifying the API endpoint, and providing authentication credentials for your LangSmith project.\n\nOverall, environment variables provide a flexible and secure way to configure your LangSmith application without hardcoding sensitive information into your code.","Setting environment variables is a common practice in software development to configure the behavior of applications. In the context of LangSmith's tracing feature, setting environment variables is necessary to enable the logging of run traces to LangSmith.\n\nBy setting the `LANGCHAIN_TRACING_V2` environment variable, LangSmith's tracing feature is activated, and all calls to LLMs, chains, agents, tools, and retrievers are logged to LangSmith. This allows you to debug and analyze the behavior of your application, understand the inputs and outputs of LLM calls, trace the sequence of events in complicated chains and agents, monitor latency and token usage, and more.\n\nSetting environment variables provides a convenient and consistent way to enable tracing across different environments and programming languages. However, if your specific situation doesn't permit the use of environment variables or doesn't expose `process.env` (in the case of JavaScript runtime environments), there are alternative methods available to enable tracing without environment variables. These methods involve manually creating the `LangChainTracer` callback and passing it to the LangChain components during initialization or in the call itself.",A
Can I trace my Llama V2 llm?,"So long as you are using one of LangChain's LLM implementations, all your calls can be traced","Yes, you can trace your Llama V2 LLM using LangSmith. LangSmith's tracing feature allows you to log the inputs, outputs, and other relevant information of your LLM calls. This can be helpful for debugging and understanding the behavior of your LLM.\n\nTo enable tracing for your Llama V2 LLM, you need to set the `LANGCHAIN_TRACING_V2` environment variable to `true` before running your code. This can be done in your Python environment or in your shell.\n\nHere's an example of how to enable tracing for your Llama V2 LLM in Python:\n\n```python\nimport os\n\nos.environ[""LANGCHAIN_TRACING_V2""] = ""true""\n\n# Your Llama V2 LLM code here\n```\n\nOnce tracing is enabled, all calls to your Llama V2 LLM will be logged to LangSmith. You can then use the LangSmith client to view and analyze the traces.\n\nNote that tracing is enabled at the environment level, so it will apply to all LLM calls made within that environment. If you want to selectively trace specific LLM calls, you can use the `LangChainTracer` callback and pass it to the LLM object during initialization or in the call itself.\n\n```python\nfrom langchain.callbacks import LangChainTracer\nfrom langchain.llm import LlamaV2LLM\n\nllm = LlamaV2LLM()\nllm.invoke(""Hello, world!"", config={""callbacks"": [LangChainTracer()]})\n```\n\nBy using the `LangChainTracer` callback, you can have more fine-grained control over which LLM calls are traced.\n\nRemember to handle your API key and other authentication details properly to ensure secure and authorized access to LangSmith.","Yes, you can trace your Llama V2 LLM by enabling tracing in your LangChain application. Tracing allows you to log the inputs and outputs of each component in your LLM application, including the Llama V2 LLM. This can be helpful for debugging and understanding the behavior of your LLM.\n\nTo enable tracing, you need to set the `LANGCHAIN_TRACING_V2` environment variable to `true` before running your LangChain application. This will ensure that all calls to LLMs, chains, agents, tools, and retrievers are logged to LangSmith.\n\nHere's an example of how to enable tracing in Python:\n\n```python\nimport os\nfrom langchain.chat_models import LlamaV2\n\nos.environ[""LANGCHAIN_TRACING_V2""] = ""true""\n\nllm = LlamaV2()\nllm.invoke(""Hello, world!"")\n```\n\nAnd here's an example in TypeScript:\n\n```typescript\nimport { LlamaV2 } from ""langchain/chat_models/llama_v2"";\n\nprocess.env.LANGCHAIN_TRACING_V2 = ""true"";\n\nconst llm = new LlamaV2();\nawait llm.invoke(""Hello, world!"");\n```\n\nRemember to replace `""Hello, world!""` with your desired input to the LLM.",A
How do I use a traceable decorator?,"The traceable decorator is available in the langsmith python SDK. To use, configure your environment with your API key,import the required function, decorate your function, and then call the function. Below is an example:\n```python\nfrom langsmith.run_helpers import traceable\n@traceable(run_type=""chain"") # or ""llm"", etc.\ndef my_function(input_param):\n # Function logic goes here\n return output\nresult = my_function(input_param)\n```","To use the `traceable` decorator in LangSmith, you can follow these steps:\n\n1. Import the necessary modules and functions:\n\n```python\nimport datetime\nfrom typing import Any\n\nimport openai\nfrom langsmith.run_helpers import traceable\n```\n\n2. Define your functions or methods that you want to trace. Decorate them with the `traceable` decorator, specifying the `run_type` and `name` parameters:\n\n```python\n@traceable(run_type=""llm"", name=""openai.ChatCompletion.create"")\ndef my_llm(*args: Any, **kwargs: Any) -> dict:\n return openai.ChatCompletion.create(*args, **kwargs)\n\n@traceable(run_type=""tool"")\ndef my_tool(tool_input: str) -> str:\n return tool_input.upper()\n\n@traceable(run_type=""chain"")\ndef my_chain(prompt: str) -> str:\n messages = [\n {\n ""role"": ""system"",\n ""content"": ""You are an AI Assistant. The time is "" + str(datetime.datetime.now()),\n },\n {""role"": ""user"", ""content"": prompt},\n ]\n return my_llm(model=""gpt-3.5-turbo"", messages=messages)\n\n@traceable(run_type=""chain"")\ndef my_chat_bot(text: str) -> str:\n generated = my_chain(text)\n\n if ""meeting"" in generated:\n return my_tool(generated)\n else:\n return generated\n```\n\n3. Call your traced functions as usual. The calls will be automatically traced and logged:\n\n```python\nmy_chat_bot(""Summarize this morning's meetings."")\n```\n\nYou can view the example run and the call hierarchy in the LangSmith app by visiting the provided URL.\n\nNote: The `traceable` decorator is used to trace the function calls and generate a run tree. It does not modify the behavior of the functions themselves.","The `traceable` decorator is a convenient way to log your function calls as runs in LangSmith. It automatically captures the inputs, outputs, and metadata of the function and sends them to LangSmith for logging.\n\nTo use the `traceable` decorator, follow these steps:\n\n1. Import the `traceable` decorator from the `langsmith.run_helpers` module.\n2. Decorate your function with `@traceable(run_type=""chain"")`, specifying the `run_type` as ""chain"" (or any other appropriate value).\n3. Call your decorated function as usual.\n\nHere's an example in Python:\n\n```python\nfrom langsmith.run_helpers import traceable\n\n@traceable(run_type=""chain"")\ndef my_function(input1, input2):\n # Your function logic here\n output = input1 + input2\n return output\n\nresult = my_function(3, 4)\n```\n\nIn this example, the `my_function` is decorated with `@traceable(run_type=""chain"")`. When the function is called, the inputs, outputs, and metadata will be logged as a run in LangSmith.\n\nNote that you need to have the LangSmith SDK installed and configured with your API key and project name for the `traceable` decorator to work properly.",
What's a langsmith dataset?,A LangSmith dataset is a collection of examples. Each example contains inputs and optional expected outputs or references for that data point.,"A LangSmith dataset is a collection of input-output examples that are used for testing and evaluating language models, chains, and agents. It is a structured set of data that represents different scenarios and expected outputs.\n\nIn LangSmith, you can create datasets by adding examples from your application runs. Each example consists of an input and the corresponding expected output. These examples can be added at different steps of a nested chain, including end-to-end chains, intermediary chains (such as LLM chains), or individual LLM or Chat Model calls.\n\nBy collecting examples in a dataset, you can create a benchmark for your models and chains. This allows you to test and evaluate the performance of your application against known scenarios and expected outputs. Datasets are particularly useful for comparing different versions of your chains and models, as well as for identifying and addressing issues or failures in your application.\n\nLangSmith provides features to easily add examples to datasets, edit examples before adding them, and visualize the dataset for analysis and evaluation.","A LangSmith dataset refers to a collection of data that is used for training, testing, or evaluating language models within the LangSmith framework. Datasets in LangSmith typically consist of text data that is used to train language models or evaluate their performance.\n\nLangSmith provides a variety of built-in datasets that you can use, such as text classification datasets, question-answering datasets, sentiment analysis datasets, and more. These datasets are designed to be easily accessible and compatible with the LangSmith framework.\n\nIn addition to the built-in datasets, you can also create your own custom datasets in LangSmith. This allows you to tailor the data to your specific application or use case.\n\nOverall, LangSmith datasets play a crucial role in training and evaluating language models, enabling you to build powerful and accurate language-based applications.",A
How might I query for all runs in a project?,"client.list_runs(project_name='my-project-name'), or in TypeScript, client.ListRuns({projectName: 'my-project-anme'})","To query for all runs in a project, you can use the LangSmith client's `list_runs` method. Here's an example in Python:\n\n```python\nfrom langsmith import Client\n\nclient = Client()\nruns = list(client.list_runs(project_name=""<your_project>""))\n```\n\nAnd here's an example in TypeScript:\n\n```typescript\nimport { Client, Run } from ""langsmith"";\n\nconst client = new Client();\nconst runs: Run[] = [];\nfor await (const run of client.listRuns({ projectName: ""<your_project>"" })) {\n runs.push(run);\n}\n```\n\nIn both examples, replace `<your_project>` with the name of your project. This will retrieve all runs in the specified project.","To query for all runs in a project using the LangSmith SDK, you can use the `listRuns` method of the `Client` class. Here's an example in Python:\n\n```python\nfrom langsmith import Client\n\nclient = Client()\nruns = list(client.list_runs(project_name=""<your_project>""))\n```\n\nAnd here's an example in TypeScript:\n\n```typescript\nimport { Client, Run } from ""langsmith"";\n\nconst client = new Client();\nconst runs: Run[] = [];\nfor await (const run of client.listRuns({ projectName: ""<your_project>"" })) {\n runs.push(run);\n}\n```\n\nIn both examples, replace `<your_project>` with the name of your project. This will retrieve all runs in the specified project.",
What is LangChain?,LangChain is an open-source framework for building applications using large language models. It is also the name of the company building LangSmith.,"LangChain is an open-source framework for building applications using large language models (LLMs). It provides a set of tools and libraries that simplify the development, testing, and evaluation of LLM-based applications. LangChain seamlessly integrates with various LLM frameworks, allowing developers to leverage the power of LLMs in their applications.\n\nWith LangChain, you can easily create LLM chains, which are sequences of LLM calls that work together to achieve a specific task. You can also build intelligent agents that interact with users and make use of LLMs for natural language understanding and generation. LangChain provides features for debugging, testing, and monitoring LLM applications, making it easier to identify and fix issues.\n\nLangChain is developed by LangChain, the company behind LangSmith, a platform for building production-grade LLM applications. LangSmith provides additional features and capabilities for debugging, testing, and evaluating LLM applications built with LangChain.","LangChain is a framework for building applications using large language models. It provides a set of tools and libraries that make it easier to interact with language models and integrate them into your applications. With LangChain, you can leverage the power of language models to perform tasks such as text generation, translation, summarization, and more. LangChain supports various programming languages and provides an API for seamless integration with your applications.",B


## Conclusion

In this walkthrough, you compared two versions of a RAG Q&A chain by predicting preference scores for each pair of predictions.
This approach is one way to automatically compare two versions of a chain that can give additional context beyond regular benchmarking.

There are many related ways to evaluate preferences! Here, we used binary choices to compare the two models and only evaluated once, but you may get better results by trying one of the following approaches:

- Evaluate multiple times in each position and returning a win rate
- Ensemble evaluators
- Instruct the model to output continuous scores
- Instruct the model to use a different prompt strategy than chain of thought

For more information on measuring the reliability of this and other approaches, you can check out the [evaluations examples](https://python.langchain.com/docs/guides/evaluation/examples) in the LangChain repo.