# Comparing Q&A System Outputs

The most common way to compare two models is to benchmark them both on a dataset and compare the aggregate metrics.

This approach is useful but it may filter out helpful information about the quality of the two system variants. In this case, it can be helpful to directly perform pairwise comparisons on the responses and take the resulting preference scores into consideration.

In this tutorial, we will share one way to do this in code. We will use a retrieval Q&A system over LangSmith's docs as a motivating example.

The main steps are:

1. Setup
   - Create a dataset of questions and answers.
   - Define different versions of your chains to evaluate.
   - Evaluate chains directly on a dataset using regular metrics (e.g. correctness).
4. Evaluate the pairwise preferences over that dataset

In this case, we will test the impact of chunk sizes on our result quality. Let's begin!

## Prerequisites

This tutorial uses OpenAI for the model, ChromaDB to store documents, and LangChain to compose the chain. To make sure the tracing and evals are set up for [LangSmith](https://smith.langchain.com), please configure your API Key appropriately.

We will also use pandas to render the results in the notebook.

In [1]:
# %env LANGCHAIN_API_KEY=<YOUR_API_KEY>

Install the required packages. `lxml` and `html2text` are used by the document loader.

In [2]:
# %pip install -U "langchain[openai]" --quiet
# %pip install chromadb --quiet
# %pip install lxml --quiet
# %pip install html2text --quiet
# %pip install pandas --quiet
# %pip install nest_asyncio --quiet

In [3]:
# %env OPENAI_API_KEY=<YOUR-API-KEY>

In [4]:
# Used for running in jupyter
import nest_asyncio

nest_asyncio.apply()

## 1. Setup

#### a. Create a dataset

No evaluation process is complete without a development dataset. We've hard-coded a few examples below to demonstrate the process. In general, you'll want a lot more (>100) pairs for statistically significant results. Drawing from actual user queries can be helpful to ensure better representation of the domain.

In [5]:
examples = [
    ("What is LangChain?", "LangChain is an open-source framework for building applications using large language models. It is also the name of the company building LangSmith."),
    ("How might I query for all runs in a project?", "client.list_runs(project_name='my-project-name'), or in TypeScript, client.ListRuns({projectName: 'my-project-anme'})"),
    ("What's a langsmith dataset?", "A LangSmith dataset is a collection of examples. Each example contains inputs and optional expected outputs or references for that data point."),
    ("How do I use a traceable decorator?", """The traceable decorator is available in the langsmith python SDK. To use, configure your environment with your API key,\
import the required function, decorate your function, and then call the function. Below is an example:
```python
from langsmith.run_helpers import traceable
@traceable(run_type="chain") # or "llm", etc.
def my_function(input_param):
    # Function logic goes here
    return output
result = my_function(input_param)
```"""),
    ("Can I trace my Llama V2 llm?", "So long as you are using one of LangChain's LLM implementations, all your calls can be traced"),
    ("Why do I have to set environment variables?", "Environment variables can tell your LangChain application to perform tracing and contain the information necessary to authenticate to LangSmith."
     " While there are other ways to connect, environment variables tend to be the simplest way to configure your application."),
    ("How do I move my project between organizations?", "LangSmith doesn't directly support moving projects between organizations.")
]

In [6]:
from langsmith import Client

client = Client()

In [7]:
dataset_name = "Retrieval QA Questions"
dataset = client.create_dataset(dataset_name=dataset_name)
for q, a in examples:
    client.create_example(inputs={"question": q}, outputs={"answer": a}, dataset_id=dataset.id)

#### b. Define RAG Q&A system

Our Q&A system uses a simple retriever and LLM response generator. To break that down further, the chain will be composed of:

1. A [VectorStoreRetriever](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.base.VectorStoreRetriever.html#langchain.vectorstores.base.VectorStoreRetriever) to retrieve documents. This uses:
   - An embedding model to vectorize documents and user queries for retrieval. In this case, the [OpenAIEmbeddings](https://api.python.langchain.com/en/latest/embeddings/langchain.embeddings.openai.OpenAIEmbeddings.html) model.
   - A vectorstore, in this case we will use [Chroma](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.chroma.Chroma.html#langchain.vectorstores.chroma.Chroma).
2. A response generator. This uses:
   - A [ChatPromptTemplate](https://api.python.langchain.com/en/latest/prompts/langchain.prompts.chat.ChatPromptTemplate.html#langchain.prompts.chat.ChatPromptTemplate) to combine the query and documents. 
   - An LLM, in this case, the 16k token context window version of `gpt-3.5-turbo` via [ChatOpenAI](https://api.python.langchain.com/en/latest/chat_models/langchain.chat_models.openai.ChatOpenAI.html#langchain.chat_models.openai.ChatOpenAI).

We will combine them using LangChain's [expression syntax](https://python.langchain.com/docs/guides/expression_language/cookbook).

First, load the documents to populate the vectorstore:

In [8]:
from langchain.document_loaders import RecursiveUrlLoader
from langchain.document_transformers import Html2TextTransformer
from langchain.text_splitter import TokenTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

api_loader = RecursiveUrlLoader("https://docs.smith.langchain.com")
doc_transformer = Html2TextTransformer()
raw_documents = api_loader.load()
transformed = doc_transformer.transform_documents(raw_documents)

def create_retriever(transformed_documents, text_splitter):
    documents = text_splitter.split_documents(transformed_documents)
    embeddings = OpenAIEmbeddings()
    vectorstore = Chroma.from_documents(documents, embeddings)
    return vectorstore.as_retriever(search_kwargs={"k": 4})



Next up, we'll define the chain. Since we are going to vary the retriever parameters, our constructor will
take the retriever as an argument.

In [9]:
from datetime import datetime
from operator import itemgetter

from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.schema.output_parser import StrOutputParser


def create_chain(retriever):
    prompt = ChatPromptTemplate.from_messages(
            [
                ("system", "You are a helpful documentation Q&A assistant, trained to answer"
                " questions from LangSmith's documentation."
                " LangChain is a framework for building applications using large language models."
                "\nThe current time is {time}.\n\nRelevant documents will be retrieved in the following messages."),
                ("system", "{context}"),
                ("human","{question}")
            ]
        ).partial(time=str(datetime.now()))

    model = ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0)
    response_generator = (
        prompt 
        | model 
        | StrOutputParser()
    )
    chain = (
        # The runnable map here routes the original inputs to a context and a question dictionary to pass to the response generator
        {
            "context": itemgetter("question") | retriever | (lambda docs: "\n".join([doc.page_content for doc in docs])),
            "question": itemgetter("question")
        }
        | response_generator
    )
    return chain

With the documents prepared, and the chain constructor ready, it's time to create and evaluate our chains.
We will vary the split size and overlap to evaluate its impact on the response quality.

In [10]:
text_splitter = TokenTextSplitter(
    model_name="gpt-3.5-turbo",
    chunk_size=2000,
    chunk_overlap=200,
)
retriever = create_retriever(transformed, text_splitter)

chain_1 = create_chain(retriever)

In [11]:
# We will shrink both the chunk size and overlap
text_splitter_2 = TokenTextSplitter(
    model_name="gpt-3.5-turbo",
    chunk_size=500,
    chunk_overlap=50,
)
retriever_2 = create_retriever(transformed, text_splitter_2)

chain_2 = create_chain(retriever_2)

#### c. Evaluate the chains

At this point, we are still going through the regular development -> evaluation process. We have two candidates and will evaluate them with a correctness evaluator from LangChain. By running `run_on_dataset`, we will generate predicted answers to each question in the dataset and log feedback from the evaluator for that data point.

In [12]:
from langchain.smith import RunEvalConfig

eval_config = RunEvalConfig(
    # We will use the chain-of-thought Q&A correctness evaluator
    evaluators=["cot_qa"],
)

In [13]:
results = client.run_on_dataset(
    dataset_name=dataset_name,
    llm_or_chain_factory=chain_1,
    evaluation=eval_config
)
project_name = results["project_name"]

View the evaluation results for project '07958c041b4c4040bf4916693c4db420-RunnableSequence' at:
https://smith.langchain.com/projects/p/1431858d-e481-4759-bae0-91449079957f?eval=true


In [14]:
results_2 = client.run_on_dataset(
    dataset_name=dataset_name,
    llm_or_chain_factory=chain_2,
    evaluation=eval_config
)
project_name_2 = results_2["project_name"]

View the evaluation results for project 'd038d5a8113e4c48a616fd407f548b16-RunnableSequence' at:
https://smith.langchain.com/projects/p/5285a9fc-f719-4ffb-98de-ee3a642a831d?eval=true


Now you should have two test run projects over the same dataset. If you click on one, it should look something like the following:
    
![Original Feedback](img/original_eval.png)

You can look at the aggregate results here and for the other project to compare them. You could also view them in a dataframe:

In [15]:
# Helper function to wrap the results in the dataframe table
from IPython.core.display import HTML

def word_wrap_on_hover(df):
    styles = """
    <style>
        .hover_table td {
            max-width: 200px; /* You can adjust this value */
            overflow: hidden;
            text-overflow: ellipsis;
            white-space: nowrap;
        }
        .hover_table td:hover {
            white-space: normal;
            word-wrap: break-word;
        }
    </style>
    """
    html_table = df.to_html(index=False, classes='hover_table')
    return HTML(styles + html_table)

In [17]:
import pandas as pd

runs_1 = list(client.list_runs(project_name=project_name, execution_order=1))
runs_2 = list(client.list_runs(project_name=project_name_2, execution_order=1))

def get_project_df(runs):
    return pd.DataFrame([{**run.outputs, **{k: v.get('avg') for k, v in run.feedback_stats.items()}} for run in runs], index = [run.reference_example_id for run in runs])

runs_1_df = get_project_df(runs_1)
runs_2_df = get_project_df(runs_2)
joined_df = runs_1_df.join(runs_2_df, lsuffix='_1', rsuffix='_2')
columns_1 = [col for col in joined_df.columns if col.endswith('_1')]
columns_2 = [col for col in joined_df.columns if col.endswith('_2')]
new_columns_order = [col for pair in zip(columns_1, columns_2) for col in pair]
joined_df = joined_df[new_columns_order]

In [18]:
word_wrap_on_hover(joined_df)

output_1,output_2,COT Contextual Accuracy_1,COT Contextual Accuracy_2
"LangChain is a framework for building applications using large language models (LLMs). It simplifies the process of setting up and using LLMs, such as OpenAI's GPT models, by providing a set of tools and utilities. LangChain aims to bridge the gap between initial setup and production-level performance of LLM applications.\n\nWith LangChain, you can easily trace and debug LLM calls, visualize the inputs and outputs of LLMs, modify prompts in a playground environment, understand the sequence of events in complex chains and agents, track latency and token usage, and collaborate with colleagues for debugging. LangChain also allows you to collect examples and create datasets for testing and evaluation, and provides evaluators for common evaluation scenarios. Additionally, LangChain supports monitoring and logging of application runs, and enables exporting datasets for use in other contexts.\n\nOverall, LangChain helps developers build reliable and high-quality LLM applications by providing a range of features and tools for tracing, debugging, testing, and evaluating LLMs and chains.","LangChain is a framework for building applications using large language models (LLMs). It simplifies the process of developing and deploying LLM applications by providing tools and features to enhance reliability, debugging, testing, evaluation, and monitoring.\n\nLangChain is designed to bridge the gap between the initial setup of LLMs and their reliable use in production. It offers a user-friendly interface and various functionalities to make working with LLMs more efficient and effective.\n\nSome key features of LangChain include:\n\n1. Tracing: LangChain provides tracing capabilities that log all calls to LLMs, chains, agents, tools, and retrievers. This helps in debugging unexpected results, identifying looping agents, analyzing chain performance, and tracking token usage.\n\n2. Debugging: LangChain offers tools to debug LLMs, chains, and agents. It provides visibility into the exact inputs and outputs of LLM calls, allowing you to understand the constructed input string and the structure of the output. It also includes a playground feature where you can modify prompts and observe the resulting changes to the output.\n\n3. Sequence Visualization: For complex chains and agents, LangChain's tracing feature provides a visualization of the sequence of events, including the order of calls, inputs, and outputs. This helps in understanding the flow and interactions within the chain or agent.\n\n4. Performance Analysis: LangChain allows you to track the latency of each step in a chain, helping you identify and optimize the slowest components. It also provides insights into token usage, making it easier to identify costly parts of the chain.\n\n5. Collaborative Debugging: LangChain enables easy sharing of chains and LLM runs with colleagues for collaborative debugging. It includes a ""Share"" button that generates a shared link, allowing others to access and analyze the chain or run.\n\n6. Dataset Collection: LangChain facilitates the collection of examples for debugging and testing purposes. It includes an ""Add to Dataset"" button for each run, allowing you to add input/output examples to a chosen dataset. This helps in benchmarking changes and testing the application against known issues.\n\n7. Testing and Evaluation: LangChain simplifies the process of testing changes to prompts or chains using datasets. It allows you to run chains over data points, visualize the outputs, and assign feedback to runs. It also provides evaluators to analyze the results and guide your attention to specific examples.\n\n8. Monitoring: LangChain can be used to monitor LLM applications in production. It provides tracing, latency, and token usage statistics, allowing you to troubleshoot issues and ensure the smooth functioning of your application.\n\nOverall, LangChain aims to enhance the development, debugging, testing, evaluation, and monitoring of LLM applications, making it easier to build reliable and efficient language-based systems.",1.0,1.0
"To query for all runs in a project, you can use the LangSmith SDK or the LangSmith REST API. Here's an example of how to do it using the LangSmith SDK in Python:\n\n```python\nfrom langsmith import Client\n\nclient = Client()\nruns = list(client.list_runs(project_name=""your_project_name""))\n```\n\nIn this example, replace `""your_project_name""` with the name of your project. The `list_runs` method returns a generator that you can iterate over to get all the runs in the project.\n\nIf you prefer to use the LangSmith REST API directly, you can make a GET request to the `/runs` endpoint with the appropriate project name. Here's an example using `curl`:\n\n```bash\ncurl -X GET ""https://api.smith.langchain.com/runs?project_name=your_project_name"" \\n -H ""Authorization: Bearer your_api_key""\n```\n\nReplace `""your_project_name""` with the name of your project and `""your_api_key""` with your actual API key.\n\nBoth methods will return a list of runs in the specified project.","To query for all runs in a project using the LangSmith SDK, you can use the `listRuns` method of the `Client` class. Here's an example in Python:\n\n```python\nfrom langsmith import Client\n\nclient = Client()\nruns = list(client.list_runs(project_name=""<your_project>""))\n```\n\nAnd here's an example in TypeScript:\n\n```typescript\nimport { Client, Run } from ""langsmith"";\n\nconst client = new Client();\nconst runs: Run[] = [];\nfor await (const run of client.listRuns({ project: ""<your_project>"" })) {\n runs.push(run);\n}\n```\n\nIn both examples, replace `<your_project>` with the name of your project. This will retrieve all runs in the specified project. You can further filter the runs by using additional parameters in the `listRuns` method, such as `startTime`, `endTime`, or `runType`.",0.0,1.0
"A LangSmith dataset is a collection of examples that can be used to evaluate or improve a chain, agent, or model. It consists of rows, where each row represents an example and contains the inputs and (optionally) the expected outputs for a given interaction. Datasets in LangSmith can be created from existing runs, uploaded as CSV files, or manually created using the LangSmith client. Datasets are useful for testing and evaluating models, monitoring performance, and exporting data for use in other contexts.","A LangSmith dataset is a collection of input-output examples that are used for testing, evaluation, and training purposes in the LangSmith framework. Datasets in LangSmith are created by adding examples from the execution of LLM chains or individual LLM/Chat Model calls.\n\nEach example in a dataset consists of an input and an expected output. The input represents the prompt or query given to the language model, and the expected output is the desired response or result. These examples help in benchmarking and evaluating the performance of language models and chains.\n\nLangSmith datasets can be used for various purposes, including:\n\n1. Testing: Datasets can be used to test the performance of LLM chains or individual language models by running the examples and comparing the actual outputs with the expected outputs.\n\n2. Evaluation: Datasets can be used to evaluate the quality and accuracy of language models by measuring metrics such as BLEU score, perplexity, or other custom evaluation criteria.\n\n3. Training: Datasets can be used as training data for fine-tuning language models or building custom models. The examples in the dataset provide supervised learning signals for training the models.\n\n4. Exporting: Datasets can be exported from LangSmith for use in other contexts, such as OpenAI Evals or fine-tuning with external frameworks like FireworksAI.\n\nBy curating and utilizing datasets in LangSmith, developers can improve the reliability and performance of their language model applications.",1.0,1.0
"To use the `traceable` decorator, you need to import it from the `langsmith.run_helpers` module. The `traceable` decorator allows you to easily log the execution of a function as a run in LangSmith.\n\nHere's an example of how to use the `traceable` decorator:\n\n```python\nfrom langsmith.run_helpers import traceable\n\n@traceable(run_type=""llm"", name=""openai.ChatCompletion.create"")\ndef my_llm(*args, **kwargs):\n # Your function logic here\n pass\n```\n\nIn the example above, the `traceable` decorator is applied to the `my_llm` function. The `run_type` parameter specifies the type of the run (e.g., ""llm"", ""tool"", ""chain""), and the `name` parameter specifies the name of the run. You can customize these parameters based on your specific use case.\n\nOnce the `traceable` decorator is applied, any calls to the `my_llm` function will be logged as runs in LangSmith. You can view and analyze these runs in the LangSmith web app.\n\nNote that the `traceable` decorator works for both synchronous and asynchronous functions. If you're using an asynchronous function, make sure to use the `await` keyword when calling the function.\n\nFor more information on using the `traceable` decorator and other tracing features, you can refer to the LangSmith documentation on tracing.","To use the `traceable` decorator in LangSmith, you can follow these steps:\n\n1. Import the necessary modules:\n```python\nfrom langsmith.run_helpers import traceable\n```\n\n2. Define your function and decorate it with the `traceable` decorator:\n```python\n@traceable(run_type=""llm"")\ndef my_function(input_param1, input_param2):\n # Function logic goes here\n return output\n```\n\n3. Use the decorated function as usual:\n```python\nresult = my_function(""input1"", ""input2"")\n```\n\nThe `traceable` decorator adds tracing capabilities to your function. It automatically logs the inputs, outputs, and other relevant information of the function call. The `run_type` parameter specifies the type of the run, such as ""llm"" (for language model) or ""chain"" (for chain of functions).\n\nBy using the `traceable` decorator, you can easily trace and monitor the execution of your functions, which can be helpful for debugging, testing, and monitoring purposes.",1.0,1.0
"Yes, you can trace your Llama V2 LLM using LangSmith. Tracing allows you to log the inputs, outputs, and other important information of your LLM runs to LangSmith for visualization and debugging purposes.\n\nTo enable tracing for your Llama V2 LLM, you need to set the `LANGCHAIN_TRACING_V2` environment variable to `true` before running your LLM code. Here's an example of how to enable tracing in Python:\n\n```python\nimport os\nfrom langchain.chat_models import LlamaV2\n\nos.environ[""LANGCHAIN_TRACING_V2""] = ""true""\n\nllm = LlamaV2()\nllm.invoke(""Hello, world!"")\n```\n\nMake sure you have the LangSmith SDK installed and your API key and project name configured correctly. The traces of your LLM runs will be saved to the specified project in LangSmith, where you can visualize and analyze them.\n\nNote that tracing is asynchronous, so it won't block the execution of your LLM code. If you want to ensure that all traces are submitted before exiting your application, you can use the `wait_for_all_tracers()` function provided by the LangSmith SDK. Here's an example:\n\n```python\nimport os\nfrom langchain.chat_models import LlamaV2\nfrom langchain.callbacks.tracers.langsmith import wait_for_all_tracers\n\nos.environ[""LANGCHAIN_TRACING_V2""] = ""true""\n\nllm = LlamaV2()\ntry:\n llm.invoke(""Hello, world!"")\nfinally:\n wait_for_all_tracers()\n```\n\nThis will wait for all traces to be submitted before exiting your application.\n\nRemember to replace `<your_project>` with the actual name of your project in LangSmith.","Yes, you can trace your Llama V2 LLM using LangSmith's tracing feature. Tracing allows you to log the runs of your LLM application and inspect the inputs and outputs of each component in the chain. To enable tracing for your Llama V2 LLM, you need to set the `LANGCHAIN_TRACING_V2` environment variable to `true` before running your application. Here's an example of how to enable tracing in Python:\n\n```python\nimport os\nfrom langchain.chat_models import LlamaV2\n\nos.environ[""LANGCHAIN_TRACING_V2""] = ""true""\n\nllm = LlamaV2()\nllm.invoke(""Hello, world!"")\n```\n\nMake sure to replace `""Hello, world!""` with your desired input. With tracing enabled, the runs of your Llama V2 LLM will be logged to LangSmith, allowing you to analyze and debug your application.",1.0,1.0
"Setting environment variables is a common practice in software development to configure the behavior of applications or libraries. In the case of LangSmith, setting environment variables is necessary to enable tracing and logging of run traces to LangSmith.\n\nBy setting the `LANGCHAIN_TRACING_V2` environment variable, LangChain components such as LLMs, chains, agents, tools, and retrievers will automatically log their calls and outputs to LangSmith. This allows you to easily debug and monitor the behavior of your application, track inputs and outputs, analyze latency and token usage, and collect examples for testing and evaluation.\n\nSetting environment variables provides a convenient and consistent way to enable tracing across different programming languages and environments. It ensures that all LangChain components within your application are automatically traced without the need for manual configuration in each component.\n\nHowever, it's important to note that setting environment variables may vary depending on your programming language and environment. The specific steps to set environment variables can be found in the documentation or resources for your programming language or development environment.","Setting environment variables is a common practice in software development for configuring and customizing applications. In the context of logging traces to LangSmith, setting environment variables allows you to specify the necessary information, such as the API key and project name, without hardcoding them in your code. This provides flexibility and security, as you can easily change the configuration without modifying your code.\n\nBy using environment variables, you can separate the configuration from the codebase, making it easier to manage different environments (e.g., development, staging, production) and share the code with others without exposing sensitive information.\n\nAdditionally, environment variables are a standard way to pass configuration information to applications, and many frameworks and libraries support reading configuration values from environment variables out of the box.",1.0,1.0
"Currently, LangSmith does not support project migration between organizations. If you want to move your project to a different organization, the recommended approach is to create a new project within the desired organization and manually transfer the necessary data, such as runs and datasets, to the new project.\n\nYou can achieve this by using the LangSmith SDK or API to export the data from the original project and import it into the new project. Here are the general steps you can follow:\n\n1. Export the data from the original project: Use the LangSmith SDK or API to retrieve the runs, datasets, and any other relevant data from the original project. You can filter the data based on your requirements.\n\n2. Save the exported data: Store the exported data in a suitable format, such as JSON or CSV files, for easy transfer to the new project.\n\n3. Create a new project in the desired organization: Log in to LangSmith, navigate to the Organizations page, select the desired organization, and create a new project within that organization.\n\n4. Import the data into the new project: Use the LangSmith SDK or API to import the data from the exported files into the new project. This may involve creating new runs, uploading datasets, or any other necessary actions to replicate the data from the original project.\n\nBy following these steps, you can effectively move your project from one organization to another. Remember to update any references or configurations in your application to use the new project in the new organization.\n\nIf you encounter any issues or need further assistance, feel free to reach out to the LangSmith support team at support@langchain.dev.","Currently, LangSmith does not support project migration between organizations. However, you can manually export and import runs and datasets using the SDK to imitate project migration. Here's how you can do it:\n\n1. Export Runs: Use the SDK to export the runs from your current organization. You can find examples of exporting runs in the LangSmith SDK documentation.\n\n2. Import Runs: Create a new project within your desired organization and use the SDK to import the exported runs into the new project. Again, you can refer to the LangSmith SDK documentation for examples of importing runs.\n\n3. Export Datasets: Similarly, export the datasets from your current organization using the SDK.\n\n4. Import Datasets: Create the necessary datasets in the new project within your desired organization and use the SDK to import the exported datasets.\n\nBy following these steps, you can migrate your project from one organization to another. However, please note that this process may take some time and effort. If you have a large number of runs or datasets, it may be faster to create a new project within your desired organization and start fresh.\n\nIf you have any further questions or encounter any issues during the migration process, please reach out to LangSmith support at support@langchain.dev for assistance.",1.0,1.0


It looks like the benchmark performance is similar, so let's move on to the pairwise comparison.

## 2. Pairwise Evaluation

Suppose both approaches return similar scores when evaluated in isolation.

We can run a pairwise evaluator to see how try to predict preferred outputs. We will first define a couple helper functions to run the evaluator
on each prediction pair. Let's break this function down:

- The function accepts a dataset example and loads each model's predictions on that data point.
- It then randomizes the order of the predictions and calls the evaluator. This is done to aveage out the impact of any ordering bias in the evaluator LLM.
- Once the evaluation result is returned, we check it to make sure it is valid and then log feedback for both models.

Once this is complete, the values are all returned so we can display them in a table in the notebook below. 

In [19]:
import random
import logging

def _get_run_and_prediction(example_id, project_name):
    run = next(client.list_runs(reference_example_id=example_id, project_name=project_name))
    prediction = next(iter(run.outputs.values()))
    return run, prediction

def _log_feedback(run_ids):
    for score, run_id in enumerate(run_ids):
        client.create_feedback(run_id, key="preference", score=score)

def predict_preference(example, project_a, project_b, eval_chain):
    example_id = example.id
    run_a, pred_a = _get_run_and_prediction(example_id, project_a)
    run_b, pred_b = _get_run_and_prediction(example_id, project_b)
    input_, answer = example.inputs['question'], example.outputs['answer']
    result = {"input": input_, "answer": answer, "A": pred_a, "B": pred_b}

    # Flip a coin to average out persistent positional bias
    if random.random() < 0.5:
        result['A'], result['B'] = result['B'], result['A']
        run_a, run_b = run_b, run_a
    try:
        eval_res = eval_chain.evaluate_string_pairs(
            prediction=result['A'],
            prediction_b=result['B'],
            input=input_, 
            reference=answer
        )
    except Exception as e:
        logging.warning(e)
        return result

    if eval_res["value"] is None:
        return result

    preferred_run = (run_a.id, "A") if eval_res["value"] == "A" else (run_b.id, "B")
    runner_up_run = (run_b.id, "B") if eval_res["value"] == "A" else (run_a.id, "A")
    _log_feedback((runner_up_run[0], preferred_run[0]))
    result["Preferred"] = preferred_run[1]
    return result


For this example, we will use the `labeled_pairwise_string` evaluator from LangChain off-the-shelf. By default, instructs the evaluation llm to choose the preference based on helpfulness, relevance, correctness, and depth of thought. In your case, you will likely want to customize the criteria used!

For more information on how to configure it, check out the [Labeled Pairwise String Evaluator](https://python.langchain.com/docs/guides/evaluation/comparison/labeled_pairwise_string) documentation and inspect the resulting traces when running this notebook.


In [20]:
from langchain.evaluation import load_evaluator

pairwise_evaluator = load_evaluator("labeled_pairwise_string")

In [21]:
import functools
from langchain.schema.runnable import RunnableLambda


eval_func = functools.partial(
    predict_preference,
    project_a=project_name,
    project_b=project_name_2,
    eval_chain=pairwise_evaluator,
)


# We will wrap in a lambda to take advantage of its default `batch` convenience method
runnable = RunnableLambda(eval_func)

In [22]:
examples = list(client.list_examples(dataset_name="Retrieval QA Questions"))
values = runnable.batch(examples)

By running the function above, the "preference" feedback was automatically logged to the test projects you created in step 3. Below is a view of the same test run as before with the preference scores added. This model seems to be less preferred than the other! 

![Preference Tags](img/with_preferences.png)

The `predict_preference` function we wrote above is set up to not log feedback in the case of a tie, meaning some of the examples do not have a corresponding preference score. You can adjust this behavior as you see fit. 

You can also view the feedback results for the other test run in the app to see how well the evaluator's results match your expectations.

In [23]:
df = pd.DataFrame(values)
word_wrap_on_hover(df)

input,answer,A,B,Preferred
How do I move my project between organizations?,LangSmith doesn't directly support moving projects between organizations.,"Currently, LangSmith does not support project migration between organizations. If you want to move your project to a different organization, the recommended approach is to create a new project within the desired organization and manually transfer the necessary data, such as runs and datasets, to the new project.\n\nYou can achieve this by using the LangSmith SDK or API to export the data from the original project and import it into the new project. Here are the general steps you can follow:\n\n1. Export the data from the original project: Use the LangSmith SDK or API to retrieve the runs, datasets, and any other relevant data from the original project. You can filter the data based on your requirements.\n\n2. Save the exported data: Store the exported data in a suitable format, such as JSON or CSV files, for easy transfer to the new project.\n\n3. Create a new project in the desired organization: Log in to LangSmith, navigate to the Organizations page, select the desired organization, and create a new project within that organization.\n\n4. Import the data into the new project: Use the LangSmith SDK or API to import the data from the exported files into the new project. This may involve creating new runs, uploading datasets, or any other necessary actions to replicate the data from the original project.\n\nBy following these steps, you can effectively move your project from one organization to another. Remember to update any references or configurations in your application to use the new project in the new organization.\n\nIf you encounter any issues or need further assistance, feel free to reach out to the LangSmith support team at support@langchain.dev.","Currently, LangSmith does not support project migration between organizations. However, you can manually export and import runs and datasets using the SDK to imitate project migration. Here's how you can do it:\n\n1. Export Runs: Use the SDK to export the runs from your current organization. You can find examples of exporting runs in the LangSmith SDK documentation.\n\n2. Import Runs: Create a new project within your desired organization and use the SDK to import the exported runs into the new project. Again, you can refer to the LangSmith SDK documentation for examples of importing runs.\n\n3. Export Datasets: Similarly, export the datasets from your current organization using the SDK.\n\n4. Import Datasets: Create the necessary datasets in the new project within your desired organization and use the SDK to import the exported datasets.\n\nBy following these steps, you can migrate your project from one organization to another. However, please note that this process may take some time and effort. If you have a large number of runs or datasets, it may be faster to create a new project within your desired organization and start fresh.\n\nIf you have any further questions or encounter any issues during the migration process, please reach out to LangSmith support at support@langchain.dev for assistance.",A
Why do I have to set environment variables?,"Environment variables can tell your LangChain application to perform tracing and contain the information necessary to authenticate to LangSmith. While there are other ways to connect, environment variables tend to be the simplest way to configure your application.","Setting environment variables is a common practice in software development for configuring and customizing applications. In the context of logging traces to LangSmith, setting environment variables allows you to specify the necessary information, such as the API key and project name, without hardcoding them in your code. This provides flexibility and security, as you can easily change the configuration without modifying your code.\n\nBy using environment variables, you can separate the configuration from the codebase, making it easier to manage different environments (e.g., development, staging, production) and share the code with others without exposing sensitive information.\n\nAdditionally, environment variables are a standard way to pass configuration information to applications, and many frameworks and libraries support reading configuration values from environment variables out of the box.","Setting environment variables is a common practice in software development to configure the behavior of applications or libraries. In the case of LangSmith, setting environment variables is necessary to enable tracing and logging of run traces to LangSmith.\n\nBy setting the `LANGCHAIN_TRACING_V2` environment variable, LangChain components such as LLMs, chains, agents, tools, and retrievers will automatically log their calls and outputs to LangSmith. This allows you to easily debug and monitor the behavior of your application, track inputs and outputs, analyze latency and token usage, and collect examples for testing and evaluation.\n\nSetting environment variables provides a convenient and consistent way to enable tracing across different programming languages and environments. It ensures that all LangChain components within your application are automatically traced without the need for manual configuration in each component.\n\nHowever, it's important to note that setting environment variables may vary depending on your programming language and environment. The specific steps to set environment variables can be found in the documentation or resources for your programming language or development environment.",B
Can I trace my Llama V2 llm?,"So long as you are using one of LangChain's LLM implementations, all your calls can be traced","Yes, you can trace your Llama V2 LLM using LangSmith. Tracing allows you to log the inputs, outputs, and other important information of your LLM runs to LangSmith for visualization and debugging purposes.\n\nTo enable tracing for your Llama V2 LLM, you need to set the `LANGCHAIN_TRACING_V2` environment variable to `true` before running your LLM code. Here's an example of how to enable tracing in Python:\n\n```python\nimport os\nfrom langchain.chat_models import LlamaV2\n\nos.environ[""LANGCHAIN_TRACING_V2""] = ""true""\n\nllm = LlamaV2()\nllm.invoke(""Hello, world!"")\n```\n\nMake sure you have the LangSmith SDK installed and your API key and project name configured correctly. The traces of your LLM runs will be saved to the specified project in LangSmith, where you can visualize and analyze them.\n\nNote that tracing is asynchronous, so it won't block the execution of your LLM code. If you want to ensure that all traces are submitted before exiting your application, you can use the `wait_for_all_tracers()` function provided by the LangSmith SDK. Here's an example:\n\n```python\nimport os\nfrom langchain.chat_models import LlamaV2\nfrom langchain.callbacks.tracers.langsmith import wait_for_all_tracers\n\nos.environ[""LANGCHAIN_TRACING_V2""] = ""true""\n\nllm = LlamaV2()\ntry:\n llm.invoke(""Hello, world!"")\nfinally:\n wait_for_all_tracers()\n```\n\nThis will wait for all traces to be submitted before exiting your application.\n\nRemember to replace `<your_project>` with the actual name of your project in LangSmith.","Yes, you can trace your Llama V2 LLM using LangSmith's tracing feature. Tracing allows you to log the runs of your LLM application and inspect the inputs and outputs of each component in the chain. To enable tracing for your Llama V2 LLM, you need to set the `LANGCHAIN_TRACING_V2` environment variable to `true` before running your application. Here's an example of how to enable tracing in Python:\n\n```python\nimport os\nfrom langchain.chat_models import LlamaV2\n\nos.environ[""LANGCHAIN_TRACING_V2""] = ""true""\n\nllm = LlamaV2()\nllm.invoke(""Hello, world!"")\n```\n\nMake sure to replace `""Hello, world!""` with your desired input. With tracing enabled, the runs of your Llama V2 LLM will be logged to LangSmith, allowing you to analyze and debug your application.",A
How do I use a traceable decorator?,"The traceable decorator is available in the langsmith python SDK. To use, configure your environment with your API key,import the required function, decorate your function, and then call the function. Below is an example:\n```python\nfrom langsmith.run_helpers import traceable\n@traceable(run_type=""chain"") # or ""llm"", etc.\ndef my_function(input_param):\n # Function logic goes here\n return output\nresult = my_function(input_param)\n```","To use the `traceable` decorator in LangSmith, you can follow these steps:\n\n1. Import the necessary modules:\n```python\nfrom langsmith.run_helpers import traceable\n```\n\n2. Define your function and decorate it with the `traceable` decorator:\n```python\n@traceable(run_type=""llm"")\ndef my_function(input_param1, input_param2):\n # Function logic goes here\n return output\n```\n\n3. Use the decorated function as usual:\n```python\nresult = my_function(""input1"", ""input2"")\n```\n\nThe `traceable` decorator adds tracing capabilities to your function. It automatically logs the inputs, outputs, and other relevant information of the function call. The `run_type` parameter specifies the type of the run, such as ""llm"" (for language model) or ""chain"" (for chain of functions).\n\nBy using the `traceable` decorator, you can easily trace and monitor the execution of your functions, which can be helpful for debugging, testing, and monitoring purposes.","To use the `traceable` decorator, you need to import it from the `langsmith.run_helpers` module. The `traceable` decorator allows you to easily log the execution of a function as a run in LangSmith.\n\nHere's an example of how to use the `traceable` decorator:\n\n```python\nfrom langsmith.run_helpers import traceable\n\n@traceable(run_type=""llm"", name=""openai.ChatCompletion.create"")\ndef my_llm(*args, **kwargs):\n # Your function logic here\n pass\n```\n\nIn the example above, the `traceable` decorator is applied to the `my_llm` function. The `run_type` parameter specifies the type of the run (e.g., ""llm"", ""tool"", ""chain""), and the `name` parameter specifies the name of the run. You can customize these parameters based on your specific use case.\n\nOnce the `traceable` decorator is applied, any calls to the `my_llm` function will be logged as runs in LangSmith. You can view and analyze these runs in the LangSmith web app.\n\nNote that the `traceable` decorator works for both synchronous and asynchronous functions. If you're using an asynchronous function, make sure to use the `await` keyword when calling the function.\n\nFor more information on using the `traceable` decorator and other tracing features, you can refer to the LangSmith documentation on tracing.",B
What's a langsmith dataset?,A LangSmith dataset is a collection of examples. Each example contains inputs and optional expected outputs or references for that data point.,"A LangSmith dataset is a collection of examples that can be used to evaluate or improve a chain, agent, or model. It consists of rows, where each row represents an example and contains the inputs and (optionally) the expected outputs for a given interaction. Datasets in LangSmith can be created from existing runs, uploaded as CSV files, or manually created using the LangSmith client. Datasets are useful for testing and evaluating models, monitoring performance, and exporting data for use in other contexts.","A LangSmith dataset is a collection of input-output examples that are used for testing, evaluation, and training purposes in the LangSmith framework. Datasets in LangSmith are created by adding examples from the execution of LLM chains or individual LLM/Chat Model calls.\n\nEach example in a dataset consists of an input and an expected output. The input represents the prompt or query given to the language model, and the expected output is the desired response or result. These examples help in benchmarking and evaluating the performance of language models and chains.\n\nLangSmith datasets can be used for various purposes, including:\n\n1. Testing: Datasets can be used to test the performance of LLM chains or individual language models by running the examples and comparing the actual outputs with the expected outputs.\n\n2. Evaluation: Datasets can be used to evaluate the quality and accuracy of language models by measuring metrics such as BLEU score, perplexity, or other custom evaluation criteria.\n\n3. Training: Datasets can be used as training data for fine-tuning language models or building custom models. The examples in the dataset provide supervised learning signals for training the models.\n\n4. Exporting: Datasets can be exported from LangSmith for use in other contexts, such as OpenAI Evals or fine-tuning with external frameworks like FireworksAI.\n\nBy curating and utilizing datasets in LangSmith, developers can improve the reliability and performance of their language model applications.",B
How might I query for all runs in a project?,"client.list_runs(project_name='my-project-name'), or in TypeScript, client.ListRuns({projectName: 'my-project-anme'})","To query for all runs in a project using the LangSmith SDK, you can use the `listRuns` method of the `Client` class. Here's an example in Python:\n\n```python\nfrom langsmith import Client\n\nclient = Client()\nruns = list(client.list_runs(project_name=""<your_project>""))\n```\n\nAnd here's an example in TypeScript:\n\n```typescript\nimport { Client, Run } from ""langsmith"";\n\nconst client = new Client();\nconst runs: Run[] = [];\nfor await (const run of client.listRuns({ project: ""<your_project>"" })) {\n runs.push(run);\n}\n```\n\nIn both examples, replace `<your_project>` with the name of your project. This will retrieve all runs in the specified project. You can further filter the runs by using additional parameters in the `listRuns` method, such as `startTime`, `endTime`, or `runType`.","To query for all runs in a project, you can use the LangSmith SDK or the LangSmith REST API. Here's an example of how to do it using the LangSmith SDK in Python:\n\n```python\nfrom langsmith import Client\n\nclient = Client()\nruns = list(client.list_runs(project_name=""your_project_name""))\n```\n\nIn this example, replace `""your_project_name""` with the name of your project. The `list_runs` method returns a generator that you can iterate over to get all the runs in the project.\n\nIf you prefer to use the LangSmith REST API directly, you can make a GET request to the `/runs` endpoint with the appropriate project name. Here's an example using `curl`:\n\n```bash\ncurl -X GET ""https://api.smith.langchain.com/runs?project_name=your_project_name"" \\n -H ""Authorization: Bearer your_api_key""\n```\n\nReplace `""your_project_name""` with the name of your project and `""your_api_key""` with your actual API key.\n\nBoth methods will return a list of runs in the specified project.",B
What is LangChain?,LangChain is an open-source framework for building applications using large language models. It is also the name of the company building LangSmith.,"LangChain is a framework for building applications using large language models (LLMs). It simplifies the process of setting up and using LLMs, such as OpenAI's GPT models, by providing a set of tools and utilities. LangChain aims to bridge the gap between initial setup and production-level performance of LLM applications.\n\nWith LangChain, you can easily trace and debug LLM calls, visualize the inputs and outputs of LLMs, modify prompts in a playground environment, understand the sequence of events in complex chains and agents, track latency and token usage, and collaborate with colleagues for debugging. LangChain also allows you to collect examples and create datasets for testing and evaluation, and provides evaluators for common evaluation scenarios. Additionally, LangChain supports monitoring and logging of application runs, and enables exporting datasets for use in other contexts.\n\nOverall, LangChain helps developers build reliable and high-quality LLM applications by providing a range of features and tools for tracing, debugging, testing, and evaluating LLMs and chains.","LangChain is a framework for building applications using large language models (LLMs). It simplifies the process of developing and deploying LLM applications by providing tools and features to enhance reliability, debugging, testing, evaluation, and monitoring.\n\nLangChain is designed to bridge the gap between the initial setup of LLMs and their reliable use in production. It offers a user-friendly interface and various functionalities to make working with LLMs more efficient and effective.\n\nSome key features of LangChain include:\n\n1. Tracing: LangChain provides tracing capabilities that log all calls to LLMs, chains, agents, tools, and retrievers. This helps in debugging unexpected results, identifying looping agents, analyzing chain performance, and tracking token usage.\n\n2. Debugging: LangChain offers tools to debug LLMs, chains, and agents. It provides visibility into the exact inputs and outputs of LLM calls, allowing you to understand the constructed input string and the structure of the output. It also includes a playground feature where you can modify prompts and observe the resulting changes to the output.\n\n3. Sequence Visualization: For complex chains and agents, LangChain's tracing feature provides a visualization of the sequence of events, including the order of calls, inputs, and outputs. This helps in understanding the flow and interactions within the chain or agent.\n\n4. Performance Analysis: LangChain allows you to track the latency of each step in a chain, helping you identify and optimize the slowest components. It also provides insights into token usage, making it easier to identify costly parts of the chain.\n\n5. Collaborative Debugging: LangChain enables easy sharing of chains and LLM runs with colleagues for collaborative debugging. It includes a ""Share"" button that generates a shared link, allowing others to access and analyze the chain or run.\n\n6. Dataset Collection: LangChain facilitates the collection of examples for debugging and testing purposes. It includes an ""Add to Dataset"" button for each run, allowing you to add input/output examples to a chosen dataset. This helps in benchmarking changes and testing the application against known issues.\n\n7. Testing and Evaluation: LangChain simplifies the process of testing changes to prompts or chains using datasets. It allows you to run chains over data points, visualize the outputs, and assign feedback to runs. It also provides evaluators to analyze the results and guide your attention to specific examples.\n\n8. Monitoring: LangChain can be used to monitor LLM applications in production. It provides tracing, latency, and token usage statistics, allowing you to troubleshoot issues and ensure the smooth functioning of your application.\n\nOverall, LangChain aims to enhance the development, debugging, testing, evaluation, and monitoring of LLM applications, making it easier to build reliable and efficient language-based systems.",B


## Conclusion

In this walkthrough, you compared two versions of a RAG Q&A chain by predicting preference scores for each pair of predictions.
This approach is one way to automatically compare two versions of a chain that can give additional context beyond regular benchmarking.

There are many related ways to evaluate preferences! Here, we used binary choices to compare the two models and only evaluated once, but you may get better results by trying one of the following approaches:

- Evaluate multiple times in each position and returning a win rate
- Ensemble evaluators
- Instruct the model to output continuous scores
- Instruct the model to use a different prompt strategy than chain of thought

For more information on measuring the reliability of this and other approaches, you can check out the [evaluations examples](https://python.langchain.com/docs/guides/evaluation/examples) in the LangChain repo.