# Comparing Q&A System Outputs

The most common way to compare two models is to benchmark them both on a dataset and compare the aggregate metrics.

This approach is useful but it may filter out information useful for comparing the behavior of the two system variants. In this case, it can be helpful to directly perform pairwise comparisons on the responses to generate preference scores.

In this tutorial, we will walk through an example of how to do this in a notebook so you get better compare two variants of a model.

We will use a retrieval Q&A system over LangSmith's docs as a motivating example.

The main steps are:

1. Create a dataset of questions and answers.
2. Define different versions of your chains to evaluate.
3. Evaluate chains directly on a dataset using regular metrics (e.g. correctness).
4. Evaluate the pairwise preferences over that dataset
5. Summarize aggregate results.

In this case, we will test the impact of chunk sizes on our result quality. Now without further delay:

## Prerequisites

This tutorial uses OpenAI for the model, ChromaDB to store documents, and LangChain to compose the chain. To make sure the tracing and evals are set up for [LangSmith](https://smith.langchain.com), please configure your API Key appropriately.

We will also use pandas to render the results in the notebook.

In [1]:
# %env LANGCHAIN_API_KEY=<YOUR_API_KEY>

Install the required packages. `lxml` and `html2text` are used by the document loader.

In [2]:
# %pip install -U "langchain[openai]" --quiet
# %pip install chromadb --quiet
# %pip install lxml --quiet
# %pip install html2text --quiet
# %pip install  pandas --quiet
# %pip install nest_asyncio --quiet

In [3]:
# %env OPENAI_API_KEY=<YOUR-API-KEY>

In [4]:
# Used for running in jupyter
import nest_asyncio

nest_asyncio.apply()

## 1. Create a dataset

No evaluation process is complete without a development dataset. We've hard-coded a few examples below to demonstrate the process. In general, you'll want a lot more (>100) pairs for statistically significant results. Drawing from actual user queries can be helpful to ensure better representation of the domain.

In [5]:
examples = [
    ("What is LangChain?", "LangChain is an open-source framework for building applications using large language models. It is also the name of the company building LangSmith."),
    ("How might I query for all runs in a project?", "client.list_runs(project_name='my-project-name'), or in TypeScript, client.ListRuns({projectName: 'my-project-anme'})"),
    ("What's a langsmith dataset?", "A LangSmith dataset is a collection of examples. Each example contains inputs and optional expected outputs or references for that data point."),
    ("How do I use a traceable decorator?", """The traceable decorator is available in the langsmith python SDK. To use, configure your environment with your API key,\
import the required function, decorate your function, and then call the function. Below is an example:
```python
from langsmith.run_helpers import traceable
@traceable(run_type="chain") # or "llm", etc.
def my_function(input_param):
    # Function logic goes here
    return output
result = my_function(input_param)
```"""),
    ("Can I trace my Llama V2 llm?", "So long as you are using one of LangChain's LLM implementations, all your calls can be traced"),
    ("Why do I have to set environment variables?", "Environment variables can tell your LangChain application to perform tracing and contain the information necessary to authenticate to LangSmith."
     " While there are other ways to connect, environment variables tend to be the simplest way to configure your application."),
    ("How do I move my project between organizations?", "LangSmith doesn't directly support moving projects between organizations.")
]

In [6]:
from langsmith import Client

client = Client()

In [7]:
dataset_name = "Retrieval QA Questions"
# dataset = client.create_dataset(dataset_name=dataset_name)
# for q, a in examples:
#     client.create_example(inputs={"question": q}, outputs={"answer": a}, dataset_id=dataset.id)

## 2. Define RAG Q&A system

Our Q&A system uses a simple retriever and LLM response generator. To break that down further, the chain will be composed of:

1. A [VectorStoreRetriever](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.base.VectorStoreRetriever.html#langchain.vectorstores.base.VectorStoreRetriever) to retrieve documents. This uses:
   - An embedding model to vectorize documents and user queries for retrieval. In this case, the [OpenAIEmbeddings](https://api.python.langchain.com/en/latest/embeddings/langchain.embeddings.openai.OpenAIEmbeddings.html) model.
   - A vectorstore, in this case we will use [Chroma](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.chroma.Chroma.html#langchain.vectorstores.chroma.Chroma)
2. A response generator. This uses:
   - A [ChatPromptTemplate](https://api.python.langchain.com/en/latest/prompts/langchain.prompts.chat.ChatPromptTemplate.html#langchain.prompts.chat.ChatPromptTemplate) to combine the query and documents. 
   - An LLM, in this case, the 16k token context window version of `gpt-3.5-turbo` via [ChatOpenAI](https://api.python.langchain.com/en/latest/chat_models/langchain.chat_models.openai.ChatOpenAI.html#langchain.chat_models.openai.ChatOpenAI).

We will combine them using LangChain's [expression syntax](https://python.langchain.com/docs/guides/expression_language/cookbook).

First, load the documents to populate the vectorstore:

In [8]:
from langchain.document_loaders import RecursiveUrlLoader
from langchain.document_transformers import Html2TextTransformer
from langchain.text_splitter import TokenTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

api_loader = RecursiveUrlLoader("https://docs.smith.langchain.com")
doc_transformer = Html2TextTransformer()
raw_documents = api_loader.load()
transformed = doc_transformer.transform_documents(raw_documents)

def create_retriever(transformed_documents, text_splitter):
    documents = text_splitter.split_documents(transformed_documents)
    embeddings = OpenAIEmbeddings()
    vectorstore = Chroma.from_documents(documents, embeddings)
    return vectorstore.as_retriever(search_kwargs={"k": 4})



Next up, we'll define the chain. Since we are going to vary the retriever parameters, our constructor will
take the retriever as an argument.

In [9]:
from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.schema.output_parser import StrOutputParser

from datetime import datetime
from operator import itemgetter


def create_chain(retriever):
    prompt = ChatPromptTemplate.from_messages(
            [
                ("system", "You are a helpful documentation Q&A assistant, trained to answer"
                " questions from LangSmith's documentation."
                " LangChain is a framework for building applications using large language models."
                "\nThe current time is {time}.\n\nRelevant documents will be retrieved in the following messages."),
                ("system", "{context}"),
                ("human","{question}")
            ]
        ).partial(time=str(datetime.now()))

    model = ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0)
    response_generator = (
        prompt 
        | model 
        | StrOutputParser()
    )
    chain = (
        # The runnable map here routes the original inputs to a context and a question dictionary to pass to the response generator
        {
            "context": itemgetter("question") | retriever | (lambda docs: "\n".join([doc.page_content for doc in docs])),
            "question": itemgetter("question")
        }
        | response_generator
    )
    return chain

With the documents prepared, and the chain constructor ready, it's time to create and evaluate our chains.
We will vary the split size and overlap to evaluate its impact on the response quality.

In [10]:
text_splitter = TokenTextSplitter(
    model_name="gpt-3.5-turbo",
    chunk_size=2000,
    chunk_overlap=200,
)
retriever = create_retriever(transformed, text_splitter)

chain_1 = create_chain(retriever)

In [11]:
# We will shrink both the chunk size and overlap
text_splitter_2 = TokenTextSplitter(
    model_name="gpt-3.5-turbo",
    chunk_size=500,
    chunk_overlap=50,
)
retriever_2 = create_retriever(transformed, text_splitter_2)

chain_2 = create_chain(retriever_2)

## 3. Evaluate the chains

At this point, we are still going through the regular development -> evaluation process. We have two candidates and will evaluate them with a correctness evaluator from LangChain. By running `run_on_dataset`, we will generate predicted answers to each question in the dataset and log feedback from the evaluator for that data point.

In [12]:
from langchain.smith import RunEvalConfig

eval_config = RunEvalConfig(
    # We will use the chain-of-thought Q&A correctness evaluator
    evaluators=["cot_qa"],
)

In [13]:
results = client.run_on_dataset(
    dataset_name=dataset_name,
    llm_or_chain_factory=chain_1,
    evaluation=eval_config
)
project_name = results["project_name"]

View the evaluation results for project 'b59f6be5abed4626b95292eb47f2a5f6-RunnableSequence' at:
https://smith.langchain.com/projects/p/de8b7abb-6d35-4449-8a6b-263308187cf2?eval=true


In [14]:
results_2 = client.run_on_dataset(
    dataset_name=dataset_name,
    llm_or_chain_factory=chain_2,
    evaluation=eval_config
)
project_name_2 = results_2["project_name"]

View the evaluation results for project 'ae48fca6106a4e9d8dad7fb798a84bdd-RunnableSequence' at:
https://smith.langchain.com/projects/p/a2714237-55eb-4501-be8b-8c5c41623365?eval=true


Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised APIError: Bad gateway. {"error":{"code":502,"message":"Bad gateway.","param":null,"type":"cf_bad_gateway"}} 502 {'error': {'code': 502, 'message': 'Bad gateway.', 'param': None, 'type': 'cf_bad_gateway'}} {'Date': 'Thu, 17 Aug 2023 22:43:33 GMT', 'Content-Type': 'application/json', 'Content-Length': '84', 'Connection': 'keep-alive', 'X-Frame-Options': 'SAMEORIGIN', 'Referrer-Policy': 'same-origin', 'Cache-Control': 'private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0', 'Expires': 'Thu, 01 Jan 1970 00:00:01 GMT', 'Server': 'cloudflare', 'CF-RAY': '7f856121be2615a4-SJC', 'alt-svc': 'h3=":443"; ma=86400'}.


Now you should have two test run projects over the same dataset. If you click on one, it should look something like the following:
    
![Original Feedback](img/original_eval.png)

You can look at the aggregate results here and for the other project to compare them, but let's move on to the pairwise comparison.

## 4. Pairwise Evaluation

It looks like both approaches return similar scores when the evaluator considers each prediction in isolation.
It's time to run a pairwise evaluator to see how they compare. We will first define a couple helper functions to run the evaluator
on each prediction pair. Let's break it down:

- The function accepts an example object and loads the predictioons for each model by querying LangSmith.
- It then randomizes the order of the predictions and calls the evaluator. This is done in case the evaluation result
    towards a specific ordering.
- Once the evaluation result is returned, we check it to make sure it is valid and then log feedback for both models.

Once this is complete, the values are all returned so we can display them in a table in the notebook below

In [15]:
import random
import logging
from typing import Dict

def _get_run_and_prediction(example_id, project_name):
    run = next(client.list_runs(reference_example_id=example_id, project_name=project_name))
    prediction = next(iter(run.outputs.values()))
    return run, prediction

def _log_feedback(run_ids, null):
    for score, run_id in enumerate(run_ids):
        client.create_feedback(run_id, key="preference", score=score)

def predict_preference(example, project_a, project_b, eval_chain):
    example_id = example.id
    run_a, pred_a = _get_run_and_prediction(example_id, project_a)
    run_b, pred_b = _get_run_and_prediction(example_id, project_b)
    input_, answer = example.inputs['question'], example.outputs['answer']
    result = {"input": input_, "answer": answer, "A": pred_a, "B": pred_b}

    # Flip a coin to average out persistent positional bias
    if random.random() < 0.5:
        result['A'], result['B'] = result['B'], result['A']
        run_a, run_b = run_b, run_a
    try:
        eval_res = eval_chain.evaluate_string_pairs(
            prediction=result['A'],
            prediction_b=result['B'],
            input=input_, 
            reference=answer
        )
    except Exception as e:
        logging.warning(e)
        return result

    if eval_res["value"] is None:
        return result

    preferred_run = (run_a.id, "A") if eval_res["value"] == "A" else (run_b.id, "B")
    runner_up_run = (run_b.id, "B") if eval_res["value"] == "A" else (run_a.id, "A")
    _log_feedback((runner_up_run[0], preferred_run[0]))
    result["Preferred"] = preferred_run[1]
    return result


For this example, we will use the `labeled_pairwise_string` evaluator from LangChain off-the-shelf.

For more information on how to configure it, check out the [Labeled Pairwise String Evaluator](https://python.langchain.com/docs/guides/evaluation/comparison/labeled_pairwise_string) documentation and inspect the resulting traces when running this notebook. We encourage you to customize these or use your own evaluation functions as appropriate.


In [16]:
from langchain.evaluation import load_evaluator

pairwise_evaluator = load_evaluator("labeled_pairwise_string")

In [17]:
import functools
from langchain.schema.runnable import RunnableLambda


eval_func = functools.partial(
    predict_preference,
    project_a=project_name,
    project_b=project_name_2,
    eval_chain=pairwise_evaluator,
)


# We will wrap in a lambda to take advantage of its default `batch` convenience method
runnable = RunnableLambda(eval_func)

In [18]:
examples = list(client.list_examples(dataset_name="Retrieval QA Questions"))
values = runnable.batch(examples)



By running the function above, the "preference" feedback was automatically logged to the test projects you created in step 3. Below is a view of the same test run as before with the preference scores added. This model seems to be less preferred than the other! 

![Preference Tags](img/with_preferences.png)

The `predict_preference` function we wrote above is set up to not log feedback in the case of a tie, meaning some of the examples do not have a corresponding preference score. You can adjust this behavior as you see fit. 

You can also view the feedback results for the other test run in the app to see how well the evaluator's results match your expectations.

In [47]:
import pandas as pd

df = pd.DataFrame(values)

In [48]:
from IPython.core.display import HTML

def word_wrap_on_hover(df):
    """
    Takes a Pandas DataFrame and returns an HTML table with word wrap on hover.
    """
    styles = """
    <style>
        .hover_table td {
            max-width: 200px; /* You can adjust this value */
            overflow: hidden;
            text-overflow: ellipsis;
            white-space: nowrap;
        }
        .hover_table td:hover {
            white-space: normal;
            word-wrap: break-word;
        }
    </style>
    """
    html_table = df.to_html(classes='hover_table')
    return HTML(styles + html_table)

word_wrap_on_hover(df)

Unnamed: 0,input,answer,A,B,Preferred
0,How do I move my project between organizations?,LangSmith doesn't directly support moving projects between organizations.,"Currently, LangSmith does not support project migration between organizations. However, you can manually export and import runs and datasets using the SDK to imitate project migration. Here's an example of how you can export runs:\n\n1. Use the LangSmith SDK to retrieve the runs from your source organization:\n\n```python\nfrom langchain import LangSmithClient\n\nclient = LangSmithClient(api_key='<source-organization-api-key>')\nruns = client.get_runs(project_id='<source-project-id>')\n```\n\n2. Save the runs to a file or database for later import.\n\n3. Use the LangSmith SDK to import the runs into your target organization:\n\n```python\nfrom langchain import LangSmithClient\n\nclient = LangSmithClient(api_key='<target-organization-api-key>')\nfor run in runs:\n client.create_run(project_id='<target-project-id>', run=run)\n```\n\nBy following a similar process, you can also export and import datasets between organizations.\n\nIf you have any further questions or need assistance, please reach out to us at support@langchain.dev.","Currently, LangSmith does not support project migration between organizations. If you want to move your project to a different organization, the recommended approach is to create a new project within the desired organization and manually transfer the necessary data, such as runs and datasets, to the new project.\n\nYou can achieve this by using the LangSmith SDK or REST API to export the data from the original project and import it into the new project. Here are the general steps you can follow:\n\n1. Export the runs and datasets from the original project using the LangSmith API or SDK. You can filter the runs and datasets based on your requirements.\n\n2. Save the exported data to a file or storage location.\n\n3. Create a new project in the desired organization.\n\n4. Import the exported runs and datasets into the new project using the LangSmith API or SDK.\n\nBy following these steps, you can effectively move your project from one organization to another.",
1,Why do I have to set environment variables?,"Environment variables can tell your LangChain application to perform tracing and contain the information necessary to authenticate to LangSmith. While there are other ways to connect, environment variables tend to be the simplest way to configure your application.","Setting environment variables is a common practice in software development to configure the behavior of applications or libraries. In the case of LangSmith, setting environment variables is necessary to enable tracing and logging of LLM calls, chains, agents, and other LangChain components.\n\nBy setting the `LANGCHAIN_TRACING_V2` environment variable, LangSmith's tracing feature is activated. This allows LangSmith to capture and log the inputs, outputs, and other relevant information of your LangChain application's runs. This tracing data is invaluable for debugging, understanding the sequence of events, monitoring performance, and collecting examples for testing and evaluation.\n\nSetting environment variables provides a convenient and consistent way to configure tracing across different programming languages and environments. It allows you to enable tracing without modifying your code, making it easier to toggle tracing on and off as needed.\n\nHowever, it's important to note that setting environment variables may vary depending on your programming language and environment. The specific steps to set environment variables can differ between operating systems, shells, and development tools. It's recommended to refer to the documentation or resources specific to your programming language and environment for instructions on how to set environment variables.","Setting environment variables is a common practice in software development for configuring and customizing applications. In the context of logging traces to LangSmith, setting environment variables allows you to specify the necessary information, such as the API key and project name, without hardcoding them into your code. This provides flexibility and security, as you can easily change the configuration without modifying your code.\n\nBy using environment variables, you can separate the configuration from the codebase, making it easier to manage different environments (e.g., development, staging, production) and share the code with others without exposing sensitive information.\n\nAdditionally, environment variables are a standard way to pass configuration information to applications, and many frameworks and libraries support reading configuration values from environment variables out of the box.",
2,Can I trace my Llama V2 llm?,"So long as you are using one of LangChain's LLM implementations, all your calls can be traced","Yes, you can trace your Llama V2 LLM using LangSmith's tracing feature. Tracing allows you to log the inputs, outputs, and other relevant information of your LLM runs for debugging and analysis purposes.\n\nTo enable tracing for your Llama V2 LLM, you need to set the `LANGCHAIN_TRACING_V2` environment variable to `true` before running your application. You can do this by running the following command in your shell:\n\n```shell\nexport LANGCHAIN_TRACING_V2=true\n```\n\nMake sure you have also set the `LANGCHAIN_ENDPOINT` and `LANGCHAIN_API_KEY` environment variables to the appropriate values for your LangSmith account.\n\nOnce tracing is enabled, you can run your Llama V2 LLM as usual, and LangSmith will automatically log the runs and make them available for visualization and analysis in the LangSmith interface.\n\nNote that tracing is currently supported for LLM and Chat Model calls, and it works for both synchronous and asynchronous calls.","Yes, you can trace your Llama V2 LLM using LangSmith. Tracing allows you to log the inputs, outputs, and other important information of your LLM runs to LangSmith for visualization and debugging purposes.\n\nTo enable tracing for your Llama V2 LLM, you need to set the `LANGCHAIN_TRACING_V2` environment variable to `true` before running your LLM code. Here's an example of how to enable tracing in Python:\n\n```python\nimport os\nfrom langchain.chat_models import ChatOpenAI\n\nos.environ[""LANGCHAIN_TRACING_V2""] = ""true""\n\nllm = ChatOpenAI()\nllm.invoke(""Hello, world!"")\n```\n\nMake sure you have the LangSmith SDK installed and your LangSmith API key configured. The traces from your LLM runs will be logged to LangSmith and can be visualized in the LangSmith web app.\n\nNote that tracing is asynchronous, so you may need to wait for the traces to be submitted before exiting your application. You can use the `wait_for_all_tracers()` function from the `langsmith.callbacks.tracers.langchain` module to ensure all traces are logged before exiting.",
3,How do I use a traceable decorator?,"The traceable decorator is available in the langsmith python SDK. To use, configure your environment with your API key,import the required function, decorate your function, and then call the function. Below is an example:\n```python\nfrom langsmith.run_helpers import traceable\n@traceable(run_type=""chain"") # or ""llm"", etc.\ndef my_function(input_param):\n # Function logic goes here\n return output\nresult = my_function(input_param)\n```","To use the `traceable` decorator in LangSmith, you can follow these steps:\n\n1. Import the necessary modules:\n```python\nfrom langsmith.run_helpers import traceable\n```\n\n2. Define your function and decorate it with the `traceable` decorator:\n```python\n@traceable(run_type=""llm"")\ndef my_function(input_param1, input_param2):\n # Function logic goes here\n return output_value\n```\n\n3. Use the decorated function as usual:\n```python\nresult = my_function(""input1"", ""input2"")\n```\n\nThe `traceable` decorator adds tracing capabilities to your function. It automatically logs the inputs, outputs, and other relevant information of the function call. The `run_type` parameter specifies the type of the run, such as ""llm"" (for language model) or ""chain"" (for chain of functions).\n\nBy using the `traceable` decorator, you can easily trace and monitor the execution of your functions, which can be helpful for debugging, testing, and monitoring purposes.","To use the `traceable` decorator, you need to import it from the `langsmith.run_helpers` module. The `traceable` decorator allows you to easily log the execution of a function as a run in LangSmith.\n\nHere's an example of how to use the `traceable` decorator:\n\n```python\nfrom langsmith.run_helpers import traceable\n\n@traceable(run_type=""llm"", name=""openai.ChatCompletion.create"")\ndef my_llm(*args, **kwargs):\n # Your function logic here\n pass\n```\n\nIn the example above, the `traceable` decorator is applied to the `my_llm` function. The `run_type` parameter specifies the type of the run (e.g., ""llm"", ""tool"", ""chain""), and the `name` parameter specifies the name of the run. You can customize these parameters based on your specific use case.\n\nOnce the `traceable` decorator is applied, any calls to the `my_llm` function will be logged as runs in LangSmith. You can view and analyze these runs in the LangSmith web app.\n\nNote that the `traceable` decorator works for both synchronous and asynchronous functions. If you're using an asynchronous function, make sure to use the `await` keyword when calling the function.\n\nBy default, the `traceable` decorator logs the inputs and outputs of the function. If you want to exclude certain inputs or outputs from being logged, you can use the `exclude_inputs` and `exclude_outputs` parameters of the decorator.\n\nHere's an example of excluding inputs and outputs from being logged:\n\n```python\n@traceable(run_type=""llm"", name=""openai.ChatCompletion.create"", exclude_inputs=[""api_key""], exclude_outputs=[""choices""])\ndef my_llm(api_key, *args, **kwargs):\n # Your function logic here\n pass\n```\n\nIn the example above, the `api_key` input and the `choices` output will not be logged in the run.\n\nThat's how you can use the `traceable` decorator to log function calls as runs in LangSmith.",B
4,What's a langsmith dataset?,A LangSmith dataset is a collection of examples. Each example contains inputs and optional expected outputs or references for that data point.,"A LangSmith dataset is a collection of input-output examples that are used for testing and evaluation purposes. It is a curated set of data points that represent different scenarios and inputs that your application may encounter. These examples can be used to run your prompt or chain and evaluate the outputs.\n\nLangSmith datasets are created and managed within the LangSmith platform. You can add examples to a dataset directly from the LangSmith interface, either manually or by using the ""Add to Dataset"" button available for each run. These examples can be edited to include the expected results, making them particularly useful for capturing both positive and negative test cases.\n\nOnce you have a dataset, you can use it to test changes to your prompt or chain. You can run your chain over the data points in the dataset and visualize the outputs. This allows you to evaluate the performance of your application and compare the actual outputs with the expected results.\n\nLangSmith datasets can also be exported for use in other contexts, such as OpenAI Evals or fine-tuning with platforms like FireworksAI. This allows you to leverage the curated examples outside of the LangSmith platform for further experimentation and development.\n\nOverall, LangSmith datasets provide a structured way to organize and manage input-output examples for testing, evaluation, and improvement of your language model applications.","A LangSmith dataset is a collection of examples that can be used to evaluate or improve a chain, agent, or model. It consists of rows, where each row represents an example and contains the inputs and (optionally) the expected outputs for a given interaction. Datasets in LangSmith can be created from existing runs, uploaded as CSV files, or manually created using the LangSmith client. Datasets are useful for testing and evaluating models, monitoring performance, and exporting data for use in other contexts.",A
5,How might I query for all runs in a project?,"client.list_runs(project_name='my-project-name'), or in TypeScript, client.ListRuns({projectName: 'my-project-anme'})","To query for all runs in a project, you can use the LangSmith SDK. Here's an example in Python:\n\n```python\nfrom langsmith import Client\n\nclient = Client()\n\nruns = list(client.list_runs(project_name=""<your_project>""))\n```\n\nAnd here's an example in TypeScript:\n\n```typescript\nimport { Client, Run } from ""langsmith"";\n\nconst client = new Client();\nconst runs: Run[] = [];\nfor await (const run of client.listRuns({ project: ""<your_project>"" })) {\n runs.push(run);\n}\n```\n\nIn both examples, replace `<your_project>` with the name of your project. This will retrieve all runs in the specified project. You can further filter the runs by adding additional parameters to the `list_runs` method, such as `filter` or `start_time`.","To query for all runs in a project, you can use the LangSmith SDK or the LangSmith REST API. Here's an example of how to do it using the LangSmith SDK in Python:\n\n```python\nfrom langsmith import Client\n\nclient = Client()\nruns = list(client.list_runs(project_name=""your_project_name""))\n```\n\nIn this example, replace `""your_project_name""` with the name of your project. The `list_runs` method returns a generator that you can iterate over to get all the runs in the project.\n\nIf you prefer to use the LangSmith REST API directly, you can make a GET request to the `/runs` endpoint with the appropriate project name. Here's an example using `curl`:\n\n```bash\ncurl -X GET ""https://api.smith.langchain.com/runs?project_name=your_project_name"" \\n -H ""Authorization: Bearer your_api_key""\n```\n\nReplace `""your_project_name""` with the name of your project and `""your_api_key""` with your actual API key.\n\nBoth methods will return a list of runs in the specified project.",A
6,What is LangChain?,LangChain is an open-source framework for building applications using large language models. It is also the name of the company building LangSmith.,"LangChain is a framework for building applications using large language models (LLMs). It simplifies the process of developing and deploying LLM applications by providing tools and features to enhance reliability, debugging, testing, and evaluation.\n\nLangChain is designed to bridge the gap between the initial setup of LLMs and their performance in production. It offers a user-friendly interface and various functionalities to make working with LLMs more efficient and effective.\n\nSome key features of LangChain include:\n\n1. Tracing: LangChain provides tracing capabilities that log all calls to LLMs, chains, agents, tools, and retrievers. This allows for easy debugging and understanding of the inputs and outputs of each call.\n\n2. Debugging: LangChain helps in debugging LLMs, chains, and agents by providing visibility into the exact inputs and outputs of LLM calls. It also offers a playground feature where prompts can be modified and re-run to observe the resulting changes in the output.\n\n3. Sequence Visualization: LangChain's tracing feature includes a visualization tool that helps understand the sequence of events in complicated chains and agents. It provides insights into the order of calls, inputs, and outputs of each call.\n\n4. Performance Monitoring: LangChain allows monitoring the latency and token usage of chains, helping identify and optimize the slowest components. This is particularly useful for ensuring the efficiency of LLM applications.\n\n5. Collaborative Debugging: LangChain enables easy sharing of faulty chains and LLM runs with colleagues for collaborative debugging. It provides a ""Share"" button that generates a shared link to access the chain and runs.\n\n6. Dataset Collection: LangChain facilitates the collection of examples for debugging and testing purposes. It includes an ""Add to Dataset"" button for each run, allowing users to add input/output examples to a chosen dataset. This helps in benchmarking changes and testing the application against known issues.\n\n7. Testing and Evaluation: LangChain simplifies the process of testing changes to prompts or chains using datasets. It provides a client-side tool to run chains over data points and visualize the outputs. It also offers evaluators to assess the results and provide guidance for further analysis.\n\n8. Monitoring: LangChain can be used to monitor LLM applications in production. It allows logging of traces, visualization of latency and token usage statistics, and troubleshooting of specific issues as they arise.\n\nOverall, LangChain aims to enhance the development, debugging, testing, and monitoring of LLM applications, making it easier to build reliable and efficient language-based applications.","LangChain is a framework for building applications using large language models (LLMs). It aims to simplify the process of developing reliable LLM applications by providing tools and features to enhance debugging, testing, evaluation, and monitoring. LangChain includes LangSmith, a tracing and debugging tool that helps developers understand and troubleshoot their LLM applications. It also provides features for collecting examples, testing and evaluating chains, and exporting datasets. LangChain is designed to improve the development and deployment of LLM applications by providing insights, tools, and resources to enhance the reliability and performance of these applications.",A


# 4. Conclusion

In this walkthrough, you compared two versions of a RAG Q&A chain by predicting preference scores for each pair of predictions.
This approach is one way to automatically compare two versions of a chain that can give additional context beyond regular benchmarking.

There are many related ways to evaluate preferences! Here, we used binary choices to compare the two models and only evaluated once, but you may get better results by trying one of the following approaches:

- Evaluate multiple times in each position and returning a win rate
- Ensemble evaluators
- Instruct the model to output continuous scores
- Instruct the model to use a different prompt strategy than chain of thought

For more information on measuring the reliability of this and other approaches, you can check out the [evaluations examples](https://python.langchain.com/docs/guides/evaluation/examples) in the LangChain repo.