# Comparing Q&A System Outputs

While comparing models using aggregate metrics over datasets is often the fastest way to compare, you can sometimes miss out on the finer performance details of a system. While you can
often get more input by A/B testing variants in production, one thing you can do is generate comparison metrics.

LangSmith currently doesn't have a native evaluation flow for pairwise evaluation of systems over a dataset. The following guide will walk through how you can still do this
in code while saving the traces to LangSmith for later inspection. We will share some setup code with the QA evaluation system in other recipes.

The main steps are:

1. Create a dataset of questions and answers.
2. Define candidate chains.
3. Generate predictions over the dataset.
3. Evaluate the pairwise results.
4. Summarize aggregate results.

In this case, we will test the impact of chunk sizes on our results.

TODO all the other narration.

## Prerequisites

This tutorial uses OpenAI for the model, ChromaDB to store documents, and LangChain to compose the chain. To make sure the tracing and evals are set up for [LangSmith](https://smith.langchain.com), please configure your API Key appropriately.

In [1]:
# %env LANGCHAIN_API_KEY=<YOUR_API_KEY>

Install the required packages. `lxml` and `html2text` are used by the document loader.

In [2]:
# %pip install -U "langchain[openai]" > /dev/null
# %pip install chromadb > /dev/null
# %pip install lxml > /dev/null
# %pip install html2text > /dev/null

In [3]:
# %env OPENAI_API_KEY=<YOUR-API-KEY>

## 1. Create a Dataset

For our example, we will be evaluating a Q&A system over the LangSmith documentation. In order to measure aggregate accuracy, we'll need to create a list of example question-answer paris. We've hard-coded some below to demonstrate the process. In general, you'll want a lot more (>100) pairs to get more meaningful results. Drawing from actual queries can be helpful to ensure better representation of the domain.

Below, we have hard-coded some question-answer pairs to evaluate and use the client's `create_example` method to create each example row.

In [4]:
# We have some hard-coded examples here.
examples = [
    ("What is LangChain?", "LangChain is an open-source framework for building applications using large language models. It is also the name of the company building LangSmith."),
    ("How might I query for all runs in a project?", "client.list_runs(project_name='my-project-name'), or in TypeScript, client.ListRuns({projectName: 'my-project-anme'})"),
    ("What's a langsmith dataset?", "A LangSmith dataset is a collection of examples. Each example contains inputs and optional expected outputs or references for that data point."),
    ("How do I use a traceable decorator?", """The traceable decorator is available in the langsmith python SDK. To use, configure your environment with your API key,\
import the required function, decorate your function, and then call the function. Below is an example:
```python
from langsmith.run_helpers import traceable
@traceable(run_type="chain") # or "llm", etc.
def my_function(input_param):
    # Function logic goes here
    return output
result = my_function(input_param)
```"""),
    ("Can I trace my Llama V2 llm?", "So long as you are using one of LangChain's LLM implementations, all your calls can be traced"),
    ("Why do I have to set environment variables?", "Environment variables can tell your LangChain application to perform tracing and contain the information necessary to authenticate to LangSmith."
     " While there are other ways to connect, environment variables tend to be the simplest way to configure your application."),
    ("How do I move my project between organizations?", "LangSmith doesn't directly support moving projects between organizations.")
]

In [5]:
from langsmith import Client

client = Client()

In [6]:
dataset_name = "Retrieval QA Questions"
dataset = client.create_dataset(dataset_name=dataset_name)
for q, a in examples:
    client.create_example(inputs={"question": q}, outputs={"answer": a}, dataset_id=dataset.id)

## 2. Define RAG Q&A System

Our Q&A system uses a simple retriever and LLM response generator. To break that down further, the chain will be composed of:

1. A [VectorStoreRetriever](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.base.VectorStoreRetriever.html#langchain.vectorstores.base.VectorStoreRetriever) to retrieve documents. This uses:
   - An embedding model to vectorize documents and user queries for retrieval. In this case, the [OpenAIEmbeddings](https://api.python.langchain.com/en/latest/embeddings/langchain.embeddings.openai.OpenAIEmbeddings.html) model.
   - A vectorstore, in this case we will use [Chroma](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.chroma.Chroma.html#langchain.vectorstores.chroma.Chroma)
2. A response generator. This uses:
   - A [ChatPromptTemplate](https://api.python.langchain.com/en/latest/prompts/langchain.prompts.chat.ChatPromptTemplate.html#langchain.prompts.chat.ChatPromptTemplate) to combine the query and documents. 
   - An LLM, in this case, the 16k token context window version of `gpt-3.5-turbo` via [ChatOpenAI](https://api.python.langchain.com/en/latest/chat_models/langchain.chat_models.openai.ChatOpenAI.html#langchain.chat_models.openai.ChatOpenAI).

We will combine them using LangChain's [expression syntax](https://python.langchain.com/docs/guides/expression_language/cookbook).

First, load the documents to populate the vectorstore:

In [7]:
from langchain.document_loaders import RecursiveUrlLoader
from langchain.document_transformers import Html2TextTransformer
from langchain.text_splitter import TokenTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

api_loader = RecursiveUrlLoader("https://docs.smith.langchain.com")
doc_transformer = Html2TextTransformer()
raw_documents = api_loader.load()
transformed = doc_transformer.transform_documents(raw_documents)

def create_retriever(transformed_documents, text_splitter):
    documents = text_splitter.split_documents(transformed_documents)
    embeddings = OpenAIEmbeddings()
    vectorstore = Chroma.from_documents(documents, embeddings)
    return vectorstore.as_retriever(search_kwargs={"k": 4})



With the documents prepared, create the vectorstore retriever. This is what will be used to provide context when generating a response.

In [8]:
text_splitter = TokenTextSplitter(
    model_name="gpt-3.5-turbo",
    chunk_size=2000,
    chunk_overlap=200,
)
retriever = create_retriever(transformed, text_splitter)

Next up, we'll define the response generator. This responds to the user by injecting the retrieved documents and the user query into a prompt template.

In [9]:
from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.schema.output_parser import StrOutputParser

from datetime import datetime
from operator import itemgetter


def create_chain(retriever):
    prompt = ChatPromptTemplate.from_messages(
            [
                ("system", "You are a helpful documentation Q&A assistant, trained to answer"
                " questions from LangSmith's documentation."
                " LangChain is a framework for building applications using large language models."
                "\nThe current time is {time}.\n\nRelevant documents will be retrieved in the following messages."),
                ("system", "{context}"),
                ("human","{question}")
            ]
        ).partial(time=str(datetime.now()))

    model = ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0)
    response_generator = (
        prompt 
        | model 
        | StrOutputParser()
    )
    chain = (
        # The runnable map here routes the original inputs to a context and a question dictionary to pass to the response generator
        {
            "context": itemgetter("question") | retriever | (lambda docs: "\n".join([doc.page_content for doc in docs])),
            "question": itemgetter("question")
        }
        | response_generator
    )
    return chain

In [10]:
chain_1 = create_chain(retriever)

In [11]:
# We will halve both the chunk size and overlap
text_splitter_2 = TokenTextSplitter(
    model_name="gpt-3.5-turbo",
    chunk_size=1000,
    chunk_overlap=100,
)
retriever_2 = create_retriever(transformed, text_splitter_2)

chain_2 = create_chain(retriever_2)

## 3. Evaluate the Chain

We will only be using the pairwise evaluator for this walkthrough. In practice, you will want to also use a correctness evaluator to measure aggregate correctness.

In [12]:
from langsmith import Client
client = Client()
examples = list(client.list_examples(dataset_name="Retrieval QA Questions"))

In [13]:
import uuid
from langchain.callbacks import LangChainTracer

uid = uuid.uuid4().hex[:8]
# Make predictions for the second model
project_name = f"chain-1-eval-{uid}"
configs = [{"callbacks": [LangChainTracer(example_id=example.id, project_name=project_name)]} for example in examples]
results = chain_1.batch([example.inputs for example in examples], config=configs)

In [14]:
# Make predictions for the second model
project_name_2 = f"chain-2-eval-{uid}"
# Same as above but use a different project name
configs = [{"callbacks": [LangChainTracer(example_id=example.id, project_name=project_name_2)]} for example in examples]
results_2 = chain_2.batch([example.inputs for example in examples], config=configs)

In [15]:
import random
import logging

def predict_preference(value: dict, eval_chain) -> list:
    pred_a, pred_b, input_, answer = value["prediction"], value["prediction_b"], value["input"], value["answer"]
    a, b = "a", "b"
    # Flip a coin to average out persistent positional bias
    if random.random() < 0.5:
        pred_a, pred_b = pred_b, pred_a
        a, b = "b", "a"
    try:
        eval_res = eval_chain.evaluate_string_pairs(
            prediction=pred_a,
            prediction_b=pred_b,
            input=input_,
            reference=answer
        )
    except Exception as e:
        logging.warning(e)
        return None
    result_map = {
        "A": a,
        "B": b
    }
    # None means no preference
    return result_map.get(eval_res["value"])

In [16]:
from langchain.schema.runnable import RunnableLambda
from langchain.evaluation import load_evaluator
import functools

pairwise_evaluator = load_evaluator("labeled_pairwise_string")
eval_func = functools.partial(predict_preference, eval_chain=pairwise_evaluator)
runnable = RunnableLambda(eval_func)

In [17]:
batch_inputs = [{"prediction": pred, "prediction_b": pred_b, "input": e.inputs['question'], "answer": e.outputs['answer']} for pred, pred_b, e in zip(results, results_2, examples)]
values = runnable.batch(batch_inputs)



# 4. Add Feedback

Now that we've made predictions, we can add the feedback to the runs to evaluate.

In [18]:
def _get_run(example_id, project_name):
    return next(iter(client.list_runs(reference_example_id=example_id, project_name=project_name)))

In [19]:
# Add feedback
for example, value in zip(examples, values):
    if value is None:
        continue
    chain_1_run = _get_run(example.id, project_name)
    chain_2_run = _get_run(example.id, project_name_2)
    run_ids = [chain_1_run.id, chain_2_run.id]
    score_map = {'a': chain_1_run.id, 'b': chain_2_run.id}
    score_map = {'a': (1, 0), 'b': (0, 1)}
    for run_id, score in zip(run_ids, score_map[value]):
        other_run_id = run_ids[1 - run_ids.index(run_id)]
        client.create_feedback(run_id,
                               key="preference",
                               score=score,
                               source_info={"compared_to": other_run_id}
                              )

In [20]:
# View the projects by navigating to the Testing and Datasets page.
ds = client.read_dataset(dataset_name="Retrieval QA Questions")
ds.url

'https://smith.langchain.com/datasets/16fb82a2-d986-4d86-ba54-ea6698446d65'

You can review the results in the app to see preferred outputs.

To measure statistical significance, you can check out other guides, such as the LangChain OSS guide [here](https://python.langchain.com/docs/guides/evaluation/examples/comparisons).