# Prod candidate testing

Deploying your app into production is just one step on a longer journey of continuous improvement.

You'll likely develop candidate improvements you want to test against your prod sytstem.

This is somewhat analogous to "back-testing" (I guess), though you often won't have ground-truth labels in this case. (If your application DOES permit capturing ground-truth labels, then we obviously recommend you use those.

This notebook shows how to do this in LangSmith.

Basic steps are:

1. Sample prod runs to test against.
2. Convert runs to dataset + initial test.
3. Run new system against the dataset to compare.

In this way, each sample you may take would become a new dataset you can version and backtest against.

## Prerequisites

Install + set environment.

In [None]:
%pip install -U langchain langchain_anthropic

In [1]:
import os

# Set the project name to whichever project you'd like to be testing against
project_name = "Tweet Critic"
# os.environ["LANGCHAIN_API_KEY"] = "YOUR API KEY"
# os.environ["LANGCHAIN_TRACING_V2"] = "true"
# os.environ["ANTHROPIC_API_KEY"] = "YOUR ANTHROPIC API KEY"
os.environ["LANGCHAIN_PROJECT"] = project_name

#### (Prelim) Production Deployment

You likely have a project already and can skip this step. We'll simulate one here for the sake of the notebook. Our example app is a "tweet critic" that revises tweets we put out.

In [2]:
from langchain import hub
from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage
from langchain_core.output_parsers import StrOutputParser

prompt = hub.pull("wfh/tweet-critic:7e4f539e")
llm = ChatAnthropic(model="claude-3-haiku-20240307")
system = prompt | llm | StrOutputParser()


inputs = [
    """RAG From Scratch: Our RAG From Scratch video series covers some important RAG concepts in short, focused videos with code. This is the 10th video and it discusses query routing. Problem: We sometimes have multiple datastores (e.g., different vector DBs, SQL DBs, etc) and prompts to choose from based on a user query. Idea: Logical routing can use an LLM to decide which datastore is more appropriate. Semantic routing embeds the query and prompts, then chooses the best prompt based on similarity. Video: https://youtu.be/pfpIndq7Fi8 Code: https://github.com/langchain-ai/rag-from-scratch/blob/main/rag_from_scratch_10_and_11.ipynb""",
    """@Voyage_AI_ Embedding Integration Package Use the same custom embeddings that power Chat LangChain via the new langchain-voyageai package! Voyage AI builds custom embedding models that can improve retrieval quality. ChatLangChain: https://chat.langchain.com Python Docs: https://python.langchain.com/docs/integrations/providers/voyageai""",
    """Implementing RAG: How to Write a Graph Retrieval Query in LangChain Our friends at @neo4j have a nice guide on combining LLMs and graph databases. Blog:""",
    """Text-to-PowerPoint with LangGraph.js You can now generate PowerPoint presentations from text! @TheGreatBonnie wrote a guide showing how to use LangGraph.js, @tavilyai, and @CopilotKit to build a Next.js app for this. Tutorial: https://dev.to/copilotkit/how-to-build-an-ai-powered-powerpoint-generator-langchain-copilotkit-openai-nextjs-4c76 Repo: https://github.com/TheGreatBonnie/aipoweredpowerpointapp""",
    """Build an Answer Engine Using Groq, Mixtral, Langchain, Brave & OpenAI in 10 Min Our friends at @Dev__Digest have a tutorial on building an answer engine over the internet. Code: https://github.com/developersdigest/llm-answer-engine YouTube: https://youtube.com/watch?v=43ZCeBTcsS8&t=96s""",
    """Building a RAG Pipeline with LangChain and Amazon Bedrock Amazon Bedrock has great models for building LLM apps. This guide covers how to get started with them to build a RAG pipeline. https://gettingstarted.ai/langchain-bedrock/""",
    """SF Meetup on March 27! Join our meetup to hear from LangChain and Pulumi experts and learn about building AI-enabled capabilities. Sign up: https://meetup.com/san-francisco-pulumi-user-group/events/299491923/?utm_campaign=FY2024Q3_Meetup_PUG%20SF&utm_content=286236214&utm_medium=social&utm_source=twitter&hss_channel=tw-837770064870817792""",
    """Chat model response metadata @LangChainAI chat model invocations now include metadata like logprobs directly in the output. Upgrade your version of `langchain-core` to try it. PY: https://python.langchain.com/docs/modules/model_io/chat/logprobs JS: https://js.langchain.com/docs/integrations/chat/openai#generation-metadata""",
    """Benchmarking Query Analysis in High Cardinality Situations Handling high-cardinality categorical values can be challenging. This blog explores 6 different approaches you can take in these situations. Blog: https://blog.langchain.dev/high-cardinality""",
    """Building Google's Dramatron with LangGraph.js & Claude 3 We just released a long YouTube video (1.5 hours!) on building Dramatron using LangGraphJS and @AnthropicAI's Claude 3 "Haiku" model. It's a perfect fit for LangGraph.js and Haiku's speed. Check out the tutorial: https://youtube.com/watch?v=alHnQjyn7hg""",
    """Document Loading Webinar with @AirbyteHQ Join a webinar on document loading with PyAirbyte and LangChain on 3/14 at 10am PDT. Features our founding engineer @eyfriis and the @aaronsteers and Bindi Pankhudi team. Register: https://airbyte.com/session/airbyte-monthly-ai-demo""",
]

_ = system.batch(
    [{"messages": [HumanMessage(content=content)]} for content in inputs],
    {"max_concurrency": 3},
)

## Convert Prod Runs to Test

The first step is to generate a dataset based on the production _inputs_.
Then copy over all the traces to serve as a baseline run.

In [26]:
from datetime import datetime, timedelta, timezone

SAMPLE_SIZE = 10
end_time = datetime.now(tz=timezone.utc)
start_time = end_time - timedelta(days=1)
filter = f'and(gt(start_time, "{start_time.isoformat()}"), lt(end_time, "{end_time.isoformat()}"))'
dataset_name = f'{project_name}-candidate-testing {start_time.strftime("%Y-%m-%d")}-{end_time.strftime("%Y-%m-%d")}'

# This will make it a bit slower, but will let you view the full trace
include_children = True

#### Run conversion script

In [33]:
client.delete_dataset(dataset_name=dataset_name)

In [34]:
import random
import uuid
from datetime import datetime, timedelta, timezone

from langsmith import Client

client = Client()

prod_runs = list(
    client.list_runs(
        project_name=project_name,
        execution_order=1,
        filter=filter,
    )
)

# We will downsample to only N runs
sampled_runs = random.sample(prod_runs, min(SAMPLE_SIZE, len(prod_runs)))

# Then convert each run's inputs to an example input
ds = client.create_dataset(dataset_name=dataset_name)
client.create_examples(
    inputs=[r.inputs for r in sampled_runs],
    source_run_ids=[r.id for r in sampled_runs],
    dataset_id=ds.id,
)

# Now we'll copy over the prod runs as a "test run"

if not include_children:
    runs_to_copy = sampled_runs
else:
    # TODO: re-fetch with trace_id
    runs_to_copy = [client.read_run(r.id, load_child_runs=True) for r in sampled_runs]


# Copy over and update IDs


test_project_name = f"prod-baseline-{uuid.uuid4().hex[:6]}"

run_to_example_map = {
    e.source_run_id: e.id for e in client.list_examples(dataset_name=dataset_name)
}


def convert_ids(run_dict: dict, id_map: dict):
    # Convert dotted order and parent_run_id
    do = run_dict["dotted_order"]
    # TODO: speed up / compile regex
    for k, v in id_map.items():
        do = do.replace(str(k), str(v))
    run_dict["dotted_order"] = do

    # parent_run_id
    if run_dict.get("parent_run_id"):
        run_dict["parent_run_id"] = id_map[run_dict["parent_run_id"]]
    if not run_dict.get("extra"):
        run_dict["extra"] = {}
    return run_dict


def convert_root_run(root):
    # Mutate the trace_id and run_id, the dotted order, and the parent_run_ids.
    runs_ = [root]
    trace_id = uuid.uuid4()
    id_map = {root.trace_id: trace_id}
    results = []
    while runs_:
        src = runs_.pop()
        src_dict = src.dict(exclude={"parent_run_ids", "child_run_ids", "session_id"})
        id_map[src_dict["id"]] = id_map.get(src_dict["id"], uuid.uuid4())
        src_dict["id"] = id_map[src_dict["id"]]
        src_dict["trace_id"] = id_map[src_dict["trace_id"]]
        if src.child_runs:
            runs_.extend(src.child_runs)
        results.append(src_dict)
    result = [convert_ids(r, id_map) for r in results]
    result[0]["reference_example_id"] = run_to_example_map[root.id]
    return result


to_create = [
    run_dict for root_run in runs_to_copy for run_dict in convert_root_run(root_run)
]

project = client.create_project(
    project_name=test_project_name,
    reference_dataset_id=ds.id,
    metadata={
        "system_version": "prod",
    },
)
# Copy modified runs over.
for new_run in to_create:
    client.create_run(**new_run, project_name=test_project_name)

# Close out the project so you can manually modify the metadata attributes if desired
_ = client.update_project(project.id, end_time=datetime.now(tz=timezone.utc))

## Run new system

Now we have the dataset and prod runs saved as a "test".

Let's run inference on our new system to compare.

In [35]:
# Use an updated version of the prompt
prompt = hub.pull("wfh/tweet-critic:34c57e4f")
llm = ChatAnthropic(model="claude-3-haiku-20240307")
system = prompt | llm | StrOutputParser()

In [36]:
from langchain.load import load


def deserialize_messages(example_input: dict):
    # The dataset includes serialized messages that we
    # must convert to a format accepted by our system.
    return {
        "messages": [
            (message["type"], message["content"])
            for message in example_input["messages"]
        ]
    }


test_results = client.run_on_dataset(
    llm_or_chain_factory=deserialize_messages | system,
    dataset_name=dataset_name,
)

View the evaluation results for project 'memorable-disease-42' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/a41492bc-da4d-4379-8199-fd4e8877e01a/compare?selectedSessions=93951de1-ea67-4e65-af98-daec016d711c

View all tests for Dataset Tweet Critic-candidate-testing 2024-03-17-2024-03-18 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/a41492bc-da4d-4379-8199-fd4e8877e01a
[------------------------------------------------->] 10/10