# Comparison against the baseline

Integrating Lynxius directly into your CI/CD pipeline makes it straightforward to determine whether a pull request introduces a regression or enhances the performance of your LLM-powered application.

Lynxius makes it easy to ensure that each pull request is thoroughly tested and compared against a baseline. The baseline consists of the same set of queries and ground truth outputs run against the master branch of your repository, excluding the changes from the pull request. This precise comparison helps your development team quickly identify regressions, ensuring the quality of your LLM-powered application, at every step of the way.

Setting up your testing pipeline like this ensures that your team can iterate over your codebase swiftly and confidently, making it easier to maintain and improve your application.

For this to work, we'll have to run the same set of evaluations twice:
1. Whenever the master branch of your application is updated (or a nightly cron job, for example),
2. and whenever a pull request is made

The first set of evaluations is called the baseline. Simply put, the baseline indicates the performance of your application on a master branch. Whenever you want to make a change to your application and make a pull request, you want to ensure that this change indeed improves the performance of your system and doesn't introduce any unwanted side effects or regressions. For this, we set up testing against the baseline.

In [3]:
import os
import sys
from getpass import getpass
sys.path.append("../")

In [2]:
# Makes it easier to iterate
%load_ext autoreload
%autoreload 2

In [4]:
# We'll be using OpenAI to evaluate locally so we have to set the API key
if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
os.environ["OPENAI_API_KEY"] = openai_api_key

🔑 Enter your OpenAI API key:  ········


In [10]:
# Our sample LLM application
from datasets_utils import chatdoctor_v1

# Lynxius client comunicates with the Lynxius online platform
from lynxius.client import LynxiusClient

# Importing the evaluators
from lynxius.evals.bert_score import BertScore
from lynxius.evals.answer_correctness import AnswerCorrectness
from lynxius.evals.semantic_similarity import SemanticSimilarity
from lynxius.evals.custom_eval import CustomEval
from lynxius.evals.context_precision import ContextPrecision
from lynxius.evals.json_diff import JsonDiff

# ContextChunk represents a document retrieved from you RAG system
from lynxius.rag.types import ContextChunk

In [15]:
# Here we define sample RAG contexts.
# Retrieval of context documents depends on the RAG database that you're using
context = [
    ContextChunk(document="Avoid close contact with people who are sick. When you are sick, keep your distance from others to protect them from getting sick, too.", relevance=0.75),
    ContextChunk(document="If possible, stay home from work, school, and errands when you’re sick. You can go back to your normal activities when, for at least 24 hours, both are true:", relevance=0.31)
]

# Define tags to make it easier to locate these eval runs on the Lynxius platform
tags = ["notebook", "experiment", "baseline"]

In [14]:
# Lynxius allows you to use your own evaluator templates
# Let's define and use one!

# When using a custom template, the only thing that you need to ensure is that
# the final verdict is printed at the very bottom of the resonse, with no other characters.
custom_eval_template = """
You are given a question, a reference answer and a candidate answer concerning a clinical matter.
You must determine if the candidate answer covers exactly the same content as the reference answer.
If the candidate answer contains additional information, or fails to mention something that is present
in the reference answer, your verdict should be 'incorrect'. Otherwise, your verdict should be 'correct'.
Provide a short explanation about how you arrived to your verdict. The verdict must be printed at the
very bottom of your response, on a new line, and it must not contain any extra characters.
Here is the data:
***********
Query: {query}
***********
Reference answer: {reference}
***********
Candidate answer: {output}
"""

In [11]:
# We recommend storing baseline eval runs and test eval runs in separate projects.
# That's what we will do here so we'll need to use 2 Lynxius API keys, one for each project.
# The following keys are left here for the sake of example - they're not operational. Please use your API keys.

testing_lynxius_api_key = "PU7Mf8iDMVcH2ElMaabChQP6zkLqb2cTbrlfnIagGAuHhWyj"
baseline_lynxius_api_key = "gJ64Mgtv78DeEABavgxNGlyuGXEoPReUFtWrCel31wDU2Psy"

In [26]:
# ================================================================================================
# This code runs on every merge to a master branch or, alternativelly, in a nightly cron job.
# ================================================================================================

client_baseline = LynxiusClient(api_key=baseline_lynxius_api_key, run_local=True)
dataset_details = client_baseline.get_dataset_details(dataset_id="6e83cec5-d8d3-4237-af9e-8d4b7c71a2ce")

# Define and run the evals
bert_score = BertScore("master_baseline_sample", level="word", presence_threshold=0.65, tags=tags)
answer_correctness = AnswerCorrectness("master_baseline_sample", tags=tags)
semantic_similarity = SemanticSimilarity("master_baseline_sample", tags=tags)
custom_eval = CustomEval("master_baseline_sample", name="clinical_correctness", prompt_template=custom_eval_template, tags=tags)
context_precision = ContextPrecision("master_baseline_sample", tags=tags)

for entry in dataset_details.entries:
    # Query your LLM
    actual_output = chatdoctor_v1(entry.query)

    # Add traces to the evals
    bert_score.add_trace(reference=entry.reference, output=actual_output, context=context)
    answer_correctness.add_trace(query=entry.query, reference=entry.reference, output=actual_output, context=context)
    semantic_similarity.add_trace(reference=entry.reference, output=actual_output, context=context)
    custom_eval.add_trace(values={"query": entry.query, "reference": entry.reference, "output": actual_output}, context=context)
    context_precision.add_trace(query=entry.query, reference=entry.reference, context=context)

# Run evals locally and store results in the Lynxius platform
client_baseline.evaluate(bert_score)
client_baseline.evaluate(answer_correctness)
client_baseline.evaluate(semantic_similarity)
client_baseline.evaluate(custom_eval)
client_baseline.evaluate(context_precision)

'6d32201a-7cc8-4c47-aa5c-e43cfb0efc7c'

In [27]:
json_diff = JsonDiff("master_baseline_sample", tags=tags)

ref = {
    "prop1": True,
    "prop2": True,
    "prop3": {
      "prop4": True,
      "prop5": True
    }
}
output = {
    "prop1": True,
    "prop2": True,
    "prop3": {
      "prop4": True,
      "prop5": False
    }
}
weights = {
    "prop1": 1.0,
    "prop2": 1.0,
    "prop3": {
        "prop4": 0.5,
        "__prop3": 0.7,
    }
}

json_diff.add_trace(reference=ref, output=output, weights=weights, context=context)
client_baseline.evaluate(json_diff)

'b89e92d5-d6d9-4a85-89df-fda762ddeeef'

In [31]:
# ================================================================================================
# This code runs whenever a pull request is made or updated. Here we will reference the baseline.
# ================================================================================================

client_testing = LynxiusClient(api_key=testing_lynxius_api_key, run_local=True)
# Download the same dataset!!!
dataset_details = client_testing.get_dataset_details(dataset_id="6e83cec5-d8d3-4237-af9e-8d4b7c71a2ce")

# This is the ID of a project where the baseline eval runs are stored. 
# You can get it by visiting the project details page on the Lynxius Platform
baseline_project_uuid = "a9f1e445-fe16-430e-977e-33e954baa568"
# This is label that we've assigned to our baseline eval runs earlier
baseline_eval_run_label = "master_baseline_sample"

# Define and run the evals
bert_score = BertScore(
    "PR #123",
    level="word",
    presence_threshold=0.65,
    tags=tags,
    baseline_project_uuid=baseline_project_uuid,
    baseline_eval_run_label=baseline_eval_run_label,
)
answer_correctness = AnswerCorrectness(
    "PR #123",
    tags=tags,
    baseline_project_uuid=baseline_project_uuid,
    baseline_eval_run_label=baseline_eval_run_label,
)
semantic_similarity = SemanticSimilarity(
    "PR #123",
    tags=tags,
    baseline_project_uuid=baseline_project_uuid,
    baseline_eval_run_label=baseline_eval_run_label,
)
custom_eval = CustomEval(
    "PR #123",
    name="clinical_correctness",
    prompt_template=custom_eval_template,
    tags=tags,
    baseline_project_uuid=baseline_project_uuid,
    baseline_eval_run_label=baseline_eval_run_label,
)
context_precision = ContextPrecision(
    "PR #123",
    tags=tags,
    baseline_project_uuid=baseline_project_uuid,
    baseline_eval_run_label=baseline_eval_run_label,
)

for entry in dataset_details.entries:
    # Query your LLM
    actual_output = chatdoctor_v1(entry.query)

    # Add traces to the evals
    bert_score.add_trace(reference=entry.reference, output=actual_output, context=context)
    answer_correctness.add_trace(query=entry.query, reference=entry.reference, output=actual_output, context=context)
    semantic_similarity.add_trace(reference=entry.reference, output=actual_output, context=context)
    custom_eval.add_trace(values={"query": entry.query, "reference": entry.reference, "output": actual_output}, context=context)
    context_precision.add_trace(query=entry.query, reference=entry.reference, context=context)

# Run evals locally and store results in the Lynxius platform
client_testing.evaluate(bert_score)
client_testing.evaluate(answer_correctness)
client_testing.evaluate(semantic_similarity)
client_testing.evaluate(custom_eval)
client_testing.evaluate(context_precision)

'dcd91d4a-bade-4406-83e4-5bee8844daa6'

In [30]:
json_diff = JsonDiff(
    "PR #123",
    tags=tags,
    baseline_project_uuid=baseline_project_uuid,
    baseline_eval_run_label=baseline_eval_run_label,
)

ref = {
    "prop1": True,
    "prop2": True,
    "prop3": {
      "prop4": True,
      "prop5": True
    }
}
output = {
    "prop1": True,
    "prop2": True,
    "prop3": {
      "prop4": True,
      "prop5": False
    }
}
weights = {
    "prop1": 1.0,
    "prop2": 1.0,
    "prop3": {
        "prop4": 0.5,
        "__prop3": 0.7,
    }
}

json_diff.add_trace(reference=ref, output=output, weights=weights, context=context)
client_testing.evaluate(json_diff)

'c5632d74-089c-4c3d-9290-0d2b7ed0ae84'