# Quickstart Continuous Evaluation

Set up your LLM App Baselines and automated testing pipeline in just three minutes! Ship AI with confidence at every development stage by following this quickstart guide.

[Signup for free](https://platform.lynxius.ai/auth/signup) here to create an account. Don't forget to validate your email!

In [1]:
# First, we have to setup Lynxius API key
import os
import sys
from getpass import getpass
sys.path.append("../")

lynxius_cicd_api_key = getpass("🔑 Enter your Lynxius CI/CD API key: ")
lynxius_main_api_key = getpass("🔑 Enter your Lynxius Main-Baseline API key: ")

os.environ["LYNXIUS_BASE_URL"] = "https://platform.lynxius.ai"

In [2]:
# `chat_pizza` LLM App used OpenAI GPT-4 to produce these outputs (https://github.com/lynxius/lynxius-docs/blob/main/docs/public/images/)
dataset = [
    {
        "query": "What is the first tomato-topped pizza? Keep it short.",
        "reference": "Pizza marinara is supposedly the oldest tomato-topped pizza.",
        "output": "The first tomato-topped pizza is Pizza marinara.",
    },
    {
        "query": "When did pizza arrive in the United States? Keep it short.",
        "reference": "The first pizzeria in the U.S. was opened in New York City's Little Italy in 1905.",
        "output": "The first pizzeria in the U.S. was opened in 1905.",
    },
    {
        "query": "Which tomato sauce is used in neapolitan pizza? Keep it short.",
        "reference": "The tomato sauce of Neapolitan pizza must be made with San Marzano tomatoes or pomodorini del Piennolo del Vesuvio.",
        "output": "San Marzano tomatoes or pomodorini del Piennolo del Vesuvio are traditionally used in Neapolitan pizza sauce.",
    },
    {
        "query": "What is pizza quattro stagioni? Keep it short.",
        "reference": "Pizza quattro stagioni is a variety of Italian pizza prepared in four sections with diverse ingredients. Each section represents one season of the year. Artichokes represent spring, tomatoes or basil represent summer, mushrooms represent autumn and the ham, prosciutto or olives represent winter.",
        "output": "Pizza Quattro Stagioni is an Italian pizza that represents the four seasons through its toppings, divided into four sections. Each section features ingredients typical of a particular season, like artichokes for spring, peppers for summer, mushrooms for autumn, and olives or prosciutto for winter.",
    },
    {
        "query": "What is the main pizza ingredient in one word?",
        "reference": "Dough.",
        "output": "Dough.",
    },
    {
        "query": "What is in Hawaiian pizza?",
        "reference": "Hawaiian pizza contains tomato sauce, pineapple and ham.",
        "output": "Hawaiian pizza: tomato sauce, pineapple and ham.",
    }
]

In [3]:
from lynxius.client import LynxiusClient
from lynxius.evals.answer_correctness import AnswerCorrectness

pr_client = LynxiusClient(api_key=lynxius_cicd_api_key)
bsl_client = LynxiusClient(api_key=lynxius_main_api_key)

label = "PR #124"
tags = ["GPT-4", "q_answering", "pull_request"]
baseline_project_uuid="4d683adf-a17b-4847-bb78-9663152bcba7"  # identifier of main baseline project
baseline_eval_run_label="main_branch_baseline"               # lable identifier of baseline QA task
answer_correctness = AnswerCorrectness(
    label=label,
    tags=tags,
    baseline_project_uuid=baseline_project_uuid,
    baseline_eval_run_label=baseline_eval_run_label
)

for entry in dataset:
    answer_correctness.add_trace(
        query=entry["query"],
        reference=entry["reference"],
        output=entry["output"],  # chat_pizza LLM call
        context=[]
    )

# run eval
answer_correctness_uuid = pr_client.evaluate(answer_correctness)

# get eval results and compare    
pr_eval_run = pr_client.get_eval_run(answer_correctness_uuid)
pr_score = pr_eval_run.get("aggregate_score")
bsl_score = bsl_client.get_eval_run(
    pr_eval_run.get("baseline_eval_run_uuid")
).get("aggregate_score")

print(f"pr_score is: {pr_score}")
print(f"bsl_score is: {bsl_score}")

if pr_score > bsl_score:
    print(f"PR score {pr_score} is greater than baseline score {bsl_score}.")
else:
    print(f"PR score {pr_score} is not greater than baseline score {bsl_score}.")

Attempt 1 received status PENDING. Retrying...
Attempt 2 received status PENDING. Retrying...
pr_score is: 0.873015873015873
bsl_score is: 0.373015873015873
PR score 0.873015873015873 is greater than baseline score 0.373015873015873.
