# Tutorial 2 - Basic Workflow - Add Your Own Tests 

**Scenario**: You are a model developer and you want to evaluate your custom chatbot in your CI/CD pipeline as your team continuously train and improve the system. In this case, you have already identified a dataset that does not exist in Moonshot to benchmark your model's performance. How can you add this new dataset into Moonshot and run it with your system?

In this tutorial, you will learn how to:

- Add your own `dataset` into Moonshot
- Create and run your own `recipe`
- Create and run your own `cookbook`

**Prerequisite**:

- You have added your `connector endpoint` in Moonshot. If you are unsure how to do it, please refer to "Tutorial 1" in the same folder.

**Before starting this tutorial, please make sure you have already installed `moonshot` and `moonshot-data`.** Otherwise, please follow this tutorial to install and configure Moonshot first.

## Import Moonshot Library API

In this section, we prepare our Jupyter notebook environment by importing necessary libraries required to execute an existing benchmark.

In [8]:
# Moonshot Framework API Imports
# These imports from the Moonshot framework allow us to interact with the API, 
# creating and managing various components such as recipes, cookbooks, and endpoints.
import os
import json
import asyncio
import sys

# Ensure that the root of the Moonshot framework is in the system path for module importing.
sys.path.insert(0, '../../')

from moonshot.api import (
    api_get_all_recipe,
    api_create_recipe,
    api_create_cookbook,
    api_get_all_runner,
    api_load_runner,
    api_read_result,
    api_set_environment_variables
)

moonshot_path = "./moonshot"
env = {
    "ATTACK_MODULES": os.path.join(moonshot_path, "attack-modules"),
    "BOOKMARKS": os.path.join(moonshot_path, "generated-outputs/bookmarks"),
    "CONNECTORS": os.path.join(moonshot_path, "connectors"),
    "CONNECTORS_ENDPOINTS": os.path.join(moonshot_path, "connectors-endpoints"),
    "CONTEXT_STRATEGY": os.path.join(moonshot_path, "context-strategy"),
    "COOKBOOKS": os.path.join(moonshot_path, "cookbooks"),
    "DATABASES": os.path.join(moonshot_path, "generated-outputs/databases"),
    "DATABASES_MODULES": os.path.join(moonshot_path, "databases-modules"),
    "DATASETS": os.path.join(moonshot_path, "datasets"),
    "IO_MODULES": os.path.join(moonshot_path, "io-modules"),
    "METRICS": os.path.join(moonshot_path, "metrics"),
    "PROMPT_TEMPLATES": os.path.join(moonshot_path, "prompt-templates"),
    "RECIPES": os.path.join(moonshot_path, "recipes"),
    "RESULTS": os.path.join(moonshot_path, "generated-outputs/results"),
    "RESULTS_MODULES": os.path.join(moonshot_path, "results-modules"),
    "RUNNERS": os.path.join(moonshot_path, "generated-outputs/runners"),
    "RUNNERS_MODULES": os.path.join(moonshot_path, "runners-modules"),
}

# Apply the environment variables to configure the Moonshot framework.
api_set_environment_variables(env)

# NOTE: if you manage to set the environment variables successfully, you will not see any printout

## Prepare the Dataset

In this section, we will look at preparing a dataset in Moonshot. Say you have a list of "fruits" questions to ask your chatbot, you need to prepare them using the data schema that is compatible with Moonshot.

- `name` (str): name of the data
- `description` (str): description of the dataset
- `license` (str): license of the data
- `reference` (str): a link/reference to where the dataset is from (or author of the dataset)
- `examples` (list): A list of dictionary containing the prompt (`input`) and ground truth (`target`). A `target` can be left blank.

In [9]:
from os import walk

test_dataset = {
    "name": "Fruits Dataset",
    "description":"Measures whether the model knows what is a fruit",
    "license": "MIT license",
    "reference": "",
    "examples": [
        {
            "input": "Is Lemon a Fruit? Answer Yes or No.",
            "target": "Yes."
        },
        {
            "input": "Is Apple a Fruit? Answer Yes or No.",
            "target": "Yes."
        },
        {
            "input": "Is Bak Choy a Fruit? Answer Yes or No.",
            "target": "No."
        },
        {
            "input": "Is Bak Kwa a Fruit? Answer Yes or No.",
            "target": "No."
        },
        {
            "input": "Is Dragonfruit a Fruit? Answer Yes or No.",
            "target": "Yes."
        },
        {
            "input": "Is Orange a Fruit? Answer Yes or No.",
            "target": "Yes."
        },
        {
            "input": "Is Coke Zero a Fruit? Answer Yes or No.",
            "target": "No."
        }
    ]
}


in_file = f"{moonshot_path}/datasets/test-dataset.json"
json.dump(test_dataset, open(in_file, "w+"), indent=2)

## Create a new `recipe`

To run this dataset, you need to create a new `recipe`. A `recipe` contains all the details required to run a benchmark. A `recipe` guides Moonshot on what data to use, and how to evaluate the model's responses.

To create a new recipe, you need the following fields:

1. **Name**: A unique name for the recipe.
2. **Description**: An explanation of what the recipe does and what it's for.
3. **Tags**: Keywords that categorise the recipe, making it easier to find and group with similar recipes.
4. **Categories**: Broader classifications that help organise recipes into collections.
5. **Datasets**: The data that will be used when running the recipe. This could be a set of prompts, questions, or any input that the model will respond to.
6. **Prompt Templates**: Pre-prompt or post-prompt text that will be appended to every prompt.
7. **Metrics**: Criteria or measurements used to evaluate the model's responses, such as accuracy, fluency, or adherence to a prompt.
8. **Grading Scale**: A set of thresholds or criteria used to grade or score the model's performance.

In [10]:
test_recipe = api_create_recipe(
    "Fruit Questions", # name, mandatory
    "This recipe is created to test model's ability in answering fruits question.", # description, mandatory
    ["chatbot"], # tags, optional
    ["capability"], # category, optional
    ["test-dataset"], # dataset filename, mandatory
    [], # prompt template, optional
    ["exactstrmatch"], # metrics, mandatory
    { # grading scale, optional
        "A": [
            80,
            100
        ],
        "B": [
            60,
            79
        ],
        "C": [
            40,
            59
        ],
        "D": [
            20,
            39
        ],
        "E": [
            0,
            19
        ]
    }
)

print(f"Recipe '{test_recipe}' has been created.")


Recipe 'fruit-questions' has been created.


## Run your new recipe

With this new recipe, you can run this on your `connector endpoint`. We will run this on endpoint `my-openai-endpoint` (which we created in Tutorial 1).

In [16]:
from slugify import slugify
from moonshot.api import api_get_all_run, api_create_runner, api_get_all_runner_name

name = "my new recipe runner" # Indicate the name
recipes = ["fruit-questions"] # Test against fruits-questions
endpoints = ["my-openai-endpoint"]  # Test against 1 endpoint
percentage_of_prompts = 100 # run every prompt as the fruit recipe's dataset is small

# Below are the optional fields
random_seed = 0   # Default: 0; this allows for randomness in dataset selection when num_of_prompts are set
system_prompt = ""  # Default: ""; this allows setting the system prompt for the endpoints

# Advanced user - Modify runner processing module and result processing module
# Default: benchmarking and benchmarking-result
runner_proc_module = "benchmarking"  # Default: "benchmarking"
result_proc_module = "benchmarking-result"  # Default: "benchmarking-result"

# Run the recipes with the defined endpoints
# If the id exists, it will perform a load on the runner, instead of creating a new runner.
# The benefit of this, allows the new run to use possible cached results from previous runs which greatly enhances the run time.
slugify_id = slugify(name, lowercase=True)
if slugify_id in api_get_all_runner_name():
    rec_runner = api_load_runner(slugify_id)
else:
    rec_runner = api_create_runner(name, endpoints)

# run_cookbooks is an async function. Currently there is no sync version.
# We will get an existing event loop and execute the run cookbooks process.
await rec_runner.run_recipes(
    recipes,
    percentage_of_prompts,
    random_seed,
    system_prompt,
    runner_proc_module,
    result_proc_module,
)
rec_runner.close()  # Perform a close on the runner to allow proper cleanup.

# Display results
runner_runs = api_get_all_run(rec_runner.id)
result_info = runner_runs[-1].get("results")
if result_info:
    print(json.dumps(result_info, indent=2))
else:
    raise RuntimeError("no run result generated")


2024-11-08 10:44:54,829 [INFO][runner.py::run_recipes(349)] [Runner] my-new-recipe-runner - Running benchmark recipe run...
2024-11-08 10:44:54,859 [INFO][benchmarking.py::generate(169)] [Benchmarking] Running recipes (['fruit-questions'])...
2024-11-08 10:44:54,860 [INFO][benchmarking.py::generate(173)] [Benchmarking] Running recipe fruit-questions... (1/1)
2024-11-08 10:44:54,864 [INFO][connector.py::get_prediction(348)] [Connector ID: openai-gpt35-turbo] Predicting Prompt Index 4.
2024-11-08 10:44:55,419 [INFO][benchmarking.py::generate(203)] [Benchmarking] Run took 0.5598s
2024-11-08 10:44:55,421 [INFO][benchmarking.py::generate(258)] [Benchmarking] Preparing results took 0.0000s
2024-11-08 10:44:55,425 [INFO][benchmarking-result.py::generate(58)] [BenchmarkingResult] Generate results took 0.0041s
2024-11-08 10:44:55,426 [INFO][runner.py::run_recipes(375)] [Runner] my-new-recipe-runner - Benchmark recipe run completed.


{
  "metadata": {
    "id": "my-new-recipe-runner",
    "start_time": "2024-11-08 10:44:54",
    "end_time": "2024-11-08 10:44:55",
    "duration": 0,
    "status": "completed",
    "recipes": [
      "fruit-questions"
    ],
    "cookbooks": null,
    "endpoints": [
      "openai-gpt35-turbo"
    ],
    "prompt_selection_percentage": 100,
    "random_seed": 0,
    "system_prompt": ""
  },
  "results": {
    "recipes": [
      {
        "id": "fruit-questions",
        "details": [
          {
            "model_id": "openai-gpt35-turbo",
            "dataset_id": "test-dataset",
            "prompt_template_id": "no-template",
            "data": [
              {
                "prompt": "Is Lemon a Fruit? Answer Yes or No.",
                "predicted_result": {
                  "response": "Yes",
                  "context": []
                },
                "target": "Yes.",
                "duration": 1.3310747500217985
              },
              {
                "prom

  rec_runner.close()  # Perform a close on the runner to allow proper cleanup.


## Beautifying Test Results

The result above is shown in our raw JSON file. To beautify the results, we have provided these helper functions to them into a nice table.

In [None]:
from rich.columns import Columns
from rich.console import Console
from rich.panel import Panel
from rich.table import Table
console = Console()

def show_recipe_results(recipes, endpoints, recipe_results, duration):
    """
    Show the results of the recipe benchmarking.

    This function takes the recipes, endpoints, recipe results, results file, and duration as arguments.
    If there are any recipe results, it generates a table to display them using the generate_recipe_table function.
    It also prints the location of the results file and the time taken to run the benchmarking.
    If there are no recipe results, it prints a message indicating that there are no results.

    Args:
        recipes (list): A list of recipes that were benchmarked.
        endpoints (list): A list of endpoints that were used in the benchmarking.
        recipe_results (dict): A dictionary with the results of the recipe benchmarking.
        duration (float): The time taken to run the benchmarking in seconds.

    Returns:
        None
    """
    if recipe_results:
        # Display recipe results
        generate_recipe_table(recipes, endpoints, recipe_results)
    else:
        console.print("[red]There are no results.[/red]")

    # Print run stats
    console.print(f"{'='*50}\n[blue]Time taken to run: {duration}s[/blue]\n*Overall rating will be the lowest grade that the recipes have in each cookbook\n{'='*50}")

def generate_recipe_table(recipes: list, endpoints: list, results: dict) -> None:
    """
    Generate and display a table of recipe results.

    This function creates a table that lists the results of running recipes against various endpoints.
    Each row in the table corresponds to a recipe, and each column corresponds to an endpoint.
    The results include the grade and average grade value for each recipe-endpoint pair.

    Args:
        recipes (list): A list of recipe IDs that were benchmarked.
        endpoints (list): A list of endpoint IDs against which the recipes were run.
        results (dict): A dictionary containing the results of the benchmarking.

    Returns:
        None: This function does not return anything. It prints the table to the console.
    """
    # Create a table with a title and headers
    table = Table(
        title="Recipes Result", show_lines=True, expand=True, header_style="bold"
    )
    table.add_column("No.", width=2)
    table.add_column("Recipe", justify="left", width=78)
    # Add a column for each endpoint
    for endpoint in endpoints:
        table.add_column(endpoint, justify="center")

    # Iterate over each recipe and populate the table with results
    for index, recipe_id in enumerate(recipes, start=1):
        # Attempt to find the result for the current recipe
        recipe_result = next(
            (
                result
                for result in results["results"]["recipes"]
                if result["id"] == recipe_id
            ),
            None,
        )

        # If the result exists, extract and format the results for each endpoint
        if recipe_result:
            print("inner recipe_result:", recipe_result)
            endpoint_results = []
            for endpoint in endpoints:
                # Find the evaluation summary for the endpoint
                evaluation_summary = next(
                    (
                        eval_summary
                        for eval_summary in recipe_result["evaluation_summary"]
                        if eval_summary["model_id"] == endpoint
                    ),
                    None,
                )             
                
                # Format the grade and average grade value, or use "-" if not found
                grade = "-"
                if (
                    evaluation_summary
                    and "grade" in evaluation_summary
                    and "avg_grade_value" in evaluation_summary
                    and evaluation_summary["grade"]
                ):
                    
                    grade = f"{evaluation_summary['grade']} [{evaluation_summary['avg_grade_value']}]"

                endpoint_results.append(grade)

            # Add a row for the recipe with its results
            table.add_row(
                str(index),
                f"Recipe: [blue]{recipe_result['id']}[/blue]",
                *endpoint_results,
                end_section=True,
            )
        else:
            # If no result is found, add a row with placeholders
            table.add_row(
                str(index),
                f"Recipe: [blue]{recipe_id}[/blue]",
                *(["-"] * len(endpoints)),
                end_section=True,
            )

    # Print the table to the console
    console.print(table)

if result_info:
    show_recipe_results(
            recipes, endpoints, result_info, result_info["metadata"]["duration"]
    )
    

## Create a new `cookbook`

We can also create a new `cookbook` and add existing recipes together with our new recipe. A `cookbook` in Moonshot is a curated collection of `recipes` designed to be executed together.

To create a new cookbook, you need the following elements:

1. **Name**: A unique name for the cookbook.
2. **Description**: A detailed explanation of the cookbook's purpose and the types of recipes it contains.
3. **Recipes**: A list of recipe names that are included in the cookbook. Each recipe represents a specific test or benchmark.

In [11]:
cookbook_id = api_create_cookbook(
    "test-cookbook",
    "This cookbook tests both fruits questions and general science questions.",
    ["fruit-questions", "mmlu"]
)

print(f"Cookbook '{cookbook_id}' has been created.")

Cookbook 'test-cookbook' has been created.


## Run your new cookbook

In [12]:
from slugify import slugify
from moonshot.api import api_get_all_run, api_create_runner, api_get_all_runner_name

name = "test new cookbook" # Indicate the name
cookbooks = ["test-cookbook"] # Test against 2 cookbooks, test-category-cookbook and common-risk-easy
endpoints = ["my-openai-endpoint"] # Test against 1 endpoint, test-openai-endpoint
num_of_prompts = 5 # use a smaller number to test out the function; 0 means using all prompts in dataset

# Below are the optional fields
random_seed = 0   # Default: 0; this allows for randomness in dataset selection when num_of_prompts are set
system_prompt = ""  # Default: ""; this allows setting the system prompt for the endpoints

# Advanced user - Modify runner processing module and result processing module
# Default: benchmarking and benchmarking-result
runner_proc_module = "benchmarking"  # Default: "benchmarking"
result_proc_module = "benchmarking-result"  # Default: "benchmarking-result"

# Run the cookbooks with the defined endpoints
# If the id exists, it will perform a load on the runner, instead of creating a new runner.
# The benefit of this, allows the new run to use possible cached results from previous runs which greatly enhances the run time.
slugify_id = slugify(name, lowercase=True)
if slugify_id in api_get_all_runner_name():
    cb_runner = api_load_runner(slugify_id)
else:
    cb_runner = api_create_runner(name, endpoints)

# run_cookbooks is an async function. Currently there is no sync version.
# We will get an existing event loop and execute the run cookbooks process.
await cb_runner.run_cookbooks(
        cookbooks,
        num_of_prompts,
        random_seed,
        system_prompt,
        runner_proc_module,
        result_proc_module,
    )
cb_runner.close()  # Perform a close on the runner to allow proper cleanup.

# Display results
runner_runs = api_get_all_run(cb_runner.id)
result_info = runner_runs[-1].get("results")
if result_info:
    print(json.dumps(result_info, indent=2))
else:
    raise RuntimeError("no run result generated")

Established connection to database (data/generated-outputs/databases/test-new-cookbook.db)
[Runner] test-new-cookbook - Running benchmark cookbook run...
[Run] Part 0: Initialising run...
[Run] Initialise run took 0.0017s
[Run] Part 1: Loading asyncio running loop...
[Run] Part 2: Loading modules...
[Run] Module loading took 0.0032s
[Run] Part 3: Running runner processing module...
[Benchmarking] Load recipe connectors took 0.0229s
[Benchmarking] Set connectors system prompt took 0.0000s
[Benchmarking] Part 1: Running cookbooks (['test-cookbook'])...
[Benchmarking] Running cookbook test-cookbook... (1/1)
[Benchmarking] Load required instances...
[Benchmarking] Load cookbook instance took 0.0007s
[Benchmarking] Running cookbook recipes...
[Benchmarking] Running recipe fruit-questions... (1/2)
[Benchmarking] Load required instances...
[Benchmarking] Load recipe instance took 0.0013s
[Benchmarking] Load recipe metrics took 0.0006s
[Benchmarking] Build and execute generator pipeline...
[Be

## Beautifying Test Results

In [13]:
from rich.columns import Columns
from rich.console import Console
from rich.panel import Panel
from rich.table import Table
console = Console()

def show_cookbook_results(cookbooks, endpoints, cookbook_results, duration):
    """
    Show the results of the cookbook benchmarking.

    This function takes the cookbooks, endpoints, cookbook results, results file, and duration as arguments.
    If there are results, it generates a table with the cookbook results and prints a message indicating
    where the results are saved. If there are no results, it prints a message indicating that no results were found.
    Finally, it prints the duration of the run.

    Args:
        cookbooks (list): A list of cookbooks.
        endpoints (list): A list of endpoints.
        cookbook_results (dict): A dictionary with the results of the cookbook benchmarking.
        duration (float): The duration of the run.

    Returns:
        None
    """
    if cookbook_results:
        # Display recipe results
        generate_cookbook_table(cookbooks, endpoints, cookbook_results)
    else:
        console.print("[red]There are no results.[/red]")

    # Print run stats
    console.print(f"{'='*50}\n[blue]Time taken to run: {duration}s[/blue]\n*Overall rating will be the lowest grade that the recipes have in each cookbook\n{'='*50}")

def generate_cookbook_table(cookbooks: list, endpoints: list, results: dict) -> None:
    """
    Generate and display a table with the cookbook benchmarking results.

    This function creates a table that includes the index, cookbook name, recipe name, and the results
    for each endpoint.

    The cookbook names are prefixed with "Cookbook:" and are displayed with their overall grades. Each recipe under a
    cookbook is indented and prefixed with "Recipe:" followed by its individual grades for each endpoint. If there are
    no results for a cookbook, a row with dashes across all endpoint columns is added to indicate this.

    Args:
        cookbooks (list): A list of cookbook names to display in the table.
        endpoints (list): A list of endpoints for which results are to be displayed.
        results (dict): A dictionary containing the benchmarking results for cookbooks and recipes.

    Returns:
        None: The function prints the table to the console but does not return any value.
    """
    table = Table(
        title="Cookbook Result", show_lines=True, expand=True, header_style="bold"
    )
    table.add_column("No.", width=2)
    table.add_column("Cookbook (with its recipes)", justify="left", width=78)
    for endpoint in endpoints:
        table.add_column(endpoint, justify="center")

    index = 1
    for cookbook in cookbooks:
        # Get cookbook result
        cookbook_result = next(
            (
                result
                for result in results["results"]["cookbooks"]
                if result["id"] == cookbook
            ),
            None,
        )

        if cookbook_result:
            # Add the cookbook name with the "Cookbook: " prefix as the first row for this section
            endpoint_results = []
            for endpoint in endpoints:
                # Find the evaluation summary for the endpoint
                evaluation_summary = next(
                    (
                        temp_eval
                        for temp_eval in cookbook_result["overall_evaluation_summary"]
                        if temp_eval["model_id"] == endpoint
                    ),
                    None,
                )

                # Get the grade from the evaluation_summary, or use "-" if not found
                grade = "-"
                if evaluation_summary and evaluation_summary["overall_grade"]:
                    grade = evaluation_summary["overall_grade"]
                endpoint_results.append(grade)
            table.add_row(
                str(index),
                f"Cookbook: [blue]{cookbook}[/blue]",
                *endpoint_results,
                end_section=True,
            )

            for recipe in cookbook_result["recipes"]:
                endpoint_results = []
                for endpoint in endpoints:
                    # Find the evaluation summary for the endpoint
                    evaluation_summary = next(
                        (
                            temp_eval
                            for temp_eval in recipe["evaluation_summary"]
                            if temp_eval["model_id"] == endpoint
                        ),
                        None,
                    )

                    # Get the grade from the evaluation_summary, or use "-" if not found
                    grade = "-"
                    if (
                        evaluation_summary
                        and "grade" in evaluation_summary
                        and "avg_grade_value" in evaluation_summary
                        and evaluation_summary["grade"]
                    ):
                        grade = f"{evaluation_summary['grade']} [{evaluation_summary['avg_grade_value']}]"
                    endpoint_results.append(grade)

                # Add the recipe name indented under the cookbook name
                table.add_row(
                    "",
                    f"  └──  Recipe: [blue]{recipe['id']}[/blue]",
                    *endpoint_results,
                    end_section=True,
                )

            # Increment index only after all recipes of the cookbook have been added
            index += 1
        else:
            # If no results for the cookbook, add a row indicating this with the "Cookbook: " prefix
            # and a dash for each endpoint column
            table.add_row(
                str(index),
                f"Cookbook: {cookbook}",
                *(["-"] * len(endpoints)),
                end_section=True,
            )
            index += 1

    # Display table
    console.print(table)

if result_info:
    show_cookbook_results(
        cookbooks, endpoints, result_info, result_info["metadata"]["duration"]
    )
else:
    raise RuntimeError("no run result generated")