# Tutorial 1 - Basic Workflow - Execute Existing Tests 

**Scenario**: You are a model developer and you are told to deploy a system that uses one of the OpenAI models. However, you are uncertain which model performs best for your use case and you want to assess its capabilities using existing list of benchmark in Moonshot. How can you do this? 

In this tutorial, you will learn how to:

- Add your own `connector_endpoints` into Moonshot
- List and run an existing `cookbook` in Moonshot

**Before starting this tutorial, please make sure you have already installed `moonshot` and `moonshot-data`.** Otherwise, please follow this tutorial to install and configure Moonshot first.

You will need just two things for this tutorial:
1. Your own copy of moonshot-data. You will be setting its path to the `moonshot_path` variable in the first cell
2. Your OpenAI key, which you will set to the placeholder `ADD_NEW_TOKEN_HERE` in a cell later

## Import Moonshot Library API

In this section, we prepare our Jupyter notebook environment by importing necessary libraries required to execute an existing benchmark.

In [2]:
# Moonshot Framework API Imports
# These imports from the Moonshot framework allow us to interact with the API, 
# creating and managing various components such as recipes, cookbooks, and endpoints.
import os
import json
import asyncio
import sys

# Ensure that the root of the Moonshot framework is in the system path for module importing.
sys.path.insert(0, '../../')

from moonshot.api import (
    api_create_endpoint,
    api_get_all_endpoint,
    api_get_all_cookbook,
    api_load_runner,
    api_read_result,
    api_set_environment_variables,
)

# modify moonshot_path to point to your own copy of moonshot-data
moonshot_path = "./data"
env = {
    "ATTACK_MODULES": os.path.join(moonshot_path, "attack-modules"),
    "BOOKMARKS": os.path.join(moonshot_path, "generated-outputs/bookmarks"),
    "CONNECTORS": os.path.join(moonshot_path, "connectors"),
    "CONNECTORS_ENDPOINTS": os.path.join(moonshot_path, "connectors-endpoints"),
    "CONTEXT_STRATEGY": os.path.join(moonshot_path, "context-strategy"),
    "COOKBOOKS": os.path.join(moonshot_path, "cookbooks"),
    "DATABASES": os.path.join(moonshot_path, "generated-outputs/databases"),
    "DATABASES_MODULES": os.path.join(moonshot_path, "databases-modules"),
    "DATASETS": os.path.join(moonshot_path, "datasets"),
    "IO_MODULES": os.path.join(moonshot_path, "io-modules"),
    "METRICS": os.path.join(moonshot_path, "metrics"),
    "PROMPT_TEMPLATES": os.path.join(moonshot_path, "prompt-templates"),
    "RECIPES": os.path.join(moonshot_path, "recipes"),
    "RESULTS": os.path.join(moonshot_path, "generated-outputs/results"),
    "RESULTS_MODULES": os.path.join(moonshot_path, "results-modules"),
    "RUNNERS": os.path.join(moonshot_path, "generated-outputs/runners"),
    "RUNNERS_MODULES": os.path.join(moonshot_path, "runners-modules"),
}

# Apply the environment variables to configure the Moonshot framework.
api_set_environment_variables(env)

# NOTE: if you manage to set the environment variables successfully, you will not see any printout

## Run an existing benchmark
In this section, we will teach you how to run a benchmark. You will first learn how to create the endpoint connector with your OpenAI. Then, you will run the benchmark and view the results.

**Replace `ADD_NEW_TOKEN_HERE` with your own OpenAI token below**

In [3]:
endpoint_id = api_create_endpoint(
    "my-openai-endpoint",    # name: Assign a unique name to identify this endpoint later.
    "openai-connector",      # connector_type: Specify the connector type for the model you want to evaluate.
    "",                      # uri: Leave blank as the OpenAI library handles the connection.
    "ADD_NEW_TOKEN_HERE"     # token: Insert your OpenAI API token here.
    1,                       # max_calls_per_second: Set the maximum number of calls allowed per second.
    1,                       # max_concurrency: Set the maximum number of concurrent calls.
    "gpt-3.5-turbo" ,        # model: the model version to use.
    
    # params: Include any additional parameters required for this model.
    {
        "timeout": 300,      # timeout: Define the timeout for API calls in seconds.
        "max_attempts": 3,   # max_attempts: Set the number of retries if allowed.
        "temperature": 0.5,  # temperature: Set the temperature for response variability.
    }  
)
print(f"The newly created endpoint id: {endpoint_id}")

The newly created endpoint id: my-openai-endpoint


## Run a test using our predefined `cookbook`

Moonshot comes with a list of `cookbooks` and `recipes`. A `recipe` contains one or more benchmark datasets and evaluation metrics. A `cookbook` contains one or more `recipes`. To execute an existing test, we can select either a `recipe` or `cookbook`.

In this tutorial, we will run a `cookbook` called `singapore-context`. This cookbook contains a recipe with datasets on facts about Singapore. We will be using these datasets to evaluate the LLM's understanding and knowledge of Singapore's key events, transport system, and facts.

*For the purpose of this tutorial, we will configure our `runner` to run 5% of prompts from every recipe in this cookbook*

In [6]:
from slugify import slugify
from moonshot.api import api_get_all_run, api_create_runner, api_get_all_runner_name

name = "sample-cookbook-runner" # The name of our runner
cookbooks = ["singapore-context"] # Test a cookbook with a smaller dataset
endpoints = ["my-openai-endpoint"] # Test against 1 endpoint
percentage_of_prompts = 5 # The percentage of prompts to run in a cookbook's datasets (a cookbook's dataset is the total recipes' datasets in the cookbook) 

# Optional fields
random_seed = 0   # Default: 0; this allows for randomness in dataset selection when num_of_prompts are set
system_prompt = ""  # Default: ""; this allows setting the system prompt for the endpoints

# Advanced user - Modify runner processing module and result processing module
# Default: benchmarking and benchmarking-result. Change it to your module name if you have your own runner and/or result module
runner_proc_module = "benchmarking"  # Default: "benchmarking"
result_proc_module = "benchmarking-result"  # Default: "benchmarking-result"

# Run the cookbook with the defined endpoints
# If the id exists, it will perform a load on the runner instead of creating a new runner
# Using an existing runner will allow us to use cached results from previous runs, which greatly reduces the run time
slugify_id = slugify(name, lowercase=True)
if slugify_id in api_get_all_runner_name():
    cb_runner = api_load_runner(slugify_id)
else:
    cb_runner = api_create_runner(name, endpoints)

# run_cookbooks() is an async function. Currently there is no sync version.
# We will get an existing event loop and execute the run cookbooks process.
await cb_runner.run_cookbooks(
        cookbooks,
        percentage_of_prompts,
        random_seed,
        system_prompt,
        runner_proc_module,
        result_proc_module,
    )
cb_runner.close()  # Perform a close on the runner to allow proper cleanup.

# Display results in JSON
runner_runs = api_get_all_run(cb_runner.id)
result_info = runner_runs[-1].get("results")
if result_info:
    print(json.dumps(result_info, indent=2))
else:
    raise RuntimeError("no run result generated")

2024-11-07 21:20:06,116 [INFO][runner.py::run_cookbooks(412)] [Runner] sample-cookbook-runner - Running benchmark cookbook run...
2024-11-07 21:20:06,145 [INFO][benchmarking.py::generate(139)] [Benchmarking] Running cookbooks (['singapore-context'])...
2024-11-07 21:20:06,146 [INFO][benchmarking.py::generate(145)] [Benchmarking] Running cookbook singapore-context... (1/1)
2024-11-07 21:20:06,155 [INFO][connector.py::get_prediction(348)] [Connector ID: my-openai-endpoint] Predicting Prompt Index 12.
2024-11-07 21:20:06,157 [INFO][connector.py::get_prediction(348)] [Connector ID: my-openai-endpoint] Predicting Prompt Index 12.
2024-11-07 21:20:06,157 [INFO][connector.py::get_prediction(348)] [Connector ID: my-openai-endpoint] Predicting Prompt Index 24.
2024-11-07 21:20:06,157 [INFO][connector.py::get_prediction(348)] [Connector ID: my-openai-endpoint] Predicting Prompt Index 48.
2024-11-07 21:20:06,158 [INFO][connector.py::get_prediction(348)] [Connector ID: my-openai-endpoint] Predicti

{
  "metadata": {
    "id": "sample-cookbook-runner",
    "start_time": "2024-11-07 21:20:06",
    "end_time": "2024-11-07 21:20:15",
    "duration": 9,
    "status": "completed",
    "recipes": null,
    "cookbooks": [
      "singapore-context"
    ],
    "endpoints": [
      "my-openai-endpoint"
    ],
    "prompt_selection_percentage": 5,
    "random_seed": 0,
    "system_prompt": ""
  },
  "results": {
    "cookbooks": [
      {
        "id": "singapore-context",
        "recipes": [
          {
            "id": "singapore-facts",
            "details": [
              {
                "model_id": "my-openai-endpoint",
                "dataset_id": "singapore-transport-system",
                "prompt_template_id": "no-template",
                "data": [
                  {
                    "prompt": "When was the East-West Line expanded to the Tuas West Extension?\nA) June 2023\nB) January 2017\nC) March 2020\nD) June 2017",
                    "predicted_result": {
        

  cb_runner.close()  # Perform a close on the runner to allow proper cleanup.


## Beautifying the results

The result above is shown in our raw JSON file. To beautify the results, you can use the`rich` library to put them into a nice table.

In [7]:
from rich.columns import Columns
from rich.console import Console
from rich.panel import Panel
from rich.table import Table
console = Console()

def show_cookbook_results(cookbooks, endpoints, cookbook_results, duration):
    """
    Show the results of the cookbook benchmarking.

    This function takes the cookbooks, endpoints, cookbook results, results file, and duration as arguments.
    If there are results, it generates a table with the cookbook results and prints a message indicating
    where the results are saved. If there are no results, it prints a message indicating that no results were found.
    Finally, it prints the duration of the run.

    Args:
        cookbooks (list): A list of cookbooks.
        endpoints (list): A list of endpoints.
        cookbook_results (dict): A dictionary with the results of the cookbook benchmarking.
        duration (float): The duration of the run.

    Returns:
        None
    """
    if cookbook_results:
        # Display recipe results
        generate_cookbook_table(cookbooks, endpoints, cookbook_results)
    else:
        console.print("[red]There are no results.[/red]")

    # Print run stats
    console.print(f"{'='*50}\n[blue]Time taken to run: {duration}s[/blue]\n*Overall rating will be the lowest grade that the recipes have in each cookbook\n{'='*50}")

def generate_cookbook_table(cookbooks: list, endpoints: list, results: dict) -> None:
    """
    Generate and display a table with the cookbook benchmarking results.

    This function creates a table that includes the index, cookbook name, recipe name, and the results
    for each endpoint.

    The cookbook names are prefixed with "Cookbook:" and are displayed with their overall grades. Each recipe under a
    cookbook is indented and prefixed with "Recipe:" followed by its individual grades for each endpoint. If there are
    no results for a cookbook, a row with dashes across all endpoint columns is added to indicate this.

    Args:
        cookbooks (list): A list of cookbook names to display in the table.
        endpoints (list): A list of endpoints for which results are to be displayed.
        results (dict): A dictionary containing the benchmarking results for cookbooks and recipes.

    Returns:
        None: The function prints the table to the console but does not return any value.
    """
    table = Table(
        title="Cookbook Result", show_lines=True, expand=True, header_style="bold"
    )
    table.add_column("No.", width=2)
    table.add_column("Cookbook (with its recipes)", justify="left", width=78)
    for endpoint in endpoints:
        table.add_column(endpoint, justify="center")

    index = 1
    for cookbook in cookbooks:
        # Get cookbook result
        cookbook_result = next(
            (
                result
                for result in results["results"]["cookbooks"]
                if result["id"] == cookbook
            ),
            None,
        )

        if cookbook_result:
            # Add the cookbook name with the "Cookbook: " prefix as the first row for this section
            endpoint_results = []
            for endpoint in endpoints:
                # Find the evaluation summary for the endpoint
                evaluation_summary = next(
                    (
                        temp_eval
                        for temp_eval in cookbook_result["overall_evaluation_summary"]
                        if temp_eval["model_id"] == endpoint
                    ),
                    None,
                )

                # Get the grade from the evaluation_summary, or use "-" if not found
                grade = "-"
                if evaluation_summary and evaluation_summary["overall_grade"]:
                    grade = evaluation_summary["overall_grade"]
                endpoint_results.append(grade)
            table.add_row(
                str(index),
                f"Cookbook: [blue]{cookbook}[/blue]",
                *endpoint_results,
                end_section=True,
            )

            for recipe in cookbook_result["recipes"]:
                endpoint_results = []
                for endpoint in endpoints:
                    # Find the evaluation summary for the endpoint
                    evaluation_summary = next(
                        (
                            temp_eval
                            for temp_eval in recipe["evaluation_summary"]
                            if temp_eval["model_id"] == endpoint
                        ),
                        None,
                    )

                    # Get the grade from the evaluation_summary, or use "-" if not found
                    grade = "-"
                    if (
                        evaluation_summary
                        and "grade" in evaluation_summary
                        and "avg_grade_value" in evaluation_summary
                        and evaluation_summary["grade"]
                    ):
                        grade = f"{evaluation_summary['grade']} [{evaluation_summary['avg_grade_value']}]"
                    endpoint_results.append(grade)

                # Add the recipe name indented under the cookbook name
                table.add_row(
                    "",
                    f"  └──  Recipe: [blue]{recipe['id']}[/blue]",
                    *endpoint_results,
                    end_section=True,
                )

            # Increment index only after all recipes of the cookbook have been added
            index += 1
        else:
            # If no results for the cookbook, add a row indicating this with the "Cookbook: " prefix
            # and a dash for each endpoint column
            table.add_row(
                str(index),
                f"Cookbook: {cookbook}",
                *(["-"] * len(endpoints)),
                end_section=True,
            )
            index += 1

    # Display table
    console.print(table)

if result_info:
    show_cookbook_results(
        cookbooks, endpoints, result_info, result_info["metadata"]["duration"]
    )
else:
    raise RuntimeError("no run result generated")

## List all the Cookbook

If you are curious what are the other cookbooks available, you can use `api_get_all_cookbook()`.

Here's how it will look like in the output. To run these cookbooks, just replace `leaderboard-cookbook` with one of the cookbook IDs or you can append more cookbook IDs to the list in the previous cell.

In [8]:
cookbook_ids = api_get_all_cookbook()
print("Total number of cookbooks: {0}".format(len(cookbook_ids)))
print("Showing the first three cookbooks below...")
print(json.dumps(cookbook_ids[0:3], indent=2))

Total number of cookbooks: 10
Showing the first three cookbooks below...
[
  {
    "id": "common-risk-easy",
    "name": "Easy test sets for Common Risks",
    "description": "This is a cookbook that consists (easy) test sets for common risks. These test sets are adapted from various research and will be expanded in the future.",
    "recipes": [
      "uciadult",
      "bbq",
      "winobias",
      "challenging-toxicity-prompts-completion",
      "realtime-qa",
      "commonsense-morality-easy",
      "jailbreak-dan",
      "advglue"
    ]
  },
  {
    "id": "common-risk-hard",
    "name": "Hard test sets for Common Risks",
    "description": "This is a cookbook that consists (hard) test sets for common risks. These test sets are adapted from various research and will be expanded in the future.",
    "recipes": [
      "uciadult",
      "bbq",
      "winobias",
      "challenging-toxicity-prompts-completion",
      "realtime-qa",
      "commonsense-morality-hard",
      "jailbreak-da