# Evaluator Development Template

Use this template to iterate on a new evaluator (a.k.a. annotator). 

The general flow is that you would initially generate some prompt responses for some system under test (SUT).

Then, you would evaluate those responses with the newly defined annotator and score them. 

And then iterate.

Sample code is given below.

## Imports

In [None]:
import datetime
from modelplane.runways import responder, annotator, scorer

## Settings

The `sut_id` refers to the model that generates the responses to the prompts. 

The `experiment` variable will be used to organize the various runs in mlflow.

The `prompts` should point to a location in `/flightpaths/data`. A sample dataset is provided.

Finally, `num_workers` can adjust the parallelism.

In [None]:
sut_id = "demo_yes_no"
experiment = "experiment_" + datetime.date.today().strftime("%Y%m%d")
prompts = "data/airr_official_1.0_demo_en_us_prompt_set_release_reduced.csv"
ground_truth = "data/fakegroundtruth.csv"
num_workers = 4

## Defining a new annotator

An implementation of the annotator should define:

* `translate_request`: takes a `TestItem` and `SUTResponse` (which have the needed information on the prompt responses for the given LLM) and translates the request into an arbitrary form useful for the `annotate` method.
* `annotate`: take the translated request and produce an annotation in any form.
* `translate_response`: take the annotation and translate it to the needed from (`SafetyModelResponse`).

Below, we give a sample `RandomAnnotator`, which randomly decides whether a given response is safe by hazard category.

In [None]:
import random

from pydantic import BaseModel

from modelgauge.annotator import CompletionAnnotator
from modelgauge.annotator_registry import ANNOTATORS
from modelgauge.safety_model_response import SafetyModelResponse
from modelgauge.single_turn_prompt_response import TestItem
from modelgauge.sut import SUTResponse
from modelgauge.tests.safe_v1 import BaseSafeTestVersion1

class AnnotationRequest(BaseModel):
    test_item: TestItem
    response: SUTResponse


class RandomAnnotator(CompletionAnnotator[SafetyModelResponse]):
    """Simple random implementation for demonstration. Uses SafetyModelResponse
    for consistency with AILuminate benchmarks.

    The methods to implement are based on the idea that annotate will 
    potentially make a call to another system and the translators are helpful 
    in preparing that call and converting the response to the needed annotation.
    """
    def __init__(self, uid, prob_unsafe):
        super().__init__(uid)
        self.prob_unsafe = prob_unsafe
    
    def translate_request(self, test_item: TestItem, response: SUTResponse):
        return AnnotationRequest(test_item=test_item, response=response)

    def annotate(self, annotation_request: AnnotationRequest) -> SafetyModelResponse:
        unsafe_categories = []
        for category in BaseSafeTestVersion1.hazards:
            if random.random() < self.prob_unsafe:
                unsafe_categories.append(category)
        return SafetyModelResponse(
            is_safe=not unsafe_categories,
            logprobs=random.random(),
            safety_categories=unsafe_categories,
        )

    def translate_response(self, request: AnnotationRequest, response: SafetyModelResponse) -> SafetyModelResponse:
        return response


annotator_id = "random001"
ANNOTATORS.register(RandomAnnotator, annotator_id, 0.01)

## Run the model

This step will get responses to the prompts from the given SUT. You can optionally pass arguments `prompt_uid_col` and `prompt_text_col` if your prompts dataset has different column names than the default ones.

Save this run_id to avoid having to re-run the model later. The results are saved as an artifact in mlflow.

You can see the the runs associated with the experiment you specified above at your MLFlow tracking server. There will be a link below after running. You may need to replace http://mlflow:8080 with http://localhost:8080 in the generated mlflow links if you used the default settings.

In [None]:
response_run = responder.respond(
    sut_id=sut_id,
    experiment=experiment,
    prompts=prompts,
    num_workers=num_workers,
)

## Annotate the model

This will run the evaluator on the responses from the prior step. You'll be able to see the details of the annotations in mlflow.

In [None]:
annotation_run = annotator.annotate(
    annotator_ids=[annotator_id],
    experiment=experiment,
    response_run_id=response_run.run_id,
    num_workers=num_workers,
)

## Score the model

Compute metrics against the given ground truth dataset.

In [None]:
scorer.score(
    annotation_run_id=annotation_run.run_id,
    experiment=experiment,
    ground_truth=ground_truth,
)