# Running the official evaluator

This flightpath walks through running the official ensemble, either directly or using different combiner logic and seeing the results in MLCommons' MLFlow server.

## Requirements
To run this flightpath, you must:
* Have access to the AIRR MLFlow server.
  * Modify `.env.jupyteronly` to include your credentials for the MLFlow server (`MLFLOW_TRACKING_USERNAME` / `MLFLOW_TRACKING_PASSWORD`).
    * Alternatively, you can put the credentials in `~/.mlflow/credentials` as described [here](https://mlflow.org/docs/latest/ml/auth/#credentials-file).
  * Note that if you want to use a locally running mlflow server, you can modify .env.jupyteronly to set `MLFLOW_TRACKING_URI` appropriately.
* Have access to the modelbench-private repository *and* set `USE_PRIVATE_MODELBENCH=true` in `.env.jupyteronly`.

Once modifications (if any) are made to the `.env.jupyteronly` configuration, start jupyter with the `./start_jupyter.sh` script.

## MLFlow server login

## Import runways

In [None]:
import datetime
from modelplane.runways import responder, annotator, scorer

## Settings

* `sut_id` refers to the model that generates the responses to the prompts. It is currently set to a demo SUT.
* `experiment` variable will be used to organize the various runs in mlflow.
* `prompts` should point to a location in `/flightpaths/data`. A sample dataset is provided.
* `num_workers` can adjust the parallelism.

In [None]:
sut_id = "demo_yes_no"
experiment = "fp_private_" + datetime.date.today().strftime("%Y%m%d")
prompts = "data/airr_official_1.0_demo_en_us_prompt_set_release_reduced.csv"
ground_truth = "data/fakegroundtruth.csv"
num_workers = 4

### Combiner logic - `AnyUnsafeStrategy`

This implementation will return is_safe = "unsafe" if any of the ensemble members decide it's unsafe.

Combiner logic must be declared via `ENSEMBLE_STRATEGIES`.

In [None]:
from typing import Dict

from modelgauge.safety_model_response import SafetyModelResponse, EnsembleSafetyModelResponse
from modelgauge.ensemble_annotator_set import EnsembleStrategy, ENSEMBLE_STRATEGIES


class AnyUnsafeEnsembleStrategy(EnsembleStrategy):
    def compute_response(self, annotations: Dict[str, SafetyModelResponse]) -> EnsembleSafetyModelResponse:
        response = EnsembleSafetyModelResponse(
            joined_responses=annotations, 
            is_safe=True,
            safety_categories=[],
            is_valid=False,
        )
        for annotation in annotations.values():
            if annotation.is_valid:
                response.is_valid = True
                if not annotation.is_safe:
                    response.is_safe = False
                    current_categories = set(response.safety_categories)
                    current_categories.update(annotation.safety_categories)
                    response.safety_categories = sorted(current_categories)
        return response

ENSEMBLE_STRATEGIES["anyunsafe"] = AnyUnsafeEnsembleStrategy()

## Run the model

This step will get responses to the prompts from the given SUT.

Save this run_id to avoid having to re-run the model later. The results are saved as an artifact in mlflow.

You can see the the runs associated with the experiment you specified above at your MLFlow tracking server. There will be a link below after running. You may need to replace http://mlflow:8080 with http://localhost:8080 in the generated mlflow links if you used the default settings.

In [None]:
run_id = responder.respond(
    sut_id=sut_id,
    experiment=experiment,
    prompts=prompts,
    num_workers=num_workers,
)

## Annotate the model by ensembling with the new combiner logic

This will grab the annotators from the `PRIVATE_ANNOTATOR_SET` and apply the new combiner logic to them. You'll be able to see the details of the annotations in mlflow.

If you want to run the official annotator directly (without modifications), you can instead run:

```python
annotation_run_id = annotator.annotate(
    ensemble_id="official-1.0",
    experiment=experiment,
    response_run_id=run_id,
    num_workers=num_workers,
)
```

In [None]:
from modelgauge.private_ensemble_annotator_set import PRIVATE_ANNOTATOR_SET

annotator_ids = PRIVATE_ANNOTATOR_SET.annotators
annotation_run_id = annotator.annotate(
    annotator_ids=annotator_ids,
    ensemble_strategy="anyunsafe",
    experiment=experiment,
    response_run_id=run_id,
    num_workers=num_workers,
)