# Loading and saving the evaluators.
In this notebook we will generate evaluators, save them to model registry, download them and use in the `evaluate` call.

### Import

In [None]:
import os
import inspect
import json
import pandas as pd
import shutil

from azure.ai.ml import MLClient
from azure.ai.ml.entities import Model
from azure.ai.ml.exceptions import UnsupportedOperationError
from azure.identity._credentials.default import AzureCliCredential

from promptflow.client import PFClient, load_flow
from promptflow.core import AzureOpenAIModelConfiguration, Flow
from promptflow.evals.evaluate import evaluate
from promptflow.evals.evaluators import F1ScoreEvaluator, GroundednessEvaluator

## Create or load evaluators
### Flex Flow evaluator
First we will create the flex flow evaluator and will save it. It wil be just a function, which calculates the length of an answer.

In [None]:
def answer_len(answer):
    return len(answer)


target_dir_tmp = "flex_flow_tmp"
os.makedirs(target_dir_tmp, exist_ok=True)
lines = inspect.getsource(answer_len)
with open(os.path.join("flex_flow_tmp", "answer.py"), "w") as fp:
    fp.write(lines)

from flex_flow_tmp.answer import answer_len as answer_length

After we have created the function, we can save it as a flex flow.

In [None]:
pf = PFClient()
flex_flow_path = "flex_flow"
if os.path.isdir(flex_flow_path):
    shutil.rmtree(flex_flow_path)
pf.flows.save(entry=answer_length, path=flex_flow_path)
# Remove the temporary directory
shutil.rmtree(target_dir_tmp)

Now we will test the saved flow.

In [None]:
saved_flex_flow = load_flow(flex_flow_path)
type(saved_flex_flow)

### DAG Evaluator
Now we will load the DAG evaluator, which counts the number LLM apologises.

In [None]:
dag_path = "apology_dag"
dag_evaluator = load_flow(dag_path)
print(type(dag_evaluator))
dag_evaluator(answer="Sorry, I can only truth questions")

### Prompty flow
Fist we will need to set the authentication variables.
In the `eval-basic\eval.prompty` file make sure that `azure_deployment` is set to the value, equal to yours deployment.
Please create the deployment of gpt-3.5 or gpt-4 in Azure Open AI and create the json file with the next contents.

```json
{
    "AZURE_OPENAI_API_KEY": "super_secret_key",
    "AZURE_OPENAI_ENDPOINT": "https:deployment_name.openai.azure.com/"
}
```

Finally, we will load the prompty flow, which return 1 if the llm apologises and 0 otherwise.

In [None]:
secure_data = json.load(open("openai_auth.json"))
assert "AZURE_OPENAI_API_KEY" in secure_data
assert "AZURE_OPENAI_ENDPOINT" in secure_data

for k, v in secure_data.items():
    os.environ[k] = v

prompty_path = os.path.join("apology-prompty", "apology.prompty")

Now we will try the loaded flow.

In [None]:
prompty_flow = load_flow(prompty_path)
results = evaluate(data="evaluation_dataset_context.jsonl", evaluators={"prompty_eval": prompty_flow})
pd.DataFrame(results["rows"])

### Authenticate to Azure
First we will need to authenticate to azure. For this purpose we will use the the configuration file of the net structure.
```json
{
    "resource_group_name": "resource-group-name",
    "subscription_id": "subscription-uuid",
    "registry_name": "registry-name"
}
```
**Note:** If the `registry_name` will be replaced by `workspace_name`, the evaluator will be saved to Azure ML Workspace instead of registry.


In [None]:
with open("config.json") as f:
    configuration = json.load(f)

credential = AzureCliCredential()
ml_client = MLClient(credential=credential, **configuration)

Now we will upload all the evaluators to Azure.<br>
FlexFlow

In [None]:
eval = Model(
    path=flex_flow_path,
    name="answer_len_uploaded",
    description="Evaluator, calculating answer length using Flex flow.",
)
flex_model = ml_client.evaluators.create_or_update(eval)

Upload DAG flow

In [None]:
eval = Model(
    path=dag_path,
    name="apology_dag_uploaded",
    description="Evaluator, calculating the number of times apology happens in the answer using DAG flow.",
)
dag_model = ml_client.evaluators.create_or_update(eval)

Finally, let us upload the prompty model

In [None]:
eval = Model(
    path=os.path.dirname(prompty_path),
    name="apology_prompty_uploaded",
    description="Evaluator, showing, if apology happens in the response.",
)
prompty_model = ml_client.evaluators.create_or_update(eval)

The registered evaluators can be retrieved from the registry or workspace using `get` method. It will return the model, which we have created in steps above.

In [None]:
retrieved_eval = ml_client.evaluators.get("apology_prompty_uploaded", version=1)
retrieved_eval.name

We can also list all the evaluators with the given name. As with the `ModelOperations` API, it will return an iterator of models, however in this case all models will be marked as evaluators. Let us iterate over the evaluators and print their versions.

In [None]:
evals_list = [[eval.name, eval.version] for eval in ml_client.evaluators.list("apology_prompty_uploaded")]
pd.DataFrame(evals_list, columns=['Name', 'Version']).sort_values(by='Version')

**Limitation!** Please note, that unlike for models, currently we can not list evaluators without providing name and it will raise an `UnsupportedOperationError`

In [None]:
try:
    ml_client.evaluators.list(None)
except UnsupportedOperationError as e:
    print(e)

Now we will download the evaluators and load them in promptflow.

In [None]:
evaluators = {}
for name, dirname, mod in zip(
    ("answer_len", "apology_number", "apopogy"),
    (flex_flow_path, dag_path, prompty_path),
    (flex_model, dag_model, prompty_model),
):
    ml_client.evaluators.download(mod.name, version=mod.version, download_path=".")
    evaluators[name] = load_flow(os.path.join(mod.name, dirname))

View the loaded evaluators

In [None]:
evaluators

## Run evaluators using evalute API
Let us run three loaded evaluators along with two standard evaluators.

In [None]:
configuration = AzureOpenAIModelConfiguration(
    azure_endpoint=secure_data["AZURE_OPENAI_ENDPOINT"],
    api_key=secure_data["AZURE_OPENAI_API_KEY"],
    api_version="2023-07-01-preview",
    azure_deployment="gpt-35-turbo-1106",  # Please use the name of a model you have deployed.
)

evaluators["f1_evaluator"] = F1ScoreEvaluator()
evaluators["groundedess_evaluator"] = GroundednessEvaluator(model_config=configuration)

Finally, run the evaluation

In [None]:
results = evaluate(data="evaluation_dataset_context.jsonl", evaluators=evaluators)

View the results

In [None]:
print(f"{results['metrics']=}")
pd.DataFrame(results["rows"])