# Evaluation example

This example uses the PromptTools repository with Sentence Transformers to evaluate the semantic similarity of generative outputs to an expected result.

Running this notebook requires a local `.gguf` model to be loaded on the device. This can be achieved by following the [llama.cpp](https://github.com/ggerganov/llama.cpp) repository instructions.

The particular models used here are [`Orca 2 13B`](https://huggingface.co/microsoft/Orca-2-13b) and [`Orca 2 7B`](https://huggingface.co/microsoft/Orca-2-7b) both with 4-bit quantisation.

## Import libraries

In [None]:
from prompttools.experiment import LlamaCppExperiment
import prompttools.utils as utils

## Define experiment paramaters

It is important to note that in the `LlamaCppExperiment` class we are passing `{"n_gpu_layers": [1]}` as the argument for `model_params` which enables GPU usage for Apple silicon.

This was successfully tested for the M1 MBP but has unknown effects for other devices and operating systems.

In [None]:
model_paths = [
    "../../models/Orca-2-13b-q4_0.gguf",
    "../../models/Orca-2-7b-q4_0.gguf",
]

prompts = [
    "Who was the first prime minister?",
    "Who was the first prime minister of the United Kingdom?",
]

temperatures = [0.0, 1.0]

call_params = dict(temperature=temperatures)

experiment = LlamaCppExperiment(
    model_paths,
    prompts,
    call_params=call_params,
    model_params={"n_gpu_layers": [1]},
)

## Run experiment

In [None]:
experiment.run()

## Evaluate experiment

PromptTools provides utilities for evaluating generative outputs.

This example makes use of the `semantic_similarity` function and there are more availbale in the [PromptTools documentation](https://prompttools.readthedocs.io/en/latest/utils.html) such as `autoeval_binary_scoring` which uses GPT-4 as a "strong" judge.

In [None]:
experiment.evaluate(
    "similar_to_expected",
    utils.semantic_similarity,
    expected=["Robert Walpole"] * 8,
)

In [None]:
df = experiment.to_pandas_df(get_all_cols=False)

df.sort_values(by="similar_to_expected", ascending=False)