# Running Evaluation Metrics

You can run evaluation metrics with MuSE directly, typically you already loaded your data into the relevant formats (Document, MultiDocument, or Conversation *see [Datasets.ipynb](../dataset_importer/Datasets.ipynb)*), however, just to use the evaluators, having all data as strings is enough.

Currently, the supported evaluation metrics are:
- BLEU
- METEOR
- ROUGE
- Ollama *(our custom metric based on LLMs, and key-fact extraction)*

These can be found in the `evaluation` module of the `muse` package, and should be accessed via `resolve_evaluator` function.

## Classic Metrics (BLEU, METEOR, ROUGE)

These metrics all require a reference summary in order to compare the generated summary to. The reference summary, and the generated summary should be passed as strings to the `evaluate` method of the evaluator.

In [None]:
from muse.evaluation.resolver import resolve_evaluator

reference = "Summarization systems can be evaluated in many ways, including with metrics like BLEU, METEOR, and ROUGE."
generated = "Summarization systems can be evaluated with BLEU, METEOR, and ROUGE."

# Resolve the evaluator you wish to use
bleu = resolve_evaluator("BLEU")
meteor = resolve_evaluator("METEOR")
rouge = resolve_evaluator("ROUGE")
# Evaluate the generated summary
bleu_score = bleu.evaluate([generated], reference_summary=[reference])
meteor_score = meteor.evaluate([generated], reference_summary=[reference])
rouge_score = rouge.evaluate([generated], reference_summary=[reference])

# Print the scores
print(f"BLEU: {bleu_score}")
print(f"METEOR: {meteor_score}")
print(f"ROUGE: {rouge_score}")

## Ollama Metric

The Ollama metric is a custom metric we developed that uses a language model to evaluate the quality of a summary. It also uses key-fact extraction to ensure that the summary is factually accurate. The metric can use either a reference summary, or the source document to evaluate the generated summary.

It also takes a set of options, these can be checked with the `valid_options` method on the function:

In [None]:
from muse.evaluation.llm.ollama_metric import OllamaMetric

OllamaMetric.__init__.valid_options()

These options can be left to their default in most cases, but otherwise, are passed using the `options` parameter in the `resolve_evaluator` function, or when using the `cli` via the `--config` parameter.

We also must ensure `Ollama` is installed and running before we can use this metric. Please see [Ollama](https://ollama.com/) for information on how to install and run the service.

Once this is ready, we can do the following:

In [None]:
from muse.evaluation.resolver import resolve_evaluator

reference = "Summarization systems can be evaluated in many ways, including with metrics like BLEU, METEOR, and ROUGE."
generated = "Summarization systems can be evaluated with BLEU, METEOR, and ROUGE."

# Resolve the evaluator you wish to use
ollama = resolve_evaluator("Ollama")
# Evaluate the generated summary
ollama_score = ollama.evaluate([generated], reference_text=[reference])

# Print the score
print(f"Ollama: {ollama_score}")