<a href="https://colab.research.google.com/github/obss/jury/blob/main/examples/jury_evaluate.ipynb"><img alt="Open in Colab" src="https://colab.research.google.com/assets/colab-badge.svg"></a>

## Packages (Colab)

To be able to use several metrics (e.g SacreBLEU, BERTScore, etc.), you need to install related package. When you try to use it without having those required packages, an exception will be thrown indicating that installation of spesific package is required. If you want to see score outputs for SacreBLEU and BERTScore in the experiments in this notebook, comment off related lines (those will be declared later with in line comments).

If you want to see/use those metrics, install required packages below with commenting off the code cell below.

In [None]:
!pip install jury

In [None]:
# !pip install sacrebleu bert-score==0.3.9

## Imports

We start with required imports.

In [None]:
import os
import json # Just for pretty printing the resulting dict.

from jury import load_metric, Jury

In [None]:
from typing import List


def read_from_txt(path: str) -> List[str]:
    with open(path, "r") as f:
        data = f.readlines()
    return data

## Task 1: Machine Translation

We evaluate sample machine translation generation outputs and their references. Feel free to play around with the samples below. Alternatively, you can load your own predictions and references using helper function `read_from_txt()`, where each line will be treated as a separate prediction or references, and order needs to be consistent between prediction and reference txt file.

In [None]:
mt_predictions = [
    ["the cat is on the mat", "There is cat playing on the mat"], 
    ["Look! a wonderful day."]
]
mt_references = [
    ["the cat is playing on the mat.", "The cat plays on the mat."],
    ["Today is a wonderful day", "The weather outside is wonderful."],
]

# mt_predictions = read_from_txt("/path/to/predictions.txt")
# mt_references = read_from_txt("/path/to/references.txt")

### Define Metrics

Here define your metrics used to evaluate MT prediction and references. You can either use load function from jury where you can pass additional parameters to specified metric, or specify as string, which will use default parameters.

**NOTE:** Computation of BERTScore may take some time as it will download a model for computing embeddings. Thus, we here provide `albert-base-v1`, but you can uncomment the previous line where it uses default model `roberta-large`.

[Here](https://huggingface.co/transformers/pretrained_models.html), you can observe model sizes, parameter counts, etc.

In [None]:
MT_METRICS = [
    load_metric("bleu", resulting_name="bleu_1", params={"max_order": 1}),
    load_metric("bleu", resulting_name="bleu_2", params={"max_order": 2}),
    load_metric("meteor"),
    load_metric("rouge"),
#     load_metric("sacrebleu"),  # (optional)
#     load_metric("bertscore"), # (optional) Using default model for lang en
#     load_metric("bertscore", params={"model_type": "albert-base-v1"})  # (optional) Using smaller model to reduce download time.
]

# Alternatively
# MT_METRICS = [
#     "bleu",
#     "meteor",
#     "rouge"
# ]

RUN_CONCURRENT = True  # set False to disable concurrency.

In [None]:
# Compute scores

if RUN_CONCURRENT:
    os.environ["TOKENIZERS_PARALLELISM"] = "true"
else:
    os.environ["TOKENIZERS_PARALLELISM"] = "false"

mt_jury = Jury(metrics=MT_METRICS, run_concurrent=RUN_CONCURRENT)
scores = mt_jury.evaluate(predictions=mt_predictions, references=mt_references)

In [None]:
# Display results
print(json.dumps(scores, indent=4))

## Task 2: Question Answering

For question answering task, commonly used evaluation metric is exact match or F1 score, datasets package allows this through a metric named "squad". Same interface is available here as well, with a single exception that in order to seamlessly compute, concat and output resulting scores Jury restrict each metric to compute a single score, by default SQUAD implementation computes (squad's) F1 score.

In [None]:
qa_predictions = ["1917", "Albert Einstein", "foo bar"]
qa_references = ["1917", "Einstein", "foo bar foobar"]

QA_METRICS = [
    "squad"
]

In [None]:
qa_jury = Jury(metrics=QA_METRICS, run_concurrent=False)
scores = qa_jury.evaluate(predictions=qa_predictions, references=qa_references)
print(json.dumps(scores, indent=4))

## Defining a custom metric

To define a custom metric, you only need to extend `jury.metrics.Metric` class and implement the required functions as desired. We create a metric to compute precision for our QA task above.

In [None]:
from collections import Counter
from typing import Callable

import datasets
from jury.collator import Collator
from jury.metrics import Metric
from jury.metrics._utils import normalize_text


class WordMatch(Metric):
    """
    Compute matching word ratio between prediction and reference 
        Average( # of matching tokens / # of unique tokens)
    
    Example:
        pred = ["foo bar foobar"]
        ref = ["foo bar barfoo"]
        matching_tokens = 2 # 'foo' and 'bar'
        total_tokens = 4 # 'foo', 'bar', 'foobar', and 'barfoo'
        score = 0.5 # (2/4)
    """
    def _info(self):
        return datasets.MetricInfo(
            description="Custom metric to compute ratio of matching tokens over unique tokens.",
            citation="Custom metric",
            inputs_description="",
            features=self.default_features,
        )
    
    def _tokenize(self, predictions: Collator, references: Collator):
        predictions = [normalize_text(p).split() for p in predictions]
        references = [normalize_text(r).split() for r in references]
        return predictions, references
    
    def _compute_single_pred_single_ref(
        self, predictions: Collator, references: Collator, reduce_fn: Callable = None, **kwargs
    ):
        scores = []
        predictions, references = self._tokenize(predictions, references)
        for pred, ref in zip(predictions, references):
            score = 0
            pred_counts = Counter(pred)
            ref_counts = Counter(ref)
            total_unique_tokens = len(pred_counts + ref_counts)
            for token, pred_count in pred_counts.items():
                if token in ref_counts:
                    score += min(pred_count, ref_counts[token])  # Intersection count
            score = score / total_unique_tokens
            scores.append(score)
        avg_score = sum(scores) / len(scores)
        return {"score": avg_score}

    def _compute_single_pred_multi_ref(
        self, predictions: Collator, references: Collator, reduce_fn: Callable, **kwargs
    ):
        scores = []
        for pred, refs in zip(predictions, references):
            pred_score = [
                self._compute_single_pred_single_ref(Collator([pred], keep=True), Collator([ref], keep=True))
                for ref in refs
            ]
            reduced_score = self._reduce_scores(pred_score, reduce_fn=reduce_fn)
            scores.append(reduced_score)

        return self._reduce_scores(scores, reduce_fn=np.mean)

    def _compute_multi_pred_multi_ref(self, predictions: Collator, references: Collator, reduce_fn: Callable, **kwargs):
        scores = []
        for preds, refs in zip(predictions, references):
            pred_scores = []
            for pred in preds:
                pred_score = self._compute_single_pred_multi_ref(
                    Collator([pred], keep=True), Collator([refs], keep=True), reduce_fn=reduce_fn
                )
                pred_scores.append(pred_score)
            reduced_score = self._reduce_scores(pred_scores, reduce_fn=reduce_fn)
            scores.append(reduced_score)

        return self._reduce_scores(scores, reduce_fn=np.mean)

In [None]:
from jury.metrics import Squad

QA_METRICS = [
    Squad(),
    WordMatch()
]

In [None]:
qa_jury = Jury(metrics=QA_METRICS, run_concurrent=False)
scores = qa_jury.evaluate(predictions=qa_predictions, references=qa_references)
print(json.dumps(scores, indent=4))