# Metrics

In this notebook we will define and test some of the metrics that we will implement in our evaluation pipeline.

We can use the `rank_eval` package (https://pypi.org/project/rank-eval/) that was featured in ECIR 2022. This package implements some standard metrics that we can use for ranking such as:
- Mean Average Precision (mAP)
- Normalized Discounted Cumulative Gain (NDCG)
- Mean Reciprocal Rank (MRR)

## Get started with `rank_eval` lib

In [2]:
from rank_eval import Qrels, Run, evaluate, compare

qrels = Qrels()
qrels.add_multi(
    q_ids=["q_1", "q_2"],
    doc_ids=[
        ["doc_12", "doc_25"],  # q_1 relevant documents
        ["doc_11", "doc_2"],  # q_2 relevant documents
    ],
    scores=[
        [5.0, 3.0],  # documents relevance (can be a relevance score, or a binary score of 1 indicating that the document is relevant)
        [6.0, 1.0],  # documents relevance
    ],
)

run = Run()
run.add_multi(
    q_ids=["q_1", "q_2"],
    doc_ids=[
        ["doc_12", "doc_23", "doc_25", "doc_36", "doc_32", "doc_35"],
        ["doc_12", "doc_11", "doc_25", "doc_36", "doc_2",  "doc_35"],
    ],
    scores=[
        [0.9, 0.8, 0.7, 0.6, 0.5, 0.4],
        [0.9, 0.8, 0.7, 0.6, 0.5, 0.4],
    ],
)

# Compute score for a single metric
print(evaluate(qrels, run, "ndcg@5"))

# Compute scores for multiple metrics at once
print(evaluate(qrels, run, ["map@5", "mrr"]))

# Computed metric scores are saved in the Run object
print(run.mean_scores)

# Access scores for each query
dict(run.scores)

0.7861261099276952
{'map@5': 0.6416666666666666, 'mrr': 0.75}
{'ndcg@5': 0.7861261099276952, 'map@5': 0.6416666666666666, 'mrr': 0.75}


{'ndcg@5': {'q_1': 0.9430144683295216, 'q_2': 0.6292377515258687},
 'map@5': {'q_1': 0.8333333333333333, 'q_2': 0.45},
 'mrr': {'q_1': 1.0, 'q_2': 0.5}}

In [3]:
from rank_eval import compare

# Compare different runs and perform statistical tests
# TODO: which statistical tests are they performing?

qrels = Qrels()
qrels.add_multi(
    q_ids=["q_1", "q_2"],
    doc_ids=[
        ["doc_1"],
        ["doc_2"],
    ],
    scores=[
        [1],
        [1],
    ],
)

run_1 = Run()
run_1.name = "model_1"
run_1.add_multi(
    q_ids=["q_1", "q_2"],
    doc_ids=[
        ["doc_1", "doc_11", "doc_111"],
        ["doc_222", "doc_22", "doc_2"],
    ],
    scores=[
        [0.9, 0.8, 0.7],
        [0.9, 0.8, 0.7],
    ],
)

run_2 = Run()
run_2.name = "model_2"
run_2.add_multi(
    q_ids=["q_1", "q_2"],
    doc_ids=[
        ["doc_1", "doc_11", "doc_111"],
        ["doc_22", "doc_222", "doc_2222"],
    ],
    scores=[
        [0.9, 0.8, 0.7],
        [0.9, 0.8, 0.7],
    ],
)

runs = [run_1, run_2]


report = compare(
    qrels=qrels,
    runs=runs,
    metrics=["map@3", "mrr@3", "ndcg@3"],
    max_p=0.05/len(runs)
)

print(report)

#    Model      MAP@3    MRR@3    NDCG@3
---  -------  -------  -------  --------
a    model_1   0.6667   0.6667      0.75
b    model_2   0.5      0.5         0.5


## Implementation of the `RankMetrics` class

In [4]:
from src.metrics import RankMetrics
metrics = RankMetrics()

In [5]:
ground_truth = {
    "q_1": [
        ("d_1", 1)
    ],
    "q_2": [
        ("d_2", 1)
    ]
}
predictions = {
    "q_1": [
        ("d_1", 0.9),
        ("doc_11", 0.8),
        ("doc_111", 0.7)
    ],
    "q_2": [
        ("doc_222", 0.9),
        ("doc_22", 0.8),
        ("doc_2", 0.7)
    ]
}

In [6]:
# default (supported) metrics are "map", "mrr", and "ndcg"
metrics.compute_metrics(ground_truth=ground_truth, predictions=predictions, top_k=3)

{'map@3': 0.5, 'mrr@3': 0.5, 'ndcg@3': 0.5}

In [7]:
# default top_k is 1
metrics.compute_metrics(ground_truth=ground_truth, predictions=predictions)

{'map@1': 0.5, 'mrr@1': 0.5, 'ndcg@1': 0.5}

In [8]:
# we can ask for a subset of metrics only
metrics.compute_metrics(ground_truth=ground_truth, predictions=predictions, top_k=3, metrics=["ndcg", "map"])

{'ndcg@3': 0.5, 'map@3': 0.5}

In [9]:
# if we request only one metric, it outputs the value directly
metrics.compute_metrics(ground_truth=ground_truth, predictions=predictions, top_k=3, metrics=["ndcg"])

0.5

In [10]:
# if we ask for a not supported metric, it gives an error
metrics.compute_metrics(ground_truth=ground_truth, predictions=predictions, metrics=["dummy"])

ValueError: The metric `dummy` is not supported. Currently, the following metrics are supported: ['map', 'mrr', 'ndcg']

## Validate that metrics are being correctly computed

### MRR

We are expecting:
- q1 -> 1/3 (1st relevant item in the 3rd position)
- q2 -> 1/1 (1st relevant item in the 1st position)
- mrr = (1/3 + 1) / 2 = 0.667

*Note that the scores do not have any influence in this particular metric.*

In [None]:
ground_truth = {
    "q_1": [
        ("d_1", 1),
    ],
    "q_2": [
        ("d_2", 1)
    ],
}
predictions = {
    "q_1": [
        ("d_11", 0.0),
        ("d_111", 0.0),
        ("d_1", 0.0)
    ],
    "q_2": [
        ("d_2", 0.9),
        ("d_22", 0.8),
        ("d_222", 0.7),
    ]
}

metrics.compute_metrics(ground_truth=ground_truth, predictions=predictions, top_k=3, metrics=["mrr"])

0.6666666666666666

### mAP

We are expecting:
- q1 -> AP@3 = (1/1 + 2/3)/2 = 0.833
- q2 -> AP@3 = 1/1 = 1
- mAP = (1 + 1/3) / 2 = 0.917

*Note that the scores do not have any influence in this particular metric.*

In [None]:
ground_truth = {
    "q_1": [
        ("d_1", 1),
        ("d_11", 1),
    ],
    "q_2": [
        ("d_2", 1)
    ],
}
predictions = {
    "q_1": [
        ("d_11", 0.0),
        ("d_111", 0.0),
        ("d_1", 0.0)
    ],
    "q_2": [
        ("d_2", 0.9),
        ("d_22", 0.8),
        ("d_222", 0.7),
    ]
}

metrics.compute_metrics(ground_truth=ground_truth, predictions=predictions, top_k=3, metrics=["map"])

0.9166666666666666

### NDCG

We are expecting:
- q1 -> DCG@3 = (0.7/log(1+3)), IDCG@3 = (0.7/log(1+1)), NDCG@3 = DCG@3/IDCG@3 = 0.5
- q2 -> DCG@3 = (0.7/log(1)), IDCG@3 = (0.7/log(1)), NDCG@3 = DCG@3/IDCG@3 = 1
- NDCG = (0.5 + 1) / 2 = 0.667

*Note that, now, the scores **affect** the metric.*

In [15]:
ground_truth = {
    "q_1": [
        ("d_1", 1),
    ],
    "q_2": [
        ("d_2", 1)
    ],
}
predictions = {
    "q_1": [
        ("d_11", 0.9),
        ("d_111", 0.8),
        ("d_1", 0.7)
    ],
    "q_2": [
        ("d_2", 0.9),
        ("d_22", 0.8),
        ("d_222", 0.7),
    ]
}

metrics.compute_metrics(ground_truth=ground_truth, predictions=predictions, top_k=3, metrics=["map"])

0.6666666666666666