# Evaluation of `passim` results

- read clusters from groundtruth CSV file (Google spreadsheet export)
- read clusters from passim predictions (JSON lines file)
- try to match GT clusters with predicted clusters
    - criterion: 2 or more passage overlapping (identity established based on `speech_id`, `passage` or text overlap)
    - a certain number of GT clusters will have no match (print or log)
    - => returns a list of matches that will be used for evaluation
- metrics
    - global
        - number of 100% correct clusters
        - number of missed clusters
        - number of partly correct clusters 
    - more fine-grained: precision/recall/f-score


- objects needed:
    - `TextReuseCluster`, `TextReusePassage`, `Locus`, `TextReuseEvaluator`

`TextReuseEvaluator`:
- `_match_clusters()`
- `read_predictions()`
- `read_groundtruth()`
- `evaluate()`
- `error_report()` or `evaluation_report`
- `to_csv()`

`TextReuseCluster`
- `size`
- `passages`

## Testing the evaluation

In [1]:
import sys
sys.path.append('.')
from lib.utils import read_jsonl
from lib.evaluation import TextReuseEvaluator, TextReuseCluster, TextReusePassage

In [2]:
GT_DATASET_PATH = 'data/homeric_repetitions_dataset.tsv'
PASSIM_OUTPUT_DIR = 'data/passim/exp7/out.json/'

evaluator = TextReuseEvaluator()

In [3]:
evaluator.read_predictions(PASSIM_OUTPUT_DIR)
evaluator.read_groundtruth(GT_DATASET_PATH)

In [4]:
len(evaluator.predicted_clusters)

215

In [5]:
len(evaluator.gt_clusters)

69

In [6]:
evaluator.predicted_clusters[49]

"<TextReuseCluster id=49, size=2, passages='Homer, Odyssey 11.378-12.453; Homer, Odyssey 12.116-12.141'>"

In [7]:
evaluator.gt_clusters[69].dices_speech_ids

[535, 546]

In [8]:
len(evaluator.gt_clusters)

69

In [9]:
evaluator.predicted_clusters[8].dices_speech_ids

[773, 781, 811]

In [6]:
for c in evaluator.gt_clusters:
    print(c)

"<TextReuseCluster id=1, size=3, passages='Hom. Il. 2.11-2.15; Hom. Il. 2.28-2.32; Hom. Il. 2.66-2.69'>"
"<TextReuseCluster id=2, size=2, passages='Hom. Il. 2.23-33; Hom. Il. 2.60-70'>"
"<TextReuseCluster id=3, size=2, passages='Hom. Il. 2.158-165; Hom. Il. 2.174-181'>"
"<TextReuseCluster id=4, size=4, passages='Hom. Il. 3.68-73; Hom. Il. 3.88-94; Hom. Il. 3.253-258; Hom. Il. 3.276-291'>"
"<TextReuseCluster id=5, size=2, passages='Hom. Il. 3.74-75; Hom. Il. 3.259-260'>"
"<TextReuseCluster id=6, size=2, passages='Hom. Il. 4.66-67; Hom. Il. 4.71-72'>"
"<TextReuseCluster id=7, size=2, passages='Hom. Il. 4.195-197; Hom. Il. 4. 205-207'>"
"<TextReuseCluster id=8, size=2, passages='Hom. Il. 6.90-97; Hom. Il. 6.271-278'>"
"<TextReuseCluster id=9, size=3, passages='Hom. Il. 6.93-95; Hom. Il. 6.274-276; Hom. Il. 6.308-310'>"
"<TextReuseCluster id=10, size=2, passages='Hom. Il. 7.40; Hom. Il. 7.51'>"
"<TextReuseCluster id=11, size=2, passages='Hom. Il. 7.362-364; Hom. Il. 7.389-393'>"
"<TextReus

In [6]:
for cluster in evaluator.predicted_clusters:
    if cluster.size > 2:
        print(cluster)

"<TextReuseCluster id=23, size=6, passages='Homer, Odyssey 19.124-19.163, Homer, Odyssey 19.141-19.147, Homer, Odyssey 24.121-24.190, Homer, Odyssey 24.131-24.137, Homer, Odyssey 2.96-2.102, Homer, Odyssey 2.85-2.128'>"
"<TextReuseCluster id=10, size=5, passages='Homer, Odyssey 17.108-17.149, Homer, Odyssey 17.124-17.146, Homer, Odyssey 4.333-4.592, Homer, Odyssey 4.555-4.569, Homer, Odyssey 5.7-5.20'>"
"<TextReuseCluster id=72, size=4, passages='Homer, Iliad 2.23-2.34, Homer, Iliad 2.60-2.70, Homer, Iliad 2.8-2.15, Homer, Iliad 2.56-2.75'>"
"<TextReuseCluster id=130, size=4, passages='Homer, Odyssey 4.333-4.592, Homer, Odyssey 4.333-4.592, Homer, Odyssey 4.376-4.381, Homer, Odyssey 4.465-4.470'>"
"<TextReuseCluster id=213, size=4, passages='Homer, Odyssey 19.4-19.13, Homer, Odyssey 19.7-19.13, Homer, Odyssey 16.267-16.307, Homer, Odyssey 16.288-16.294'>"
"<TextReuseCluster id=1, size=3, passages='Homer, Odyssey 1.158-1.177, Homer, Odyssey 14.166-14.190, Homer, Odyssey 16.222-16.224'>"

In [15]:
from typing import List, Optional

def filter_by_speech_id(clusters : List[TextReuseCluster], speech_ids : List[str]) -> Optional[List[TextReuseCluster]]:
    result = []
    for cluster in clusters.values():
        speech_id_overlap = set(cluster.dices_speech_ids).intersection(set(speech_ids))
        if len(speech_id_overlap) > 2:
            print(speech_id_overlap)

## Filtering behaviour

In [36]:
evaluator.gt_clusters[1].labels
#evaluator.gt_clusters[1].dices_speech_ids

['Hom. Il. 2.11-2.15', 'Hom. Il. 2.28-2.32', 'Hom. Il. 2.66-2.69']

In [35]:
evaluator.predicted_clusters[72].labels
#evaluator.predicted_clusters[72].dices_speech_ids

['Homer, Iliad 2.23-2.34',
 'Homer, Iliad 2.60-2.70',
 'Homer, Iliad 2.8-2.15',
 'Homer, Iliad 2.56-2.75']

In [20]:
for gt_cluster in evaluator.gt_clusters.values():
    filter_by_speech_id(evaluator.predicted_clusters, gt_cluster.dices_speech_ids)

In [16]:
filter_by_speech_id(evaluator.predicted_clusters, evaluator.gt_clusters[69].dices_speech_ids)

In [24]:
locus = "Hom. Il. 14.195-196"

In [27]:
begin, end = locus.replace('Hom. Il.', '').strip().split('-')

In [28]:
begin

'14.195'

In [29]:
end

'196'