# Tutorial: Preparing submission file and run evaluation

In this notebook, a step-by-step tutorial is provided for preparing the submission file for the Task A of TalentCLEF 2026 shared task. To achieve this, the data for Task A, hosted on [Zenodo](https://doi.org/10.5281/zenodo.17625261), will be downloaded; a file with the appropriate [submission format](https://talentclef.github.io/talentclef/docs/talentclef-2026/evaluation/) will be prepared, and it will be evaluated using the [task's evaluation script](https://github.com/TalentCLEF/talentclef26_evaluation_script). Additionally, the provided format is also compatible with the Codabench benchmark where the official evaluation will be done


-----------------------------
TalentCLEF is an initiative to advance Natural Language Processing (NLP) in Human Capital Management (HCM). It aims to create a public benchmark for model evaluation and promote collaboration to develop fair, multilingual, and flexible systems that improve Human Resources (HR) practices across different industries.

The second edition of TalentCLEF shared task’s will be part of the [Conference and Labs of the Evaluation Forum (CLEF)](https://clef2026.clef-initiative.eu/), scheduled to be held in Jena, Germany, in 2026. If you are interested in registering, you can find registration form [here](https://clef-labs-registration.dipintra.it/)


<img src="https://github.com/TalentCLEF/talentclef/blob/main/logo_talentclef.png?raw=true" alt="TalentCLEF logo" width="200"/>
<img src="https://talentclef.github.io/talentclef/docs/talentclef-2026/workshop/logo_clef_jena.svg" alt="CLEF2026 logo" width="150"/>

## Imports

In [None]:
import pandas as pd
import numpy as np
import os
from sentence_transformers import SentenceTransformer, util
import subprocess

## Download Task A files

First, let's download the Task A and Task B zip files directly from Zenodo.



In [None]:
# Download
!wget https://zenodo.org/records/18449283/files/TaskA.zip
!unzip TaskA.zip -d taskA

## Load data

Set the environement where english dev set has been extracted:

In [None]:
english_dev_path = "/content/taskA/TaskA/development/en"

Load queries and corpus elements in English from the Validation folder:

In [None]:
queries = os.path.join(english_dev_path, "queries")
corpus_elements = os.path.join(english_dev_path, "corpus")

Function to load data:

In [None]:
def load_text_corpus(path, id_col="c_id", encoding="utf-8"):
    """
    Load text files from a directory into a pandas DataFrame.

    Each file in the directory is treated as a document. The file name is used
    as the document identifier, and the file content is stored as text.

    Parameters
    ----------
    path : str
        Path to the directory containing the text files.
    id_col : str, optional
        Name of the column used as the document identifier
        (e.g., 'c_id', 'q_id'). Default is 'c_id'.
    encoding : str, optional
        Text encoding used to read the files. Default is 'utf-8'.

    Returns
    -------
    pd.DataFrame
        A DataFrame with two columns:
        - id_col: document identifier (file name)
        - text: document content
    """
    records = []

    for filename in os.listdir(path):
        file_path = os.path.join(path, filename)

        if os.path.isfile(file_path):
            with open(file_path, "r", encoding=encoding, errors="ignore") as f:
                text = f.read()

            records.append({
                id_col: filename,
                "text": text
            })

    return pd.DataFrame(records)

In [None]:
en_queries = load_text_corpus(os.path.join(english_dev_path,"queries"), id_col="q_id")
en_corpus = load_text_corpus(os.path.join(english_dev_path,"corpus"), id_col="c_id")

## Predictions

Generate the list of ids and text to process:

In [None]:
queries_ids = en_queries.q_id.to_list()
queries_texts = en_queries.text.to_list()

corpus_ids = en_corpus.c_id.to_list()
corpus_texts = en_corpus.text.to_list()

Load a simple embedding model

In [None]:
model = SentenceTransformer("all-MiniLM-L6-v2")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

For this example, we will use a very simple approach. We will apply a basic embedding model and directly embed each document, allowing us to identify the documents in the corpus that are most similar to a given query.

First, encode queries and corpus elements

In [None]:
q_embs = model.encode(queries_texts, batch_size=32)
c_embs = model.encode(corpus_texts, batch_size=32)

Then, compute similarities

In [None]:
similarities = util.cos_sim(q_embs, c_embs).cpu().numpy()

## Prepare submission file

The submissions must follow the TREC Run File format, **without** headers in the output file. This means that the fle have 6 space-spearated columns per line, with following information:

- q_id: Query ID.
- Q0: A constant identifier, usually "Q0".
- doc_id: ID of the retrieved document.
- rank: Position of the document in the ranking.
- score: Relevance score assigned by the model.
- tag: Experiment name

Let's process results and prepare output file. In this tutorial, we will only consider 5 relevant corpus per query.

In [None]:
#
results = []
for q_idx, q_id in enumerate(queries_ids):
    sorted_indices = np.argsort(-similarities[q_idx])  # Decrease order
    for rank, c_idx in enumerate(sorted_indices[:5]): # For this tutorial consider only 5 relevant corpus per query
        doc_id = corpus_ids[c_idx]
        score = similarities[q_idx, c_idx]
        results.append(f"{str(q_id)} Q0 {str(doc_id)} {rank+1} {score:.4f} baseline_model")

The output list have the expected structure:

In [None]:
results[0:2]

['38671 Q0 1508 1 0.6775 baseline_model',
 '38671 Q0 24279 2 0.6637 baseline_model']

Then, save the list as a trec/txt file:

In [None]:
with open("evaluation_test_en.trec", "w", encoding="utf-8") as f:
    f.write("\n".join(results))

## Evaluation

For the evaluation, we will use the official [TalentCLEF 2026 evaluation script](https://github.com/TalentCLEF/talentclef26_evaluation_script), which uses the Ranx library under the hood.

First, clone the repo and install the requirements file:

In [None]:
!git clone https://github.com/TalentCLEF/talentclef26_evaluation_script.git
!pip install -r /content/talentclef26_evaluation_script/requirements.txt


Then, select the Qrels file and the Run file to perform the evaluation.


In [None]:
qrels_file = "/content/taskA/TaskA/development/en/qrels.tsv"
run_file = "/content/evaluation_test_en.trec"

Some examples on how to use the evaluation script for different scenarios is shown in the [repo README.md](https://github.com/TalentCLEF/talentclef26_evaluation_script/blob/main/README.md#examples).

In this notebook, we've been working only with english data from Task A dev set, so `--lang-mode`will be _en_, and `--task` will be _A_.

In [None]:
command = ["python", "/content/talentclef26_evaluation_script/talentclef_evaluate.py", "--task", "A", "--lang-mode", "en", "--qrels", qrels_file, "--run", run_file]
result = subprocess.run(command, capture_output=True, text=True)
print(result.stdout)

TalentCLEF 2026 - Task A Evaluation
Received parameters:
  Task: A
  Qrels: /content/taskA/TaskA/development/en/qrels.tsv
  Run: /content/evaluation_test_en.trec
  Language Mode: en

Loading qrels...
Loading run...

Running Task A evaluation...

=== Evaluation Results ===
map: 0.1557
mrr: 0.9500
ndcg: 0.2806
precision@5: 0.9400
precision@10: 0.4700
precision@100: 0.0470

