# LongEval
In diesem Notebook findet ihr einen einfachen starter um ein Retrieval System für den LongEval Shared Task aufzusetzen.

Weitere Informationen findet ihr hier:
- [Shared Task Website](https://clef-longeval.github.io/)
- [Overview Paper 2024](http://www.zubiaga.org/publications/files/alkhalifa2024longeval-extended.pdf)
- [Overview Paper 2023](http://www.zubiaga.org/publications/files/alkhalifa2023longeval-overview.pdf)
- [LongEval Test Collection Paper](https://www.semanticscholar.org/reader/f40debce2b7caf35ea0730c27c5330989d20b300)

Alle Datensätze findet ihr unter `datasets/LongEval`. PyTerrier Indexe für jeden Zeitpunkt mit der üblichen Preprocessing Pipeline findet ihr in `datasets/LongEval/index`. In der Datei `datasets/LongEval/metadata.yml` findet ihr Metadaten die euch gegebenenfalls helfen die Sub-Collections zu organisieren.

In [1]:
import yaml
import os
import pandas as pd
import numpy as np

import pyterrier as pt
if not pt.java.started():
    pt.java.init()

Java started and loaded: pyterrier.java, pyterrier.terrier.java [version=5.11 (build: craig.macdonald 2025-01-13 21:29), helper_version=0.0.8]


In [2]:
from sqlalchemy import create_engine

DATABASE = "longeval-web"
USER = "dis18"
HOST = "db"
PORT = "5432"
PASSWORD = "dis182425"

engine = create_engine(f"postgresql+psycopg2://{USER}:{PASSWORD}@{HOST}:{PORT}/{DATABASE}")

df = pd.read_sql('select * from "Topic" limit 1', con=engine)
sql_query = lambda x: pd.read_sql(x, con=engine)

In [3]:
dataset = "longeval-web"
language = "fr"
sub_collection = "2023-02"

In [4]:
BASE_PATH = "/home/jovyan/work/datasets/LongEval-Web"

with open(BASE_PATH + "/metadata.yml", "r") as yamlfile:
    config = yaml.load(yamlfile, Loader=yaml.FullLoader)

In [5]:
index_path = os.path.join(".", BASE_PATH, f"index/{dataset}-{language}-{sub_collection}-pyterrier")
topics_path = os.path.join(BASE_PATH, "release_2025_p1/French/queries.txt")

In [6]:
q_topics = f"""
select distinct b.queryid qid, b.text_fr query
from "Qrel" a 
join  (
        select  *
        from    "Topic"
      ) b
      on      a.queryid = b.queryid
join (
        select distinct docid
        from   "Document"
        where  sub_collection = '{sub_collection}'
      )c
      on ('doc' || a.docid) = c.docid
where a.sub_collection = '{sub_collection}' 
"""

topics = sql_query(q_topics)
#topics = pd.read_csv(topics_path, sep="\t", names=["qid", "query"])
topics["qid"] = topics["qid"].astype(str)

topics["query"] = topics["query"].str.replace("'", "")
topics["query"] = topics["query"].str.replace("*", "")
topics["query"] = topics["query"].str.replace("/", "")
topics["query"] = topics["query"].str.replace(":", "")
topics["query"] = topics["query"].str.replace("?", "")
topics["query"] = topics["query"].str.replace(")", "")
topics["query"] = topics["query"].str.replace("(", "")
topics["query"] = topics["query"].str.replace("+", "")
spam = ["59769", "6060", "75200", "74351", "67599", "74238", "74207", "75100", "58130"]
topics = topics[~topics["qid"].isin(spam)]

In [7]:
index = pt.IndexFactory.of(index_path)

14:30:13.677 [main] WARN org.terrier.structures.BaseCompressingMetaIndex -- Structure meta reading data file directly from disk (SLOW) - try index.meta.data-source=fileinmem in the index properties file. 4.3 GiB of memory would be required.


### Run Erstellen

In [8]:
BM25 = pt.terrier.Retriever(index, wmodel="BM25", verbose=True)

In [9]:
print(">>> Loaded index with", index.getCollectionStatistics().getNumberOfDocuments(), "documents.")

>>> Loaded index with 2037717 documents.


# Run Laden 
Erstellen eines Runs für alle Topics dauert sehr lange. Alternativ könnt ihr auch einen BM25 Baseline Run laden und eure Ansätze als Re-Ranking implementieren.

In [10]:
# Read own run
run = pd.read_csv("run_modelprediction_2023-02.csv")

In [11]:
run_base = pt.io.read_results(f"{BASE_PATH}/runs/{dataset}-{language}-{sub_collection}-BM25.gz")
run_base["docno"] = run_base["docno"].str.strip("doc")  # the indexed documents prefix the docid with `doc`, this needs to be removed

In [12]:
qid_map = {f"{qid}":1 for qid in run["qid"].drop_duplicates().tolist() }
run_base["qid_bool"] = run_base["qid"].map(qid_map)
run_base_small = run_base.dropna(subset=["qid_bool"])

# System Evaluieren

In [13]:
run_base_small["qid"].unique().shape

(3673,)

In [14]:
q_qrels = f"""
select a.queryid qid, a.docid docno, cast(a.relevance as int) label
from "Qrel" a 
join  (
        select  *
        from    "Topic"
      ) b
      on      a.queryid = b.queryid
join (
        select distinct docid
        from   "Document"
        where  sub_collection = '{sub_collection}'
      )c
      on ('doc' || a.docid) = c.docid
where a.sub_collection = '{sub_collection}' 
"""

qrels = sql_query(q_qrels)
#qrels = pt.io.read_qrels(BASE_PATH + f"/release_2025_p1/French/LongEval Train Collection/qrels/{sub_collection}_{language}/qrels_processed.txt")

In [15]:
print(qrels.columns.tolist())

['qid', 'docno', 'label']


In [16]:
run["docno"] = run["docno"].str.strip("doc")  # the indexed documents prefix the docid with `doc`, this needs to be removed

In [17]:
cols = ["qid", "docno", "rank", "score"]
run_eval = run[cols]
pt.Experiment(
    [run_base_small, run_eval],
    topics,
    qrels,
    eval_metrics=["bpref", "map", "ndcg", "ndcg_cut_10", "P.10"],
    verbose=True
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataframe[column] = dataframe[column].astype(dtype)
pt.Experiment: 100%|██████████| 2/2 [00:32<00:00, 16.30s/system]


Unnamed: 0,name,bpref,map,ndcg,ndcg_cut_10,P.10
0,qid docno rank score ...,0.247167,0.10308,0.150342,0.118722,0.027428
1,qid docno rank score\n0 ...,0.262908,0.130859,0.171742,0.144436,0.029921


In [18]:
print(run_base)

           qid    docno  rank      score       name  qid_bool
0            3  2214755     0  24.258737  pyterrier       1.0
1            3   684186     1  23.376940  pyterrier       1.0
2            3   637997     2  23.182743  pyterrier       1.0
3            3   430968     3  23.010934  pyterrier       1.0
4            3  3430721     4  22.815825  pyterrier       1.0
...        ...      ...   ...        ...        ...       ...
7735425  75397  3364209   995  11.714027  pyterrier       NaN
7735426  75397  3245800   996  11.706049  pyterrier       NaN
7735427  75397  1290080   997  11.703555  pyterrier       NaN
7735428  75397  1690427   998  11.696082  pyterrier       NaN
7735429  75397  2906989   999  11.694390  pyterrier       NaN

[7735430 rows x 6 columns]


In [19]:
cols = ["qid", "docno", "rank", "score"]
run_eval = run[cols]
df_res = pt.Experiment(
    [run_base_small, run_eval],
    topics,
    qrels,
    eval_metrics=["ndcg"],#["bpref", "map", "ndcg", "ndcg_cut_10", "P.10"],
    verbose=True,
    perquery=True,
)
df_res

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataframe[column] = dataframe[column].astype(dtype)
pt.Experiment: 100%|██████████| 2/2 [00:31<00:00, 15.99s/system]


Unnamed: 0,name,qid,measure,value
11181,qid docno rank score\n0 ...,100,ndcg,0.000000
11210,qid docno rank score\n0 ...,1000,ndcg,0.231378
9741,qid docno rank score\n0 ...,1006,ndcg,0.274785
8254,qid docno rank score\n0 ...,1007,ndcg,1.000000
11654,qid docno rank score\n0 ...,1009,ndcg,0.000000
...,...,...,...,...
42,qid docno rank score ...,99,ndcg,0.194959
387,qid docno rank score ...,990,ndcg,0.189200
388,qid docno rank score ...,992,ndcg,0.356207
389,qid docno rank score ...,996,ndcg,0.327395
