# LongEval
In diesem Notebook findet ihr einen einfachen starter um ein Retrieval System für den LongEval Shared Task aufzusetzen.

Weitere Informationen findet ihr hier:
- [Shared Task Website](https://clef-longeval.github.io/)
- [Overview Paper 2024](http://www.zubiaga.org/publications/files/alkhalifa2024longeval-extended.pdf)
- [Overview Paper 2023](http://www.zubiaga.org/publications/files/alkhalifa2023longeval-overview.pdf)
- [LongEval Test Collection Paper](https://www.semanticscholar.org/reader/f40debce2b7caf35ea0730c27c5330989d20b300)

Alle Datensätze findet ihr unter `datasets/LongEval`. PyTerrier Indexe für jeden Zeitpunkt mit der üblichen Preprocessing Pipeline findet ihr in `datasets/LongEval/index`. In der Datei `datasets/LongEval/metadata.yml` findet ihr Metadaten die euch gegebenenfalls helfen die Sub-Collections zu organisieren.

In [1]:
import yaml
import os
import pandas as pd
import numpy as np

import pyterrier as pt
if not pt.java.started():
    pt.java.init()

Java started and loaded: pyterrier.java, pyterrier.terrier.java [version=5.11 (build: craig.macdonald 2025-01-13 21:29), helper_version=0.0.8]


In [2]:
from sqlalchemy import create_engine

DATABASE = "longeval-web"
USER = "dis18"
HOST = "db"
PORT = "5432"
PASSWORD = "dis182425"

engine = create_engine(f"postgresql+psycopg2://{USER}:{PASSWORD}@{HOST}:{PORT}/{DATABASE}")

df = pd.read_sql('select * from "Topic" limit 1', con=engine)
sql_query = lambda x: pd.read_sql(x, con=engine)

In [3]:
dataset = "longeval-web"
language = "fr"
sub_collection = "2023-03"

In [4]:
BASE_PATH = "/home/jovyan/work/datasets/LongEval-Web"

with open(BASE_PATH + "/metadata.yml", "r") as yamlfile:
    config = yaml.load(yamlfile, Loader=yaml.FullLoader)

In [5]:
index_path = os.path.join(".", BASE_PATH, f"index/{dataset}-{language}-{sub_collection}-pyterrier")
topics_path = os.path.join(BASE_PATH, "release_2025_p1/French/queries.txt")

In [6]:
q_topics = f"""
select distinct b.queryid qid, b.text_fr query
from "Qrel" a 
join  (
        select  *
        from    "Topic"
      ) b
      on      a.queryid = b.queryid
join (
        select distinct docid
        from   "Document"
        where  sub_collection = '{sub_collection}'
      )c
      on ('doc' || a.docid) = c.docid
where a.sub_collection = '{sub_collection}' 
"""
q_topics_sub_collection = f"""
                select distinct a.queryid qid, a.text_{language} query 
                from "Topic" a 
                where sub_collection = '{sub_collection}'
                group by a.queryid, a.text_{language} 
                """


topics = sql_query(q_topics_sub_collection)
#topics = pd.read_csv(topics_path, sep="\t", names=["qid", "query"])
topics["qid"] = topics["qid"].astype(str)

topics["query"] = topics["query"].str.replace("'", "")
topics["query"] = topics["query"].str.replace("*", "")
topics["query"] = topics["query"].str.replace("/", "")
topics["query"] = topics["query"].str.replace(":", "")
topics["query"] = topics["query"].str.replace("?", "")
topics["query"] = topics["query"].str.replace(")", "")
topics["query"] = topics["query"].str.replace("(", "")
topics["query"] = topics["query"].str.replace("+", "")
spam = ["59769", "6060", "75200", "74351", "67599", "74238", "74207", "75100", "58130"]
topics = topics[~topics["qid"].isin(spam)]

In [7]:
index = pt.IndexFactory.of(index_path)

18:45:02.400 [main] WARN org.terrier.structures.BaseCompressingMetaIndex -- Structure meta reading data file directly from disk (SLOW) - try index.meta.data-source=fileinmem in the index properties file. 4.3 GiB of memory would be required.


### Run Erstellen

In [10]:
BM25 = pt.terrier.Retriever(index, wmodel="BM25", verbose=True)

In [11]:
print(">>> Loaded index with", index.getCollectionStatistics().getNumberOfDocuments(), "documents.")

>>> Loaded index with 2028979 documents.


In [12]:
run = BM25.transform(topics.sample(n=10, random_state=1)) # we take only 10 queries for a test

TerrierRetr(BM25): 100%|██████████| 10/10 [00:05<00:00,  1.74q/s]


In [13]:
run

Unnamed: 0,qid,docid,docno,rank,score,query
0,54508,1066639,772636,0,32.574383,petrole leclerc
1,54508,205322,1646120,1,32.469047,petrole leclerc
2,54508,1493231,1699378,2,32.078767,petrole leclerc
3,54508,200122,2322267,3,28.259876,petrole leclerc
4,54508,1582692,3450018,4,28.022994,petrole leclerc
...,...,...,...,...,...,...
9600,54038,1943209,2960578,995,13.813258,jean paul rappeneau
9601,54038,1535301,3139347,996,13.812507,jean paul rappeneau
9602,54038,1549692,3060982,997,13.810782,jean paul rappeneau
9603,54038,576750,2605088,998,13.806885,jean paul rappeneau


In [11]:
# Load cluster
f_cluster = "topics_cluster_all_subcollections.csv"
df_cluster = pd.read_csv(f_cluster)
df_cluster = df_cluster.astype({"queryid":"string"})
print(df_cluster.loc[df_cluster["queryid"]==43207])
cluster_map = df_cluster.set_index("queryid")["cluster"]

# Create Mapping for queryid and docs
map_qid_docid = run.set_index("qid")["docno"]

Empty DataFrame
Columns: [queryid, text_en, text_fr, sub_collection, split, token_stem, category, cluster]
Index: []


In [12]:
# Drop copy run and transform for prediction
predict_run = run.loc[:,["qid", "docno"]].copy()
predict_run["real_cluster"] = predict_run["qid"].map(cluster_map)
docid_export = predict_run.drop(columns=["qid"])
docid_export = predict_run.drop_duplicates()
docid_export.to_csv("run_docids.csv", index=False)

**ATTENTION**
Use following next and continue after:

1.  [create predictset](5_mw_create_predictset_docidlist.ipynb)
2.  [predict cluster & relevance](6_mw_predict_cluster_relevance.ipynb)

(just start the whole kernel once, as of now... changes maybe later)

In [14]:
# Get predicted cluster and relevance
f_path = "docs_pred.csv"
df_pred_cluster_rel = pd.read_csv(f_path)
df_pred_cluster_rel.drop_duplicates(inplace=True)

# create mappings to assign to run for readjusting
cluster_pred = df_pred_cluster_rel.set_index("docid")["cluster_pred"]
rel_pred = df_pred_cluster_rel.set_index("docid")["rel_pred"]

run["real_cluster"] = run["qid"].map(cluster_map)
run["pred_cluster"] = run["docno"].map(cluster_pred)
run["pred_cluster"] = run["pred_cluster"].fillna(-1).astype("int64")
cluster_cond = (run["pred_cluster"]!=-1) & (run["real_cluster"]==run["pred_cluster"])
run.loc[cluster_cond, ["rel_pred"]]= run["docno"].map(rel_pred)
run.loc[run["rel_pred"].isna(),["rel_pred"]] = 1

In [15]:
# rerank with pred_rel factor

run["score"] = run["score"] * run["rel_pred"]
#print(run)
run.sort_values(by=["qid", "score"], inplace=True, ascending=False)
run["rank"] = run.groupby("qid").cumcount() # thank you claude.ai <3
#run["docid"] = run["docno"][3:]
print(run)

         qid    docid       docno  rank      score             query  \
29414    995  1925382  doc2857754     0  42.101296  enseirb bordeaux   
29415    995  1852443   doc544715     1  39.443090  enseirb bordeaux   
29416    995  1621557  doc1677047     2  38.560717  enseirb bordeaux   
29417    995   649192  doc2056515     3  36.703032  enseirb bordeaux   
29418    995   877090  doc2069100     4  36.703032  enseirb bordeaux   
...      ...      ...         ...   ...        ...               ...   
171151  1026  1474851  doc1771635   995   0.207065        espace caf   
170966  1026    73930   doc604794   996   0.145379        espace caf   
170911  1026  1566925  doc1891122   997   0.087496        espace caf   
170679  1026    35609   doc180801   998   0.052369        espace caf   
170801  1026  1045322  doc1804982   999   0.013975        espace caf   

        real_cluster  pred_cluster  rel_pred  
29414              8             4  1.000000  
29415              8            22  1.000

# Run Laden 
Erstellen eines Runs für alle Topics dauert sehr lange. Alternativ könnt ihr auch einen BM25 Baseline Run laden und eure Ansätze als Re-Ranking implementieren.

In [16]:
run_base = pt.io.read_results(f"{BASE_PATH}/runs/{dataset}-{language}-{sub_collection}-BM25.gz")
run_base["docno"] = run_base["docno"].str.strip("doc")  # the indexed documents prefix the docid with `doc`, this needs to be removed

In [17]:
qid_map = {f"{qid}":1 for qid in run["qid"].drop_duplicates().tolist() }
run_base["qid_bool"] = run_base["qid"].map(qid_map)
run_base_small = run_base.dropna(subset=["qid_bool"])

# System Evaluieren

In [18]:
run_base_small["qid"].unique().shape

(199,)

In [19]:
q_qrels = f"""
select a.queryid qid, a.docid docno, cast(a.relevance as int) label
from "Qrel" a 
join  (
        select  *
        from    "Topic"
      ) b
      on      a.queryid = b.queryid
join (
        select distinct docid
        from   "Document"
        where  sub_collection = '{sub_collection}'
      )c
      on ('doc' || a.docid) = c.docid
where a.sub_collection = '{sub_collection}' 
"""

qrels = sql_query(q_qrels)
#qrels = pt.io.read_qrels(BASE_PATH + f"/release_2025_p1/French/LongEval Train Collection/qrels/{sub_collection}_{language}/qrels_processed.txt")

In [20]:
print(qrels.columns.tolist())

['qid', 'docno', 'label']


In [21]:
run["docno"] = run["docno"].str.strip("doc")  # the indexed documents prefix the docid with `doc`, this needs to be removed

In [22]:
cols = ["qid", "docno", "rank", "score"]
run_eval = run[cols]
pt.Experiment(
    [run_base_small, run_eval],
    topics,
    qrels,
    eval_metrics=["bpref", "map", "ndcg", "ndcg_cut_10", "P.10"],
    verbose=True
)

pt.Experiment: 100%|██████████| 2/2 [00:01<00:00,  1.52system/s]


Unnamed: 0,name,bpref,map,ndcg,ndcg_cut_10,P.10
0,qid docno rank score ...,0.014616,0.006882,0.01029,0.00817,0.002038
1,qid docno rank score\n29414 ...,0.016656,0.007946,0.011343,0.0091,0.002183


In [23]:
print(run_base)

            qid    docno  rank      score                              name  \
0             2   581746     0  26.356897  CIR-longeval-web-fr-2023-02-BM25   
1             2  2246696     1  20.328055  CIR-longeval-web-fr-2023-02-BM25   
2             2  3139049     2  19.755776  CIR-longeval-web-fr-2023-02-BM25   
3             2  2605959     3  18.974690  CIR-longeval-web-fr-2023-02-BM25   
4             2  3067242     4  16.374459  CIR-longeval-web-fr-2023-02-BM25   
...         ...      ...   ...        ...                               ...   
53000446  75427  2078031   995  11.124785  CIR-longeval-web-fr-2023-02-BM25   
53000447  75427   240537   996  11.121439  CIR-longeval-web-fr-2023-02-BM25   
53000448  75427  1811249   997  11.120463  CIR-longeval-web-fr-2023-02-BM25   
53000449  75427   742355   998  11.119224  CIR-longeval-web-fr-2023-02-BM25   
53000450  75427   198709   999  11.118625  CIR-longeval-web-fr-2023-02-BM25   

          qid_bool  
0              NaN  
1        

In [24]:
cols = ["qid", "docno", "rank", "score"]
run_eval = run[cols]
df_res = pt.Experiment(
    [run_base_small, run_eval],
    topics,
    qrels,
    eval_metrics=["ndcg"],#["bpref", "map", "ndcg", "ndcg_cut_10", "P.10"],
    verbose=True,
    perquery=True,
)
df_res

pt.Experiment: 100%|██████████| 2/2 [00:01<00:00,  1.55system/s]


Unnamed: 0,name,qid,measure,value
199,qid docno rank score ...,10,ndcg,0.0
200,qid docno rank score ...,100,ndcg,0.0
201,qid docno rank score ...,1000,ndcg,0.0
202,qid docno rank score ...,1001,ndcg,0.0
203,qid docno rank score ...,1002,ndcg,0.0
...,...,...,...,...
9614,qid docno rank score\n29414 ...,993,ndcg,0.0
4809,qid docno rank score\n29414 ...,995,ndcg,0.0
9615,qid docno rank score\n29414 ...,996,ndcg,0.0
9616,qid docno rank score\n29414 ...,997,ndcg,0.0
