# Getting started

### CLEF 2025 - CheckThat! Lab  - Task 4 Scientific Web Discourse - Subtask 4b (Scientific Claim Source Retrieval)



- **https://www.nltk.org/**
- **https://spacy.io/docs**

In [51]:
!pip install nltk spacy pandarallel ipywidgets rank_bm25


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [None]:
!python -m spacy download en_core_web_sm

In [3]:
import numpy as np
import pandas as pd

## 1.a) Import the collection set
The collection set contains metadata of CORD-19 academic papers.

The preprocessed and filtered CORD-19 dataset is available on the Gitlab repository here: https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/tree/main/task4/subtask_4b

Participants should first download the file then upload it on the Google Colab session with the following steps.


In [4]:
# 1) Download the collection set from the Gitlab repository: https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/tree/main/task4/subtask_4b
# 2) Drag and drop the downloaded file to the "Files" section (left vertical menu on Colab)
# 3) Modify the path to your local file path
PATH_COLLECTION_DATA = 'subtask_4b/subtask4b_collection_data.pkl' 

In [5]:
df_collection = pd.read_pickle(PATH_COLLECTION_DATA)

### Dataframe Information (`df_collection.info()`):

The dataframe `df_collection` contains **7,718 entries** (rows) and **17 columns**. This is the metadata for **7,718 papers** in the CORD-19 dataset.


In [6]:
df_collection.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7718 entries, 162 to 1056448
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   cord_uid          7718 non-null   object        
 1   source_x          7718 non-null   object        
 2   title             7718 non-null   object        
 3   doi               7677 non-null   object        
 4   pmcid             4959 non-null   object        
 5   pubmed_id         6233 non-null   object        
 6   license           7718 non-null   object        
 7   abstract          7718 non-null   object        
 8   publish_time      7715 non-null   object        
 9   authors           7674 non-null   object        
 10  journal           6668 non-null   object        
 11  mag_id            0 non-null      float64       
 12  who_covidence_id  528 non-null    object        
 13  arxiv_id          20 non-null     object        
 14  label             7718 n

In [7]:
df_collection.head()

Unnamed: 0,cord_uid,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,mag_id,who_covidence_id,arxiv_id,label,time,timet
162,umvrwgaw,PMC,Professional and Home-Made Face Masks Reduce E...,10.1371/journal.pone.0002618,PMC2440799,18612429,cc-by,BACKGROUND: Governments are preparing for a po...,2008-07-09,"van der Sande, Marianne; Teunis, Peter; Sabel,...",PLoS One,,,,umvrwgaw,2008-07-09,1215561600
611,spiud6ok,PMC,The Failure of R (0),10.1155/2011/527610,PMC3157160,21860658,cc-by,"The basic reproductive ratio, R (0), is one of...",2011-08-16,"Li, Jing; Blakeley, Daniel; Smith?, Robert J.",Comput Math Methods Med,,,,spiud6ok,2011-08-16,1313452800
918,aclzp3iy,PMC,Pulmonary sequelae in a patient recovered from...,10.4103/0970-2113.99118,PMC3424870,22919170,cc-by-nc-sa,The pandemic of swine flu (H1N1) influenza spr...,2012,"Singh, Virendra; Sharma, Bharat Bhushan; Patel...",Lung India,,,,aclzp3iy,2012-01-01,1325376000
993,ycxyn2a2,PMC,What was the primary mode of smallpox transmis...,10.3389/fcimb.2012.00150,PMC3509329,23226686,cc-by,The mode of infection transmission has profoun...,2012-11-29,"Milton, Donald K.",Front Cell Infect Microbiol,,,,ycxyn2a2,2012-11-29,1354147200
1053,zxe95qy9,PMC,"Lessons from the History of Quarantine, from P...",10.3201/eid1902.120312,PMC3559034,23343512,no-cc,"In the new millennium, the centuries-old strat...",2013-02-03,"Tognotti, Eugenia",Emerg Infect Dis,,,,zxe95qy9,2013-02-03,1359849600


In this process, I performed several text preprocessing steps on the title and abstract columns of the dataset df_collection** First, I removed any URLs using a regular expression. I then removed punctuation and digits, converted the text to lowercase, and eliminated common stopwords, except for those found in the important_terms list, which includes relevant medical terms like disease names . To further lemmatization the text, I applied stemming to all words, except for those in the important_terms list. Additionally, I used the **spaCy** library to extract named entities from both the title and abstract columns. I also cleaned up extra spaces by reducing multiple spaces to a single one . As a result of these preprocessing steps, I created new columns for the cleaned title** and abstract, as well as for the extracted entities from both columns.

In [84]:
import re
import string
import html
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import spacy
from pandarallel import pandarallel
pandarallel.initialize(progress_bar=True)

stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
nlp = spacy.load("en_core_web_md")


important_terms = {
    'covid', 'covid-19', 'hiv', 'rna', 'sars', 'r0', 'h1n1', 'who', 'cdc', 
    'micro', 'kg', 'm/s', 'flu', 'cancer', 'aids', 'diabetes', 'malaria', 
    'tuberculosis', 'hepatitis', 'pneumonia', 'leprosy', 'arthritis', 'asthma', 
    'hypertension', 'obesity', 'influenza', 'hiv', 'hepatitis-b', 'coronavirus', 
    'outbreak', 'pandemic', 'endemic', 'vaccine', 'antiviral', 'antibiotic', 
    'surgical', 'gene', 'genome', 'mutation', 'pathogen'
}

def preprocess_text(text, terms_extractor=nlp):
    
    text = re.sub(r'http\S+|www\S+', '', text)

    text = html.unescape(text)

    text = text.lower()
     
    text = ' '.join([word if word in important_terms else re.sub(r'\d+', '', word) for word in text.split()])
     
    text = re.sub(r'\s+', ' ', text).strip()

    text = text.translate(str.maketrans('', '', string.punctuation))

    text = ' '.join([word for word in text.split() if word not in stop_words or word in important_terms])
    
    text = ' '.join([stemmer.stem(word) if word not in important_terms else word for word in text.split()])
    
    doc = terms_extractor(text)
    entities = ' '.join(ent.text for ent in doc.ents)

    return text, entities

INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.


In [85]:
df_collection['cleaned_title'], df_collection['entities_in_title'] = zip(*df_collection['title'].parallel_apply(preprocess_text))
df_collection['cleaned_abstract'], df_collection['entities_in_abstract'] = zip(*df_collection['abstract'].parallel_apply(preprocess_text))

filtered_df = df_collection[['title', 'cleaned_title', 'entities_in_title', 'abstract', 'cleaned_abstract', 'entities_in_abstract']]
filtered_df.head(10)

Python(97851) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=965), Label(value='0 / 965'))), HB…

Python(97852) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97853) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97854) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97855) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97856) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97857) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97858) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97859) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97867) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=965), Label(value='0 / 965'))), HB…

Python(97868) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97869) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97870) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97871) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97872) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97873) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97874) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97875) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


Unnamed: 0,title,cleaned_title,entities_in_title,abstract,cleaned_abstract,entities_in_abstract
162,Professional and Home-Made Face Masks Reduce E...,profession homemad face mask reduc exposur res...,profession homemad mask reduc exposur gener popul,BACKGROUND: Governments are preparing for a po...,background govern prepar potenti influenza pan...,potenti possibl gener popul routin circumst me...
611,The Failure of R (0),failur r,failur r,"The basic reproductive ratio, R (0), is one of...",basic reproduct ratio r one fundament concept ...,reproduct ratio one mathemat biolog threshold ...
918,Pulmonary sequelae in a patient recovered from...,pulmonari sequela patient recov swine flu,pulmonari sequela,The pandemic of swine flu (H1N1) influenza spr...,pandemic swine flu hn influenza spread involv ...,involv diseas groundglass opac highresolut com...
993,What was the primary mode of smallpox transmis...,primari mode smallpox transmiss implic biodefens,primari implic biodefens,The mode of infection transmission has profoun...,mode infect transmiss profound implic effect c...,transmiss gener primari second resolv review e...
1053,"Lessons from the History of Quarantine, from P...",lesson histori quarantin plagu influenza,lesson histori quarantin,"In the new millennium, the centuries-old strat...",new millennium centuriesold strategi quarantin...,quarantin becom infecti diseas quarantin borde...
1589,Anxiety and Depression: Linkages with Viral Di...,anxieti depress linkag viral diseas,anxieti depress linkag diseas,Anxiety and mood disorders are common in the g...,anxieti mood disord common gener popul countri...,anxieti mood disord common gener popul literat...
2069,Effects of Ultraviolet Germicidal Irradiation ...,effect ultraviolet germicid irradi uvgi n resp...,respir filtrat,The ability to disinfect and reuse disposable ...,abil disinfect reus dispos n filter facepiec r...,abil disinfect reus dispos n facepiec respir i...
2843,Secretome of Intestinal Bacilli: A Natural Gua...,secretom intestin bacilli natur guard patholog,secretom intestin bacilli,Current studies of human gut microbiome usuall...,current studi human gut microbiom usual consid...,substanc two repres bacillu lactobacillu colon...
2952,Long term outcomes in survivors of epidemic In...,long term outcom survivor epidem influenza hn ...,outcom epidem influenza hn,Patients who survive influenza A (H7N9) virus ...,patient who surviv influenza hn viru infect ri...,psycholog complic lung injuri multiorgan dysfu...
3044,Far-UVC light: A new tool to control the sprea...,faruvc light new tool control spread airbornem...,microbi diseas,Airborne-mediated microbial diseases such as i...,airbornemedi microbi diseas influenza tubercul...,airbornemedi microbi diseas tuberculosis repre...


## 1.b) Import the query set

The query set contains tweets with implicit references to academic papers from the collection set.

The preprocessed query set is available on the Gitlab repository here: https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/tree/main/task4/subtask_4b

Participants should first download the file then upload it on the Google Colab session with the following steps.

In [86]:
PATH_QUERY_TRAIN_DATA = 'subtask_4b/subtask4b_query_tweets_train.tsv'
PATH_QUERY_DEV_DATA = 'subtask_4b/subtask4b_query_tweets_dev.tsv' 

In [87]:
df_query_train = pd.read_csv(PATH_QUERY_TRAIN_DATA, sep = '\t')
df_query_dev = pd.read_csv(PATH_QUERY_DEV_DATA, sep = '\t')

In [88]:
df_query_train['cleaned_tweet_text'], df_query_train['entities_in_tweet_text_train'] = zip(*df_query_train['tweet_text'].parallel_apply(preprocess_text))
df_query_train['cleaned_tweet_text'] = df_query_train['cleaned_tweet_text'].fillna('').astype(str)

df_query_dev['cleaned_tweet_text'], df_query_dev['entities_in_tweet_text_dev'] = zip(*df_query_dev['tweet_text'].parallel_apply(preprocess_text))
df_query_dev['cleaned_tweet_text'] = df_query_dev['cleaned_tweet_text'].fillna('').astype(str)


Python(97932) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=1607), Label(value='0 / 1607'))), …

Python(97933) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97934) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97935) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97936) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97937) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97938) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97939) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97940) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97946) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=175), Label(value='0 / 175'))), HB…

Python(97947) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97948) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97949) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97950) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97951) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97952) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97953) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97954) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


In [89]:
df_query_train.head()

Unnamed: 0,post_id,tweet_text,cord_uid,cleaned_tweet_text,entities_in_tweet_text_train
0,0,Oral care in rehabilitation medicine: oral vul...,htlvpvz5,oral care rehabilit medicin oral vulner oral m...,rehabilit medicin muscl wast hospitalassoci oral
1,1,this study isn't receiving sufficient attentio...,4kfl29ul,studi isnt receiv suffici attent reveal blackl...,receiv suffici blacklatinoindigen
2,2,"thanks, xi jinping. a reminder that this study...",jtwb17u8,thank xi jinp remind studi conclud nonpharmace...,conclud nonpharmaceut intervent three week ear...
3,3,Taiwan - a population of 23 million has had ju...,0w9k8iy1,taiwan popul million case death widespread mas...,taiwan popul million quarantin measur erad pos...
4,4,Obtaining a diagnosis of autism in lower incom...,tiqksd69,obtain diagnosi autism lower incom countri tak...,lengthi modifi screen diagnosi lowincom


# 2)The following code runs a BM25 after preprcoessing 



In [90]:
from rank_bm25 import BM25Okapi

corpus = df_collection[['cleaned_title', 'cleaned_abstract', 'entities_in_title', 'entities_in_abstract']].parallel_apply(
    lambda x: f"{x['cleaned_title']} {x['cleaned_abstract']} {' '.join(x['entities_in_title'])} {' '.join(x['entities_in_abstract'])}",
    axis=1
).tolist()

cord_uids = df_collection[:]['cord_uid'].tolist()

tokenized_corpus = [doc.split(' ') for doc in corpus]

bm25 = BM25Okapi(tokenized_corpus)

Python(97970) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=965), Label(value='0 / 965'))), HB…

Python(97971) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97972) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97973) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97974) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97975) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97976) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97977) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97978) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


In [91]:
tokenized_corpus[0][:5]

['profession', 'homemad', 'face', 'mask', 'reduc']

In [92]:
text2bm25top = {}
def get_top_cord_uids(query):
  if query in text2bm25top.keys():
      return text2bm25top[query]
  else:
      tokenized_query = query.split(' ')
    

      #based on the tokenized query we call the BM25 to compute the relevance score for each document relevant to the query 
      doc_scores = bm25.get_scores(tokenized_query)
      # it sort the doc scores in descending order 
      indices = np.argsort(-doc_scores)[:5]
      #This line uses the indices from the previous step to retrieve the actual document IDs 
      bm25_topk = [cord_uids[x] for x in indices]
      

      text2bm25top[query] = bm25_topk
      return bm25_topk

In [93]:
len(text2bm25top)

0

In [95]:
df_query_train['final_query'] = df_query_train['cleaned_tweet_text'] + ' ' + df_query_train['entities_in_tweet_text_train']
df_query_dev['final_query'] = df_query_dev['cleaned_tweet_text'] + ' ' + df_query_dev['entities_in_tweet_text_dev']

df_query_train.head()

Unnamed: 0,post_id,tweet_text,cord_uid,cleaned_tweet_text,entities_in_tweet_text_train,final_query
0,0,Oral care in rehabilitation medicine: oral vul...,htlvpvz5,oral care rehabilit medicin oral vulner oral m...,rehabilit medicin muscl wast hospitalassoci oral,oral care rehabilit medicin oral vulner oral m...
1,1,this study isn't receiving sufficient attentio...,4kfl29ul,studi isnt receiv suffici attent reveal blackl...,receiv suffici blacklatinoindigen,studi isnt receiv suffici attent reveal blackl...
2,2,"thanks, xi jinping. a reminder that this study...",jtwb17u8,thank xi jinp remind studi conclud nonpharmace...,conclud nonpharmaceut intervent three week ear...,thank xi jinp remind studi conclud nonpharmace...
3,3,Taiwan - a population of 23 million has had ju...,0w9k8iy1,taiwan popul million case death widespread mas...,taiwan popul million quarantin measur erad pos...,taiwan popul million case death widespread mas...
4,4,Obtaining a diagnosis of autism in lower incom...,tiqksd69,obtain diagnosi autism lower incom countri tak...,lengthi modifi screen diagnosi lowincom,obtain diagnosi autism lower incom countri tak...


In [96]:
df_query_train['bm25_topk'] = df_query_train['final_query'].parallel_apply(lambda x: get_top_cord_uids(x))
df_query_dev['bm25_topk'] = df_query_dev['final_query'].parallel_apply(lambda x: get_top_cord_uids(x))

Python(98061) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=1607), Label(value='0 / 1607'))), …

Python(98062) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(98063) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(98064) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(98065) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(98066) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(98067) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(98068) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(98069) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(98092) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=175), Label(value='0 / 175'))), HB…

Python(98093) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(98094) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(98095) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(98096) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(98097) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(98098) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(98099) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(98100) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


# 3) Evaluating the baseline
The following code evaluates the BM25 retrieval baseline on the query set using the Mean Reciprocal Rank score (MRR@5).

In [100]:
# Evaluate retrieved candidates using MRR@k
def get_performance_mrr(data, col_gold, col_pred, list_k = [1, 5, 10]):
    d_performance = {}
    for k in list_k:
        data["in_topx"] = data.apply(lambda x: (1/([i for i in x[col_pred][:k]].index(x[col_gold]) + 1) if x[col_gold] in [i for i in x[col_pred][:k]] else 0), axis=1)
        #performances.append(data["in_topx"].mean())
        d_performance[k] = data["in_topx"].mean()
    return d_performance

In [104]:
# Evaluate retrieved candidates using MRR@k
results_train = get_performance_mrr(df_query_train, 'cord_uid', 'bm25_topk')
results_dev = get_performance_mrr(df_query_dev, 'cord_uid', 'bm25_topk')

# Printed MRR@k results
print(f"Results on the train set: {results_train}")
print(f"Results on the dev set: {results_dev}")


Results on the train set: {1: 0.5258694468217536, 5: 0.581183381311756, 10: 0.581183381311756}
Results on the dev set: {1: 0.5185714285714286, 5: 0.5752261904761904, 10: 0.5752261904761904}


In [105]:
def results_to_markdown_table(train_results, dev_results):
    # Header
    table = "| Set   | Top-K | Score     |\n"
    table += "|--------|--------|------------|\n"

    # Train rows
    for k, score in train_results.items():
        table += f"| Train | {k}     | {score:.4f} |\n"

    # Dev rows
    for k, score in dev_results.items():
        table += f"| Dev   | {k}     | {score:.4f} |\n"

    return table

print(results_to_markdown_table(results_train, results_dev))

| Set   | Top-K | Score     |
|--------|--------|------------|
| Train | 1     | 0.5259 |
| Train | 5     | 0.5812 |
| Train | 10     | 0.5812 |
| Dev   | 1     | 0.5186 |
| Dev   | 5     | 0.5752 |
| Dev   | 10     | 0.5752 |



| Set   | Top-K | Score     |
|--------|--------|-----------|
| Train | 1     | 0.5221 |
| Train | 5     | 0.5747 |
| Train | 10    | 0.5747 |
| Dev   | 1     | 0.5186 |
| Dev   | 5     | 0.5740 |
| Dev   | 10    | 0.5740 |


# 4) Evaluating different spaCy models

In [80]:
!pip install -U scispacy

Python(96810) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


Collecting spacy-transformers
  Downloading spacy_transformers-1.3.8-cp311-cp311-macosx_11_0_arm64.whl.metadata (7.0 kB)
Collecting transformers<4.50.0,>=3.4.0 (from spacy-transformers)
  Downloading transformers-4.49.0-py3-none-any.whl.metadata (44 kB)
Collecting spacy-alignments<1.0.0,>=0.7.2 (from spacy-transformers)
  Downloading spacy_alignments-0.9.1-cp311-cp311-macosx_11_0_arm64.whl.metadata (2.7 kB)
Collecting huggingface-hub<1.0,>=0.26.0 (from transformers<4.50.0,>=3.4.0->spacy-transformers)
  Downloading huggingface_hub-0.30.2-py3-none-any.whl.metadata (13 kB)
Collecting tokenizers<0.22,>=0.21 (from transformers<4.50.0,>=3.4.0->spacy-transformers)
  Downloading tokenizers-0.21.1-cp39-abi3-macosx_11_0_arm64.whl.metadata (6.8 kB)
Collecting safetensors>=0.4.1 (from transformers<4.50.0,>=3.4.0->spacy-transformers)
  Downloading safetensors-0.5.3-cp38-abi3-macosx_11_0_arm64.whl.metadata (3.8 kB)
Downloading spacy_transformers-1.3.8-cp311-cp311-macosx_11_0_arm64.whl (1

In [81]:
def compare_spacy_models(candidate_models):
    results = {}

    try:
        for model_name in candidate_models:
            print(f"Testing model: {model_name}")

            # Install only if needed
            try:
                if not spacy.util.is_package(model_name):
                    print(f"Installing {model_name}...")
                    spacy.cli.download(model_name)
                terms_extractor = spacy.load(model_name)
            except Exception as e:
                print(f"❌ {model_name} failed: {e}")
                continue

            # Process a sample with the new model
            df_query_sample = df_query_train.head(100).copy()
            df_query_sample['cleaned_tweet_text'], df_query_sample['entities_in_tweet_text'] = zip(
                *df_query_sample['tweet_text'].parallel_apply(
                    lambda x: preprocess_text(x, terms_extractor)
                )
            )

            # Update get_top_cord_uids for this test
            df_query_sample['final_query'] = df_query_sample['cleaned_tweet_text'] + ' ' + df_query_sample['entities_in_tweet_text']
            df_query_sample['bm25_topk'] = df_query_sample['final_query'].parallel_apply(lambda x: get_top_cord_uids(x))

            # Evaluate
            mrr_result = get_performance_mrr(df_query_sample, 'cord_uid', 'bm25_topk')
            results[model_name] = mrr_result
    except Exception as e:
        print(f"Failed: {e}")

    # Display results as a table
    df_results = pd.DataFrame(results).T
    return df_results

In [82]:
models = [
    "en_core_web_sm",
    "en_core_web_md",
    "en_core_web_lg",
    # "en_core_web_trf",  # problems with some dependencies
    "en_core_sci_md",
    # "en_core_sci_lg"  # the process is getting killed
]
compare_spacy_models(models)

Testing model: en_core_web_sm


Python(96945) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=13), Label(value='0 / 13'))), HBox…

Python(96946) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(96947) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(96948) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(96949) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(96950) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(96951) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(96952) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(96953) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(96954) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=13), Label(value='0 / 13'))), HBox…

Python(96955) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(96956) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(96957) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(96958) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(96959) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(96960) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(96961) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(96962) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


Testing model: en_core_web_md


Python(96968) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=13), Label(value='0 / 13'))), HBox…

Python(96969) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(96970) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(96971) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(96972) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(96973) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(96974) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(96975) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(96976) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(96983) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=13), Label(value='0 / 13'))), HBox…

Python(96984) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(96985) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(96986) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(96987) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(96988) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(96989) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(96990) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(96991) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


Testing model: en_core_web_lg


Python(96995) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=13), Label(value='0 / 13'))), HBox…

Python(96996) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(96997) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(96998) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(96999) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97000) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97001) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97002) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97003) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97022) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=13), Label(value='0 / 13'))), HBox…

Python(97023) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97024) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97025) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97026) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97027) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97028) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97029) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97030) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


Testing model: en_core_web_trf
❌ en_core_web_trf failed: [E002] Can't find factory for 'curated_transformer' for language English (en). This usually happens when spaCy calls `nlp.create_pipe` with a custom component name that's not registered on the current language class. If you're using a custom component, make sure you've added the decorator `@Language.component` (for function components) or `@Language.factory` (for class components).

Available factories: attribute_ruler, tok2vec, merge_noun_chunks, merge_entities, merge_subtokens, token_splitter, doc_cleaner, parser, beam_parser, lemmatizer, trainable_lemmatizer, entity_linker, entity_ruler, tagger, morphologizer, ner, beam_ner, senter, sentencizer, spancat, spancat_singlelabel, span_finder, future_entity_ruler, span_ruler, textcat, textcat_multilabel, en.lemmatizer
Testing model: en_core_sci_md


Python(97037) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=13), Label(value='0 / 13'))), HBox…

Python(97038) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97039) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97040) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97041) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97042) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97043) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97044) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97045) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97474) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=13), Label(value='0 / 13'))), HBox…

Python(97475) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97476) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97477) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97478) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97479) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97480) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97482) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(97483) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


Unnamed: 0,1,5,10
en_core_web_sm,0.57,0.6215,0.6215
en_core_web_md,0.58,0.621667,0.621667
en_core_web_lg,0.58,0.6115,0.6115
en_core_sci_md,0.55,0.616167,0.616167


# 5) Exporting results to prepare the submission on Codalab

In [43]:
df_query_dev['preds'] = df_query_dev['bm25_topk'].apply(lambda x: x[:5])

In [44]:
df_query_dev[['post_id', 'preds']].to_csv('predictions.tsv', index=None, sep='\t')