This notebook compares the rankings of systems (ROS) between editorial relevance labels and citation counts. The code below mainly uses the from the "Scientific Abstracts" task at TREC Precision Medicine 2017. 

**Download TREC Precision Medicine run files.** 

In [8]:
!wget -O trec-pm.tar.xz https://th-koeln.sciebo.de/s/JTTV4fxFmuCGMeY/download trec-pm.tar.xz
!tar -xf trec-pm.tar.xz

--2022-11-14 09:11:59--  https://th-koeln.sciebo.de/s/JTTV4fxFmuCGMeY/download
Resolving th-koeln.sciebo.de (th-koeln.sciebo.de)... 128.176.1.2
Connecting to th-koeln.sciebo.de (th-koeln.sciebo.de)|128.176.1.2|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 224988232 (215M) [application/octet-stream]
Saving to: ‘trec-pm.tar.xz’


2022-11-14 09:12:12 (17.5 MB/s) - ‘trec-pm.tar.xz’ saved [224988232/224988232]

--2022-11-14 09:12:12--  http://trec-pm.tar.xz/
Resolving trec-pm.tar.xz (trec-pm.tar.xz)... failed: Name or service not known.
wget: unable to resolve host address ‘trec-pm.tar.xz’
FINISHED --2022-11-14 09:12:12--
Total wall clock time: 13s
Downloaded: 1 files, 215M in 12s (17.5 MB/s)


**The directory includes the qrels and all runs submitted to the "Scientific Abstracts" and "Clinical Trials" tracks at TREC PM 2017-19**

see also: https://trec.nist.gov/data/precmed.html

In [12]:
!ls trec-pm

trec-pm-2017-abstracts	trec-pm-2018-abstracts	trec-pm-2019-abstracts
trec-pm-2017-cds	trec-pm-2018-cds	trec-pm-2019-cds


**Download Dirk's citation and altmetric data.**

In [11]:
!wget -O bibliometric.tar.xz https://th-koeln.sciebo.de/s/BRolGxMzrCipoTT/download
!tar -xf bibliometric.tar.xz

--2022-11-14 09:13:54--  https://th-koeln.sciebo.de/s/BRolGxMzrCipoTT/download
Resolving th-koeln.sciebo.de (th-koeln.sciebo.de)... 128.176.1.2
Connecting to th-koeln.sciebo.de (th-koeln.sciebo.de)|128.176.1.2|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2054660 (2.0M) [application/octet-stream]
Saving to: ‘bibliometric.tar.xz’


2022-11-14 09:13:56 (1.82 MB/s) - ‘bibliometric.tar.xz’ saved [2054660/2054660]



**Make a qrels file from the citation data. The following code uses simple criteria to make multi-graded labels from the citation count and writes them into a file 'qrels.cite'.**
- 2: if the number of citations is higher than twice the mean of all citations
- 0: if the number of citations is lower than the mean of all citations
- 1: the ones in between

In [22]:
import pandas as pd 

df = pd.read_csv('STI_Ergebnisse_final.txt', sep='\t')

_df = df[df['TC'].notna()]
_df = _df[_df['TOPIC'].str.contains('2017', regex=False)]
_df = _df[['TOPIC','PUBMED_ID', 'TC']]
thresh = df[df['TC'].notna()]['TC'].mean()

with open('qrels.cite', 'w') as f_out:

    for row in _df.iterrows():

        topic = row[1]['TOPIC'].split('-')[1]
        pubmed_id = row[1]['PUBMED_ID']
        citation_cnt = row[1]['TC']
        rel = 1
        
        if citation_cnt >= 2*thresh:
            rel = 2
        if citation_cnt < thresh:
            rel = 0
            
        line_out = ' '.join([topic, '0', str(pubmed_id), str(rel), '\n'])
                
        f_out.write(line_out)

  exec(code_obj, self.user_global_ns, self.user_ns)


**Extract the run files and write them into a new directory.**

In [16]:
import os
import gzip

def extract_runs(dir_in, dir_out):

    os.makedirs(dir_out, exist_ok=True)

    for root, dirs, files in os.walk(dir_in):
        for file in files:
            if file.endswith(".gz"):
                run_name = file.split('.')[1]
                with gzip.open(os.path.join(root, file), 'rb') as f_in:
                    file_content = f_in.read()
                    with open(dir_out + '/' + run_name, 'wb') as f_out:
                        f_out.write(file_content) 
                          
DIR_IN = 'trec-pm/trec-pm-2017-abstracts' 
DIR_OUT = 'runs/trec-pm-2017-abstracts'    
                    
extract_runs(DIR_IN, DIR_OUT)

**Install the super-fast evaluation toolkit ranx, which implements some trec_eval measures with the help of Python and numba.**

see also: https://github.com/AmenRa/ranx or https://amenra.github.io/ranx/

In [18]:
!pip install ranx

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting ranx
  Downloading ranx-0.3.3-py3-none-any.whl (95 kB)
[K     |████████████████████████████████| 95 kB 3.6 MB/s 
Collecting orjson
  Downloading orjson-3.8.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (272 kB)
[K     |████████████████████████████████| 272 kB 17.8 MB/s 
[?25hCollecting lz4
  Downloading lz4-4.0.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 57.3 MB/s 
[?25hCollecting ir-datasets
  Downloading ir_datasets-0.5.4-py3-none-any.whl (311 kB)
[K     |████████████████████████████████| 311 kB 53.5 MB/s 
Collecting cbor2
  Downloading cbor2-5.4.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (190 kB)
[K     |████████████████████████████████| 190 kB 63.3 MB/s 
Collecting rich
  Downloading rich-12.6.0-py3-none-any.whl (237 kB)
[K     |████████████████████████████████| 

**Make a reference system of rankings (ROS) from the qrels of the "Scientific Abstracts" task at TREC PM 2017.**

The first time, it takes a while to run ranx as it needs to compile the source code. Latter executions will run much faster.

In [23]:
from ranx import Qrels, Run, evaluate, compare

DIR_RUN = DIR_OUT
PATH_QRELS = "trec-pm/trec-pm-2017-abstracts/qrels-final-abstracts.txt"

qrels = Qrels.from_file(PATH_QRELS, kind="trec")

ros_ref = {}

for root, dirs, files in os.walk(DIR_RUN):
    for file in files:
        run = Run.from_file(os.path.join(root, file), kind="trec")
        score = evaluate(qrels, run, "ndcg@5")
        ros_ref[file] = score

ros_ref = dict(sorted(ros_ref.items(), key=lambda item: item[1], reverse=True))
ros_ref

{'UTDHLTAF': 0.6811529021982724,
 'UTDHLTFF': 0.6725106332676734,
 'UD_GU_SA_4': 0.6181395015199264,
 'mugpubboost': 0.6154718278121372,
 'UD_GU_SA_5': 0.6123358953072634,
 'mugpubbase': 0.6044095495078302,
 'UD_GU_SA_3': 0.6028432779922607,
 'UTDHLTSF': 0.5982230564812898,
 'UD_GU_SA_2': 0.5972838729989333,
 'UTDHLTJQ': 0.5902978038794592,
 'UD_GU_SA_1': 0.5788208864645199,
 'mRun3MRF': 0.5581057166719795,
 'SIBTMlit4': 0.5475789306164868,
 'UDInfoPMSA2': 0.539927045349584,
 'mugpubdiseas': 0.530256562984943,
 'SIBTMlit3': 0.5252097783297163,
 'SIBTMlit2': 0.521742885780416,
 'mRun1Bsl': 0.5150557133717655,
 'SIBTMlit1': 0.5150323687339575,
 'UKY_AGG': 0.5095888364139758,
 'Textual': 0.5053800833840775,
 'pms_run5_abs': 0.4984219381826194,
 'UKY_CJT': 0.49711140771395873,
 'UNTIIALQ': 0.49092363019842344,
 'Semantic': 0.48638655196414843,
 'Broad': 0.48638655196414843,
 'mRun2BslOth': 0.4821345214151435,
 'UKY_MAN': 0.48038558486828153,
 'UKY_BASE': 0.480201489284041,
 'SIBTMlit5': 0.

**Make the corresponding ROS based on citation data.**

In [20]:
PATH_QRELS_CITE = "qrels.cite"

qrels = Qrels.from_file(PATH_QRELS_CITE, kind="trec")

ros_cite = {}

for root, dirs, files in os.walk(DIR_RUN):
    for file in files:
        run = Run.from_file(os.path.join(root, file), kind="trec")
        score = evaluate(qrels, run, "ndcg@5")
        ros_cite[file] = score

ros_cite = dict(sorted(ros_cite.items(), key=lambda item: item[1], reverse=True))
ros_cite

{'aCSIROmedPCB': 0.6493167976727501,
 'aCSIROmedMGB': 0.6053074946640554,
 'SIBTMlit5': 0.38383884132283996,
 'aCSIROmedAll': 0.2974447259065918,
 'UCASSEM2a': 0.2797739040676953,
 'eth_a_ws_q': 0.27821257571120106,
 'POZabsBB2GRn': 0.2721619174306273,
 'pms_run1': 0.263247431177043,
 'UCASSEM3a': 0.2593835280046743,
 'UCASSEM1a': 0.25852568462190545,
 'UCASSEMUMLSa': 0.25626891027860293,
 'KISTI04': 0.2554851851854662,
 'UTDHLTAF': 0.25231367471768124,
 'mayonlppm4': 0.25099088482202425,
 'KISTI02': 0.24993303404712683,
 'cbnuSA3': 0.2491767515698897,
 'UCASBASEa': 0.24749856536608092,
 'medline2': 0.24681841632105544,
 'medline3': 0.2447797577516258,
 'SIBTMlit4': 0.2435636048189697,
 'UTDHLTFF': 0.2421077773003898,
 'eth_a_nn': 0.24044592999434128,
 'MedIER_sa3': 0.24043250052198406,
 'KISTI03': 0.23706330841628834,
 'mayonlppm1': 0.23702363358982811,
 'KISTI05': 0.23012841372568882,
 'mayonlppm2': 0.22724502681787664,
 'kkseabs1': 0.2232969474100465,
 'pms_run5_abs': 0.223113278915

**Determine Kendall's tau between the ROS.**

In [21]:
from scipy import stats

tau, p_value = stats.kendalltau(list(ros_ref.keys()), list(ros_cite.keys()))
tau

0.10529032258064516