A simple toolkit to process TREC files in Python.
Switch branches/tags
Nothing to show
Clone or download

README.md

##TREC TOOLS

A simple toolkit to process TREC files. If you do not know what TREC is, you surely do not need this package.

##Installing

pip install trectools

Background

The aim of this module is to facilitate typical procedures used when analysing data from a TREC/CLEF/NTCIR campaign. The main object in TREC campaign is participant retrieval system. A retrieval system is takes as input some information need represented by a query and generates a list of documents that are relevant for that query. This information is represented in a TREC campaign as a participant run, which is a file with the following structure:

qid Q0 docno rank score tag

where:

  • qid is the query number
  • Q0 is the literal Q0
  • docno is the id of a document returned for qid
  • rank (1-999) is the rank of this response for this qid
  • score is a system-dependent indication of the quality of the response
  • tag is the identifier for the system

Example:
1 Q0 nhslo3844_12_012186 1 1.73315273652 mySystem
1 Q0 nhslo1393_12_003292 2 1.72581054377 mySystem
1 Q0 nhslo3844_12_002212 3 1.72522727817 mySystem
1 Q0 nhslo3844_12_012182 4 1.72522727817 mySystem
1 Q0 nhslo1393_12_003296 5 1.71374426875 mySystem

Once a campaign ends, the evaluation phase starts. Usually, it is impossible to judge every document retrieved by every participant run for every query. There is a huge cost, both in terms of money and time, to make judgements. Many strategies have been proposed to select which documents to judge. Without going into many details, a pool of documents has to be created. Once documents in that pool are judged with respect to a query, a file is created containing all these judgements. This file is usually called 'qrel' and contains lines like this:

qid 0 docno relevance

where:

  • qid is the query number
  • 0 is the literal 0
  • docno is the id of a document in your collection
  • relevance is how relevant is docno for qid

Example:
1 0 aldf.1864_12_000027 1
1 0 aller1867_12_000032 2
1 0 aller1868_12_000012 0
1 0 aller1871_12_000640 1
1 0 arthr0949_12_000945 0
1 0 arthr0949_12_000974 1

Finally, the information retrieval community uses some evaluation metric to quantify how good a participant system is. Many of common metrics, such as precision@N, mean average precision, bpref and others, are implemented in a tool called [trec_eval] (http://trec.nist.gov/trec_eval/). Although trec_eval lacks many other important measures (e.g., nDCG or RBP), it provides a consistent format for system result:

label qid value

where:

  • label is any string, usually representing a metric
  • qid is the query number or 'all' to represent a aggregate value
  • value is numeral result of a metric

Example: num_rel_ret 7 77 map 7 0.4653 P_10 9 0.9000 num_rel_ret all 1180 map all 0.1323 gm_map all 0.0504

The three main modules found in this package are inspired by the main files created in a TREC campaign: a participant run, a qrel e a result file: TrecRun, TrecQrel, TrecRes. Also, there is a 'misc' module to implement many common operations involving one or more module (such as comparing statistical significance of different runs). See the section below for some examples.

Code Examples

> from trectools import TrecRun, TrecQrel, TrecRes, misc

> myRun = TrecRun("~/mysystem.run")
> myRun.topics()
{1,2,3,4,5,6,7}

> myRun.get_top_documents(topic=1,n=2)
['nhslo3844_12_012186', 'nhslo1393_12_003292']

> myQrel = TrecQrel("~/assessor.qrel")
> myQrel.describe()
count    2076.000000
mean        0.268786
std         0.575825
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max         2.000000
> myQrel.get_number_of(1)
278

> myQrel.get_number_of(2)
140

> myQrel.check_agreement(myQrel)
1.0

> myRes = myRun.evaluate_run(qrel)
> myRes.get_result(metric="P_10")
0.8700

> myRes.get_results_for_metric("P_10")
{1:0.9000, 2:0.8000, ...} 

> myRun2 = TrecRun("~/mysystem2.run")

> myRes2 = myRun2.evaluate_run(qrel)
> myRes.compare_with(myRes2, metric="map")
Ttest_indResult(statistic=1.2224721254608264, pvalue=0.22486892703278308)

> list_of_results = [myRes, myRes2]
> misc.sort_systems_by(list_of_results, "P_10")
[(0.8700, 'myRes1'), (0.8300, 'myRes2')]

> misc.get_correlation( misc.sort_systems_by(list_of_results, "P_10"), misc.sort_systems_by(list_of_results, "map") )
KendalltauResult(correlation=0.99999999999999989, pvalue=0.11718509694604401)

> misc.get_correlation( misc.sort_systems_by(list_of_results, "P_10"), misc.sort_systems_by(list_of_results, "map"), correlation="tauap" )
1.0