Add ability to compute trec_eval metrics directly on in-memory data structures #13

lintool · 2019-11-08T21:12:57Z

From you paper:

This is now pretty easy... with pyserini on PyPI.

But the real point of this issue is this: currently, as a I understand it, the input to evaluation is a file. Can we make it so that we can compute evaluation metrics directly from in-memory data structures?

Question is, what should the in-memory data structures look like? A Panda DF with the standard trec output format columns? A dictionary to support random access by qid? Something else?

If we can converge on something, I can even try to volunteer some of my students to contribute to this effort... :)

joaopalotti · 2019-11-11T21:49:37Z

Hi Jimmy,

Thanks for your interest in this project.

Earlier versions of trectools were indeed using trec_eval (externally calling the program and parsing the results), but, currently, we have re-implemented many (unfortunately not all) of the trec_eval evaluation metrics. (We are trying to have them all implemented, along with new features, through an undergraduate student of Guido we will try to recruit).

Please have a look at this this module
It does calculate everything in-memory using pandas dataframes with standard trec output columns. I am open to discuss modifications of that if you want. At the time of this implementation, we were not concerned with efficiency, so I believe a lot can be done to optimize it (e.g., profiling to find bottlenecks).

For many of the implemented metrics, we have a flag "trec_eval" =[True/False] (default True) that mimics trec_eval (if set to true) in terms of (1) reranking the input based on the 'score' column, rather than the column with the ranking position (this is what you get instead if you set the flag to false); (2) using the same implementation as trec_eval, rather than alternative implementations, such as the case of nDCG (i.e. different gain functions).

However there are many ways to make this tool more useful for the community (e.g. result visualization/comparison, web interface, more metrics, etc.).

One straightforward thing that we look forward to having is more systematic unit tests.
Currently, the unit tests cover only a small fraction of the evaluation metrics. See this unit test, for example .

We are currently short of time to implement many of our extensions, so it would be incredible to see people interested in contributing to it -- and indeed that is why we are looking to have a new undergraduate/master student to work on this. Please let us know your ideas on how to proceed.
Talking with Guido, he thought you may be interested in integrating the use of trectools into the jig used for OSIRRC 2019.

lintool · 2019-11-11T22:41:15Z

Hi @joaopalotti - Looking at your code, TrecEval takes a TrecRun, which reads an external file into a DF. So if we generate a DF that corresponds to the same on-disk format, we should be able to feed into your code.

However, IMO, we should try to directly build bindings between the C trec_eval code and Python via Cython. The rationale is that trec_eval has a bunch of corner cases, see for example [1]... and it would just be easier to wrap the original C code so we get exactly the same behavior.

[1] The Impact of Score Ties on Repeatability in Document Ranking

joaopalotti · 2019-11-11T23:23:00Z

That is correct, TrecEval requires a TrecRun object (and also a TrecQrel) as input to be able to calculate all evaluation metrics. However, you do not need to initialize your TrecRun object from a file in disk, you could populate TrecRun.run_data from any in-memory DF following the usual trec run format (clearly we could have a function to do it instead of directing accessing the run_data property).

Note that an effort to do what you are saying was conducted by those guys here: https://github.com/cvangysel/pytrec_eval

However, we have decided to have our own implementation as a way to (1) have finer control on aspects such as tie-breaking, formula variation, etc; (2) quickly integrate evaluation metrics that we frequently use, but are not part of trec_eval (RBP, for instance).

Our idea was to get exactly the same results when the parameter trec_eval=True, and be able to play with it to easily obtain other behaviours. To increase our confidence that we are actually getting exactly the same results, we need a strong set of unit test. :)

lintool · 2019-11-11T23:43:52Z

I'm happy to provide all the runs in Anserini for your unit tests!

lgienapp mentioned this issue Apr 13, 2022

Enforce explicit dtypes throughout all DataFrame-based classes #37

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to compute trec_eval metrics directly on in-memory data structures #13

Add ability to compute trec_eval metrics directly on in-memory data structures #13

lintool commented Nov 8, 2019

joaopalotti commented Nov 11, 2019

lintool commented Nov 11, 2019

joaopalotti commented Nov 11, 2019

lintool commented Nov 11, 2019

Add ability to compute trec_eval metrics directly on in-memory data structures #13

Add ability to compute trec_eval metrics directly on in-memory data structures #13

Comments

lintool commented Nov 8, 2019

joaopalotti commented Nov 11, 2019

lintool commented Nov 11, 2019

joaopalotti commented Nov 11, 2019

lintool commented Nov 11, 2019