Information Retrieval evaluation tool for pairwise preference judgements.
Python Shell
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


An evaluation tool for pairwise preference judgements.

Evaluation measures and file formats described in the following:

  B. Carterette, P.N. Bennett, O. Chapelle (2008). A Test Collection of
  Preference Judgments. In Proceedings of the SIGIR 2008 Beyond Binary
  Relevance: Preferences, Diversity, and Set-Level Judgments Workshop.
  Singapore. July, 2008.

There is one notable exception to the evaluation measures calculated:

APpref as computed by this tool is somewhat different that what is described in
the above paper. Here, we average ppref@k values across ALL the positions where
a *preferred* document occurs, that is a document that has ever been preferred
to any other document for this query. Note that these are exactly the positions
where rpref@k could change, although it does not necessarily.

We also include "positions" beyond the end of the rank list, in order to account
for preferred documents that were not retrieved by the system. This is in
keeping with standard MAP calculations that average over all judged relevant
documents whether or not they were retrieved. [-q] [-v] [-i] PREFERENCES_FILE RESULTS_FILE
  Evaluates the results against the provided preferences.
  PREFERENCES_FILE should be in a 4-column format:
    QID == query ID
    SOURCE_DOC, TARGET_DOC are document identifiers
    PREFERENCE (-2, -1, 0, 1, 2) indicates the preference value.
    PREFERENCE == -2 -- indicates SOURCE_DOC is BAD, TARGET_DOC should be "NA"
                  -1 -- indicates SOURCE_DOC preferred to TARGET_DOC
                   0 -- indicates SOURCE_DOC and TARGET_DOC are duplicates
                   1 -- indicates TARGET_DOC preferred to SOURCE_DOC
                   2 -- indicates TARGET_DOC is BAD, SOURCE_DOC should be "NA"
  RESULTS_FILE is in the standard 5-column TREC format
    the "Q0", RANK & COMMENT columns are ignored

  -q -- produces per-query output
  -v -- produces (very, very) verbose output
  -i -- do not assume preferences are transitive. Transitivity is assumed by
  Runs the evaluation script above on sample data, comparing the output to
  the expected output.

  small_results small_preferences small_expected_results.txt
    - toy examples & expected results
  results.txt preferences.txt expected_results.txt
    - more complicated examples & expected results