Fetching latest commit…
Cannot retrieve the latest commit at this time.
|Failed to load latest commit information.|
An evaluation tool for pairwise preference judgements. Evaluation measures and file formats described in the following: B. Carterette, P.N. Bennett, O. Chapelle (2008). A Test Collection of Preference Judgments. In Proceedings of the SIGIR 2008 Beyond Binary Relevance: Preferences, Diversity, and Set-Level Judgments Workshop. Singapore. July, 2008. http://research.microsoft.com/en-us/um/people/pauben/papers/sigir-2008-bbr-data-preference-overview.pdf There is one notable exception to the evaluation measures calculated: APpref as computed by this tool is somewhat different that what is described in the above paper. Here, we average ppref@k values across ALL the positions where a *preferred* document occurs, that is a document that has ever been preferred to any other document for this query. Note that these are exactly the positions where rpref@k could change, although it does not necessarily. We also include "positions" beyond the end of the rank list, in order to account for preferred documents that were not retrieved by the system. This is in keeping with standard MAP calculations that average over all judged relevant documents whether or not they were retrieved. pairwise_pref_eval.py [-q] [-v] [-i] PREFERENCES_FILE RESULTS_FILE Evaluates the results against the provided preferences. PREFERENCES_FILE should be in a 4-column format: QID SOURCE_DOC TARGET_DOC PREFERENCE Where QID == query ID SOURCE_DOC, TARGET_DOC are document identifiers PREFERENCE (-2, -1, 0, 1, 2) indicates the preference value. PREFERENCE == -2 -- indicates SOURCE_DOC is BAD, TARGET_DOC should be "NA" -1 -- indicates SOURCE_DOC preferred to TARGET_DOC 0 -- indicates SOURCE_DOC and TARGET_DOC are duplicates 1 -- indicates TARGET_DOC preferred to SOURCE_DOC 2 -- indicates TARGET_DOC is BAD, SOURCE_DOC should be "NA" RESULTS_FILE is in the standard 5-column TREC format QID "Q0" DOC RANK SCORE [COMMENT+] Note: the "Q0", RANK & COMMENT columns are ignored -q -- produces per-query output -v -- produces (very, very) verbose output -i -- do not assume preferences are transitive. Transitivity is assumed by default test_eval.sh Runs the evaluation script above on sample data, comparing the output to the expected output. test_data/ small_results small_preferences small_expected_results.txt - toy examples & expected results results.txt preferences.txt expected_results.txt - more complicated examples & expected results