Skip to content
Tools for the TREC CAsT benchmark
Java Python
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Tools and scripts for TREC CAsT

Topic file processing

Code for processing the topic files in Java (all three formats) is available in: src/main/java

A maven file is provide for building the code.

Convert various collections to TREC Web format.

To run the parser for CAR corpus V2:
Setup trec car tools from

To run the parser for Washington Post: Run: python DATAPATH OUTPUT_DIRECTORY
Here DATAPATH is the directory containing the json files of Washington Post data, and OUTPUT_DIRECTORY is the name of the directory where you want to store the converted files

To run the parser for MSMARCO: Install tqdm if you dont already have it
Run: python path_to_collection.tsv OUTPUT_DIRECTORY DUPLICATE_FILE
Here path_to_collection.tsv is the tab seperated MARCO data, OUTPUT_DIRECTORY is the location of the directory where you want to store the converted files, and DUPLICATE_FILE is the file containing the list of deduplicated documents.

NOTE: The scripts have been tested with Python 3.6 (but anything >= 3.5 should work).

You can’t perform that action at this time.