Skip to content
Tools for the TREC CAsT benchmark
Java Python
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
src/main
target/generated-sources/protobuf/java/cast/topics
Readme.md
pom.xml

Readme.md

Tools and scripts for TREC CAsT

Topic file processing

Code for processing the topic files in Java (all three formats) is available in: src/main/java

A maven file is provide for building the code.

Convert various collections to TREC Web format.

To run the parser for CAR corpus V2:
Setup trec car tools from https://github.com/TREMA-UNH/trec-car-tools
Run: python car_trecweb.py PATH_TO_CBORFILE OUTPUT_DIRECTORY

To run the parser for Washington Post: Run: python wapo_trecweb.py DATAPATH OUTPUT_DIRECTORY
Here DATAPATH is the directory containing the json files of Washington Post data, and OUTPUT_DIRECTORY is the name of the directory where you want to store the converted files

To run the parser for MSMARCO: Install tqdm if you dont already have it
Run: python marco_trecweb.py path_to_collection.tsv OUTPUT_DIRECTORY DUPLICATE_FILE
Here path_to_collection.tsv is the tab seperated MARCO data, OUTPUT_DIRECTORY is the location of the directory where you want to store the converted files, and DUPLICATE_FILE is the file containing the list of deduplicated documents.

NOTE: The scripts have been tested with Python 3.6 (but anything >= 3.5 should work).

You can’t perform that action at this time.