Skip to content

hltcoe/CoreferentMentionRetrieval

Repository files navigation

CoreferentMentionRetrieval

This is the test collection for the task of Coreferent Mention Retrieval defined by Sankepally et al. in "A Test Collection for Coreferent Mention Retrieval." ACM SIGIR 2018.
Following are the file names and their contents:

  • doc_ids.txt : The Document IDs from the subset of documents in the TAC 2014 EDL collection (LDC catalog number: LDC2014E13)
  • doc_sentence_char_offsets.tar.gz : This compressed file contains file with sentence boundaries. When uncompressed it is 150MBs in size. Character offsets for the sentence boundaries for all documents in doc_ids.txt are specified in this format: [docid]:sentence_start_offset:sentence_ending_offset [unique sentence identifier].
  • CMR_char_offset_queries.txt : Each line has a mention query in the following format:
    [query ID] [query type] [doc_id:sentence_start_offset:sentence_ending_offset:token_start_offset:token_ending_offset] [query mention string]
  • CMR_char_offset_qrels.txt: Each line has a mention query in the following format:
    [query ID] [query type] [doc_id:sentence_start_offset:sentence_ending_offset] [binary relevance judgment]

Evaluation

You can use the latest trec_eval for evaluation.
After making sure your results file is in TREC format and has no duplicate lines, you can run:
./trec_eval -q -M100 -m infAP batch2_char_offset_qrels.txt [your_result_file_name]

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published