Skip to content

AQuaMuSe is a novel scalable approach to automatically mine dual query based multi-document summarization datasets for extractive and abstractive summaries using question answering dataset (Google Natural Questions) and large document corpora (Common Crawl)

google-research-datasets/aquamuse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

Dataset for Query-based Multi-Document Summarization

This repository contains versions of automatically generated datasets for abstractive and extractive query-based multi-document summarization as described in AQuaMuSe paper.

High-level Notes:

  • Dependencies: Documents URLs references the Common Crawl June 2017 Archive.
  • Data Format:
    • Directory structure:
      • Each dataset release with have two top-level folders: abstractive and extractive.
      • Each top-level folder contains three sub-folders for train, dev and test examples.
    • File format: TFrecords.
    • Fields:
      • query: input query to be used as summarization context. This is a single valued byte_list feature, derived from Natural Questions user queries.
      • input_urls: List of URLs to input documents pointing to Common Crawl to be summarized. Each URL is separated with a special token separator <EOD>.
      • target: Summarization target, derived from Natural Questions long answers.

Disclaimer

This is not an official Google product.

About

AQuaMuSe is a novel scalable approach to automatically mine dual query based multi-document summarization datasets for extractive and abstractive summaries using question answering dataset (Google Natural Questions) and large document corpora (Common Crawl)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published