Datasets

Jump to bottom

Mehvin edited this page Aug 1, 2018 · 13 revisions

Below are the datasets used in this project

DeepMind Q&A Dataset (CNN & DailyMail)

Main Dataset
English News Articles
Each File Contains an Article and it's Respective Gold-standard
Total of 312,085 Files
Link to download

Australian Legal Case Reports Dataset

Different Domain of Dataset - Legal Cases
Used for Generalization Test
Total of 3,887 Legal Documents
Link to download

Large Scale Chinese Short Text Summarization (LCSTS) Dataset

Contains Short Texts and it's Respective Gold-standard
Total of 2,400,591 Short Texts
Link to download - Requires application to obtain corpus

Datasets Pre-processing

Done via Python
Pre-processing done includes:
- Cleaning
- Re-formatting
- Splitting
Pre-processing files for:

Completed by Melvin and Joe