Creates a SQLite database if the CNN and DailyMail summarization dataset.
See the full documentation. The API reference is also available.
The easiest way to install the command line program is via the pip
installer:
pip3 install zensols.cnndmdb
Binaries are also available on pypi.
This package can be used from the command line with the cnndmdb
command, or as a Python API.
- Install the Python dependencies:
pip install -r src/python/requirements.txt
- Create the SQLite database file:
cnndmdb load
. This takes a while since the entire corpus is first downloaded and then inserted into the SQLite file. - Check to make sure the file
data/cnndm.sqlite3
was created. - Optionally create a
~/.cnndmdbrc
to relocate thedata/cnndm.sqlite3
file.
To relocate the SQLite file, add the following to the ~/.cnndmdbrc
file:
[cnndmdb_default]
db_file = ~/path/to/cnndm.sqlite3
The SQLite database keys can be given:
cnndmdb keys
Then the command line can also be used to print articles:
cnndmdb show -t org 3b07f5102c69e3e609d73b2ccb0dc5549d4fbaf6
The -t org
tells it to use the original corpus keys. This option also allows
for selected SQLite rowid
keys or a Kth smallest article.
The corpus objects are accessible as mapped Python objects. For example:
corpus: Corpus = ApplicationFactory.get_corpus()
art: Article = next(iter(corpus.stash.values()))
print(art.text)
The data is sourced from a Tensorflow dataset, which in turn uses the Abigail See GitHub repository.
@article{DBLP:journals/corr/SeeLM17,
author = {Abigail See and
Peter J. Liu and
Christopher D. Manning},
title = {Get To The Point: Summarization with Pointer-Generator Networks},
journal = {CoRR},
volume = {abs/1704.04368},
year = {2017},
url = {http://arxiv.org/abs/1704.04368},
archivePrefix = {arXiv},
eprint = {1704.04368},
timestamp = {Mon, 13 Aug 2018 16:46:08 +0200},
biburl = {https://dblp.org/rec/bib/journals/corr/SeeLM17},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
@inproceedings{hermann2015teaching,
title={Teaching machines to read and comprehend},
author={Hermann, Karl Moritz and Kocisky, Tomas and Grefenstette, Edward and Espeholt, Lasse and Kay, Will and Suleyman, Mustafa and Blunsom, Phil},
booktitle={Advances in neural information processing systems},
pages={1693--1701},
year={2015}
}
An extensive changelog is available here.
Copyright (c) 2023 Paul Landes