Skip to content

jchwenger/dataset.gutenberg-language

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dataset: Gutenberg

Downloading & cleaning Gutenberg (filtered by language).

Download

Two ways possible, using and building on a python utility from flauBERT:

python download.py -indir download_directory

Or this script. The latter creates directories zipfiles/ and files/ as well as the file links-zipfile.txt that contains the links to the sources.

./download.sh lang_code

The lang_code can be 'fr', 'en', etc.

Clean

A script also from flauBERT using tools from kiasar, using NLTK to clean up licenses & endings.

python clean.py -indir download_directory -outdir clean_dir

$ python download.py --help
usage: download.py [-h] -indir INDIR [-lang LANG] [-update_url UPDATE_URL]

optional arguments:
  -h, --help            show this help message and exit
  -indir INDIR          Path to directory to save downloaded files
  -lang LANG            Language to download
  -update_url UPDATE_URL
                        Choose to update book URLs if necessary (1/0)
$ python clean.py --help
usage: clean.py [-h] -indir INDIR -outdir OUTDIR [-json JSON]

optional arguments:
  -h, --help      show this help message and exit
  -indir INDIR    Path to directory to save downloaded files
  -outdir OUTDIR  Path to directory to save clean files
  -json JSON      Save file to json format or not (1/0)

About

Downloading Gutenberg, filtered by language

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published