Downloading & cleaning Gutenberg (filtered by language).
Two ways possible, using and building on a python utility from flauBERT:
python download.py -indir download_directory
Or this script. The latter creates directories zipfiles/
and files/
as well as the file links-zipfile.txt that contains the links to the sources.
./download.sh lang_code
The lang_code
can be 'fr', 'en', etc.
A script also from flauBERT using tools from kiasar, using NLTK to clean up licenses & endings.
python clean.py -indir download_directory -outdir clean_dir
$ python download.py --help
usage: download.py [-h] -indir INDIR [-lang LANG] [-update_url UPDATE_URL]
optional arguments:
-h, --help show this help message and exit
-indir INDIR Path to directory to save downloaded files
-lang LANG Language to download
-update_url UPDATE_URL
Choose to update book URLs if necessary (1/0)
$ python clean.py --help
usage: clean.py [-h] -indir INDIR -outdir OUTDIR [-json JSON]
optional arguments:
-h, --help show this help message and exit
-indir INDIR Path to directory to save downloaded files
-outdir OUTDIR Path to directory to save clean files
-json JSON Save file to json format or not (1/0)