Add logging and file list processing #65

faroit · 2021-09-11T16:54:27Z

This PR implements

a proper logging system based on pythons default logging module
a simple error file system that allows to log failed downloads to be logged in a txt file. The text file is formatted as url error_code
a Path can be passed to io.download() to allow downloading from a txt file (such as an error log) any list of URLs where the first entry in each line is a URL can be used to download URLs from a file.
the log level defaults toINFO which means that the progressbar is shown and errors will be shown.
addresses Add resume functionality for the download #2 as the failed ursl can be reused for another download.
stats = io.download() now returns the statistics of success/failed/skipped files

remaining issues

add unit tests for failing files
test for non-standard urls
fix tqdm write streams

Functionality

Resuming

instead of implementing a complicated resume system, we just rely on overwrite=False which prevents files being downloaded again. As we use hashes of the URLs, this should impose any problems and should be still very fast since no additional queries are being made.

example:

running download = gbif_dl.io.download(queries, overwrite=False) multiple times would automatically resume as the files that are already on disk will be skipped.

Logging

Furthermore we now provide a way to redirect the output of the download progress.

by default the progressbar output is send to std.err
a list of failed URLs and its status code is send to std.out

So a (Linux) user could do:

 python api_test.py > failed_urls.txt

the file might look like this

https://inaturalist-open-data.s3.amazonaws.com/photos/asdasdasd/original.jpeg 404
https://inaturalist-open-data.s3.amazonaws.com/photos/asdsadas/original.jpeg 301

to redirect the list of failed URLs to a file.
Conveniently, this PR also implements a way to download a list of URLs from a text :

gbif_dl.io.download("failed_urls.txt")

For non-unix users, this PR added a way to directly log to a file without using pipes:

download = gbif_dl.io.download(queries, error_log_path="failed_urls.txt", overwrite=False)

Error logging and the progress bar can now be turned off using standard log levels (instead of verbose=True which was removed)

# new default shows progressbar on std.err and failed URLs on std.out
download = gbif_dl.io.download(queries, loglevel="INFO")  
# setting log level to `ERROR` disables the progressbar but still shows failed urls
download = gbif_dl.io.download(queries, loglevel="ERROR")  
# setting log level to `CRITICAL` disables progressbar and download errors
download = gbif_dl.io.download(queries, loglevel="CRITICAL")

faroit added 5 commits September 11, 2021 18:35

add logging module

1694514

redirect logging out to tqdm

8c40dd6

black

87515bc

test 404 errors, returning download stats

e2b9454

update readme

5d7f185

faroit changed the title ~~Add logging and loading from file list~~ Add logging and file list processing Sep 12, 2021

faroit requested review from lombardata and AntoineAA September 13, 2021 12:16

faroit marked this pull request as ready for review September 13, 2021 12:16

typo

7ef9d92

faroit merged commit 8175f1c into master Sep 14, 2021

faroit deleted the faroit/issue3 branch September 14, 2021 13:06

faroit mentioned this pull request Sep 14, 2021

Add resume functionality for the download #2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add logging and file list processing #65

Add logging and file list processing #65

faroit commented Sep 11, 2021 •

edited

Add logging and file list processing #65

Add logging and file list processing #65

Conversation

faroit commented Sep 11, 2021 • edited

remaining issues

Functionality

Resuming

example:

Logging

faroit commented Sep 11, 2021 •

edited