Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add logging and file list processing #65

Merged
merged 6 commits into from
Sep 14, 2021
Merged

Add logging and file list processing #65

merged 6 commits into from
Sep 14, 2021

Conversation

faroit
Copy link
Contributor

@faroit faroit commented Sep 11, 2021

This PR implements

  • a proper logging system based on pythons default logging module
  • a simple error file system that allows to log failed downloads to be logged in a txt file. The text file is formatted as url error_code
  • a Path can be passed to io.download() to allow downloading from a txt file (such as an error log) any list of URLs where the first entry in each line is a URL can be used to download URLs from a file.
  • the log level defaults toINFO which means that the progressbar is shown and errors will be shown.
  • addresses Add resume functionality for the download #2 as the failed ursl can be reused for another download.
  • stats = io.download() now returns the statistics of success/failed/skipped files

remaining issues

  • add unit tests for failing files
  • test for non-standard urls
  • fix tqdm write streams

Functionality

Resuming

instead of implementing a complicated resume system, we just rely on overwrite=False which prevents files being downloaded again. As we use hashes of the URLs, this should impose any problems and should be still very fast since no additional queries are being made.

example:

running download = gbif_dl.io.download(queries, overwrite=False) multiple times would automatically resume as the files that are already on disk will be skipped.

Logging

Furthermore we now provide a way to redirect the output of the download progress.

  • by default the progressbar output is send to std.err
  • a list of failed URLs and its status code is send to std.out

So a (Linux) user could do:

 python api_test.py > failed_urls.txt

the file might look like this

https://inaturalist-open-data.s3.amazonaws.com/photos/asdasdasd/original.jpeg 404
https://inaturalist-open-data.s3.amazonaws.com/photos/asdsadas/original.jpeg 301

to redirect the list of failed URLs to a file.
Conveniently, this PR also implements a way to download a list of URLs from a text :

gbif_dl.io.download("failed_urls.txt")

For non-unix users, this PR added a way to directly log to a file without using pipes:

download = gbif_dl.io.download(queries, error_log_path="failed_urls.txt", overwrite=False)

Error logging and the progress bar can now be turned off using standard log levels (instead of verbose=True which was removed)

# new default shows progressbar on std.err and failed URLs on std.out
download = gbif_dl.io.download(queries, loglevel="INFO")  
# setting log level to `ERROR` disables the progressbar but still shows failed urls
download = gbif_dl.io.download(queries, loglevel="ERROR")  
# setting log level to `CRITICAL` disables progressbar and download errors
download = gbif_dl.io.download(queries, loglevel="CRITICAL")

@faroit faroit changed the title Add logging and loading from file list Add logging and file list processing Sep 12, 2021
@faroit faroit marked this pull request as ready for review September 13, 2021 12:16
@faroit faroit merged commit 8175f1c into master Sep 14, 2021
@faroit faroit deleted the faroit/issue3 branch September 14, 2021 13:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant