Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resume #17

Closed
mgangl opened this issue Oct 5, 2021 · 5 comments
Closed

Resume #17

mgangl opened this issue Oct 5, 2021 · 5 comments
Assignees
Milestone

Comments

@mgangl
Copy link
Contributor

mgangl commented Oct 5, 2021

https://podaac.jpl.nasa.gov/forum/viewtopic.php?f=6&t=1418

Good evening!

We download a lot of data from PODAAC and occasionally something goes wrong partway through the download (real-world stuff like the a bad network connection the system going down at the wrong moment).

Does the cloud data subscriber script have mechanisms to deal with these cases? Ideally it should try to get the failed download again but quickly "give up" if the problem persists.

Subscriber should be able to 'resume' during a download failure. Currently, if any of the downloads fail during a subscriber run, the subscriber "exits" without updating its last run, and the next time it runs, it will attempt to download all files from the previous, "failed" run, even if only one out of N files actually failed.

@mike-gangl
Copy link
Contributor

Some more comments:

What I’d really like is something like what ncftp has, which checks to see if I’ve already got a copy of each file in my download directory and only actually downloads if I don’t. (Or alternatively, I’m okay with overwriting identical existing files – it’s just electrons to me; you’re the ones paying for the big database and transfer cost!)

@mike-gangl mike-gangl added this to the 1.9.0 milestone Feb 28, 2022
@skorper
Copy link
Contributor

skorper commented Mar 31, 2022

Related, we want to update the data subscriber to not re-download granules if the UMM-G revision date changes but the data content itself does not change.

We can achieve this by looking at the checksum in CMR UMM-G DataGranule -> ArchiveAndDistributionInformation -> find entry where name ends with .nc -> checksum -> value, and comparing that with the checksum calculated on the local version of file.

Keep in mind the checksum can be either md5 or sha512.

https://podaac.jpl.nasa.gov/Tutorial-Discovering-Data-File-Checksums-for-Cloud-based-Data

@frankinspace
Copy link
Member

Rather than find entry where name ends with .nc it would be more robust to first get RelatedUrls with Type=GET DATA, grab the filename from URL, then use that filename to find the correct entry in DataGranule -> ArchiveAndDistributionInformation

@skorper
Copy link
Contributor

skorper commented Apr 4, 2022

We also want a force option to redownload regardless of checksum.

wveit pushed a commit that referenced this issue Apr 25, 2022
Prevents re-downloading files (e.g. in case previous run
failed because of other file failures).

If the subscriber sees a file already exists, it will also calculate
the file checksum and see if it matches the checksum in
CMR. If the checcksum doesn't match, it will re-download.

There is now a --force/-f option that will cause subscriber
to re-download even if the file exists and is up to date.

Issue #17
mike-gangl added a commit that referenced this issue Apr 28, 2022
* Change print statements to log statements

* Fix flake errors

* Add retry logic for 500 and 401 errors from CMR

* Subscriber check if file exists before downloading

Prevents re-downloading files (e.g. in case previous run
failed because of other file failures).

If the subscriber sees a file already exists, it will also calculate
the file checksum and see if it matches the checksum in
CMR. If the checcksum doesn't match, it will re-download.

There is now a --force/-f option that will cause subscriber
to re-download even if the file exists and is up to date.

Issue #17

* Issues/15 (#65)

* updated get_search to include verbose option, not entire 'args' option

* added search after functionality to podaac access; removed scroll from initial parameters

* updated changelog

* closes #15

* Update python-app.yml

added netrc creation for future use of regression tests.

* Add checks for pre-existing files to downloader (#67)

* Check if file exists before download - downloader

* Update documentation

Co-authored-by: Wilbert Veit <wilbert.e.veit@jpl.nasa.gov>

* Programmatic Regression Testing (#66)

* added programmatice regression testing. currently relies on a valid .netrc file, refactoring might be needed to manually add a user/password to the CMR/TEA downloads

* Update python-app.yml

* updated regression tests, readied 1.9.0 version

* added -f option test to downloader regression

* Update python-app.yml

Co-authored-by: Joe Sapp <joe.sapp@noaa.gov>
Co-authored-by: mgangl <mike.gangl@gmail.com>
Co-authored-by: Frank Greguska <Francis.Greguska@jpl.nasa.gov>
Co-authored-by: Wilbert Veit <wilbert.e.veit@jpl.nasa.gov>
Co-authored-by: Wilbert Veit <wilbertveit@rocketmail.com>
@mike-gangl
Copy link
Contributor

'resume' now works by comparing checksums of existing files with search results- so no re-download occurs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants