Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Next data request is blocked by long downloads #139

Open
alxmrs opened this issue Apr 8, 2022 · 2 comments
Open

Next data request is blocked by long downloads #139

alxmrs opened this issue Apr 8, 2022 · 2 comments

Comments

@alxmrs
Copy link
Collaborator

alxmrs commented Apr 8, 2022

I've inspected the logs on MARS and our dataflow jobs and noticed a discrepancy. In our MARS activity log, data takes a certain amount of time to serve a request – say, 2 hours. In the Dataflow logs, the same request takes longer than two hours – maybe, an additional hour longer.

After some analysis, I've found the cause for the difference: The ECMWF Web API, which processes the underlying request to the MARS service, takes a long time to actually download the data – especially if the payload is large. When I've inspected EC's code, I've noticed that it transfers 1 MB at a time of data from EC's servers to the Dataflow job. Since we regularly have payloads in the 10s of GiBs, it makes sense that this takes a long time!

This feature request involves enhancing the fetch stage of the downloader such that subsequent requests don't have to wait on current downloads. As soon as data has been fetched from the MARS archive and is available for download, we should initiate a new data request.

Implementation Notes

A clean way to implement this feature would be to restructure the fetch stage to be a Composite Beam transform, similar to

class PartitionConfig(beam.PTransform):

In parallel, we should extend the interface for a Clients to separate the request stage and the download stage of fetching.

The new fetch stage should involve multiple PTransforms that parallelize the work of fetching a request, waiting for it to become available, and then processing the download when ready.

To better understand what specific steps the clients / fetch should have, please review the flow of the ECMWF WebAPI for MARS, specifically:
https://github.com/ecmwf/ecmwf-api-client/blob/master/ecmwfapi/api.py#L518

Notice that it waits in a loop, checking if the data is ready, before actually performing the download.
https://github.com/ecmwf/ecmwf-api-client/blob/dd20383585359cc80249826977ff102c3a81deee/ecmwfapi/api.py#L532
https://github.com/ecmwf/ecmwf-api-client/blob/dd20383585359cc80249826977ff102c3a81deee/ecmwfapi/api.py#L548

With an extended client interface, we can provide affordances to call these steps in the client APIs ourselves, but in a way that better works with Beam + the global queue.

To complete this bug, all clients (so, CDS and MARS) should have the same interface and optimization, if possible.

Advance implementation ideas

From my read of the Beam docs, this new fetcher might be a good candidate for a Splittable DoFn. A possible advantage of using a SDF is that the CPU utilization of the download might be improved, which could improve allocation. Instead of sleeping before checking if data is available, with a SDF, we may be able to use a timer and schedule the next call to the MARS server. While this implementation approach is not necessary, it may save on both compute and energy costs for the downloader. See the "User Initiated checkpoint" section, for example.

@alxmrs
Copy link
Collaborator Author

alxmrs commented Apr 8, 2022

As noted in #140, an acceptance criteria for this issue is that each stage of fetching should include retry logic. It should aim to prevent this bug from occurring again.

@alxmrs
Copy link
Collaborator Author

alxmrs commented Apr 25, 2022

@mahrsee1997, after seeing your e2e tests, I think it may be time to investigate using the Splittable DoFns as described in the "Advanced implementation ideas" section. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant