-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Australian Gridded Climate Data (AGCD) V2 1971-2020 #87
Conversation
@norlandrhagen & @andersy005, as a prelude to our sync on recipe contribution, I thought I'd begin with an async demonstration for you by attempting to move this existing PR from Raphael through the stages of evaluation described in:
This PR was made before @pangeo-forge-bot performed automated checks on all incoming PRs, but that's okay. PRs are evaluated at every commit, so to trigger evaluation of this PR, I just need to push an arbitrary commit to this PR. I will do that now by bumping |
I don't see a ['recipes/AGDC/meta.yml', 'recipes/AGDC/recipe.py'] Please commit a
|
The bot could not find a file named |
Turns out we just found a bug, which I'm now tracking in https://github.com/pangeo-forge/registrar/issues/35. The bakery specified in |
🎉 New recipe runs created for the following recipes at sha |
Great! So all of the static linting now passed, and we've received notification that a |
/run recipe-test recipe_run_id=95 |
✨ A test of your recipe I'll notify you with a comment on this thread when this test is complete. (This could be a little while...) In the meantime, you can follow the logs for this recipe run at https://pangeo-forge.org/dashboard/recipe-run/95 |
Pangeo Cloud told me that our test of your recipe To see what error caused the failure, please review the logs at https://pangeo-forge.org/dashboard/recipe-run/95 If you haven't yet tried pruning and running your recipe locally, I suggest trying that now. Please report back on the results of your local testing in a new comment below, and a Pangeo Forge maintainer will help you with next steps! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this failed with a lock-related error.
Task 'store_chunk[8]': Exception encountered during task execution!
Traceback (most recent call last): File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/comm/tcp.py", line 205, in read frames_nbytes = await stream.read_bytes(fmt_size) tornado.iostream.StreamClosedError: Stream is closed
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/srv/conda/envs/notebook/lib/python3.9/site-packages/prefect/engine/task_runner.py", line 861, in get_task_run_state value = prefect.utilities.executors.run_task_with_timeout( File "/srv/conda/envs/notebook/lib/python3.9/site-packages/prefect/utilities/executors.py", line 323, in run_task_with_timeout return task.run(*args, **kwargs) # type: ignore
File "/usr/local/lib/python3.9/site-packages/registrar/flow.py", line 113, in wrapper
File "/srv/conda/envs/notebook/lib/python3.9/site-packages/pangeo_forge_recipes/recipes/xarray_zarr.py", line 631, in store_chunk with lock_for_conflicts(lock_keys, timeout=config.lock_timeout):
File "/srv/conda/envs/notebook/lib/python3.9/contextlib.py", line 119, in __enter__ return next(self.gen)
File "/srv/conda/envs/notebook/lib/python3.9/site-packages/pangeo_forge_recipes/utils.py", line 106, in lock_for_conflicts acquired = lock.acquire(timeout=timeout)
File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/lock.py", line 137, in acquire result = self.client.sync(
File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/client.py", line 868, in sync return sync(
File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/utils.py", line 332, in sync raise exc.with_traceback(tb)
File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/utils.py", line 315, in f result[0] = yield future File "/srv/conda/envs/notebook/lib/python3.9/site-packages/tornado/gen.py", line 762, in run value = future.result()
File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/core.py", line 895, in send_recv_from_rpc result = await send_recv(comm=comm, op=key, **kwargs)
File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/core.py", line 672, in send_recv response = await comm.read(deserializers=deserializers)
File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/comm/tcp.py", line 221, in read convert_stream_closed_error(self, e)
File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/comm/tcp.py", line 128, in convert_stream_closed_error raise CommClosedError(f"in {obj}: {exc}") from exc distributed.comm.core.CommClosedError: in <TCP (closed) ConnectionPool.lock_acquire local=tcp://10.60.1.8:50098 remote=tcp://dask-jovyan-215ce648-4.pangeo-forge-columbia-staging-bakery:8786>: Stream is closed
We should investigate why these locks are erroring out. @TomAugspurger - you have some experience with this. Any ideas?
recipes/AGDC/recipe.py
Outdated
from pangeo_forge_recipes.recipes import XarrayZarrRecipe | ||
|
||
# Filename Pattern Inputs | ||
target_chunks = {"lat": 3451, "lon": 4426, "time": 20} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These chunks seems pretty big? Could we get away with smaller ones? That might help the locking situation.
I'm not super confident, but I'm not immediately convinced that the lock is the problem. Rather, my gut reaction is that the worker is dying for some other reason, and we happen to be trying to acquire the lock when we notice that. Do the logs at https://pangeo-forge.org/dashboard/recipe-run/95 include the output from the workers, or is that just the client? |
The pangeo-forge.org logs (sourced from Prefect) are incomplete. Here's a fuller picture:
There are a lot of unmanaged worker memory warnings throughout these logs, e.g.
|
🎉 New recipe runs created for the following recipes at sha |
@norlandrhagen, let me know how I can help here. Looks like the last commit didn't change anything in the recipe itself, so we don't need to re-run the test execution yet, AFAICT. |
@cisaacstern |
That feature is being tracked in pangeo-forge/user-stories#3 but is not yet available. |
🎉 New recipe runs created for the following recipes at sha |
🎉 New recipe runs created for the following recipes at sha |
/run recipe-test recipe_run_id=992 |
🎉 New recipe runs created for the following recipes at sha |
/run recipe-test recipe_run_id=996 |
✨ A test of your recipe I'll notify you with a comment on this thread when this test is complete. (This could be a little while...) In the meantime, you can follow the logs for this recipe run at https://pangeo-forge.org/dashboard/recipe-run/996 |
Pangeo Forge Cloud told me that our test of your recipe To see what error caused the failure, please review the logs at https://pangeo-forge.org/dashboard/recipe-run/996 If you haven't yet tried pruning and running your recipe locally, I suggest trying that now. Please report back on the results of your local testing in a new comment below, and a Pangeo Forge maintainer will help you with next steps! |
@norlandrhagen, I've been having some success resolving mysterious memory-related issues by switching to our |
🎉 New recipe runs created for the following recipes at sha
|
/run recipe-test recipe_run_id=72 |
✨ A test of your recipe I'll notify you with a comment on this thread when this test is complete. (This could be a little while...)
|
Woohoo! Dataflow to the rescue? |
Yep. How does that dataset look? ☝️ If good, we can merge this. |
Looks like there is an issue with time, other than that, it's looking good. |
What is the issue with time? We only expect the test run to use the first two inputs in the time dimension. Is that what you're seeing, or something else? |
Ohhh, that makes sense. Well in that case, recipe looks good to go! |
@cisaacstern is it possible to kick off a run ofthis dataset? It seems like the pruned recipe run looked good. |
We've just released a new backend app, so before merging this, I'm just going to re-run the test, to make sure it still works. |
/run AGCD |
🎉 The test run of import xarray as xr
store = "https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/test/pangeo-forge/staged-recipes/recipe-run-1057/AGCD.zarr"
ds = xr.open_dataset(store, engine='zarr', chunks={})
ds |
Thanks for this contribution, @norlandrhagen! And for the patience as we worked through getting it merged. |
Of course! Thanks for getting this one working. |
This recipe seems to work in local testing.