Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Australian Gridded Climate Data (AGCD) V2 1971-2020 #87

Merged
merged 11 commits into from
Sep 8, 2022

Conversation

norlandrhagen
Copy link
Contributor

This recipe seems to work in local testing.

@cisaacstern
Copy link
Member

cisaacstern commented Apr 27, 2022

@norlandrhagen & @andersy005, as a prelude to our sync on recipe contribution, I thought I'd begin with an async demonstration for you by attempting to move this existing PR from Raphael through the stages of evaluation described in:

https://pangeo-forge.readthedocs.io/en/latest/pangeo_forge_cloud/pr_checks_reference.html

This PR was made before @pangeo-forge-bot performed automated checks on all incoming PRs, but that's okay. PRs are evaluated at every commit, so to trigger evaluation of this PR, I just need to push an arbitrary commit to this PR. I will do that now by bumping pangeo_forge_version in meta.yaml to the latest release.

recipes/AGDC/meta.yml Outdated Show resolved Hide resolved
@pangeo-forge-bot
Copy link

I don't see a meta.yaml in this PR, only these files:

['recipes/AGDC/meta.yml', 'recipes/AGDC/recipe.py']

Please commit a meta.yaml as described here.

Sorry, I only recognize the longform .yaml extension!
If you're using the shortform .yml, please update your filename to use the longform extension.

@cisaacstern
Copy link
Member

The bot could not find a file named meta.yaml (with the long form extension) because the YAML file in this PR uses the short form extension. We can resolve this by changing meta.yml -> meta.yaml, which I'll do now.

@cisaacstern
Copy link
Member

Turns out we just found a bug, which I'm now tracking in https://github.com/pangeo-forge/registrar/issues/35. The bakery specified in meta.yaml is not known to Pangeo Forge Cloud: https://pangeo-forge.org/dashboard/bakeries. As noted in the linked issue, we should really have @pangeo-forge-bot tell us when that condition is encountered. But for now, to move this process forward, I'll just update meta.yaml to use a known bakery (there is only one, at the moment).

recipes/AGDC/meta.yaml Outdated Show resolved Hide resolved
@pangeo-forge-bot
Copy link

🎉 New recipe runs created for the following recipes at sha 0b837a0b28609fb35df063e94a9b6795171b1fd3:

@cisaacstern
Copy link
Member

Great! So all of the static linting now passed, and we've received notification that a recipe_run was created in the Pangeo Forge Cloud orchestration database referencing the recipe included in this PR. I will now trigger a test run of that recipe.

@cisaacstern
Copy link
Member

/run recipe-test recipe_run_id=95

@pangeo-forge-bot
Copy link

✨ A test of your recipe AGCD is now running on Pangeo Forge Cloud!

I'll notify you with a comment on this thread when this test is complete. (This could be a little while...)

In the meantime, you can follow the logs for this recipe run at https://pangeo-forge.org/dashboard/recipe-run/95

@pangeo-forge-bot
Copy link

Pangeo Cloud told me that our test of your recipe AGCD failed. But don't worry, I'm sure we can fix this!

To see what error caused the failure, please review the logs at https://pangeo-forge.org/dashboard/recipe-run/95

If you haven't yet tried pruning and running your recipe locally, I suggest trying that now.

Please report back on the results of your local testing in a new comment below, and a Pangeo Forge maintainer will help you with next steps!

Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this failed with a lock-related error.


Task 'store_chunk[8]': Exception encountered during task execution!
Traceback (most recent call last): File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/comm/tcp.py", line 205, in read frames_nbytes = await stream.read_bytes(fmt_size) tornado.iostream.StreamClosedError: Stream is closed
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/srv/conda/envs/notebook/lib/python3.9/site-packages/prefect/engine/task_runner.py", line 861, in get_task_run_state value = prefect.utilities.executors.run_task_with_timeout( File "/srv/conda/envs/notebook/lib/python3.9/site-packages/prefect/utilities/executors.py", line 323, in run_task_with_timeout return task.run(*args, **kwargs) # type: ignore
File "/usr/local/lib/python3.9/site-packages/registrar/flow.py", line 113, in wrapper
File "/srv/conda/envs/notebook/lib/python3.9/site-packages/pangeo_forge_recipes/recipes/xarray_zarr.py", line 631, in store_chunk with lock_for_conflicts(lock_keys, timeout=config.lock_timeout):
File "/srv/conda/envs/notebook/lib/python3.9/contextlib.py", line 119, in __enter__ return next(self.gen)
File "/srv/conda/envs/notebook/lib/python3.9/site-packages/pangeo_forge_recipes/utils.py", line 106, in lock_for_conflicts acquired = lock.acquire(timeout=timeout)
File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/lock.py", line 137, in acquire result = self.client.sync(
File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/client.py", line 868, in sync return sync(
File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/utils.py", line 332, in sync raise exc.with_traceback(tb)
File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/utils.py", line 315, in f result[0] = yield future File "/srv/conda/envs/notebook/lib/python3.9/site-packages/tornado/gen.py", line 762, in run value = future.result()
File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/core.py", line 895, in send_recv_from_rpc result = await send_recv(comm=comm, op=key, **kwargs)
File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/core.py", line 672, in send_recv response = await comm.read(deserializers=deserializers)
File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/comm/tcp.py", line 221, in read convert_stream_closed_error(self, e)
File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/comm/tcp.py", line 128, in convert_stream_closed_error raise CommClosedError(f"in {obj}: {exc}") from exc distributed.comm.core.CommClosedError: in <TCP (closed) ConnectionPool.lock_acquire local=tcp://10.60.1.8:50098 remote=tcp://dask-jovyan-215ce648-4.pangeo-forge-columbia-staging-bakery:8786>: Stream is closed

We should investigate why these locks are erroring out. @TomAugspurger - you have some experience with this. Any ideas?

from pangeo_forge_recipes.recipes import XarrayZarrRecipe

# Filename Pattern Inputs
target_chunks = {"lat": 3451, "lon": 4426, "time": 20}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These chunks seems pretty big? Could we get away with smaller ones? That might help the locking situation.

recipes/AGDC/recipe.py Outdated Show resolved Hide resolved
@TomAugspurger
Copy link
Contributor

We should investigate why these locks are erroring out. @TomAugspurger - you have some experience with this. Any ideas?

I'm not super confident, but I'm not immediately convinced that the lock is the problem. Rather, my gut reaction is that the worker is dying for some other reason, and we happen to be trying to acquire the lock when we notice that.

Do the logs at https://pangeo-forge.org/dashboard/recipe-run/95 include the output from the workers, or is that just the client?

@cisaacstern
Copy link
Member

The pangeo-forge.org logs (sourced from Prefect) are incomplete. Here's a fuller picture:

https://gist.github.com/cisaacstern/103e755ddeea7b0b2dd16b29042778e0

There are a lot of unmanaged worker memory warnings throughout these logs, e.g.

2022-04-27T22:43:52.996592464Z stderr F DEBUG:pangeo_forge_recipes.recipes.xarray_zarr:Converting variable tmin of 448149432 bytes to `numpy.ndarray`
2022-04-27T22:43:55.509916521Z stderr F distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 3.06 GiB -- Worker memory limit: 3.90 GiB
2022-04-27T22:43:56.599976356Z stderr F distributed.utils_perf - INFO - full garbage collection released 1.13 GiB from 496 reference cycles (threshold: 9.54 MiB)
2022-04-27T22:43:56.600346814Z stderr F distributed.worker - WARNING - Worker is at 80% memory usage. Pausing worker.  Process memory: 3.13 GiB -- Worker memory limit: 3.90 GiB
2022-04-27T22:43:56.60069608Z stderr F distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 3.13 GiB -- Worker memory limit: 3.90 GiB
2022-04-27T22:43:56.603242955Z stderr F distributed.worker - WARNING - Worker is at 51% memory usage. Resuming worker. Process memory: 2.00 GiB -- Worker memory limit: 3.90 GiB

@pangeo-forge-bot
Copy link

🎉 New recipe runs created for the following recipes at sha 4731318ca40fc91bb6484523684ace8dfdeef0a6:

@cisaacstern
Copy link
Member

@norlandrhagen, let me know how I can help here. Looks like the last commit didn't change anything in the recipe itself, so we don't need to re-run the test execution yet, AFAICT.

@norlandrhagen
Copy link
Contributor Author

@cisaacstern
Thanks! Yup, no need to re-run. Is there a way to skip the CI recipe run?

@cisaacstern
Copy link
Member

That feature is being tracked in pangeo-forge/user-stories#3 but is not yet available.

@pangeo-forge-bot
Copy link

🎉 New recipe runs created for the following recipes at sha 6a440996f916fc365ba9507cf2926b4fae94120b:

@pangeo-forge-bot
Copy link

🎉 New recipe runs created for the following recipes at sha 267ebdb25143c44a5da55595e7543d11d7438c16:

@norlandrhagen
Copy link
Contributor Author

/run recipe-test recipe_run_id=992

@pangeo-forge-bot
Copy link

🎉 New recipe runs created for the following recipes at sha 015789e469d8ce8cf0654a3d2cfe48cba365607b:

@cisaacstern
Copy link
Member

/run recipe-test recipe_run_id=996

@pangeo-forge-bot
Copy link

✨ A test of your recipe AGCD is now running on Pangeo Forge Cloud!

I'll notify you with a comment on this thread when this test is complete. (This could be a little while...)

In the meantime, you can follow the logs for this recipe run at https://pangeo-forge.org/dashboard/recipe-run/996

@pangeo-forge-bot
Copy link

Pangeo Forge Cloud told me that our test of your recipe AGCD failed. But don't worry, I'm sure we can fix this!

To see what error caused the failure, please review the logs at https://pangeo-forge.org/dashboard/recipe-run/996

If you haven't yet tried pruning and running your recipe locally, I suggest trying that now.

Please report back on the results of your local testing in a new comment below, and a Pangeo Forge maintainer will help you with next steps!

@cisaacstern
Copy link
Member

cisaacstern commented Jul 22, 2022

@norlandrhagen, I've been having some success resolving mysterious memory-related issues by switching to our dev backend (which deploys to Google Dataflow rather than Prefect). I'll label this PR as dev now, and try that out. Note: I'll need to make an arbitrary commit bump the meta.yaml to use pangeo_forge_version==0.9.0 as well, to get a clean recipe run.

@cisaacstern cisaacstern added the dev (dev use only) directs registrar calls to staging api label Jul 22, 2022
recipes/AGDC/meta.yaml Outdated Show resolved Hide resolved
@pangeo-forge-bot
Copy link

🎉 New recipe runs created for the following recipes at sha f95296f02bb2250e7af43afdbdcd5840ff142865:

Note: This PR is deployed to Pangeo Forge Cloud's dev backend, for which a full frontend website in not currently available. The links below therefore point to plain text information about the created recipe run(s).

@cisaacstern
Copy link
Member

/run recipe-test recipe_run_id=72

@pangeo-forge-bot
Copy link

✨ A test of your recipe AGCD is now running on Pangeo Forge Cloud!

I'll notify you with a comment on this thread when this test is complete. (This could be a little while...)

Note: This test is deployed to Pangeo Forge Cloud's dev backend, for which public logs are not yet available.

@pangeo-forge-bot
Copy link

🥳 Hooray! The test execution of your recipe AGCD succeeded.

Here is a static representation of the dataset built by this recipe:

            <xarray.Dataset>
    Dimensions:     (lat: 691, lon: 886, time: 731)
    Coordinates:
      * lat         (lat) float32 -44.5 -44.45 -44.4 -44.35 ... -10.1 -10.05 -10.0
      * lon         (lon) float32 112.0 112.1 112.1 112.2 ... 156.1 156.2 156.2
      * time        (time) datetime64[ns] 1971-01-01T09:00:00 ... 1972-12-31T09:0...
    Data variables:
        precip      (time, lat, lon) float32 ...
        tmax        (time, lat, lon) float32 ...
        tmin        (time, lat, lon) float32 ...
        vapourpres  (time, lat, lon) float32 ...
    Attributes: (12/33)
        Conventions:               CF-1.6, ACDD-1.3
        acknowledgment:            The Australian Government, Bureau of Meteorolo...
        agcd_version:              AGCD (AWAP) v1.0.0 Snapshot (1900-01-01 to 202...
        analysis_components:       0900: the gridded vapour pressure value at 9am...
        attribution:               Data should be cited as : Australian Bureau of...
        cdm_data_type:             Grid
        ...                        ...
        summary:                   The partial pressure of water vapour in air (v...
        time_coverage_end:         1971-12-31T00:00:00
        time_coverage_start:       1971-01-01T15:00:00
        title:                     Interpolated Vapour Pressure
        url:                       http://www.bom.gov.au/climate/
        uuid:                      e684e0a6-73c7-4522-ab78-a8285ca34b4b

You can also open this dataset by running the following Python code

import fsspec
import xarray as xr

dataset_public_url = 'https://ncsa.osn.xsede.org/Pangeo/pangeo-forge-test/staging/recipe-run-72/pangeo-forge/staged-recipes/AGCD.zarr'
mapper = fsspec.get_mapper(dataset_public_url)
ds = xr.open_zarr(mapper, consolidated=True)
ds

in this badge (or your Python interpreter of choice).

Checklist

Please copy-and-paste the list below into a new comment on this thread, and check the boxes off as you've reviewed them.

Note: This test execution is limited to two increments in the concatenation dimension, so you should expect the length of that dimension (e.g, "time" or equivalent) to be 2.

- [ ] Are the dimension lengths correct?
- [ ] Are all of the expected variables present?
- [ ] Does plotting the data produce a plot that looks like your dataset?
- [ ] Can you run a simple computation/reduction on the data and produce a plausible result?

@norlandrhagen
Copy link
Contributor Author

Woohoo! Dataflow to the rescue?

@cisaacstern
Copy link
Member

Dataflow to the rescue?

Yep.

How does that dataset look? ☝️ If good, we can merge this.

@norlandrhagen
Copy link
Contributor Author

Looks like there is an issue with time, other than that, it's looking good.

@norlandrhagen
Copy link
Contributor Author

image

Looks like Australia!

@cisaacstern
Copy link
Member

Looks like there is an issue with time, other than that, it's looking good.

What is the issue with time? We only expect the test run to use the first two inputs in the time dimension. Is that what you're seeing, or something else?

@norlandrhagen
Copy link
Contributor Author

Ohhh, that makes sense. Well in that case, recipe looks good to go!

@norlandrhagen norlandrhagen changed the title AGDC (Australian Gridded Climate Data (AGCD) V2) 1971-2020 Australian Gridded Climate Data (AGCD) V2 1971-2020 Jul 23, 2022
@norlandrhagen
Copy link
Contributor Author

@cisaacstern is it possible to kick off a run ofthis dataset? It seems like the pruned recipe run looked good.

@cisaacstern cisaacstern removed the dev (dev use only) directs registrar calls to staging api label Sep 8, 2022
recipes/AGDC/recipe.py Outdated Show resolved Hide resolved
@cisaacstern
Copy link
Member

We've just released a new backend app, so before merging this, I'm just going to re-run the test, to make sure it still works.

@cisaacstern
Copy link
Member

/run AGCD

@pangeo-forge
Copy link
Contributor

pangeo-forge bot commented Sep 8, 2022

🎉 The test run of AGCD at a830182 succeeded!

import xarray as xr

store = "https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/test/pangeo-forge/staged-recipes/recipe-run-1057/AGCD.zarr"
ds = xr.open_dataset(store, engine='zarr', chunks={})
ds

@cisaacstern
Copy link
Member

xr.testing.assert_identical indicates that the datasets reported in #87 (comment) and #87 (comment) are identical, so going to merge this. 🎉

Thanks for this contribution, @norlandrhagen! And for the patience as we worked through getting it merged.

@cisaacstern cisaacstern merged commit 01e333a into pangeo-forge:master Sep 8, 2022
@norlandrhagen
Copy link
Contributor Author

Of course! Thanks for getting this one working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants