Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Append mode for StoreToZarr #721

Merged
merged 24 commits into from
Apr 3, 2024
Merged

Append mode for StoreToZarr #721

merged 24 commits into from
Apr 3, 2024

Conversation

cisaacstern
Copy link
Member

Towards #447

This first pass focuses just on StoreToZarr (not reference recipes).

Will open for review once it looks like it's working.

@rabernat
Copy link
Contributor

I would recommend using append_dim: str | None instead of mode.

@cisaacstern cisaacstern added the test-integration Apply this label to run integration tests on a PR. label Mar 31, 2024
@cisaacstern cisaacstern marked this pull request as ready for review March 31, 2024 22:24
@cisaacstern
Copy link
Member Author

cisaacstern commented Mar 31, 2024

This now works! The best "documentation" of it so far is the end to end test.

Tagged a bunch of you for review, for visibility. Also @thodson-usgs since I know you're interested in this feature.

To state the (presumably) obvious, this minimal implementation requires that the user/deployer know what range of data should be appended to the existing dataset, and does not provide any idempotency guarantees or much in the way of logical checks. It will happily append time ranges that do not monotonically increase from the existing concat dim, or re-append the same data multiple times.

But because resizing of the contact dimension is done with xarray.Dataset.to_zarr, we do get the benefit of some built-in consistency checks there.

Copy link
Contributor

@norlandrhagen norlandrhagen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work Charles! It's great to see this feature added. Do you plan to add a section to the docs?

pangeo_forge_recipes/aggregation.py Show resolved Hide resolved
self._append_offset = 0
if self.append_dim:
logger.warn(
"When `append_dim` is given, StoreToZarr is NOT idempotent. Successive deployment "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In future PR's to this, do you see a good path forward to making append safe/idempotent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! For future readers, we had a good discussion of this in today's Coordination meeting, minutes here.

@@ -362,6 +369,7 @@ def expand(self, pcoll: beam.PCollection) -> beam.PCollection:
attrs=self.attrs,
encoding=self.encoding,
consolidated_metadata=False,
append_dim=self.append_dim,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing how easy it was to modify this step to allow appending.

@cisaacstern
Copy link
Member Author

cisaacstern commented Apr 2, 2024

Do you plan to add a section to the docs?

@norlandrhagen how does aaa5a69 seem? Rendered version here

Edit: Noticed a small rendering issue, fixed that just now.

Comment on lines +293 to +296
if append_dim:
# if appending, only keep schema for coordinate to append. if we don't drop other
# coords, we may end up overwriting existing data on the `ds.to_zarr` call below.
schema["coords"] = {k: v for k, v in schema["coords"].items() if k == append_dim}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realizing that this, while seemingly necessary to avoid overwriting non-append_dim coords, may prevent us from benefiting from the dimension consistency checks offered by ds.to_zarr... we may need to re-implement such checks in StoreToZarr.__post_init__ if we want them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this further... in StoreToZarr.__post_init__ we actually don't know what the coordinates in the new data are (aside from the user-provided append_dim), because we haven't actually fetched and opened any source files yet.

So if we want to check for consistency, we could actually do it right here. Before dropping non-append coords from the schema, we would just need to open the target dataset and check that all coords match first. Could be a good idea.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having just reviewed xarray's checks again, I actually don't see that this is done there, so may be overly strict. I think we can move forward with what we have for now.

@cisaacstern cisaacstern merged commit 30b9bb2 into main Apr 3, 2024
8 checks passed
@cisaacstern cisaacstern deleted the append-mode branch April 3, 2024 17:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test-integration Apply this label to run integration tests on a PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants