New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subset chunks #166
Subset chunks #166
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flagged one logger typo in-line. Will follow-up on the conversation thread with results from my tests against FESOM and eNATL60 datasets.
This is really cool, Ryan. And even better: it appears to work. 🎉 🏆 🎉 Following your instructions above, I pushed pangeo-forge/staged-recipes@4bf58e8 to the FESOM recipe (which, as noted in pangeo-forge/staged-recipes#52 (comment) was previously blocked by #93). With Noting for future reference to myself or others: for situations where a pre-#166 install of from pangeo_forge_recipes.recipes.xarray_zarr import cache_input_metadata
for input_name in rec.iter_inputs():
cache_input_metadata(
input_name,
file_pattern=rec.file_pattern,
input_cache=rec.input_cache,
cache_inputs=rec.cache_inputs,
copy_input_to_local_file=rec.copy_input_to_local_file,
xarray_open_kwargs=rec.xarray_open_kwargs,
delete_input_encoding=rec.delete_input_encoding,
process_input=rec.process_input,
metadata_cache=rec.metadata_cache,
) I'll try the eNATL60 recipe now and report back shortly. |
Update: FESOM build using e379f4f now complete and all chunks appear to be initialized as expected: import s3fs
import zarr
endpoint_url = 'https://ncsa.osn.xsede.org'
fs_osn = s3fs.S3FileSystem(anon=True, client_kwargs={'endpoint_url': endpoint_url},)
fesom = "s3://Pangeo/pangeo-forge/swot_adac/FESOM/surf/fma.zarr"
group = zarr.open_consolidated(fs_osn.get_mapper(fesom))
for a in group.arrays():
print(
str(group[a[0]].info).split("Type")[0][:-1],
str(group[a[0]].info).split("FSMap")[1],
)
Will now re-install from d3e8b2c and try eNATL60. |
I've proposed some additional logging in rabernat#2. Is this the best way to suggest edits to a big refactor like this (pull request to PR branch)? Seemed better than committing directly tp |
I think this is ready. Any sort of review would be appreciated. In particular, it's important for @pangeo-forge/dev-team to grok the changes in indexing, summarized in the docs. The basic API has not changed, but the indexing objects returned by |
Co-authored-by: Ryan Abernathey <ryan.abernathey@gmail.com>
These make sense, and aesthetically seem to bring us into closer alignment with xarray's named dimensions. As such, I imagine they actually reduce the likelihood of human error. Very cool. Only change I'd recommend (beyond inline typo fix, above) is that we merge some version of rabernat#2 to this PR. I've just updated it to reflect your comments, which led to a much nicer implementation. Here's the logging it now returns for the eNATL60 example using the default threshold of 500 MB:
and with a non-default setting supplied via an env variable: import os
os.environ["PANGEO_FORGE_MAX_MEMORY"] = "100_000_000"
Without something like this users will encounter silent kernel crashes without being informed that |
Closes #93 by allowing chunks to be a subset of inputs.
Before
There was a many-to-one relationship between "inputs" and "chunks." Multiple inputs could be routed to a single chunk via the
inputs_per_chunk
parameter inXarrayZarrRecipe
. This is appropriate to scenarios where we have many small NetCDF as inputs.Now
We also allow one-to-many relationships between inputs and chunks. This is accomplished via the
subset_inputs
parameter. This is a dictionary, e.g.{"time": 5}
that tells us to subset each input into 5 distinct chunks, along the time dimensions. This only works if there are at least 5 items along the time axis in each file. It also doesn't make sense to combine this withinputs_per_chunk > 1
, since they would effectively cancel each other out.How
To support this change, I had to refactor the internal indexing logic considerably. A big technical change is that InputKey and ChunkKey are now no longer tuples of integers but rather tuples of a new type called DimIndex
pangeo-forge-recipes/pangeo_forge_recipes/patterns.py
Lines 55 to 60 in e17f678
We were already using these keys implicitly to encode lots of information, e.g. an input or chunk's position in the sequence. These keys are now more verbose and explicit.
Another change is that, rather than storing the mapping between chunks and inputs in a static dictionary inside the recipe class, we determine it via a pure function
pangeo-forge-recipes/pangeo_forge_recipes/recipes/xarray_zarr.py
Lines 65 to 67 in e17f678
This actually really simplifies the XarrayZarrRecipe initialization logic! 🎉
Review
I don't expect anyone to really be able to thoroughly review this sort of monster PR. Perhaps someone could at least look over the API changes and tests?
TODO