Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subset chunks #166

Merged
merged 44 commits into from Jul 27, 2021
Merged

Subset chunks #166

merged 44 commits into from Jul 27, 2021

Conversation

rabernat
Copy link
Contributor

@rabernat rabernat commented Jul 15, 2021

Closes #93 by allowing chunks to be a subset of inputs.

Before

There was a many-to-one relationship between "inputs" and "chunks." Multiple inputs could be routed to a single chunk via the inputs_per_chunk parameter in XarrayZarrRecipe. This is appropriate to scenarios where we have many small NetCDF as inputs.

Now

We also allow one-to-many relationships between inputs and chunks. This is accomplished via the subset_inputs parameter. This is a dictionary, e.g. {"time": 5} that tells us to subset each input into 5 distinct chunks, along the time dimensions. This only works if there are at least 5 items along the time axis in each file. It also doesn't make sense to combine this with inputs_per_chunk > 1, since they would effectively cancel each other out.

How

To support this change, I had to refactor the internal indexing logic considerably. A big technical change is that InputKey and ChunkKey are now no longer tuples of integers but rather tuples of a new type called DimIndex

@dataclass(frozen=True)
class DimIndex:
name: str
index: int
sequence_len: int
operation: CombineOp

We were already using these keys implicitly to encode lots of information, e.g. an input or chunk's position in the sequence. These keys are now more verbose and explicit.

Another change is that, rather than storing the mapping between chunks and inputs in a static dictionary inside the recipe class, we determine it via a pure function

def inputs_for_chunk(
chunk_key: ChunkKey, inputs_per_chunk: int, ninputs: int
) -> Sequence[InputKey]:

This actually really simplifies the XarrayZarrRecipe initialization logic! 🎉

Review

I don't expect anyone to really be able to thoroughly review this sort of monster PR. Perhaps someone could at least look over the API changes and tests?

TODO

  • Check all docstrings for consistency
  • Check if tutorials need updating
  • Add a new tutorial that does subsetting

Copy link
Member

@cisaacstern cisaacstern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flagged one logger typo in-line. Will follow-up on the conversation thread with results from my tests against FESOM and eNATL60 datasets.

@cisaacstern
Copy link
Member

This is really cool, Ryan. And even better: it appears to work. 🎉 🏆 🎉

Following your instructions above, I pushed pangeo-forge/staged-recipes@4bf58e8 to the FESOM recipe (which, as noted in pangeo-forge/staged-recipes#52 (comment) was previously blocked by #93). With pangeo-forge-recipes installed from e379f4f (my install is one commit behind this PR!), I've now stored about one third of the chunks for FESOM to OSN with no issues.

Noting for future reference to myself or others: for situations where a pre-#166 install of pangeo-forge-recipes was used to cache inputs, the metadata cache (if there is one) will need to be rebuilt before storing chunks with a post-#166 install:

from pangeo_forge_recipes.recipes.xarray_zarr import cache_input_metadata

for input_name in rec.iter_inputs():
    cache_input_metadata(
        input_name,
        file_pattern=rec.file_pattern,
        input_cache=rec.input_cache,
        cache_inputs=rec.cache_inputs,
        copy_input_to_local_file=rec.copy_input_to_local_file,
        xarray_open_kwargs=rec.xarray_open_kwargs,
        delete_input_encoding=rec.delete_input_encoding,
        process_input=rec.process_input,
        metadata_cache=rec.metadata_cache,
    )

I'll try the eNATL60 recipe now and report back shortly.

@cisaacstern
Copy link
Member

Update: FESOM build using e379f4f now complete and all chunks appear to be initialized as expected:

import s3fs
import zarr

endpoint_url = 'https://ncsa.osn.xsede.org'
fs_osn = s3fs.S3FileSystem(anon=True, client_kwargs={'endpoint_url': endpoint_url},)

fesom = "s3://Pangeo/pangeo-forge/swot_adac/FESOM/surf/fma.zarr"
group = zarr.open_consolidated(fs_osn.get_mapper(fesom))

for a in group.arrays():
    print(
        str(group[a[0]].info).split("Type")[0][:-1],
        str(group[a[0]].info).split("FSMap")[1], 
    )
Name               : /dflux 
No. bytes          : 17280000000 (16.1G)
Chunks initialized : 216/216

Name               : /lat 
No. bytes          : 8000 (7.8K)
Chunks initialized : 1/1

Name               : /lon 
No. bytes          : 8000 (7.8K)
Chunks initialized : 1/1

Name               : /ssh 
No. bytes          : 17280000000 (16.1G)
Chunks initialized : 216/216

Name               : /sss 
No. bytes          : 17280000000 (16.1G)
Chunks initialized : 216/216

Name               : /sst 
No. bytes          : 17280000000 (16.1G)
Chunks initialized : 216/216

Name               : /time 
No. bytes          : 17280 (16.9K)
Chunks initialized : 216/216

Name               : /tx_sur 
No. bytes          : 17280000000 (16.1G)
Chunks initialized : 216/216

Name               : /ty_sur 
No. bytes          : 17280000000 (16.1G)
Chunks initialized : 216/216

Name               : /u_surf 
No. bytes          : 17280000000 (16.1G)
Chunks initialized : 216/216

Name               : /v_surf 
No. bytes          : 17280000000 (16.1G)
Chunks initialized : 216/216

Will now re-install from d3e8b2c and try eNATL60.

@cisaacstern
Copy link
Member

I've proposed some additional logging in rabernat#2. Is this the best way to suggest edits to a big refactor like this (pull request to PR branch)? Seemed better than committing directly tp rabernat:subset_chunks so as not interfere with the main flow of the work

@rabernat rabernat marked this pull request as ready for review July 21, 2021 11:07
@rabernat
Copy link
Contributor Author

I think this is ready. Any sort of review would be appreciated.

In particular, it's important for @pangeo-forge/dev-team to grok the changes in indexing, summarized in the docs. The basic API has not changed, but the indexing objects returned by Recipe.iter_inputs() and Recipe.iter_chunks() are now a more complex data type. If you are used to constructing these indexes manually as tuples, then those workflows will break.

docs/development/release_notes.md Outdated Show resolved Hide resolved
@cisaacstern
Copy link
Member

cisaacstern commented Jul 23, 2021

the changes in indexing

These make sense, and aesthetically seem to bring us into closer alignment with xarray's named dimensions. As such, I imagine they actually reduce the likelihood of human error. Very cool.

Only change I'd recommend (beyond inline typo fix, above) is that we merge some version of rabernat#2 to this PR. I've just updated it to reflect your comments, which led to a much nicer implementation. Here's the logging it now returns for the eNATL60 example using the default threshold of 500 MB:

pangeo_forge_recipes.recipes.xarray_zarr - DEBUG - Converting variable votemper of 5346601680 bytes to `numpy.ndarray`
pangeo_forge_recipes.recipes.xarray_zarr - WARNING - Variable votemper of 5346601680 bytes is 10.69 times larger than specified maximum variable array size of 500000000 bytes. Consider re-instantiating recipe with `subset_inputs = {"time_counter": 11}`. If `len(ds["time_counter"])` < 11, substitute "time_counter" for any name in ds["votemper"].dims with length >= 11 or consider subsetting along multiple dimensions. Setting PANGEO_FORGE_MAX_MEMORY env variable changes the variable array size which will trigger this warning.

and with a non-default setting supplied via an env variable:

import os
os.environ["PANGEO_FORGE_MAX_MEMORY"] = "100_000_000"
pangeo_forge_recipes.recipes.xarray_zarr - DEBUG - Converting variable votemper of 5346601680 bytes to `numpy.ndarray`
pangeo_forge_recipes.recipes.xarray_zarr - WARNING - Variable votemper of 5346601680 bytes is 53.47 times larger than specified maximum variable array size of 100000000 bytes. Consider re-instantiating recipe with `subset_inputs = {"time_counter": 54}`. If `len(ds["time_counter"])` < 54, substitute "time_counter" for any name in ds["votemper"].dims with length >= 54 or consider subsetting along multiple dimensions. Setting PANGEO_FORGE_MAX_MEMORY env variable changes the variable array size which will trigger this warning.

Without something like this users will encounter silent kernel crashes without being informed that subset_inputs can resolve their problems for them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handling of large files
2 participants