# Reading and writing data

## Read datasets into chunks

There are two main options for loading an `xarray.Dataset` into Xarray-Beam. You can either [create the dataset](data-model.ipynb) from scratch or use the {py:class}`~xarray_beam.DatasetToChunks` transform starting at the root of a Beam pipeline:

In [42]:
# hidden imports & helper functions

In [39]:
import textwrap
import apache_beam as beam
import xarray_beam as xbeam
import xarray

def summarize_dataset(dataset):
    return f'<xarray.Dataset data_vars={list(dataset.data_vars)} dims={dict(dataset.sizes)}>'

def print_summary(key, chunk):
    print(f'{key}\n  with {summarize_dataset(chunk)}')

In [40]:
ds = xarray.tutorial.load_dataset('air_temperature')

In [41]:
with beam.Pipeline() as p:
    p | xbeam.DatasetToChunks(ds, chunks={'time': 1000}) | beam.MapTuple(print_summary)



Key(offsets={'lat': 0, 'lon': 0, 'time': 0}, vars=None)
  with <xarray.Dataset data_vars=['air'] dims={'lat': 25, 'time': 1000, 'lon': 53}>
Key(offsets={'lat': 0, 'lon': 0, 'time': 1000}, vars=None)
  with <xarray.Dataset data_vars=['air'] dims={'lat': 25, 'time': 1000, 'lon': 53}>
Key(offsets={'lat': 0, 'lon': 0, 'time': 2000}, vars=None)
  with <xarray.Dataset data_vars=['air'] dims={'lat': 25, 'time': 920, 'lon': 53}>


Importantly, xarray datasets fed into `DatasetToChunks` **can be lazy**, with data not already loaded eagerly into NumPy arrays. When you feed lazy datasets into `DatasetToChunks`, each individual chunk will be indexed and evaluated separately on Beam workers.

This pattern allows for leveraging Xarray's builtin dataset loaders (e.g., `open_dataset()` and `open_zarr()`) for feeding arbitrarily large datasets into Xarray-Beam.

For best performance, set `chunks=None` when opening datasets and then _explicitly_ provide chunks in `DatasetToChunks`:

In [47]:
# write data into the distributed Zarr format
ds.chunk({'time': 1000}).to_zarr('example-data.zarr', mode='w')

# load it with zarr
on_disk = xarray.open_zarr('example-data.zarr', chunks=None)

with beam.Pipeline() as p:
    p | xbeam.DatasetToChunks(on_disk, chunks={'time': 1000}) | beam.MapTuple(print_summary)



Key(offsets={'lat': 0, 'lon': 0, 'time': 0}, vars=None)
  with <xarray.Dataset data_vars=['air'] dims={'time': 1000, 'lat': 25, 'lon': 53}>
Key(offsets={'lat': 0, 'lon': 0, 'time': 1000}, vars=None)
  with <xarray.Dataset data_vars=['air'] dims={'time': 1000, 'lat': 25, 'lon': 53}>
Key(offsets={'lat': 0, 'lon': 0, 'time': 2000}, vars=None)
  with <xarray.Dataset data_vars=['air'] dims={'time': 920, 'lat': 25, 'lon': 53}>


`chunks=None` tells Xarray to use its builtin lazy indexing machinery, instead of using Dask. This is advantageous because datasets using Xarray's lazy indexing are serialized much more compactly (via [pickle](https://docs.python.org/3/library/pickle.html)) when passed into Beam transforms.

Alternatively, you can pass in lazy datasets [using dask](http://xarray.pydata.org/en/stable/user-guide/dask.html). In this case, you don't need to explicitly supply `chunks` to `DatasetToChunks`:

In [49]:
on_disk = xarray.open_zarr('example-data.zarr', chunks={'time': 1000})

with beam.Pipeline() as p:
    p | xbeam.DatasetToChunks(on_disk) | beam.MapTuple(print_summary)



Key(offsets={'lat': 0, 'lon': 0, 'time': 0}, vars=None)
  with <xarray.Dataset data_vars=['air'] dims={'time': 1000, 'lat': 25, 'lon': 53}>
Key(offsets={'lat': 0, 'lon': 0, 'time': 1000}, vars=None)
  with <xarray.Dataset data_vars=['air'] dims={'time': 1000, 'lat': 25, 'lon': 53}>
Key(offsets={'lat': 0, 'lon': 0, 'time': 2000}, vars=None)
  with <xarray.Dataset data_vars=['air'] dims={'time': 920, 'lat': 25, 'lon': 53}>


Dask's lazy evaluation system is much more general than Xarray's lazy indexing, so as long as resulting dataset can be independently evaluated in each chunk this can be a very convenient way to setup computation for Xarray-Beam.

Unfortunately, it doesn't scale as well. In particular, the overhead of pickling large Dask graphs for passing to Beam workers can be prohibitive for large (typically multiple TB) datasets with millions of chunks. However, a current major effort in Dask on [high level graphs](https://blog.dask.org/2021/07/07/high-level-graphs) should improve this in the near future.

```{note}
We are still figuring out the optimal APIs to facilitate opening data and building lazy datasets in Xarray-Beam. E.g., see [this issue](https://github.com/google/xarray-beam/issues/26) for discussion of a higher level `ZarrToChunks` transform embedding these best practices.
```

## Writing data to Zarr

[Zarr](https://zarr.readthedocs.io/) is the preferred file format for reading and writing data with Xarray-Beam, due to its excellent scalability and support inside Xarray.

{py:class}`~xarray_beam.ChunksToZarr` is Xarray-Beam's API for saving chunks into a Zarr store. 

You can get started just using it directly:

In [50]:
with beam.Pipeline() as p:
    p | xbeam.DatasetToChunks(on_disk) | xbeam.ChunksToZarr('example-data-v2.zarr')



By default, `ChunksToZarr` needs to evaluate and combine the entire distributed dataset in order to determine overall Zarr metadata (e.g., array names, shapes, dtypes and attributes). This is fine for relatively small datasets, but can entail significant additional communication and storage costs for large datasets.

The optional `template` argument allows for prespecifying structure of the full on disk dataset in the form of another lazy `xarray.Dataset`. Like the lazy datasets fed into DatasetToChunks, lazy templates can built-up using either Xarray's lazy indexing or lazy operations with Dask, but the data _values_ in a `template` will never be written to disk -- only the metadata structure is used.

One recommended pattern is to use a lazy Dask dataset consisting of a single value to build up the desired template, e.g.,

In [55]:
ds = xarray.open_zarr('example-data.zarr', chunks=None)
template = xarray.zeros_like(ds.chunk())  # a single virtual chunk of all zeros

Xarray operations like indexing and expand dimensions (see {py:meth}`xarray.Dataset.expand_dims`) are entirely lazy on this dataset, which makes it relatively straightforward to build up a Dataset with the required variables and dimensions, e.g., as used in the [ERA5 climatology example](https://github.com/google/xarray-beam/blob/main/examples/era5_climatology.py).