# Tutorial

Xarray-Beam is a library for writing [Apache Beam](http://beam.apache.org/) pipelines consisting of [xarray](http://xarray.pydata.org) Dataset objects. This tutorial (and Xarray-Beam itself) assumes basic familiarity with both Beam and Xarray.

This tutorial will walk you through the basics of writing a pipeline with Xarray-Beam. We also recommend reading through a few [end to end examples](https://github.com/google/xarray-beam/tree/main/examples) to understand what code using Xarray-Beam typically looks like.

```{note}
Before getting started, it's important to understand that although Xarray-Beam tries to make it _straightforward_ to write distributed pipelines with Xarray objects, but it doesn't try to hide the distributed magic inside high-level objects like [Xarray with Dask](http://xarray.pydata.org/en/stable/user-guide/dask.html) or Dask/Spark DataFrames.

Xarray-Beam is a lower-level tool. You will be manipulating large datasets piece-by-piece yourself, and you as the developer will be responsible for maintaining Xarray-Beam's internal invariants. This means that to successfully use Xarray-Beam, **you will need to understand how how it represents distributed datasets**. This may sound like a lot of responsibility, but we promise that isn't too bad!
```

We'll start off with some standard imports:

In [1]:
import numpy as np
import xarray_beam as xbeam
import xarray



## Keys in Xarray-Beam

Xarray-Beam is designed around the model that every stage in your Beam pipeline _could_ be stored in a single `xarray.Dataset` object, but is instead represented by a distributed beam `PCollection` of smaller `xarray.Dataset` objects, distributed in two possible ways:

- Distinct _variables_ in a Dataset may be separated across multiple records.
- Individual arrays can also be split into multiple _chunks_, similar to those used by [dask.array](https://docs.dask.org/en/latest/array.html).

To keep track of how individual records could be combined into a larger (virtual) dataset, Xarray-Beam defines a `Key` object. Key objects consist of:

1. `offsets`: integer offests for chunks from the origin in an `immutabledict`
2. `vars`: The subset of variables included in each chunk, either as a `frozenset`, or as `None` to indicate "all variables".

Making a `Key` from scratch is simple:

In [2]:
key = xbeam.Key({'x': 0, 'y': 10}, vars=None)
key

Key(offsets={'x': 0, 'y': 10}, vars=None)

Or given an existing `Key`, you can easily modify it with `replace()` or `with_offsets()`:

In [3]:
key.replace(vars={'foo', 'bar'})

Key(offsets={'x': 0, 'y': 10}, vars={'foo', 'bar'})

In [4]:
key.with_offsets(x=None, z=1)

Key(offsets={'y': 10, 'z': 1}, vars=None)

`Key` objects don't do very much. They are just simple structs with two attributes, along with various special methods required to use them as `dict` keys or as keys in Beam pipelines. You can find a more examples of manipulating keys in the docstring.

## Creating PCollections

The standard inputs & outputs for Xarray-Beam are PCollections of tuples of `(xbeam.Key, xarray.Dataset)` pairs. Xarray-Beam provides a bunch of PCollections for typical tasks, but many pipelines will still involve some manual manipulation of `Key` and `Dataset` objects, e.g., with builtin Beam transforms like `beam.Map`.

To start off, let's write a helper functions for creating our first collection from scratch:

In [7]:
def create_records():
    for offset in [0, 3]:
        key = xbeam.Key({'x': offset})
        chunk = xarray.Dataset({
            'foo': ('x', offset + np.arange(3)),
            'bar': ('x', 10 + offset + np.arange(3)),
        })
        yield key, chunk

In practice, we'd typically feed this into `beam.Create()` to start out pipeline, e.g., `beam.Create(create_records())`. But for now, let's take a look at the two entries:

In [14]:
inputs = list(create_records())

In [9]:
inputs

[(Key(offsets={'x': 0}, vars=None),
  <xarray.Dataset>
  Dimensions:  (x: 3)
  Dimensions without coordinates: x
  Data variables:
      foo      (x) int64 0 1 2
      bar      (x) int64 10 11 12),
 (Key(offsets={'x': 3}, vars=None),
  <xarray.Dataset>
  Dimensions:  (x: 3)
  Dimensions without coordinates: x
  Data variables:
      foo      (x) int64 3 4 5
      bar      (x) int64 13 14 15)]

```{note}
If desired, we could have set `vars={'foo', 'bar'}` on each of these `Key` objects instead of `vars=None`. This would be an equally valid representation of the same records, since all of our datasets have the same variables.
```

In practice, it's common to have an existing lazy `xarray.Dataset` that needs to be split into a bunch of separate chunks for the inputs to the Beam pipeline. This is where the `DatasetToChunks` transform comes in handy, e.g.,

In [16]:
ds = xarray.Dataset({'foo': ('x', np.arange(6)), 'bar': ('x', np.arange(10, 16))})

# this object could be used as the root of a Beam pipeline
xbeam.DatasetToChunks(ds, chunks={'x': 3})

<DatasetToChunks(PTransform) label=[DatasetToChunks] at 0x7fed459ed6a0>

TODO: finish this!

- Discuss the nuances of feeding in Dask datasets into DatasetToChunks
- Discuss options for lazy datasets: xarray's lazy indexing vs dask
- ChunksToZarr (including `template`)
- Fancy algorithms for rechunking