# Aggregation

Xarray-Beam can perform efficient distributed data aggregation in the "map-reduce" model. 

This currently only includes `Mean`, but we would welcome other contributions in this areas.

## Mean

The `Mean` transformation comes in two forms: {py:class}`Mean.Globally <xarray_beam.Mean.Globally>` and {py:class}`Mean.PerKey <xarray_beam.Mean.PerKey>`. The implementation is based on a Beam [`CombineFn`](https://beam.apache.org/documentation/transforms/python/aggregation/combineglobally/#example-4-combining-with-a-combinefn).

Note that these transforations are (currently) modelled off of [`beam.Mean`](https://beam.apache.org/documentation/transforms/python/aggregation/mean/) rather than {py:meth}`xarray.Dataset.mean`: they compute averages over sequences of `xarray.Dataset` objects or (`key`, `xarray.Dataset`) pairs, rather than calculating an average over an existing Xarray dimension or based on `xarray_beam.Key` objects, e.g.,

In [1]:
import numpy as np
import xarray_beam as xbeam
import xarray

datasets = [
    xarray.Dataset({'foo': ('x', np.random.randn(3))})
    for _ in range(100)
]
datasets | xbeam.Mean.Globally()



[<xarray.Dataset>
 Dimensions:  (x: 3)
 Dimensions without coordinates: x
 Data variables:
     foo      (x) float64 0.09011 -0.0632 -0.04796]

Notice how existing dimensions on each datasets are unchanged by the transformation. If you want to average over existing dimension, you would need to do that aggregation yourself, e.g., by averaging inside each chunk before combining the data.

Similarly, the keys fed into `xbeam.Mean.PerKey` can be any hashables, including but not limited to `xbeam.Key`:

In [2]:
ds = xarray.tutorial.open_dataset('air_temperature')

datasets = [
    (time.dt.season.item(), ds.sel(time=time).mean())
    for time in ds.time
]
datasets | xbeam.Mean.PerKey()



[('DJF',
  <xarray.Dataset>
  Dimensions:  ()
  Data variables:
      air      float64 273.6),
 ('MAM',
  <xarray.Dataset>
  Dimensions:  ()
  Data variables:
      air      float64 279.0),
 ('JJA',
  <xarray.Dataset>
  Dimensions:  ()
  Data variables:
      air      float64 289.2),
 ('SON',
  <xarray.Dataset>
  Dimensions:  ()
  Data variables:
      air      float64 283.0)]

For an example of using `Mean.PerKey` at scale, that a look at the [ERA5 climatology example](https://github.com/google/xarray-beam/blob/main/examples/era5_climatology.py).

## Custom aggregation

The "tree reduction" algorithm used by the combiner inside `Mean` is great, but it isn't the only way to aggregate a dataset with Xarray-Beam.

In many cases, the easiest way to scale up an aggregation pipeline is to make use of [rechunking](rechunking.ipynb) to convert the many small datasets inside your pipeline into a form that is easier to work with. For example, here's how one could compute the `median`, which is a notoriously difficult statistic to calculate with distributed algoirthms:

In [3]:
import apache_beam as beam

source_chunks = {'time': 100, 'lat': -1, 'lon': -1}
working_chunks = {'lat': 10, 'lon': 10, 'time': -1}

with beam.Pipeline() as p:
    (
        p
        | xbeam.DatasetToChunks(ds, source_chunks)
        | xbeam.Rechunk(ds.sizes, source_chunks, working_chunks, itemsize=4)
        | beam.MapTuple(lambda k, v: (k.with_offsets(time=None), v.median('time')))
        | xbeam.ConsolidateChunks({'lat': -1, 'lon': -1})
        | beam.MapTuple(lambda k, v: print(v))
    )



<xarray.Dataset>
Dimensions:  (lat: 25, lon: 53)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
Data variables:
    air      (lat, lon) float32 261.3 261.1 260.9 260.3 ... 297.3 297.3 297.3
