# Rechunking

Rechunking lets us re-distribute how datasets are split between variables and chunks.

We'll recreate our dummy data from the data model tutorial:

In [1]:
import apache_beam as beam
import numpy as np
import xarray_beam as xbeam
import xarray



In [2]:
def create_records():
    for offset in [0, 4]:
        key = xbeam.Key({'x': offset, 'y': 0})
        data = 2 * offset + np.arange(8).reshape(4, 2)
        chunk = xarray.Dataset({
            'foo': (('x', 'y'), data),
            'bar': (('x', 'y'), 100 + data),
        })
        yield key, chunk
        
inputs = list(create_records())

### Adjusting variables

The simplest transformation is splitting (or consoldating) different _variables_ in a Dataset with `SplitVariables()` and `ConsolidateVariables()`, e.g.,

In [3]:
inputs | xbeam.SplitVariables()





[(Key(offsets={'x': 0, 'y': 0}, vars={'foo'}),
  <xarray.Dataset>
  Dimensions:  (x: 4, y: 2)
  Dimensions without coordinates: x, y
  Data variables:
      foo      (x, y) int64 0 1 2 3 4 5 6 7),
 (Key(offsets={'x': 0, 'y': 0}, vars={'bar'}),
  <xarray.Dataset>
  Dimensions:  (x: 4, y: 2)
  Dimensions without coordinates: x, y
  Data variables:
      bar      (x, y) int64 100 101 102 103 104 105 106 107),
 (Key(offsets={'x': 4, 'y': 0}, vars={'foo'}),
  <xarray.Dataset>
  Dimensions:  (x: 4, y: 2)
  Dimensions without coordinates: x, y
  Data variables:
      foo      (x, y) int64 8 9 10 11 12 13 14 15),
 (Key(offsets={'x': 4, 'y': 0}, vars={'bar'}),
  <xarray.Dataset>
  Dimensions:  (x: 4, y: 2)
  Dimensions without coordinates: x, y
  Data variables:
      bar      (x, y) int64 108 109 110 111 112 113 114 115)]

### Adjusting chunks

You can also adjust _chunks_ in a dataset to distribute arrays of different sizes. Here you have two choices of API:

1. The lower level {py:class}`~xarray_beam.SplitChunks` and {py:class}`~xarray_beam.ConsolidateChunks`. These transformations apply a single splitting (with indexing) or consolidation (with {py:function}`xarray.concat`) function to array elements.
2. The high level {py:class}`~xarray_beam.Rechunk`, which uses a pipeline of multiple split/consolidate steps to efficient rechunk a dataset.

For minor adjustments, the more explicit `SplitChunks()` and `ConsolidateChunks()` are good options. They take a dict of _desired_ chunk sizes as a parameter, which can also be `-1` to indicate "no chunking" along a dimension:

In [4]:
inputs | xbeam.ConsolidateChunks({'x': -1})



[(Key(offsets={'x': 0, 'y': 0}, vars=None),
  <xarray.Dataset>
  Dimensions:  (x: 8, y: 2)
  Dimensions without coordinates: x, y
  Data variables:
      foo      (x, y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
      bar      (x, y) int64 100 101 102 103 104 105 ... 110 111 112 113 114 115)]

Note that because these transformations only split _or_ consolidate, they cannot necessary fully rechunk a dataset in a single step if the new chunk sizes are not multiples of old chunks (with consolidate) or do not even divide the old chunks (with split), e.g.,

In [5]:
inputs | xbeam.SplitChunks({'x': 5})  # notice that the first two chunks are still separate!



[(Key(offsets={'x': 0, 'y': 0}, vars=None),
  <xarray.Dataset>
  Dimensions:  (x: 4, y: 2)
  Dimensions without coordinates: x, y
  Data variables:
      foo      (x, y) int64 0 1 2 3 4 5 6 7
      bar      (x, y) int64 100 101 102 103 104 105 106 107),
 (Key(offsets={'x': 4, 'y': 0}, vars=None),
  <xarray.Dataset>
  Dimensions:  (x: 1, y: 2)
  Dimensions without coordinates: x, y
  Data variables:
      foo      (x, y) int64 8 9
      bar      (x, y) int64 108 109),
 (Key(offsets={'x': 5, 'y': 0}, vars=None),
  <xarray.Dataset>
  Dimensions:  (x: 3, y: 2)
  Dimensions without coordinates: x, y
  Data variables:
      foo      (x, y) int64 10 11 12 13 14 15
      bar      (x, y) int64 110 111 112 113 114 115)]

For such uneven cases, you'll need to use split followed by consolidate:

In [6]:
inputs | xbeam.SplitChunks({'x': 5}) | xbeam.ConsolidateChunks({'x': 5})



[(Key(offsets={'x': 0, 'y': 0}, vars=None),
  <xarray.Dataset>
  Dimensions:  (x: 5, y: 2)
  Dimensions without coordinates: x, y
  Data variables:
      foo      (x, y) int64 0 1 2 3 4 5 6 7 8 9
      bar      (x, y) int64 100 101 102 103 104 105 106 107 108 109),
 (Key(offsets={'x': 5, 'y': 0}, vars=None),
  <xarray.Dataset>
  Dimensions:  (x: 3, y: 2)
  Dimensions without coordinates: x, y
  Data variables:
      foo      (x, y) int64 10 11 12 13 14 15
      bar      (x, y) int64 110 111 112 113 114 115)]

Alternatively, `Rechunk()` applies multiple split and consolidate steps based on the [Rechunker](https://github.com/pangeo-data/rechunker) algorithm:

In [7]:
inputs | xbeam.Rechunk(dim_sizes={'x': 6}, source_chunks={'x': 3}, target_chunks={'x': 5}, itemsize=8)



[(Key(offsets={'x': 0, 'y': 0}, vars=None),
  <xarray.Dataset>
  Dimensions:  (x: 5, y: 2)
  Dimensions without coordinates: x, y
  Data variables:
      foo      (x, y) int64 0 1 2 3 4 5 6 7 8 9
      bar      (x, y) int64 100 101 102 103 104 105 106 107 108 109),
 (Key(offsets={'x': 5, 'y': 0}, vars=None),
  <xarray.Dataset>
  Dimensions:  (x: 3, y: 2)
  Dimensions without coordinates: x, y
  Data variables:
      foo      (x, y) int64 10 11 12 13 14 15
      bar      (x, y) int64 110 111 112 113 114 115)]

`Rechunk` requires specifying a few more parameters, but based on that information it can be _much_ more efficient for more complex rechunking tasks, particular in cases where data needs to be distributed into a very different shape (e.g., distributing a matrix across rows vs. columns). A naive "splitting" approach in such cases could divide datasets into extremely small tasks corresponding to individual array elements, which adds a huge amount of overhead.