# Size and Shape of Chunks


**OBJECTIVE**:  
The objective of this chapter is to demonstrate how to read an existing dataset available as an OpenDAP endpoint, and translate it into a cloud-optimized zarr on S3. 

This notebook will touch on chunking considerations -- Factors to think about when deciding how to break up the data into chunks/bites. And why you would want to do that to begin with. 

In [1]:
import os
import logging
import xarray as xr
logging.basicConfig(level=logging.INFO, force=True)


In [2]:
%run ../utils.ipynb
_versions(['xarray'])

Python     : 3.10.10 | packaged by conda-forge | (main, Mar 24 2023, 20:08:06) [GCC 11.3.0]
xarray     : 2023.3.0


## Why Chunk
The simple reason is that the full dataset won't fit in memory. It has to be
divided in some way so that only those parts of the data being actively worked
are loaded. 

This has other benefits when it comes to parallel algorithms.  If work can be
performed in parallel, it is easy to set it up such that a separate worker is
assigned to each chunk of the data. 

## Example Data
We're going to keep looking at the sample PRISM dataset, as read from 
an OpenDAP endpoint: 

In [3]:
# INPUT: 
OPENDAP_url = 'https://cida.usgs.gov/thredds/dodsC/prism_v2'

In [4]:
ds_in = xr.open_dataset(OPENDAP_url, decode_times=False)
ds_in.tmn

Given what we know about this data, we can apply some cloud storage principles to form a strategy for how best to chunk the data when we write it to S3. Broadly, we need to specify chunk **size** and chunk **shape**. 


## Shape Considerations

Shape refers to how to divide an array along each dimension.  So we will need to decide on the chunk size for each of the dimensions of this data.  

The preferred shape of each chunk will depend on the read pattern for future analyses.
We will be chunking the data so that future reads will be performant -- and that depends on whether
the data favors one dimension or another.  For some datasets, this will be very apparent (NWIS gages, 
for example -- it very likely will be consumed along the `time` dimension most often. It is _more likely_
that future analysis of this data will take a time series for a given gage, as opposed to taking all
gage data for a given time).  For datasets where there is no clear preference, we can try to chunk 
based on likely read patterns, but allow for other patterns without too much of a performance penalty. 

Let's see how we might do this for our sample dataset.  This data will likely be used in one of two 
dominant read patterns: 

* Time series for a given location (or small spatial extent)
  * As a special case -- is it likely that time series will be subset by a logical unit? e.g. will this
    monthly data be consumed in blocks of 12 (yearly)? 
* Full extent for a given point in time. 
  * As a special case -- are specific study areas more used than others? 
  
Let's look at a couple of options for space and time chunking: 
  

### Time Dimension

The example dataset has 1512 time steps.  What happens if we chunk in groups of 
twelve (i.e. a year at a time)?

In [5]:
print("We need {} chunks.".format(1512 // 12))

We need 126 chunks.


In this case, a user could get an single year of this monthly data as a single chunk.
If they wanted a full time series across the entire dataset, they would need to read 
126 chunks. 

So this is where the judgement call gets made -- which is the more likely read pattern 
for time?  Year-by-year, or the whole time set (or some sequence of a few years). In 
this case, I think it is more likely to want more than just one year's data.  A happy 
medium for chunk size is 6 years of data per chunk: 

In [6]:
test_chunk_size = 12*6
print("TIME chunking: {} chunks of size {}".format(1512 / test_chunk_size, test_chunk_size))


TIME chunking: 21.0 chunks of size 72


This pattern means only 21 chunks (instead of the 126 chunks we were considering a moment ago) 
for a full time series in a given location. If we assume the rule-of-thumb latency for reading
and processing a chunk (100ms per read as the theoretical expectation -- see "size" below), 
those 21 chunks can be read by a single worker in:

In [7]:
print("Expected latency in seconds: ", (21 * 100) * 0.001)

Expected latency in seconds:  2.1


Note that for cluster-aware analyses, multiple chunks can be read at the same time. Total wall-clock time will be reduced in that case. With 21 chunks, maximum (theoretical) parallelism would be achieved with 21 cooperating workers.

We'll move forward with the time dimension chunked into groups of **72** for this dataset.

#### SPACE

We're realy chunking in dimensions -- and there are two dimensions to this dataset which 
contribute to "space": `lat` and `lon`.  These can have separate chunk sizes. The question 
to ask is whether future users of this data will want square "tiles" in space, or will 
they want proportionally-sized longitude and latitude?  That is, is it important that the 
North-South extent be broken into the same number of chunks as the East-West extent?).
I'll be breaking this into square tiles, so there will be more `lon` chunks than `lat` chunks: 

In [8]:
# The size of the dataset: 
lon=1405
lat=621
test_chunk_size = lat // 4 # split the smaller of the two dimensions into 4 chunks
print("LON chunking: {} chunks of size {}".format(lon / test_chunk_size, test_chunk_size))
print("LAT chunking: {} chunks of size {}".format(lat / test_chunk_size, test_chunk_size))

LON chunking: 9.064516129032258 chunks of size 155
LAT chunking: 4.006451612903226 chunks of size 155


It is important to note that we have **just over** a round number of chunks.
Having `9.06` longigutde chunks means we will have `10` chunks in practice, but 
that last one is not full-sized. In this case, this means that the last chunk 
in the given dimension will be extremely thin. 

In the case of that latitude chunk size, the extra `0.006` of a chunk means that
the last, fractional, chunk is only one `lat` observation. 

In [9]:
#   remainder   chunksize
0.0064516129 * 155

0.9999999995

This all but guarantees that two chunks are needed for a small spatial extent 
near the "end" of the `lat` dimension. Ideally, we would want partial chunks 
to be at least half the size of the standard chunk.  The bigger that 'remainder' 
fraction, the better. 

Let's adjust numbers a little so that we don't have that sliver.  We're still
committed to square tiles, so let's try a larger chunk size to change the size
of that last fraction: 

In [10]:
test_chunk_size = 160
print("LON chunking: {} chunks of size {}".format(lon / test_chunk_size, test_chunk_size))
print("LAT chunking: {} chunks of size {}".format(lat / test_chunk_size, test_chunk_size))

LON chunking: 8.78125 chunks of size 160
LAT chunking: 3.88125 chunks of size 160


With this pattern, the "remainder" latitude chunk will be 141 in the `lat` dimension (125 
for the last `lon` chunk).  All others will be a square 160 observations in both directions.

This amounts to a 9x4 chunk grid, with the last chunk in each direction being partial. 

The entire spatial extent for a single time step can be read in 36 chunks, with this pattern. 
That feels a little high to me, given that this dataset will likely be taken at full extent
for a typical analysis.  Let's go a little bigger to see what that gets us: 


In [11]:
test_chunk_size = 354
print("LON chunking: {} chunks of size {}".format(lon / test_chunk_size, test_chunk_size))
print("LAT chunking: {} chunks of size {}".format(lat / test_chunk_size, test_chunk_size))

LON chunking: 3.968926553672316 chunks of size 354
LAT chunking: 1.7542372881355932 chunks of size 354


This is not *quite* as good in terms of full-chunk remainders -- but on the other hand, the whole extent 
can be had in only 8 chunks.  The smallest remainder is still 75% of a full-sized square tile, which is
acceptable. 

Note, that if were really confident that most analyses wanted the full extent, we might
be better off to just put the whole lat/lon dimensions into single chunks each. This 
would ensure (and **require**) that we read the entire extent any time we wanted any **part** 
of the extent for a given timestep.  Our poor time-series analysis would then be stuck 
reading the entire dataset to get all time values for a single location. `:sadface:`

We're going to move forward here with the `lat` and `lon` dimensions being chunked at **354**
observations each. 


## Size Considerations

Shape is only part of the equation.  Total chunk size matters also.  Size considerations
come into play mostly as a consideration of how the chunks are stored on disk. The 
retrieval time is influenced by the size of each chunk. Here are some constraints: 

* Files Too Big -- In a zarr dataset, each chunk is stored as a separate binary file. 
  If we need data from a particular chunk, no matter how little or how much, that file gets 
  opened, decompressed, and the whole thing read into memory. A large chunk sizes means 
  that there may be a lot of data transferred in situations when only a small subset of 
  that chunk's data is actually needed.  It also means there might not be enough chunks 
  to allow the dask workers to stay busy loading data in parallel. 
* Files Too Small -- If the chunk size is too small, the time it takes to read and 
  decompress the data for each chunk can become comparable to the latency of S3 (typically 
  10-100ms). We want the reads to take at least a second or so, so that the latency is
  not a significant part of the overall timing.
  
As a general rule, aim for chunk sizes between 10 and 200MB, depending on shape
and expected read pattern (see below). Expect 100ms latency for each separate
chunk that a process needs to open.

### Total Chunk Size
Now that we have a rough idea of the chunk dimensions, let's compute its size in bytes.  
This will tell us if we've hit our target of between 10 and 200MB per chunk.  More 
importantly, it will tell us if we will overwhelm the OpenDAP server -- the server 
can only give us 500MB at a time. Chunks should really be smaller than this (which 
we want anyway, but this 500MB limit is a hard cut-off). 


In [12]:
#       lat   lon  time  float32
bytes = 354 * 354 * 72 * 4
kbytes = bytes / (2**10)
mbytes = kbytes / (2**10)
print(f"TMN chunk size: {bytes=} ({kbytes=:.2f}) ({mbytes=:.4f})")

TMN chunk size: bytes=36091008 (kbytes=35245.12) (mbytes=34.4191)


We're looking really good for size.  Maybe even a bit low.
But we're in the (admitedly broad) range of 10-200 megabytes of
uncompressed data (i.e. in-memory) per chunk. 

## Making the chunk plan
Now that we know how we want to chunk the data, we need to give that 
information to the API which will read the data from its OpenDAP endpoint: 

In [13]:
ds_in = xr.open_dataset(
    OPENDAP_url, 
    decode_times=False, 
    chunks={ #this directs the open_dataset method to structure its reads in a particular way.
        'time': 72, 
        'lon': 354, 
        'lat': 354
    }
)
ds_in

Unnamed: 0,Array,Chunk
Bytes,11.81 kiB,576 B
Shape,"(1512, 2)","(72, 2)"
Dask graph,21 chunks in 2 graph layers,21 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 11.81 kiB 576 B Shape (1512, 2) (72, 2) Dask graph 21 chunks in 2 graph layers Data type float32 numpy.ndarray",2  1512,

Unnamed: 0,Array,Chunk
Bytes,11.81 kiB,576 B
Shape,"(1512, 2)","(72, 2)"
Dask graph,21 chunks in 2 graph layers,21 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.91 GiB,34.42 MiB
Shape,"(1512, 621, 1405)","(72, 354, 354)"
Dask graph,168 chunks in 2 graph layers,168 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 4.91 GiB 34.42 MiB Shape (1512, 621, 1405) (72, 354, 354) Dask graph 168 chunks in 2 graph layers Data type float32 numpy.ndarray",1405  621  1512,

Unnamed: 0,Array,Chunk
Bytes,4.91 GiB,34.42 MiB
Shape,"(1512, 621, 1405)","(72, 354, 354)"
Dask graph,168 chunks in 2 graph layers,168 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,9.83 GiB,68.84 MiB
Shape,"(1512, 621, 1405)","(72, 354, 354)"
Dask graph,168 chunks in 2 graph layers,168 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 9.83 GiB 68.84 MiB Shape (1512, 621, 1405) (72, 354, 354) Dask graph 168 chunks in 2 graph layers Data type float64 numpy.ndarray",1405  621  1512,

Unnamed: 0,Array,Chunk
Bytes,9.83 GiB,68.84 MiB
Shape,"(1512, 621, 1405)","(72, 354, 354)"
Dask graph,168 chunks in 2 graph layers,168 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.91 GiB,34.42 MiB
Shape,"(1512, 621, 1405)","(72, 354, 354)"
Dask graph,168 chunks in 2 graph layers,168 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 4.91 GiB 34.42 MiB Shape (1512, 621, 1405) (72, 354, 354) Dask graph 168 chunks in 2 graph layers Data type float32 numpy.ndarray",1405  621  1512,

Unnamed: 0,Array,Chunk
Bytes,4.91 GiB,34.42 MiB
Shape,"(1512, 621, 1405)","(72, 354, 354)"
Dask graph,168 chunks in 2 graph layers,168 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


Looking more closely at the `tmn` variable: 

In [14]:
ds_in.tmn

Unnamed: 0,Array,Chunk
Bytes,4.91 GiB,34.42 MiB
Shape,"(1512, 621, 1405)","(72, 354, 354)"
Dask graph,168 chunks in 2 graph layers,168 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 4.91 GiB 34.42 MiB Shape (1512, 621, 1405) (72, 354, 354) Dask graph 168 chunks in 2 graph layers Data type float32 numpy.ndarray",1405  621  1512,

Unnamed: 0,Array,Chunk
Bytes,4.91 GiB,34.42 MiB
Shape,"(1512, 621, 1405)","(72, 354, 354)"
Dask graph,168 chunks in 2 graph layers,168 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


**NOTE** that the display looks differently than it does in the {doc}`ExamineSourceData` notebook. 
In that original data examination (when we did not express a chunking preference), the data was 
described as `1319227560 values with dtype=float32`.  In the above data description, you can see
that those observations are structured in the chunks that we've asked for.  Notice that the
`xarray` description of chunk size matches our rough calculation (34.4 MB/chunk). 

Also, take note that the "Attributes" is still claming that `_ChunkSizes` is `[1 23 44]`. This
is clearly a lie (it was never really true, actually).  This is a particular oddity with OpenDAP
(or perhaps with this server), and won't be a consideration if you are working with data from
other sources. 

We've specifically asked the dask interface to request this data according to the chunk pattern 
specified -- and this is revealed in the graphical display.  Dask will make specific OPeNDAP 
requests *per chunk* using appropriate query parameters to the server. 

Because this chunk pattern can be provided by the server, and it is a reasonable pattern for 
object storage in S3, we do **not** need to add the complexity of `rechunker`. We can just 
have the zarr driver write it out according to the same plan.  Even so, it is useful to 
lay out exactly what the chunking plan might be if we were using `rechunker`:


In [15]:
chunk_plan = {
    'ppt':{'time': 72, 'lon': 354, 'lat': 354},    
    'tmx':{'time': 72, 'lon': 354, 'lat': 354},    
    'tnm':{'time': 72, 'lon': 354, 'lat': 354},
    'time_bnds': {'time': 1, 'tbnd': 2},
    'lat': (621,),
    'lon': (1405,),
    'time': (1512,),
    'tbnd': (2,)
}

Note that the coordinate variables themselves (`lat`, `lon`, etc) are stored as single-chunk stripes of data. 
Recall that these are used to translate a latitude (or longitude) value into the actual corresponding array 
address.  These coordinate arrays will always be needed in their entirity, so we chunk them such that 
they read with one chunk each. This does not affect how the data representing the data variables is chunked.  
