# Another Example

Taking what we learned reading OpenDAP data, let's do another example, end-to-end, of
reading a dataset from a tredds server.  The `gridmet` data for precipitation will be
our new example dataset: 


In [8]:
# Example Dataset:  gridmet
DATA_url = r"http://thredds.northwestknowledge.net:8080/thredds/dodsC/agg_met_pr_1979_CurrentYear_CONUS.nc"


## Preamble
Stuff we need...

In [9]:
import os
import logging
import xarray as xr

logging.basicConfig(level=logging.INFO, force=True)

In [10]:
%run ../utils.ipynb
_versions(['xarray', 'dask'])

Python     : 3.10.10 | packaged by conda-forge | (main, Mar 24 2023, 20:08:06) [GCC 11.3.0]
dask       : 2023.3.2
xarray     : 2023.3.0


## Examine Data
Lazy-load the data using defaults to see how the overall structure looks:

In [11]:
# lazy-load
ds_in = xr.open_dataset(DATA_url + r"#fillmismatch", decode_coords=True, chunks={})
# and show it:
ds_in

Unnamed: 0,Array,Chunk
Bytes,48.84 GiB,48.84 GiB
Shape,"(16171, 585, 1386)","(16171, 585, 1386)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 48.84 GiB 48.84 GiB Shape (16171, 585, 1386) (16171, 585, 1386) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",1386  585  16171,

Unnamed: 0,Array,Chunk
Bytes,48.84 GiB,48.84 GiB
Shape,"(16171, 585, 1386)","(16171, 585, 1386)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


In [12]:
ds_in.precipitation_amount

Unnamed: 0,Array,Chunk
Bytes,48.84 GiB,48.84 GiB
Shape,"(16171, 585, 1386)","(16171, 585, 1386)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 48.84 GiB 48.84 GiB Shape (16171, 585, 1386) (16171, 585, 1386) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",1386  585  16171,

Unnamed: 0,Array,Chunk
Bytes,48.84 GiB,48.84 GiB
Shape,"(16171, 585, 1386)","(16171, 585, 1386)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


The data is being presented to us from the server as if it is one big chunk.  This is almost certainly not how it is stored on the server end. And more importantly, that 48GB chunk is too big for the server to provide all at once. Typically, data requests are capped at 500MB. 

But because we did not specify a chunk pattern, we get the illusion that it is one big chunk, and it is up to the server and the client (inside the `open_dataset()` method) to negotiate the transfer. 

A hint as to the way the server thinks of this data (absent chunking directives) is the `_ChunkSizes` attribute: `[61 98 231]` for `(day, lat, lon)`.  Using that chunking pattern, the data is sized like so: 

In [13]:
#     day  #lat  #lon  #float32
bytes = 61 * 98 * 231 * 4
kbytes = bytes / (2**10)
mbytes = kbytes / (2**10)
print(f"TMN chunk size: {bytes=} ({kbytes=:.2f})({mbytes=:.4f})")

TMN chunk size: bytes=5523672 (kbytes=5394.21)(mbytes=5.2678)


## Establish Chunking Preference
Will proceed with the assumption that this data will most likely be taken at full extent, for short time intervals. 

Examining each of the dimensions of this dataset: 

In [14]:
day = 16169/365 #how many chunks for a year-at-a-time
day

44.298630136986304

In [15]:
lat = 585 / 3 # split into 3 chunks
lat

195.0

In [16]:
lon = 1386 /3 # split into 3 chunks
lon

462.0

If we chunk with this pattern, how big will each chunk be? 

In [17]:
#     day  #lat  #lon  #float32
bytes = 365 * 195 * 462 * 4
kbytes = bytes / (2**10)
mbytes = kbytes / (2**10)
print(f"Chunk size: {bytes=} ({kbytes=:.2f})({mbytes=:.4f})")

Chunk size: bytes=131531400 (kbytes=128448.63)(mbytes=125.4381)


 125MB chunk seems reasonable, but it does mean that a time-series read pattern will have to take in 45 chunks (assuming the spatial extent of the analysis is within one lat/lon chunk).  To bring that down, let's take the time dimension as 2 years at a time.  Just to make the numbers more round, we'll express the time as 24 30-day months: 

In [18]:
ds_in = xr.open_dataset(
    DATA_url + r"#fillmismatch", 
    decode_coords=True, 
    chunks={'day': 24*30, 'lon': 462, 'lat': 195}
)
ds_in

Unnamed: 0,Array,Chunk
Bytes,48.84 GiB,247.44 MiB
Shape,"(16171, 585, 1386)","(720, 195, 462)"
Dask graph,207 chunks in 2 graph layers,207 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 48.84 GiB 247.44 MiB Shape (16171, 585, 1386) (720, 195, 462) Dask graph 207 chunks in 2 graph layers Data type float32 numpy.ndarray",1386  585  16171,

Unnamed: 0,Array,Chunk
Bytes,48.84 GiB,247.44 MiB
Shape,"(16171, 585, 1386)","(720, 195, 462)"
Dask graph,207 chunks in 2 graph layers,207 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


In [80]:
ds_in.precipitation_amount

Unnamed: 0,Array,Chunk
Bytes,48.84 GiB,250.88 MiB
Shape,"(16169, 585, 1386)","(730, 195, 462)"
Dask graph,207 chunks in 2 graph layers,207 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 48.84 GiB 250.88 MiB Shape (16169, 585, 1386) (730, 195, 462) Dask graph 207 chunks in 2 graph layers Data type float32 numpy.ndarray",1386  585  16169,

Unnamed: 0,Array,Chunk
Bytes,48.84 GiB,250.88 MiB
Shape,"(16169, 585, 1386)","(730, 195, 462)"
Dask graph,207 chunks in 2 graph layers,207 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


This chunk pattern favors the spatial extent.  Six chunks are needed to read the entire spatial extent for one time step.  

The time data is chunked by two year blocks (assuming alignment with year boundaries, which is almost certainly not true).  It would be more accurate to say that the time is in two-year-sized chunks, but may not align with the calendar year.   An entire time-series for a small spatial extent will require 23 chunks to be read.

## Important
The chunk specification in the `open_dataset()` call does not reconfigure the data itself.  It governs how the data driver formulates its requests to the server.  The chunking information specified in the open dataset call helps the driver establish the boundaries for its queries.  It will then request the data, a chunk at a time, from the server.  How the server holds that data is hidden from us, and we really don't need to care. The data driver on our end (from within `xarray.open_dataset()`) does the necessary work to ensure data alignment and that the block/chunk sizes will align to what we want. 
