# Examine Source Data

**OBJECTIVE**:  
The objective of this chapter is to demonstrate how to read an existing dataset available as an OpenDAP endpoint, and translate it into a cloud-optimized zarr on S3. 

This notebook will take a guided tour of the input data, and show how to pick out key metadata about the structure of the dataset. 


In [1]:
import os
import logging
import xarray as xr
logging.basicConfig(level=logging.INFO, force=True)


In [2]:
%run ../utils.ipynb
_versions(['xarray'])

Python     : 3.10.10 | packaged by conda-forge | (main, Mar 24 2023, 20:08:06) [GCC 11.3.0]
xarray     : 2023.3.0


## Source Data

In [3]:
# INPUT: 
OPENDAP_url = 'https://cida.usgs.gov/thredds/dodsC/prism_v2'

The `xarray` loader is "lazy" -- it will read just enough of the data to make decisions about its shape, structure, etc. It will pretend like the whole dataset is in memory (and we can treat it that way), but it will only load data as required. 

### Data Set

In [5]:
# lazy-load
ds_in = xr.open_dataset(OPENDAP_url)#, decode_times=False)
# and show it:
ds_in

The "rich" HTML output to show the `xarray.Dataset` includes a lot of information, some of which is hidden behind toggles.  Click on the icons to the right to expand and see all the metadata available for the dataset. 

Notable observations: 
* **Dimensions** -- This dataset is 3D, with data being indexed by `lon`, `lat`, and `time` (setting  
  side `time_bnds` for the moment; it is a special case). Looking at the "Dimensions" line, you 
  can see that each of these dimensions is quantified -- how many unique values are available in 
  each dimension: 
    * **lon** = 1405
    * **lat** = 621
    * **time** = 1512
* **Coordinates** -- These are the convenient handles by which dimensions can be referenced. In this 
  dataset, a coordinate can be used to pick out a particular cell of the array.  Asking for 
  cells where `lat=49.9` is possible because these coordinates map the meaningful values of latitude
  to the behind-the-scenes cell index needed to fetch the value. 
* **Data Variables** -- The variables are `tmx`, `ppt`, and `tmn`, which are associated 
  with three indices by which data values are located in space and time (the _Dimensions_). 
* **Indexes** -- this is an internal data structure to help `xarray` quickly find items in the array.
* **Attributes** -- Arbitrary metadata associated with the dataset. 


Let's look at one of the data variables to learn more about how it is presented by the OPeNDAP endpoint. 

### Variable = "Data Array"

Each data variable is its own N-dimensional array (in this case, 3-dimensional, indexed by lat, lon, and time).  We can look at the individual variables by examining its array separately from the dataset: 

In [None]:
ds_in.tmn

Note from the top line that this variable is indexed as a tuple in `(time, lat, lon)`. So, behind the scenes, there is an array whose first index is a value between 0 and 1511.  How do we know the time value of index 0? (or any index, really) The "Coordinates" are the lookup table to say what "real" time value is associated with each index address. 

You'll notice that the data description in this case is merely "1319227560 values with dtype=float32"
with no indication as to how it is chunked. Assuming our 3-D array is fully populated, this value makes sense:

In [None]:
# time  lat  lon
1512 * 621 * 1405

In terms of chunking, this is where it gets interesting.

Notice that in the data attributes, that `_ChunkSizes` gives the chunk 
sizes of the data, expressed as a tuple to match the dimensions. If we
choose to believe this, it indicates that the data are broken into 
chunks, each of which is 
* 1 timestep, 
* 23 latitude steps, and 
* 44 longitude steps. 

In this case, we should be skeptical, because this information comes from an "Attribute", which 
may or may not be relevant.  Virtually anything can be set as an attribute on the dataset, and 
it does not affect the internal structure **AT ALL**. 

In [None]:
ds_in.tmn.attrs['spam'] = "Delicious"
ds_in.tmn

In the case of this OpenDAP data, we can choose to believe that this is how
the server would like to give us the data as a default.  If we accept that
default: 

Gven that `tmn` is stored as a `float32` (4 bytes), each chunk is of size: 

In [None]:
#     time  lat  lon  float32
bytes = 1 * 23 * 44 * 4
kbytes = bytes / (2**10)
mbytes = kbytes / (2**10)
print(f"TMN chunk size: {bytes=} ({kbytes=:.2f})({mbytes=:.4f})")

This is an **extremely** small chunk size, and not at all suitable for cloud storage.
We certainly will want to change that when we write this data. 

The good news is that we are not stuck with it. The opendap server is offering us 
its default chunking for network API requests, but this is configurable. We can 
change it to something more suitable.

## OpenDAP Considerations

A subtle point about this particular dataset, given that we are reading it 
from an OpenDAP Server:  The server can't give us data in chunks bigger than
500MB. 

When it comes time to read this data, we need to specify the pattern that we
want the data in.  This will ensure that each individual data request (i.e. 
each 'chunk') is smaller than this. 