# Writing OpenDAP Data to ZARR on S3

**OBJECTIVE**:  
The objective of this chapter is to demonstrate how to read an existing dataset available as an OpenDAP endpoint, and translate it into a cloud-optimized zarr on S3. 

This notebook will actually write data to S3, using the chunking patterns we 
decided on based on the {doc}`ExamineSourceData` and {doc}`EffectSizeShape` notebooks. 

## Source Data
We're still looking at readingthe PRISM(v2) dataset via its OpenDAP endpoint: 

In [4]:
# INPUT: 
OPENDAP_url = 'https://cida.usgs.gov/thredds/dodsC/prism_v2'

## Preamble
This is all stuff we are going to need: 

In [5]:
import os
import logging

import xarray as xr
import dask
import fsspec
import zarr
import hvplot.xarray

logging.basicConfig(level=logging.INFO, force=True)

In [6]:
%run ../utils.ipynb
_versions(['xarray', 'dask', 'fsspec', 'zarr'])

Python     : 3.10.10 | packaged by conda-forge | (main, Mar 24 2023, 20:08:06) [GCC 11.3.0]
dask       : 2023.3.2
fsspec     : 2023.3.0+13.g95eb5f9
xarray     : 2023.3.0
zarr       : 2.13.3


## AWS Credentials
Because we will be writing to an S3 object store, we need credentials.
This notebook will assume that the correct credentials are already 
stored in `~/.aws/credentials` . 

I am using a profile to write to the OSN storage device (profile name 
`osn-rsignellbucket2`). If you run this notebook and want to write elsewhere 
with other credentials, you may change the profile name and endpoint 
in the cell below: 

In [7]:
os.environ['AWS_PROFILE'] = 'osn-rsignellbucket2'
os.environ['AWS_ENDPOINT'] = 'https://renc.osn.xsede.org'

%run ../AWS.ipynb  # handles credentials for us. 


## OUTPUT Location


In [9]:
workspace = f's3://rsignellbucket2/testing/prism/'

# OUTPUT Dataset Name:
FNAME = 'PRISM2.zarr'

# Instantiate a fsspec reference to the workspace: 
fsw = fsspec.filesystem('s3', 
    anon=False, 
    default_fill_cache=False, 
    skip_instance_cache=True, 
    client_kwargs={ 'endpoint_url': os.environ['AWS_S3_ENDPOINT'] },
) # this will take credentials from the environment variables, 
# as defined above. No need to specify profile or keys. The endpoint, 
# unfortunately is necessary to name explicitly.

outdir = workspace + FNAME
target_store = fsw.get_mapper(outdir)

try:
    if fsw.exists(workspace + FNAME):
        logging.warning("Removing existing file/folder: %s", fname)
        fsw.rm(workspace + fname, recursive=True)
except:
    # Occasionally, the cache doesn't catch up to the fact that we've deleted a file. 
    # In that case, it throws a FileNotFound exception. Ignore. 
    pass

print("READY !!")


INFO:aiobotocore.credentials:Found credentials in environment variables.


READY !!


## Read Source Data

Given what we calculated in the {doc}`EffectSizeShape` notebook, we can specify the
chunking pattern we want when the data is initially read. 

In [10]:
ds_in = xr.open_dataset(OPENDAP_url, decode_times=False, chunks={'time': 72, 'lon': 354, 'lat': 354})
ds_in

INFO:botocore.credentials:Found credentials in environment variables.


Unnamed: 0,Array,Chunk
Bytes,11.81 kiB,576 B
Shape,"(1512, 2)","(72, 2)"
Dask graph,21 chunks in 2 graph layers,21 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 11.81 kiB 576 B Shape (1512, 2) (72, 2) Dask graph 21 chunks in 2 graph layers Data type float32 numpy.ndarray",2  1512,

Unnamed: 0,Array,Chunk
Bytes,11.81 kiB,576 B
Shape,"(1512, 2)","(72, 2)"
Dask graph,21 chunks in 2 graph layers,21 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.91 GiB,34.42 MiB
Shape,"(1512, 621, 1405)","(72, 354, 354)"
Dask graph,168 chunks in 2 graph layers,168 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 4.91 GiB 34.42 MiB Shape (1512, 621, 1405) (72, 354, 354) Dask graph 168 chunks in 2 graph layers Data type float32 numpy.ndarray",1405  621  1512,

Unnamed: 0,Array,Chunk
Bytes,4.91 GiB,34.42 MiB
Shape,"(1512, 621, 1405)","(72, 354, 354)"
Dask graph,168 chunks in 2 graph layers,168 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,9.83 GiB,68.84 MiB
Shape,"(1512, 621, 1405)","(72, 354, 354)"
Dask graph,168 chunks in 2 graph layers,168 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 9.83 GiB 68.84 MiB Shape (1512, 621, 1405) (72, 354, 354) Dask graph 168 chunks in 2 graph layers Data type float64 numpy.ndarray",1405  621  1512,

Unnamed: 0,Array,Chunk
Bytes,9.83 GiB,68.84 MiB
Shape,"(1512, 621, 1405)","(72, 354, 354)"
Dask graph,168 chunks in 2 graph layers,168 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.91 GiB,34.42 MiB
Shape,"(1512, 621, 1405)","(72, 354, 354)"
Dask graph,168 chunks in 2 graph layers,168 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 4.91 GiB 34.42 MiB Shape (1512, 621, 1405) (72, 354, 354) Dask graph 168 chunks in 2 graph layers Data type float32 numpy.ndarray",1405  621  1512,

Unnamed: 0,Array,Chunk
Bytes,4.91 GiB,34.42 MiB
Shape,"(1512, 621, 1405)","(72, 354, 354)"
Dask graph,168 chunks in 2 graph layers,168 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


Recall that `xarray` uses lazy-loading -- the entire dataset is not in memory.  It provides
us the fiction that it is, and loads data in chunks as needed. 

## Writing Data

OK... finally we are ready to write out our data.
And the good news about using chunked data is that Dask is capable of doing its
lazily-loaded data operations *in parallel* and *without hand-holding*.  We don't 
have to design a parallel workflow: Dask will sort it out.  BUT... to take advantage 
of that parallelism, we need to start up a cluster: 

### Start Dask Cluster

In [11]:
%run ../StartNebariCluster.ipynb

from dask.distributed import progress, performance_report

The 'cluster' object can be used to adjust cluster behavior.  i.e. 'cluster.adapt(minimum=10)'
The 'client' object can be used to directly interact with the cluster.  i.e. 'client.submit(func)' 
The link to view the client dashboard is:
>  https://nebari.esipfed.org/gateway/clusters/dev.8c417ce9f78f483dbfdd150eec650bfb/status


### to_zarr()
With the data already lazy-loaded into the `ds_in` dataset, we can just
call its `to_zarr()` method.  It will write using the chunk pattern already 
defined in the dataset object. 

In [20]:
%%time
with performance_report('../performance_reports/OpenDAP_to_S3-perfreport.html'):
    ds_in.to_zarr(target_store, mode='w', consolidated=True)

CPU times: user 3.38 s, sys: 332 ms, total: 3.71 s
Wall time: 6min 34s


## Verify
To make sure that we really wrote the whole thing to S3, let's sample the 
data for some simple plots: 

### Reading...

In [12]:
new_ds = xr.open_dataset(target_store, engine='zarr', chunks={})
new_ds

Unnamed: 0,Array,Chunk
Bytes,9.83 GiB,68.84 MiB
Shape,"(1512, 621, 1405)","(72, 354, 354)"
Dask graph,168 chunks in 2 graph layers,168 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 9.83 GiB 68.84 MiB Shape (1512, 621, 1405) (72, 354, 354) Dask graph 168 chunks in 2 graph layers Data type float64 numpy.ndarray",1405  621  1512,

Unnamed: 0,Array,Chunk
Bytes,9.83 GiB,68.84 MiB
Shape,"(1512, 621, 1405)","(72, 354, 354)"
Dask graph,168 chunks in 2 graph layers,168 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,23.62 kiB,1.12 kiB
Shape,"(1512, 2)","(72, 2)"
Dask graph,21 chunks in 2 graph layers,21 chunks in 2 graph layers
Data type,datetime64[ns] numpy.ndarray,datetime64[ns] numpy.ndarray
"Array Chunk Bytes 23.62 kiB 1.12 kiB Shape (1512, 2) (72, 2) Dask graph 21 chunks in 2 graph layers Data type datetime64[ns] numpy.ndarray",2  1512,

Unnamed: 0,Array,Chunk
Bytes,23.62 kiB,1.12 kiB
Shape,"(1512, 2)","(72, 2)"
Dask graph,21 chunks in 2 graph layers,21 chunks in 2 graph layers
Data type,datetime64[ns] numpy.ndarray,datetime64[ns] numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.91 GiB,34.42 MiB
Shape,"(1512, 621, 1405)","(72, 354, 354)"
Dask graph,168 chunks in 2 graph layers,168 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 4.91 GiB 34.42 MiB Shape (1512, 621, 1405) (72, 354, 354) Dask graph 168 chunks in 2 graph layers Data type float32 numpy.ndarray",1405  621  1512,

Unnamed: 0,Array,Chunk
Bytes,4.91 GiB,34.42 MiB
Shape,"(1512, 621, 1405)","(72, 354, 354)"
Dask graph,168 chunks in 2 graph layers,168 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.91 GiB,34.42 MiB
Shape,"(1512, 621, 1405)","(72, 354, 354)"
Dask graph,168 chunks in 2 graph layers,168 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 4.91 GiB 34.42 MiB Shape (1512, 621, 1405) (72, 354, 354) Dask graph 168 chunks in 2 graph layers Data type float32 numpy.ndarray",1405  621  1512,

Unnamed: 0,Array,Chunk
Bytes,4.91 GiB,34.42 MiB
Shape,"(1512, 621, 1405)","(72, 354, 354)"
Dask graph,168 chunks in 2 graph layers,168 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


Are the other variable present, and chunked the same way?

In [13]:
new_ds.ppt

Unnamed: 0,Array,Chunk
Bytes,9.83 GiB,68.84 MiB
Shape,"(1512, 621, 1405)","(72, 354, 354)"
Dask graph,168 chunks in 2 graph layers,168 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 9.83 GiB 68.84 MiB Shape (1512, 621, 1405) (72, 354, 354) Dask graph 168 chunks in 2 graph layers Data type float64 numpy.ndarray",1405  621  1512,

Unnamed: 0,Array,Chunk
Bytes,9.83 GiB,68.84 MiB
Shape,"(1512, 621, 1405)","(72, 354, 354)"
Dask graph,168 chunks in 2 graph layers,168 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


### Plot time series for a single location

In [16]:
%%time
da = new_ds.ppt.sel(lon=-75, lat=41.1, method='nearest').load()
da.hvplot(x='time', grid=True)

CPU times: user 42.7 ms, sys: 0 ns, total: 42.7 ms
Wall time: 6.58 s


## Plot a map for a single time-step

In [17]:
%%time
da = new_ds.tmx.sel(time="1970-01").load()
da.hvplot(x='lon', y='lat', rasterize=True, geo=True, tiles='OSM' )

CPU times: user 3.21 s, sys: 156 ms, total: 3.37 s
Wall time: 7.91 s


## Close down cluster
Always clean up after yourself....

In [18]:
client.close(); cluster.close()

  self.scheduler_comm.close_rpc()
