# Re-Chunking Larger Datasets 

This notebook extends ideas covered in the [basic workflow](./ReChunkingData.ipynb).  This 
notebook will perfrom the same operations, but will work on the **much** larger dataset, and 
involve some parallelization using the dask scheduler. 

:::{Warning}

You should run this **only** on a cloud compute node -- on ESIP Nebari, for example. We 
will be reading and writing **enormous** amounts of data to S3 buckets. To do that over a 
typical network connection will saturate your bandwidth and take days to complete.

:::

## System Setup 

In [40]:
import os
import xarray as xr
import dask
import intake

# Activate logging
import logging
logging.basicConfig(level=logging.INFO, force=True)


## Plumb Data Source
We're going to look at a particular dataset from the National Water Model Reanalysis Version 2.1. 
The dataset is part of the AWS Open Data Program, and is included in the HyTEST data catalog.


In [38]:
url = 'https://raw.githubusercontent.com/hytest-org/hytest/main/dataset_catalog/hytest_intake_catalog.yml'
cat = intake.open_catalog(url)
cat['nwm21-streamflow-cloud']

nwm21-streamflow-cloud:
  args:
    consolidated: true
    storage_options:
      anon: true
    urlpath: s3://noaa-nwm-retrospective-2-1-zarr-pds/chrtout.zarr
  description: National Water Model 2.1 CHRTOUT on AWS
  driver: intake_xarray.xzarr.ZarrSource
  metadata:
    catalog_dir: https://raw.githubusercontent.com/hytest-org/hytest/main/dataset_catalog


## Load the zarr data
We'll take advantage of the `intake` mechanism and load the data 
directly.  We'll need to set up our AWS credentials first, since
this data is stored on an S3 bucket. 

In [41]:
os.environ['AWS_PROFILE'] = "osn-renci"
os.environ['AWS_S3_ENDPOINT'] = "https://renc.osn.xsede.org"
%run ../../../environment_set_up/Help_AWS_Credentials.ipynb

In [42]:
ds = cat['nwm21-streamflow-cloud'].to_dask()

In [43]:
indexer = ds.gage_id != ''.rjust(15).encode()
smplData = ds.where(indexer.compute(), drop=True) # subset to only those features with a valid gage_id
smplData.drop('crs') # Not needed/wanted for this analysis
smplData

Unnamed: 0,Array,Chunk
Bytes,21.88 GiB,1.25 MiB
Shape,"(367439, 7994)","(672, 243)"
Dask graph,50871 chunks in 6 graph layers,50871 chunks in 6 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 21.88 GiB 1.25 MiB Shape (367439, 7994) (672, 243) Dask graph 50871 chunks in 6 graph layers Data type float64 numpy.ndarray",7994  367439,

Unnamed: 0,Array,Chunk
Bytes,21.88 GiB,1.25 MiB
Shape,"(367439, 7994)","(672, 243)"
Dask graph,50871 chunks in 6 graph layers,50871 chunks in 6 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,21.88 GiB,1.25 MiB
Shape,"(367439, 7994)","(672, 243)"
Dask graph,50871 chunks in 6 graph layers,50871 chunks in 6 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 21.88 GiB 1.25 MiB Shape (367439, 7994) (672, 243) Dask graph 50871 chunks in 6 graph layers Data type float64 numpy.ndarray",7994  367439,

Unnamed: 0,Array,Chunk
Bytes,21.88 GiB,1.25 MiB
Shape,"(367439, 7994)","(672, 243)"
Dask graph,50871 chunks in 6 graph layers,50871 chunks in 6 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


## Restrict for Tutorial
This is a demonstration workflow, which means we don't really need to work on the entire 
dataset -- which is very large. We are going to cut this input dataset down to be just the
first 100 `feature_id` values, so that this tutorial will run in reasonable time. 

For processing a full-sized dataset, you'd just skip this step where we slice off a representative
example of the data. Expect run time to increase in proportion to the size of the data being
processed. |

In [44]:
smplData = smplData.isel(feature_id=slice(0, 100))
smplData

Unnamed: 0,Array,Chunk
Bytes,280.33 MiB,525.00 kiB
Shape,"(367439, 100)","(672, 100)"
Dask graph,547 chunks in 7 graph layers,547 chunks in 7 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 280.33 MiB 525.00 kiB Shape (367439, 100) (672, 100) Dask graph 547 chunks in 7 graph layers Data type float64 numpy.ndarray",100  367439,

Unnamed: 0,Array,Chunk
Bytes,280.33 MiB,525.00 kiB
Shape,"(367439, 100)","(672, 100)"
Dask graph,547 chunks in 7 graph layers,547 chunks in 7 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,280.33 MiB,525.00 kiB
Shape,"(367439, 100)","(672, 100)"
Dask graph,547 chunks in 7 graph layers,547 chunks in 7 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 280.33 MiB 525.00 kiB Shape (367439, 100) (672, 100) Dask graph 547 chunks in 7 graph layers Data type float64 numpy.ndarray",100  367439,

Unnamed: 0,Array,Chunk
Bytes,280.33 MiB,525.00 kiB
Shape,"(367439, 100)","(672, 100)"
Dask graph,547 chunks in 7 graph layers,547 chunks in 7 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


Note that this data hass the full range of time values (367439 time steps) and all of the 
variables (`streamflow`, `velocity`, ...). We just selected 100 feature_ids to work with so 
that the example data will execute more quickly. 

## Spin up Dask Cluster
Our rechunking operation will be able to work in parallel. To do that, we will
spin up a `dask` cluster on the cloud hardware to schedule the various workers.
Note that this cluster must be configured with a specific user **profile** with 
permissions to write to our eventual output location. 

In [10]:
# %run ../environment_set_up/Start_Dask_Cluster_Nebari.ipynb
import os
import logging

try:
    from dask_gateway import Gateway
except ImportError:
    logging.error("Unable to import Dask Gateway.  Are you running in a cloud compute environment?\n")
    raise
os.environ['DASK_DISTRIBUTED__SCHEDULER__WORKER_SATURATION'] = "1.0"

gateway = Gateway()
_options = gateway.cluster_options()
_options.conda_environment='users/users-pangeo'  ##<< this is the conda environment we use on nebari.
_options.profile = 'Medium Worker'
_env_to_add={}
aws_env_vars=['AWS_ACCESS_KEY_ID',
              'AWS_SECRET_ACCESS_KEY',
              'AWS_SESSION_TOKEN',
              'AWS_DEFAULT_REGION',
              'AWS_S3_ENDPOINT', 
              'DASK_DISTRIBUTED__SCHEDULER__WORKER_SATURATION']
for _e in aws_env_vars:
    if _e in os.environ:
        _env_to_add[_e] = os.environ[_e]
_options.environment_vars = _env_to_add    
cluster = gateway.new_cluster(_options)          ##<< create cluster via the dask gateway
cluster.adapt(minimum=2, maximum=30)             ##<< Sets scaling parameters. 

client = cluster.get_client()

print("The 'cluster' object can be used to adjust cluster behavior.  i.e. 'cluster.adapt(minimum=10)'")
print("The 'client' object can be used to directly interact with the cluster.  i.e. 'client.submit(func)' ")
print(f"The link to view the client dashboard is:\n>  {client.dashboard_link}")

The 'cluster' object can be used to adjust cluster behavior.  i.e. 'cluster.adapt(minimum=10)'
The 'client' object can be used to directly interact with the cluster.  i.e. 'client.submit(func)' 
The link to view the client dashboard is:
>  https://nebari.esipfed.org/gateway/clusters/dev.810ce18031b74e0ea905c36e5638c01a/status


## Re-Chunk Plan
We will configure a new chunking plan which will favor time-series analysis. 
Using the dimensions of the data: 
* 367439 time steps
* 100 feature IDs

We can write the new plan as: 

In [11]:
# The new chunking plan:
chunk_plan = {
    'streamflow': {'time': 367439, 'feature_id': 1}, # all time records in one chunk for each feature_id
    'velocity': {'time': 367439, 'feature_id': 1},
    'elevation': (100,),
    'gage_id': (100,),
    'latitude': (100,),
    'longitude': (100,),    
    'order': (100,),    
    'time': (367439,), # all time coordinates in one chunk
    'feature_id': (100,) # all feature_id coordinates in one chunk
}


This will generate chunks which are 1(feature_id) x 367439(time) arrays of `float64`. 

In [12]:
#       time  * id * float64
bytes = 367439 * 1 * 8
kbytes = bytes / (2**10)
mbytes = kbytes / (2**10)
print(f"chunk size: {bytes=} ({kbytes=:.2f})({mbytes=:.4f})")

chunk size: bytes=2939512 (kbytes=2870.62)(mbytes=2.8033)


In [13]:
# Manually reset the chunking metadata in prep for re-chunking
smplData = smplData.chunk(chunks={'feature_id':1, 'time': 367439})
for x in smplData.variables:
    smplData[x].encoding['chunks'] = None

## Set up output location

With this plan, we can ask `rechunker` to re-write the data using the prescribed chunking pattern.

Unlike with the smaller dataset, we need to write this very large dataset to an object store in the datacenter: an S3 'bucket'.  So we need to set that up so that `rechunker` will have a suitable place to write data. This new data will be a complete copy of the original, just re-organized a bit. 

In [14]:
from getpass import getuser
import fsspec
uname=getuser()

fsw = fsspec.filesystem(
    's3', 
    anon=False, 
    default_fill_cache=False, 
    skip_instance_cache=True, 
    client_kwargs={'endpoint_url': os.environ['AWS_S3_ENDPOINT'], }
)

workspace = 's3://rsignellbucket2/'
testDir = workspace + "testing/"
myDir = testDir + f'{uname}/'
fsw.mkdir(testDir)

In [18]:
staging = fsw.get_mapper(myDir + 'tutorial_staging.zarr')
outfile = fsw.get_mapper(myDir + 'tutorial_rechunked.zarr')

## Ready to rechunk

In [19]:
import rechunker
## Recall that merely invoking rechunker does not do any work... just sorts out 
## the rechunking plan and writes metadata.
result = rechunker.rechunk(
    smplData,
    chunk_plan,
    "16GB",
    outfile, 
    temp_store=staging, 
)



In [21]:
from dask.distributed import progress, performance_report

with performance_report(filename="dask-report.html"):
    r = result.execute(retries=10)  

In [22]:
import zarr
_ = zarr.consolidate_metadata(outfile)

## Results
Let's read in the resulting re-chunked dataset to see how it looks:

In [23]:
reChunkedData = xr.open_zarr(outfile)
reChunkedData

Unnamed: 0,Array,Chunk
Bytes,400 B,400 B
Shape,"(100,)","(100,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 400 B 400 B Shape (100,) (100,) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",100  1,

Unnamed: 0,Array,Chunk
Bytes,400 B,400 B
Shape,"(100,)","(100,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.46 kiB,1.46 kiB
Shape,"(100,)","(100,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,|S15 numpy.ndarray,|S15 numpy.ndarray
"Array Chunk Bytes 1.46 kiB 1.46 kiB Shape (100,) (100,) Dask graph 1 chunks in 2 graph layers Data type |S15 numpy.ndarray",100  1,

Unnamed: 0,Array,Chunk
Bytes,1.46 kiB,1.46 kiB
Shape,"(100,)","(100,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,|S15 numpy.ndarray,|S15 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,400 B,400 B
Shape,"(100,)","(100,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 400 B 400 B Shape (100,) (100,) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",100  1,

Unnamed: 0,Array,Chunk
Bytes,400 B,400 B
Shape,"(100,)","(100,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,400 B,400 B
Shape,"(100,)","(100,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 400 B 400 B Shape (100,) (100,) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",100  1,

Unnamed: 0,Array,Chunk
Bytes,400 B,400 B
Shape,"(100,)","(100,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,400 B,400 B
Shape,"(100,)","(100,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,int32 numpy.ndarray,int32 numpy.ndarray
"Array Chunk Bytes 400 B 400 B Shape (100,) (100,) Dask graph 1 chunks in 2 graph layers Data type int32 numpy.ndarray",100  1,

Unnamed: 0,Array,Chunk
Bytes,400 B,400 B
Shape,"(100,)","(100,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,int32 numpy.ndarray,int32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,800 B,800 B
Shape,"(100,)","(100,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,|S8 numpy.ndarray,|S8 numpy.ndarray
"Array Chunk Bytes 800 B 800 B Shape (100,) (100,) Dask graph 1 chunks in 2 graph layers Data type |S8 numpy.ndarray",100  1,

Unnamed: 0,Array,Chunk
Bytes,800 B,800 B
Shape,"(100,)","(100,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,|S8 numpy.ndarray,|S8 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,280.33 MiB,2.80 MiB
Shape,"(367439, 100)","(367439, 1)"
Dask graph,100 chunks in 2 graph layers,100 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 280.33 MiB 2.80 MiB Shape (367439, 100) (367439, 1) Dask graph 100 chunks in 2 graph layers Data type float64 numpy.ndarray",100  367439,

Unnamed: 0,Array,Chunk
Bytes,280.33 MiB,2.80 MiB
Shape,"(367439, 100)","(367439, 1)"
Dask graph,100 chunks in 2 graph layers,100 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,280.33 MiB,2.80 MiB
Shape,"(367439, 100)","(367439, 1)"
Dask graph,100 chunks in 2 graph layers,100 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 280.33 MiB 2.80 MiB Shape (367439, 100) (367439, 1) Dask graph 100 chunks in 2 graph layers Data type float64 numpy.ndarray",100  367439,

Unnamed: 0,Array,Chunk
Bytes,280.33 MiB,2.80 MiB
Shape,"(367439, 100)","(367439, 1)"
Dask graph,100 chunks in 2 graph layers,100 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


### Comparison


In [33]:
## Before:
smplData = ds.where(indexer.compute(), drop=True) # subset to only those features with a valid gage_id
smplData['streamflow'].sel(feature_id=417955)
# Note: many chunks needed to service a single feature_id


Unnamed: 0,Array,Chunk
Bytes,2.80 MiB,5.25 kiB
Shape,"(367439,)","(672,)"
Dask graph,547 chunks in 7 graph layers,547 chunks in 7 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 2.80 MiB 5.25 kiB Shape (367439,) (672,) Dask graph 547 chunks in 7 graph layers Data type float64 numpy.ndarray",367439  1,

Unnamed: 0,Array,Chunk
Bytes,2.80 MiB,5.25 kiB
Shape,"(367439,)","(672,)"
Dask graph,547 chunks in 7 graph layers,547 chunks in 7 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [31]:
## After:
reChunkedData['streamflow'].sel(feature_id=417955) 
# All data for the specified feature_id is in a single chunk


Unnamed: 0,Array,Chunk
Bytes,2.80 MiB,2.80 MiB
Shape,"(367439,)","(367439,)"
Dask graph,1 chunks in 3 graph layers,1 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 2.80 MiB 2.80 MiB Shape (367439,) (367439,) Dask graph 1 chunks in 3 graph layers Data type float64 numpy.ndarray",367439  1,

Unnamed: 0,Array,Chunk
Bytes,2.80 MiB,2.80 MiB
Shape,"(367439,)","(367439,)"
Dask graph,1 chunks in 3 graph layers,1 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4 B,4 B
Shape,(),()
Dask graph,1 chunks in 3 graph layers,1 chunks in 3 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
Array Chunk Bytes 4 B 4 B Shape () () Dask graph 1 chunks in 3 graph layers Data type float32 numpy.ndarray,,

Unnamed: 0,Array,Chunk
Bytes,4 B,4 B
Shape,(),()
Dask graph,1 chunks in 3 graph layers,1 chunks in 3 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,15 B,15 B
Shape,(),()
Dask graph,1 chunks in 3 graph layers,1 chunks in 3 graph layers
Data type,|S15 numpy.ndarray,|S15 numpy.ndarray
Array Chunk Bytes 15 B 15 B Shape () () Dask graph 1 chunks in 3 graph layers Data type |S15 numpy.ndarray,,

Unnamed: 0,Array,Chunk
Bytes,15 B,15 B
Shape,(),()
Dask graph,1 chunks in 3 graph layers,1 chunks in 3 graph layers
Data type,|S15 numpy.ndarray,|S15 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4 B,4 B
Shape,(),()
Dask graph,1 chunks in 3 graph layers,1 chunks in 3 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
Array Chunk Bytes 4 B 4 B Shape () () Dask graph 1 chunks in 3 graph layers Data type float32 numpy.ndarray,,

Unnamed: 0,Array,Chunk
Bytes,4 B,4 B
Shape,(),()
Dask graph,1 chunks in 3 graph layers,1 chunks in 3 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4 B,4 B
Shape,(),()
Dask graph,1 chunks in 3 graph layers,1 chunks in 3 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
Array Chunk Bytes 4 B 4 B Shape () () Dask graph 1 chunks in 3 graph layers Data type float32 numpy.ndarray,,

Unnamed: 0,Array,Chunk
Bytes,4 B,4 B
Shape,(),()
Dask graph,1 chunks in 3 graph layers,1 chunks in 3 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4 B,4 B
Shape,(),()
Dask graph,1 chunks in 3 graph layers,1 chunks in 3 graph layers
Data type,int32 numpy.ndarray,int32 numpy.ndarray
Array Chunk Bytes 4 B 4 B Shape () () Dask graph 1 chunks in 3 graph layers Data type int32 numpy.ndarray,,

Unnamed: 0,Array,Chunk
Bytes,4 B,4 B
Shape,(),()
Dask graph,1 chunks in 3 graph layers,1 chunks in 3 graph layers
Data type,int32 numpy.ndarray,int32 numpy.ndarray


In [32]:
client.close()
cluster.close()

  self.scheduler_comm.close_rpc()
