# Re-Chunking Data

This notebook extends ideas covered in the [basic workflow](./ReChunkingData.ipynb).  This notebook will perfrom the same operations, but will work on the much larger dataset, and involve some parallelization using the dask scheduler. 

:::{Warning}

You should run this only on a cloud compute node.  We will be reading and writing enormous amounts of data
to S3 buckets. To do that over a typical network connection will saturate your bandwidth and take days to
complete.

:::

## System Information
We're going to need a bigger boat....  8CPU / 32Gb would be ideal

In [1]:
import os
print(f"CPUS: {os.cpu_count()}")
import psutil
svmem = psutil.virtual_memory()
print(f"Total Virtual Memory: {svmem.total/(1024*1024*1024):.2f} Gb")

CPUS: 8
Total Virtual Memory: 30.91 Gb


## Plumb Data Source

In [2]:
# List available datasets at the National Water Model Reanalysis Version 2.1. 
# The dataset is part of the AWS Open Data Program.
import fsspec
fs = fsspec.filesystem('s3', anon=True)
fs.ls('s3://noaa-nwm-retrospective-2-1-zarr-pds/')

['noaa-nwm-retrospective-2-1-zarr-pds/chrtout.zarr',
 'noaa-nwm-retrospective-2-1-zarr-pds/gwout.zarr',
 'noaa-nwm-retrospective-2-1-zarr-pds/index.html',
 'noaa-nwm-retrospective-2-1-zarr-pds/lakeout.zarr',
 'noaa-nwm-retrospective-2-1-zarr-pds/ldasout.zarr',
 'noaa-nwm-retrospective-2-1-zarr-pds/precip.zarr',
 'noaa-nwm-retrospective-2-1-zarr-pds/rtout.zarr']

In [3]:
# Load chrtout
import xarray as xr
fileHandle = fs.get_mapper('noaa-nwm-retrospective-2-1-zarr-pds/chrtout.zarr')
ds = xr.open_zarr(fileHandle, consolidated=True)

In [4]:
# Include HyTest helpers...
import sys
libDir = r'/shared/users/lib'
if libDir not in sys.path:
    sys.path.append(libDir)
# Activate logging
import logging
logging.basicConfig(level=logging.INFO, force=True)

from HyTest.helpers import configure_cluster

In [5]:
import ebdpy as ebd
os.environ['AWS_PROFILE'] = 'nhgf-development'
profile = 'nhgf-development'
region = 'us-west-2'
endpoint = f's3.{region}.amazonaws.com'
ebd.set_credentials(profile=profile, region=region, endpoint=endpoint)
worker_max = 30
client,cluster = ebd.start_dask_cluster(profile=profile,worker_max=worker_max, 
                                      region=region, use_existing_cluster=True,
                                      adaptive_scaling=False, wait_for_cluster=False, 
                                      environment='pangeonoa', worker_profile='Pangeo Worker', propagate_env=True)
#client, cluster = configure_cluster('cloud')
client.dashboard_link

Region: us-west-2
Existing Dask clusters:
Cluster Index c_idx: 0 / Name: dev.9b5ef859a930440c8698d48e4dd97b93 ClusterStatus.RUNNING
Using existing cluster [0].
Setting Fixed Scaling workers=30
Reconnect client to clear cache
client.dashboard_link (for new browser tab/window or dashboard searchbar in Jupyterhub):
https://jupyter.qhub.esipfed.org/gateway/clusters/dev.9b5ef859a930440c8698d48e4dd97b93/status
Propagating environment variables to workers
Using environment: pangeonoa


'https://jupyter.qhub.esipfed.org/gateway/clusters/dev.9b5ef859a930440c8698d48e4dd97b93/status'

In [6]:
import dask
with dask.config.set(**{'array.slicing.split_large_chunks': True}):
    smplData = ds.where(ds.gage_id != ''.rjust(15).encode(), drop=True) # subset to only those features with a valid gage_id
    smplData.drop('crs') # Not needed/wanted for this analysis
smplData

Unnamed: 0,Array,Chunk
Bytes,31.23 kiB,31.23 kiB
Shape,"(7994,)","(7994,)"
Count,4 Tasks,1 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 31.23 kiB 31.23 kiB Shape (7994,) (7994,) Count 4 Tasks 1 Chunks Type float32 numpy.ndarray",7994  1,

Unnamed: 0,Array,Chunk
Bytes,31.23 kiB,31.23 kiB
Shape,"(7994,)","(7994,)"
Count,4 Tasks,1 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,117.10 kiB,117.10 kiB
Shape,"(7994,)","(7994,)"
Count,4 Tasks,1 Chunks
Type,|S15,numpy.ndarray
"Array Chunk Bytes 117.10 kiB 117.10 kiB Shape (7994,) (7994,) Count 4 Tasks 1 Chunks Type |S15 numpy.ndarray",7994  1,

Unnamed: 0,Array,Chunk
Bytes,117.10 kiB,117.10 kiB
Shape,"(7994,)","(7994,)"
Count,4 Tasks,1 Chunks
Type,|S15,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,31.23 kiB,31.23 kiB
Shape,"(7994,)","(7994,)"
Count,4 Tasks,1 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 31.23 kiB 31.23 kiB Shape (7994,) (7994,) Count 4 Tasks 1 Chunks Type float32 numpy.ndarray",7994  1,

Unnamed: 0,Array,Chunk
Bytes,31.23 kiB,31.23 kiB
Shape,"(7994,)","(7994,)"
Count,4 Tasks,1 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,31.23 kiB,31.23 kiB
Shape,"(7994,)","(7994,)"
Count,4 Tasks,1 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 31.23 kiB 31.23 kiB Shape (7994,) (7994,) Count 4 Tasks 1 Chunks Type float32 numpy.ndarray",7994  1,

Unnamed: 0,Array,Chunk
Bytes,31.23 kiB,31.23 kiB
Shape,"(7994,)","(7994,)"
Count,4 Tasks,1 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,31.23 kiB,31.23 kiB
Shape,"(7994,)","(7994,)"
Count,4 Tasks,1 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 31.23 kiB 31.23 kiB Shape (7994,) (7994,) Count 4 Tasks 1 Chunks Type int32 numpy.ndarray",7994  1,

Unnamed: 0,Array,Chunk
Bytes,31.23 kiB,31.23 kiB
Shape,"(7994,)","(7994,)"
Count,4 Tasks,1 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,62.45 kiB,62.45 kiB
Shape,"(7994,)","(7994,)"
Count,6 Tasks,1 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 62.45 kiB 62.45 kiB Shape (7994,) (7994,) Count 6 Tasks 1 Chunks Type object numpy.ndarray",7994  1,

Unnamed: 0,Array,Chunk
Bytes,62.45 kiB,62.45 kiB
Shape,"(7994,)","(7994,)"
Count,6 Tasks,1 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,21.88 GiB,241.50 kiB
Shape,"(367439, 7994)","(672, 46)"
Count,288709 Tasks,118699 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 21.88 GiB 241.50 kiB Shape (367439, 7994) (672, 46) Count 288709 Tasks 118699 Chunks Type float64 numpy.ndarray",7994  367439,

Unnamed: 0,Array,Chunk
Bytes,21.88 GiB,241.50 kiB
Shape,"(367439, 7994)","(672, 46)"
Count,288709 Tasks,118699 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,21.88 GiB,241.50 kiB
Shape,"(367439, 7994)","(672, 46)"
Count,288709 Tasks,118699 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 21.88 GiB 241.50 kiB Shape (367439, 7994) (672, 46) Count 288709 Tasks 118699 Chunks Type float64 numpy.ndarray",7994  367439,

Unnamed: 0,Array,Chunk
Bytes,21.88 GiB,241.50 kiB
Shape,"(367439, 7994)","(672, 46)"
Count,288709 Tasks,118699 Chunks
Type,float64,numpy.ndarray


In [7]:
# The new chunking plan:
chunk_plan = {
    'streamflow': {'time': 367439, 'feature_id': 1}, # all time records in one chunk for each feature_id
    'velocity': {'time': 367439, 'feature_id': 1},
    'elevation': (7994,),
    'gage_id': (7994,),
    'latitude': (7994,),
    'longitude': (7994,),    
    'order': (7994,),    
    'time': (367439,), # all time coordinates in one chunk
    'feature_id': (7994,) # all feature_id coordinates in one chunk
}


In [8]:
# Because we subsetted/selected from the original xarray dataset, the new dataset has carried
# some of the source's metadata with it -- which is going to cause mismatches when operating on
# the sample data subset.  Manually reset these metadata in prep for re-chunking
smplData = smplData.chunk(chunks={'feature_id':1, 'time': 367439})
for x in smplData.variables:
    smplData[x].encoding['chunks'] = None

With this plan, we can ask `rechunker` to re-write the data using the prescribed chunking pattern.

Unlike with the smaller dataset, we need to write this very large dataset to an object store in the datacenter: an S3 'bucket'.  So we need to set that up so that `rechunker` will have a suitable place to write data. 

In [9]:
import fsspec
import os

fsw = fsspec.filesystem('s3', anon=False, default_fill_cache=False, skip_instance_cache=True)
workspace = 's3://nhgf-development/workspace/'
myDir = workspace + 'testing/tutorial/'
fsw.ls(myDir)

INFO:aiobotocore.credentials:Found credentials in environment variables.


[]

In [10]:
#fs.mkdir(myDir)
# for f in ['rechunked.zarr', 'staging.zarr']:
#     if fs.exists(myDir + f):
#         fs.rm(myDir + f, recursive=True)
staging = fs.get_mapper(myDir + 'staging.zarr')
outfile = fs.get_mapper(myDir + 'rechunked.zarr')

In [12]:
import rechunker
result = rechunker.rechunk(
    smplData,
    chunk_plan,
    "2GB",                #<--- Max Memory
    outfile, 
    temp_store=staging 
)


PermissionError: Access Denied

In [None]:
from dask.distributed import progress, performance_report

with performance_report(filename="dask-report.html"):
    r = result.execute(retries=10)

In [22]:
import zarr
_ = zarr.consolidate_metadata(outfile)

## Results
Let's read in the resulting re-chunked dataset to see how it looks:

In [None]:
reChunkedData = xr.open_zarr(outfile)
reChunkedData

### Comparison


In [None]:
## Before:
sampleData['streamflow'].sel(feature_id=1343034)
# Note: three chunks needed to service a single feature_id


In [None]:
## After:
reChunkedData['streamflow'].sel(feature_id=1343034) 
# All data for the specified feature_id is in a single chunk


In [None]:
from HyTest.helpers import stop_running_clusters
stop_running_clusters()