# Re-Chunking Larger Datasets 

This notebook extends ideas covered in the [basic workflow](./ReChunkingData.ipynb).  This 
notebook will perfrom the same operations, but will work on the **much** larger dataset, and 
involve some parallelization using the dask scheduler. 

:::{Warning}

You should run this **only** on a cloud compute node -- on ESIP Nebari, for example. We 
will be reading and writing **enormous** amounts of data to S3 buckets. To do that over a 
typical network connection will saturate your bandwidth and take days to complete.

:::

## System Setup 

In [None]:
# Activate logging
import logging
logging.basicConfig(level=logging.INFO, force=True)

## Let's see how big your compute environment is:
import os
print(f"CPUS: {os.cpu_count()}")
import psutil
svmem = psutil.virtual_memory()
print(f"Total Virtual Memory: {svmem.total/(1024*1024*1024):.2f} Gb")

## Plumb Data Source
We're going to look at a particular dataset from the National Water Model Reanalysis Version 2.1. 
The dataset is part of the AWS Open Data Program.  Let's look at what's available by just listing
the S3 bucket holding the NWM data:

In [None]:
import fsspec
fs = fsspec.filesystem('s3', anon=True)
fs.ls('s3://noaa-nwm-retrospective-2-1-zarr-pds/')

## Load the zarr data
The dataset we'll operate on is the `chrtout` dataset. 

In [None]:
# Load chrtout
import xarray as xr
fileHandle = fs.get_mapper('noaa-nwm-retrospective-2-1-zarr-pds/chrtout.zarr')
ds = xr.open_zarr(fileHandle, consolidated=True)

## Spin up Dask Cluster
Our rechunking operation will be able to work in parallel. To do that, we will
spin up a `dask` cluster on the cloud hardware to schedule the various workers.
Note that this cluster must be configured with a specific user **profile** with 
permissions to write to our eventual output location. 

In [None]:
## Set AWS Credentials
import configparser
awsconfig = configparser.ConfigParser()
awsconfig.read(
    os.path.expanduser('~/.aws/credentials') # default location... if yours is elsewhere, change this.
)
_profile_nm  = 'osn-renci'
_endpoint = 'https://renc.osn.xsede.org'
# Set environment vars based on parsed awsconfig
#os.environ['AWS_PROFILE'] = _profile_nm
os.environ['AWS_ACCESS_KEY_ID']     = awsconfig[_profile_nm]['aws_access_key_id']    
os.environ['AWS_SECRET_ACCESS_KEY'] = awsconfig[_profile_nm]['aws_secret_access_key']    
## Your profile may require that you specify an endpoint by which  you access S3 object storage
os.environ['AWS_S3_ENDPOINT'] = _endpoint
try: 
    del os.environ['AWS_PROFILE']
except KeyError:
    pass


# NOTE: This cluster configuration is VERY specific to the JupyterHub cloud environment on ESIP/QHUB
from dask_gateway import Gateway
gateway = Gateway()
options = gateway.cluster_options()
options.conda_environment='users/users-pangeo'  
##                         ^^^^^^ 
## This conda environment is correct for nebari.esipfed.org
## You may need to specify a different conda environment if you are running elsewhere. 
options.profile = 'Medium Worker'
options.environment_vars = dict(
    DASK_DISTRIBUTED__SCHEDULER__WORKER_SATURATION="1.0"
)
# pass environment vars to workers
# this includes AWS environment vars needed to access requester-pays and private buckets
options.environment_vars.update(dict(os.environ))
cluster = gateway.new_cluster(options)
cluster.adapt(minimum=10, maximum=30)

# get the client for the cluster
client = cluster.get_client()
client.dashboard_link

## Read Sample Data

In [None]:
import dask
with dask.config.set(**{'array.slicing.split_large_chunks': True}):
    smplData = ds.where(ds.gage_id != ''.rjust(15).encode(), drop=True) # subset to only those features with a valid gage_id
    smplData.drop('crs') # Not needed/wanted for this analysis
smplData

## Re-Chunk Plan
We will configure a new chunking plan which will favor time-series analysis. 
Using the dimensions of the data: 
* 367439 time steps
* 7994 feature IDs

We can write the new plan as: 

In [None]:
# The new chunking plan:
chunk_plan = {
    'streamflow': {'time': 367439, 'feature_id': 1}, # all time records in one chunk for each feature_id
    'velocity': {'time': 367439, 'feature_id': 1},
    'elevation': (7994,),
    'gage_id': (7994,),
    'latitude': (7994,),
    'longitude': (7994,),    
    'order': (7994,),    
    'time': (367439,), # all time coordinates in one chunk
    'feature_id': (7994,) # all feature_id coordinates in one chunk
}


In [None]:
# Manually reset the chunking metadata in prep for re-chunking
smplData = smplData.chunk(chunks={'feature_id':1, 'time': 367439})
for x in smplData.variables:
    smplData[x].encoding['chunks'] = None

## Set up output location

With this plan, we can ask `rechunker` to re-write the data using the prescribed chunking pattern.

Unlike with the smaller dataset, we need to write this very large dataset to an object store in the datacenter: an S3 'bucket'.  So we need to set that up so that `rechunker` will have a suitable place to write data. This new data will be a complete copy of the original, just re-organized a bit. 

In [None]:
# with anon=False, we force the use of the environment variable 'AWS_PROFILE', set above.
from getpass import getuser
uname=getuser()

fsw = fsspec.filesystem('s3', anon=False, default_fill_cache=False, skip_instance_cache=True, 
                                 client_kwargs={'endpoint_url': _endpoint}
)
workspace = 's3://rsignellbucket2/'
testDir = workspace + "testing/"
myDir = testDir + f'{uname}_ReChunkTutorial/'
fsw.ls(testDir)

In [None]:
fsw.mkdir(myDir)

In [None]:
for f in ['rechunked.zarr', 'staging.zarr']:
    if fsw.exists(myDir + f):
        fsw.rm(myDir + f, recursive=True)
staging = fsw.get_mapper(myDir + 'staging.zarr')
outfile = fsw.get_mapper(myDir + 'rechunked.zarr')

## Ready to rechunk

In [None]:
import rechunker
## Recall that merely invoking rechunker does not do any work... just sorts out 
## the rechunking plan and writes metadata.
result = rechunker.rechunk(
    smplData,
    chunk_plan,
    "2GB",
    outfile, 
    temp_store=staging 
)

In [None]:
from dask.distributed import progress, performance_report

with performance_report(filename="dask-report.html"):
    r = result.execute(retries=10)  

In [None]:
import zarr
_ = zarr.consolidate_metadata(outfile)

## Results
Let's read in the resulting re-chunked dataset to see how it looks:

In [None]:
reChunkedData = xr.open_zarr(outfile)
reChunkedData

### Comparison


In [None]:
## Before:
sampleData['streamflow'].sel(feature_id=1343034)
# Note: three chunks needed to service a single feature_id


In [None]:
## After:
reChunkedData['streamflow'].sel(feature_id=1343034) 
# All data for the specified feature_id is in a single chunk


In [None]:
client.close()
cluster.close()