# Best-practices for Cloud-Optimized Geotiffs

**Part 4. Dask GatewayCluster**

Unlike LocalCluster, a Dask GatewayCluster gives us the ability to dynamically increase our CPU and RAM across many machines! This is extremely powerful, because now we can load very big datasets into RAM for efficient calculations. There is a complication in that now we are running computations on many physical machines instead of just one, so network communication is more challenging and the dask machines likely don't have access to your local files. When COGS are store on S3 though, we can access them from any machine!

In [None]:
import xarray as xr
import s3fs
import pandas as pd
import os 

import dask
from dask.distributed import Client, progress
from dask_gateway import Gateway

In [None]:
# use the same GDAL environment settings as we did for the single COG case
env = dict(GDAL_DISABLE_READDIR_ON_OPEN='EMPTY_DIR', 
           AWS_NO_SIGN_REQUEST='YES',
           GDAL_MAX_RAW_BLOCK_CACHE_SIZE='200000000',
           GDAL_SWATH_SIZE='200000000',
           VSI_CURL_CACHE_SIZE='200000000')
os.environ.update(env)

### Dask GatewayCluster

dask gateway allow us to connect to a Kubernetes Cluster so that we can go beyond the RAM and CPU of a single machine. It can take several minutes for these machines to initialize on the Cloud, so be patient when starting a cluster.

In [None]:
# dask gateway allow us to connect to a Kubernetes Cluster so that we can go beyond the RAM and CPU of a single machine

# NOTE: we have to explicitly pass local environment variables to the cluster now
# By default each worker has 2 cores and 4GB memory and effectively runs as a separate process
# NOTE: how to deal with cores vs threads in a gateway cluster?

gateway = Gateway()
options = gateway.cluster_options()
options.environment = env 
cluster = gateway.new_cluster(options)
cluster.scale(4) # let's get the same number of "workers" as our previous LocalCluster examples

In [None]:
# The dashboard link can also be pasted into the dask lab-extension
cluster

In [None]:
# NOTE: just like with a LocalCluster, it's good to explicitly connect to our GatewayCluster
client = Client(cluster) 

In [None]:
# the dashboard link works just like a localcluster
client

In [None]:
options

In [None]:
# Make sure that your dask workers see GDAL environment variables
def get_env(env):
    import os
    return os.environ.get(env)

print(client.run(get_env, 'GDAL_DISABLE_READDIR_ON_OPEN'))

In [None]:
%%time 

s3 = s3fs.S3FileSystem(anon=True)
objects = s3.glob('sentinel-s1-rtc-indigo/tiles/RTC/1/IW/10/T/ET/**Gamma0_VV.tif')
images = ['s3://' + obj for obj in objects]
print(len(images))
images.sort(key=lambda x: x[-32:-24]) #sort list in place by date in filename
# Let's use first 100 images for simplicity
images = images[:100]
dates = [pd.to_datetime(x[-32:-24]) for x in images]

In [None]:
@dask.delayed
def lazy_open(href):
    chunks=dict(band=1, x=2745, y=2745)
    return xr.open_rasterio(href, chunks=chunks) 

In [None]:
%%time 

# ~6.5 s

dataArrays = dask.compute(*[lazy_open(href) for href in images])
da = xr.concat(dataArrays, dim='band', join='override', combine_attrs='drop').rename(band='time')
da['time'] = dates
da

In [None]:
%%time

# 41 s

da.mean(dim=['x','y']).compute()

In [None]:
%%time

# 44 s
# just like with a LocalCluster this workflow requires pulling (nCOGS x chunk size) into worker RAM to get mean through time for each chunk (3GB)
# Now we have 4GB per worker (and we can adjust this via cluster settings)

da.mean(dim='time').compute()

### Visualization

Using hvplot like we've done before will utilize the dask cluster as you request to plot each image

In [None]:
import hvplot.xarray
da.hvplot.image(rasterize=True, 
                aspect='equal', frame_width=500,
                cmap='gray', clim=(0,0.4))