# Create/Connect to Dask Cluster

In this example we just launch one locally, but you can just as well connect to existing one.

## Cluster sizing

Majority of the "work" is waiting for S3 data to arrive, so you would want to oversubscribe your cluster, i.e. have way more workers than there are CPUs, 8 workers per core is not unreasonable. 

Should you use threads or processes?

I recommend more threads, many threads per worker process allows sharing of common data more efficiently, the downside is [GIL](https://wiki.python.org/moin/GlobalInterpreterLock), so having too many threads might become problematic. Most of the time is spent waiting for HTTP data (from S3), GIL is released during this time. Ultimately one has to experiment to see what works best for your workload. Important message is: "don't be afraid to use more threads."

## Use local worker pool when still debugging

In the code below we launch local cluster in the same process that runs this notebook. This makes debugging any problems easier.

In [None]:
import dask
import dask.distributed

client = dask.distributed.Client(n_workers=1, 
                                 threads_per_worker=32, 
                                 processes=False, 
                                 ip='127.0.0.1')
client

# Configure Dask Cluster for S3 I/O

1. Configure GDAL for cloud access on every worker process
2. Check that we can obtain AWS credentials

## Note on STS

If using [STS](https://docs.aws.amazon.com/STS/latest/APIReference/Welcome.html) to obtain S3 access credenetials, you have to keep in mind the following:

- Every worker thread will obtain its own set of credentials (first time it does IO)
- Token expiry will cause I/O errors
- To force credential renewal you have to call `set_default_rio_config` again on every worker

Most robust and efficient way is to create a locked down set of credentials that can only read s3 buckets of interest and provision that to every worker (`~/.aws/{config|credentials}`)

In [None]:
def worker_setup_auto():
    from datacube.utils.rio import set_default_rio_config, activate_from_config
    
    # these settings will be applied in every worker thread
    set_default_rio_config(aws={'region_name': 'auto'},
                           cloud_defaults=True)
    
    # Force activation in the main thread
    # - Really just to test that configuration works
    # - Every worker thread will automatically run this again
    return activate_from_config()

# Runs once on every worker process, not per worker thread!
client.register_worker_callbacks(setup=worker_setup_auto)

In [None]:
from IPython.display import display
from types import SimpleNamespace
from datacube import Datacube
from odc.ui import show_datasets

dc = Datacube(env='gm')

cfg = SimpleNamespace(product='ls8_nbart_geomedian_annual',
                      time='2017',
                      crs='EPSG:3577',
                      resolution=(-32*25, 32*25), # 1/32 of native
                      dask_chunk=256,
                      measurements=('red', 'green', 'blue'))

In [None]:
%%time
dss = dc.find_datasets(product=cfg.product, time=cfg.time)
print('Found {:,} datasets'.format(len(dss)))

In [None]:
show_datasets(dss, mode='geojson')

## Lazy Dask Array

Construct lazy dask array, doesn't load pixels just yet.

In [None]:
%%time
xx = dc.load(product=cfg.product, 
             datasets=dss, 
             output_crs=cfg.crs,
             resolution=cfg.resolution,
             measurements=cfg.measurements,
             dask_chunks={'x': cfg.dask_chunk, 'y': cfg.dask_chunk})

print("Number of chunks per band: {}x{}x{}".format(*xx.red.data.to_delayed().shape))
display(xx)

In [None]:
%%time
rr = xx.red.compute()

In [None]:
from odc.ui import to_rgba, to_jpeg_data

In [None]:
%%time
cc = to_rgba(xx, clamp=3000)

In [None]:
from IPython.display import Image

Image(data=to_jpeg_data(cc.isel(time=0).values))