# Notebook to process data in SMCE helio-public S3 bucket using Dask

### You will need to create the conda environment for this notebook, use 'environment-s3-demo.yml' with the command (in the terminal)
### > conda env create -f environment-s3-dask-demo.yml

### then make sure to select the 'conda env:s3-dask-demo' environment for this notebook!
<hr>

## First:  What is S3? 

S3 stands for "Simple Storage Service," which provides object storage for for AWS.  https://aws.amazon.com/s3/ 

It allows people to query and access data from a common location reference.  The buckets can be made <a href="https://stackoverflow.com/q/16784052">web accessible to users outside of daskhub</a> if web access is enabled.    

S3 buckets are individual storage elements.  To <a href="https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/ls.html">get a list of the S3 buckets</a> on the SMCE Daskhub, enter this at a terminal prompt : <br>
`aws s3 ls`


To view the contents of a specific bucket, reference it with s3:// <br>
`aws s3 ls s3://helio-public/`
>            PRE SDO/
>            PRE SOHO/

(Note: "PRE" stands for prefix, so SDO/ is an AWS prefix with name SDO.) 
<hr>

In [6]:
import boto3
import s3fs
import logging
import dask
from os import listdir
from os.path import isfile, join

from re import search

from astropy.io import fits
import io

Exception in callback None()
handle: <Handle cancelled>
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/tornado/iostream.py", line 1391, in _do_ssl_handshake
    self.socket.do_handshake()
  File "/srv/conda/envs/notebook/lib/python3.8/ssl.py", line 1309, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:1125)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.8/asyncio/events.py", line 81, in _run
    self._context.run(self._callback, *self._args)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/tornado/platform/asyncio.py", line 189, in _handle_events
    handler_func(fileobj, events)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/tornado/iostream.py", line 696, in _handle_events
    self._handle

### 0. Configuration 

In [2]:
# name of the bucket to upload to
bucket_name = 'helio-public'

# location in the bucket to use
bucket_path = '/SDO/AIA/'

### Initialize the cluster and assign the client to the cluster, display the cluster widget

In [3]:
from dask_gateway import Gateway, GatewayCluster
gateway = Gateway()
options = gateway.cluster_options()

# We're setting some defaults here just for grins... 
# I like the pangeo/base-notebook image for the workers since it has almost every library you'd need on a worker
# In our environment, without setting these, the widget will default to the same image that the notebook itself is running, 
# as well as 2 cores and 4GB memory per worker

options.worker_cores=2
options.worker_memory=1
options

VBox(children=(HTML(value='<h2>Cluster Options</h2>'), GridBox(children=(HTML(value="<p style='font-weight: bo…

In [4]:
cluster = gateway.new_cluster(options)
client = cluster.get_client()

# manual scaling (2 workers)
# cluster.scale(2)

# Adaptively scale between 2 and 10 workers
# cluster.adapt(minimum=2, maximum=10)

cluster

VBox(children=(HTML(value='<h2>GatewayCluster</h2>'), HBox(children=(HTML(value='\n<div>\n<style scoped>\n    …

In [5]:
from dask.distributed import Client

client = Client(cluster)
client

0,1
Client  Scheduler: gateway://traefik-daskhub-dask-gateway.daskhub:80/daskhub.ab39e1f3801548bdb39b023a59919f1c  Dashboard: /services/dask-gateway/clusters/daskhub.ab39e1f3801548bdb39b023a59919f1c/status,Cluster  Workers: 0  Cores: 0  Memory: 0 B


### Define routine pull data from bucket into memory and work with it

In [7]:
# initialize connection to S3 bucket
s3_client = boto3.resource('s3')
bucket = s3_client.Bucket(bucket_name)

In [8]:
# get our list of files/s3 objects

# Iterates through all the objects, doing the pagination for you. Each obj
# is an ObjectSummary, so it doesn't contain the body. You'll need to call
# get to get the whole body.
s3_files = []
for obj in bucket.objects.all():
    key = obj.key

    if search ('fits', key):
        s3_files.append(obj.key)

s3_files[0]

'SDO/AIA/AIA_L4_20141018_000001_94.fits'

In [15]:
def open_fits_s3 (s3_key, bucket_name:str):
     
    fs = s3fs.S3FileSystem(anon=True)
    with fs.open(bucket_name+'/'+s3_files[0], 'rb') as f:
        fits_hdul = fits.open(io.BytesIO(f.read()))
        fits_hdul.info()
        return fits_hdul

    return None
    
def work_on_data (client:dask.distributed.client.Client, bucket_name:str, files:list=[])->int:
    
    # Iterates through all the objects, doing the pagination for you. 
    hdu_list = client.map(open_fits_s3, files, bucket_name=bucket_name)
    # print(f'HDU list: %s' % hdu_list)
        
    # <DO MORE WORK WITH FILE HERE>
    
    # return a status of some kind
    return hdu_list


In [16]:
def chunks(lst, n):
    n = max(1, n)
    return (lst[i:i+n] for i in range(0, len(lst), n))

batch_size = 2
for files_to_process in chunks(s3_files[:4], batch_size):

    r = work_on_data(client, bucket_name, files_to_process)
    print (client, r)


HDU list: [<Future: pending, key: open_fits_s3-ed7db6d246fa3f86388bf936098a3eb6>, <Future: pending, key: open_fits_s3-e79e2a0d6e5e1754e534ef77597a71ff>]
<Client: 'tls://192.168.3.157:8786' processes=2 threads=4, memory=2.15 GB> [<Future: pending, key: open_fits_s3-ed7db6d246fa3f86388bf936098a3eb6>, <Future: pending, key: open_fits_s3-e79e2a0d6e5e1754e534ef77597a71ff>]
HDU list: [<Future: pending, key: open_fits_s3-470053d25faf4048510298020903d400>, <Future: pending, key: open_fits_s3-35f1103f940b0fc48f74bd20bd9e1d47>]
<Client: 'tls://192.168.3.157:8786' processes=2 threads=4, memory=2.15 GB> [<Future: pending, key: open_fits_s3-470053d25faf4048510298020903d400>, <Future: pending, key: open_fits_s3-35f1103f940b0fc48f74bd20bd9e1d47>]
