<CENTER>
<H1 style="color:red">
SMCE Heliophysics DaskHub S3 Bucket Tutorial
</H1>
</CENTER>
<!--<img src="./banner.jpg">-->

# Notebook to process data in SMCE helio-public S3 bucket using Dask

<i>Note:  This document is maintained through the SMCE HelioCloud gitlab.</i>

This is a simple example showing how to get a list of FITS files, run them through Dask workers to pull them from disk and examine the header keyword(s). 

You can use this notebook to test out various parameters you might feed to Dask; Consider the 'batch size', number of workers, number of cores per worker and memory per worker. Use the dashboard link to inspect how Dask is performing. Try both manual and automatic scaling strategies. See if you can get it to process 100 files in 20 sec or less!

## First:  What is S3? 

S3 stands for "Simple Storage Service," which provides object storage for for AWS.  https://aws.amazon.com/s3/ 

It allows people to query and access data from a common location reference.  The buckets can be made <a href="https://stackoverflow.com/q/16784052">web accessible to users outside of daskhub</a> if web access is enabled.    

S3 buckets are individual storage elements. 

## Accessing S3 buckets

To <a href= "https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/ls.html">get a list of the S3 buckets</a> on the SMCE Daskhub, enter this at a terminal prompt : <br>
`aws s3 ls`

To view the contents of a specific bucket, reference it with s3:// <br>
`aws s3 ls s3://helio-public/`
>            PRE SDO/
>            PRE SOHO/

(Note: "PRE" stands for prefix, so SDO/ is an AWS prefix with name SDO.) 
<hr>

The external reference for this bucket is https://helio-public.s3.us-east-1.amazonaws.com/

## Basic commands using S3 buckets

To create a new directory, just reference it:<br>
`aws s3 ls s3://helio-public/yourname/yourdir`

then you can copy to the bucket as if it was a unix folder:<br>
`aws s3 cp yourfile s3://helio-public/yourname/yourdir`

copying multiple files is a bit more intricate, you need to put the multiple files in a directory first:<br>
`aws s3 cp sourcedir/ s3://helio-public/yourname/yourdir --recursive`

if you need access to a bucket that has restricted access, you have to run aws-mfa first:<br>
`~/aws-mfa default`

where default is a profile. To see available profiles:<br>
`cat ~/.aws/credentials`

you may need to change it to have execute permission first:<br>
`chmod 755 ~/aws-mfa`

## (The code below needs to be commented for instructional purposes)

In [None]:
import boto3
from botocore import UNSIGNED
from botocore.config import Config

import dask
import io
import re
import logging
import s3fs

from astropy.io import fits
from dask.distributed import Client
from os import listdir
from os.path import isfile, join
from re import search

## 0. Configuration 

In [None]:
# name of the bucket to upload to
bucket_name = 'gov-nasa-hdrl-data1'

# location in the bucket to use (a days worth of 211 A data from AIA on SDO for the date 2022-11-27)
bucket_path = 'sdo/aia/20221127/0211/'

(Feel free to play with the following values to optimize the performance.)

In [None]:
# number of workers to use, for automatic scaling, our max number
n_workers = 10

# memory per worker (in Gb)
w_memory = 2

# cores per worker
w_cores = 2

# number of files to test against (360 max)
n_files = 100

# Number of files we release to be worked on by all workers at a time
# the higher the number the more files being processed concurrently, but also
# the greater the memory consumed. 
batch_size = 50

## 1. Initialize the cluster and assign the client to the cluster, display the cluster widget

In [None]:
from dask_gateway import Gateway, GatewayCluster
gateway = Gateway()
options = gateway.cluster_options()

# We're setting some defaults here just for grins... 
# I like the pangeo/base-notebook image for the workers since it has almost every library you'd need on a worker
# In our environment, without setting these, the widget will default to the same image that the notebook itself is running, 
# as well as 2 cores and 4GB memory per worker

options.worker_cores=w_cores
options.worker_memory=w_memory
# options

In [None]:
cluster = gateway.new_cluster(options)
client = cluster.get_client()

# use Manual (if False, then uses Automatic scaling)
use_manual_scaling = False

if use_manual_scaling:
    # manual scaling (n_workers defined above)
    cluster.scale(n_workers)
else:
    # Adaptively scale between 1 and n_workers (the max)
    cluster.adapt(minimum=1, maximum=n_workers)

# uncomment this if you want to use the GUI
#cluster

In [None]:
# create client, show url we can go to to monitor progress
client = Client(cluster)
client

## 2. Scan data from bucket and make a simple list of file names

While we recommend using 'import cloudcatalog' to fetch catalog lists, below we use the bare metal read from S3 to give a generalized example.

In [None]:
# get our list of files/s3 objects
import os
import json
if os.path.isfile('s3_data.json'):
    with open ('s3_data.json') as f:
        s3_files = json.load(f)['files']
else:
    # initialize connection to S3 bucket
    s3_client = boto3.client('s3', config=Config(signature_version=UNSIGNED))
    objs = s3_client.list_objects(Bucket=bucket_name, Prefix=bucket_path)
    """ Iterates through all the objects, doing the pagination for you. Each obj
     is an ObjectSummary, so it doesn't contain the body. You'll need to call
     get to get the whole body.
    """
    s3_files = []
    for obj in objs['Contents']: #bucket.objects.all():
        key = obj['Key']
        if search ('fits', key):
            s3_files.append('s3://'+bucket_name+'/'+key) 
    # write / cache files to local listing (speed purposes)
    with open('s3_data.json', 'w') as outfile:
        json.dump({'files' : s3_files}, outfile)
    
print(len(s3_files),"files available, sample file:",s3_files[0])

## 3. Define some routines we will use for doing work with Dask

In [None]:
import astropy.io.fits
import s3fs

def DO_SCIENCE(mydata):
    # you can put better science here
    iirad = mydata.mean()
    return iirad

# these are variable helpful handler functions
def s3url_to_bucketkey(s3url: str): # -> Tuple[str, str]:
    """
    Extracts the S3 bucket name and file key from an S3 URL.
    e.g. s3://mybucket/mykeypart1/mykeypart2/fname.fits -> mybucket, mykeypart1/mykeypart2/fname.fits
    """
    name2 = re.sub(r"s3://","",s3url)
    s = name2.split("/",1)
    return s[0], s[1]

def process_fits_s3(s3key:str): # -> Tuple[str, float]:
    """ For a single FITS file, read it from S3, grab the header and
        data, then do the DO_SCIENCE() call of choice
    """
    sess = boto3.session.Session() # do this each open to avoid thread problem 'credential_provider'
    s3c = sess.client("s3")
    mybucket,mykey = s3url_to_bucketkey(s3key)
    try:
        fobj = s3c.get_object(Bucket=mybucket,Key=mykey)
        rawdata = fobj['Body'].read()
        bdata = io.BytesIO(rawdata)
        hdul = astropy.io.fits.open(bdata,memmap=False)        
        date = hdul[1].header['T_OBS']
        irrad = DO_SCIENCE(hdul[1].data)
        print(date,irrad)
    except:
        print("Error fetching ",s3key)
        date, irrad = None, None
        
    return date, irrad

def work_on_data (client:dask.distributed.client.Client, bucket_name:str, files:list=[])->int:
    """ 
    Main routine which Dask will use to 'do work'. Each worker will run this.
    """
    # simple version step 1, do it
    mean_irrad = client.map(process_fits_s3, s3_files)

    # trigger distributed task, marshall result back to local memory
    all_data = client.gather(mean_irrad)

    # return the primary header back for analysis
    return all_data

## 4. Do the cloud processing, using Dask to 'burst' into other VMs
Using our gathered list of FITS files, chunk it out in batches and provide file list chunks to the workers

In [None]:
%%time
if n_files > len(s3_files):
    n_files = len(s3_files)
    
def chunks(lst, n):
    """ program to divide our file list into chunks for each worker """
    n = max(1, n)
    return (lst[i:i+n] for i in range(0, len(lst), n))

print (f"workers: {n_workers}, cores/worker:{w_cores}, mem/worker: {w_memory}")
for files_to_process in chunks(s3_files[:n_files], batch_size):
    returns = work_on_data(client, bucket_name, files_to_process)
    print (f"client:%s Finished %s files" % (client,len(returns)))

In [None]:
from matplotlib import pyplot as plt
%matplotlib inline 
# Have Matplotlib create vector (svg) instead of raster (png) images
#%config InlineBackend.figure_formats = ['svg'] 
#plt.figure()
plt.plot_date(*zip(*returns))
plt.xticks(rotation=90)
plt.show()

In [None]:
cluster.shutdown()