# Cloud Catalog Demo (using MMS data in HelioCloud S3 storage)

The CloudCatalog specification and accompanying Python client and demo notebook are a method for efficiently getting a list of data files stored in AWS S3 that are publicly accessible to scientists.  In addition to querying which datasets are available, users can directly retrieve the list of files in the form of 'time, S3 file key, filesize' as a Pandas DataFrame.

We jump into a quick demo of fetching a day's worth of MMS1 files.  We then step back to show how to query the 'catalog of catalogs' to explore and find datasets.  We close with a more extended MMS demo that fetches the list of files then plots them directly from cloud storage (no intermediate file transfers needed).


## Setup
The file catalog API is just "import cloudcatalog".  You also set a variable to point to where the global catalog of all known storage buckets (AWS S3 buckets) exist.  This is an index catalog that points to, basically, all known HelioClouds.

In [None]:
import cloudcatalog
import cdflib
import matplotlib.pyplot as plt
from pprint import pprint

# HAPI-like queries

We will start with the main usage, 'give me a list of files to operate on', then backfill the way to search and explore catalogs after this example case.

CloudCatalog uses a HAPI-like query to get a list of files for a given dataset id, over a given time range.  For example:

In [None]:
# one sample instruments and a time range
dataset_id1 = 'mms1_feeps_brst_electron'
start = '2020-02-01T00:00:00Z'
stop =   '2020-02-02T00:00:00Z'

In [None]:
# open up the global Catalog
cr = cloudcatalog.CatalogRegistry()
endpoint = cr.get_endpoint("GSFC HelioCloud Public Temp")
fr = cloudcatalog.CloudCatalog(endpoint, cache=True)

In [None]:
filekeys_id1 = fr.request_cloud_catalog(dataset_id1,start_date=start,stop_date=stop)
#filekeys_id2 = fr.request_cloud_catalog(dataset_id2,start_date=start,stop_date=stop)
print("filekeys for ",dataset_id1,start,stop,":\n",filekeys_id1)



## Params

Now you can feed that list of file S3 locations to your program and work with your files.  Here's a simple example of looking at the metadata for the (arbitrarily chosen) 3rd file in that Pandas DataFrame:

In [None]:
print("All info on item 3:",filekeys_id1.iloc[2])
print("Just the S3 key:",filekeys_id1.iloc[2]['datakey'])

You can skip to the example at the end, where we actually access the files.  But first, what is CloudCatalog and how to use?

# Catalog of Catalogs
Now we step back to explore the catalogs that are available, to work up to that example.

The 'CatalogRegistry()' call fetches a list of all S3 'buckets' (data repositories) that are known to the HelioCloud network.  This is the 'catalog of catalogs'.

In [None]:
cr=cloudcatalog.CatalogRegistry()
cat = cr.get_catalog()
print("get_catalog() provides JSON metadata for the Catalog of Catalogs, plus a list of known catalogs:",cat)
reg = cr.get_registry()
print("get_registry() is a list of all known endpoints:",reg)
url = cr.get_endpoint("HelioCloud, including SDO")
print("get_endpoint:",url)

## Searching for data with EntireCatalogSearch

This initial search accesses each HelioCloud's specific data holdings to create what you probably want, which is a list of all datasets available within the entire HelioCloud ecosystem.

We include an example of a down or de-registered catalog to emphasize this index catalog points to HelioClouds, but doesn't manage them.



In [None]:
mysearch = cloudcatalog.EntireCatalogSearch()

Now let us do something useful-- look for MMS1 FEEPS data, or ion data, or other items of interest.

In [None]:
mysearch.search_by_id('mms1_feeps')

In [None]:
mysearch.search_by_id('srvy_ion')

In [None]:
mysearch.search_by_title('mms1/fpi/b')

In [None]:
mysearch.search_by_title('des-dist')[:2]

In [None]:
mysearch.search_by_keywords(['mms2', 'brst', 'apples'])[:3]

## Working with the global catalog (..the name)
This is primarily for admins and people looking for background information on other HelioClouds (rather than on other datasets).


In [None]:
cr = cloudcatalog.CatalogRegistry()

In [None]:
cr.get_catalog()

In [None]:
cr.get_registry()

In [None]:
endpoint = cr.get_endpoint('GSFC HelioCloud Public Temp')
endpoint

In [None]:
cr.catalog

## Working with a local catalog
Here we browse all the data holdings within a specific disk.

In [None]:
cr = cloudcatalog.CatalogRegistry()
endpoint = cr.get_endpoint('GSFC HelioCloud Public Temp')
fr = cloudcatalog.CloudCatalog(endpoint, cache=True)                          

In [None]:
fr.get_catalog()

# Useful search
Given a dataset ID (from searching above or by knowing a dataset name from another catalog or search tool or even email), we can get a list of files for our desired dataset within our desired timespan.  Here's our example dataset from before:

In [None]:
# one sample instruments and a time range
dataset_id1 = 'mms1_feeps_brst_electron'
start = '2020-02-01T00:00:00Z'
stop =   '2020-02-02T00:00:00Z'

Each dataset has metadata (in JSON format) providing additional information.  Here is an example.

In [None]:
pprint(fr.get_entry(dataset_id1))

Now we get the actual file list of data items for our given instrument in our given time range.

In [None]:
file_registry1 = fr.request_cloud_catalog(dataset_id1, start_date=start, stop_date=stop, overwrite=False)

In [None]:
file_registry1

# Operating on files in bulk
Here we can view metadata about the files, or plot them and otherwise extract data from them.

First, some simple metadata.

In [None]:
print('Python Hash of File | Start Date | File Size')
fr.stream(file_registry1, lambda bo, d, f: print(hash(bo.read()), d.replace(' ', 'T')+'Z', f))

We define a plotting routine for CDF files here. You can (as with any code) define whatever functions you want to rip through each datafile.  We chose to plot it for this demo to show the data is accessed via S3 and loaded into this program without having to copy any files around.

In [None]:
def plot_cdf(s3_uri, d, f):
    print("fetching ",s3_uri)

    cdf = cdflib.CDF(s3_uri)

    # Get the variable name and its data
    #var_name = cdf.cdf_info()["zVariables"][2]
    #var_data = cdf.varget(var_name)
    var_data = cdf.varget(1)
    var_name="data"
    # Plot the variable
    plt.figure()
    plt.plot(var_data)
    plt.xlabel("Index")
    plt.ylabel(var_name)
    plt.title(f"Plot of {var_name}")
    plt.show()


# The real stuff

This is our simple runner that takes the file list and runs the plot command. In this case, we pull a subset of the list e.g.  file_registry1[:3] will pull only the first three files.


In [None]:
print('# of zVariables | Start Date | File Size')
fr.stream_uri(file_registry1[:3], lambda s3_uri, d, f: plot_cdf(s3_uri, d, f))

That's it: find a dataset, get a list of files for a given time range, and operate on all the files in bulk.