# Example: hycom Data Processing

This notebook outlines some of the basic concepts needed for defining a C3-based achitecture to archive and work with Hycom FMRC data.

## References
https://www.hycom.org/data/gomu0pt04/expt-90pt1m000  
https://www.unidata.ucar.edu/software/tds/current/tutorial/files/FmrcPoster.pdf  
https://tds.hycom.org/thredds/catalog/GOMu0.04/expt_90.1m000/FMRC/runs/catalog.xml

## Background and Goals

The cells below download and access a single run from the hycom simulation for the Gulf of Mexico called "GOMu0.04_901m000_FMRC". FMRC means: Forcast Model Run Collection. The files retrived are in NetCDF format.  [NetCDF](https://www.unidata.ucar.edu/software/netcdf/) is a binary file format (spec/api/library) written on top of the more general [HDF5](https://www.hdfgroup.org/solutions/hdf5/) library.

### Inital Goals
1. Define a type that Mixes `File` and/ or `Client` where Hycom sim data can be collected. We have a initial prototype provisioned called `HycomFMRC`
2. Define a type to handle the data download, possibly mixing the "REST" type.
  - Do file introspection (of NetCDF/HDF5 file) to populate fields of our `HycomFMRC` type once the file is downloaded.
  - Automate retrieval using Cron etc.
3. Explore possibilities for retrieving data from files:
  - One use case: Retrieve a series of 2D slices...over time (say surface temp or something) and be able to either directly load them or stream them.
  
Generally, after solving the storage issue and figuring out source/tranform and entiy types... I am _assuming_ we will want to support the ability to retrive and/or stream data from any one of the datasets(variables) in the collection of runs _across time_.

### More on NetCDF
NetCDF files are HDF5 files.  These formats both have rich software ecosystem that support accessing data efficiently and are used to manage large multidimensionall datasets for many large scale HPC-based codes.  IF one were to support the use case I mentioned above using NetCDF/HDF5 only it could be accomplished as follows:
* Create a directory containing the collection of FMRC run files
* Add a "parent" file that contains a dataset that points to  each dataset in the individual run files
* Use the netcdf (or HDF5) library to open the parent file and request an array that does any sort of sliceing and dicing across all the files on desires.


#### Update 8/31/21
The following types have been added to facilitate data archive operations discussed above:
```
HycomDataset
HycomFMRC
HycomFMRCFile
GeospatialCoverage
```
Current functionality will basically do the following:
- Create a `HycomDataset` record based on a Hycom catalog url which returns xml.
- Create a set of `HycomFMRC` records that represent the available runs which data can be downloaded from.
- Download files associalted with a `HycomFMRC` store them in the default `FileSystem` and log them as `HycomFMRCFile`s.

Next steps and road blocks:
- __roadblock__: How to open the file stored in  `HycomFMRCFile` without local download.  Maybe local download is needed in Python?
- __next step__: Design a timeseries schema to store FMRC run data:
  - Should support queries in space an time, like a moving boundndiong box (moveing geospatially and temporally).
  - Need help creating a Cassandra schema that can handle mutidimentional data efficiently
 
#### Possible use case:
For a given variable, say `water_u` and a set of FMRC data (which are Forcast data), we basically have a 3d field through time.  A good first use case will be the following:
- Store 1 or 2 variables for the surface (depth=0) as a time series
- Develop a metric to return a geospatial subset of that data as a timeseries.

Each data point in the field will have the following attributes:
`timestamp, depth, lat, lon, value`  

An inital data set size might be 345 latitude points and 541 long points with around 150 timesteps.  These are store efficiently in the NetCDF file as arrays.  

## Requirements
This Notebook requires the py-hycom_1_0_0 kernel.

A prototype `HycomFMRC` type is provision wit hthe `dti-jupyter` package:

In [7]:
from datetime import date
from datetime import timedelta
import xml.etree.ElementTree as ET
import netCDF4 as nc
import requests
import pandas as pd
from pivottablejs import pivot_ui
import xmltodict
from urllib.parse import urlencode,urljoin
import pandas as pd
from IPython.display import display

## Types
The following types are currently provisioned to support Hycom Data:  
(todo: run query to list all Types in hycom- package.)  
```
HycomDataset
HycomFMRC
HycomFMRCFile
GeospatialCoverage
```
Uncomment and run help command cells below for more info.

In [8]:
#help(c3.HycomDataset)

In [9]:
#help(c3.HycomFMRC)

In [10]:
#help(c3.HycomFMRCFile)

In [11]:
# Ensure we have a Dataset entry for the desired catalog
cat_url = "thttps://tds.hycom.org/thredds/catalog/GOMu0.04/expt_90.1m000/FMRC/runs/catalog.xml"
gom_dataset = c3.HycomDataset.upsertHycomDatasetFromCatalog(url = cat_url)

500 - NotClassified - c3.love.util.OsUtil_err2 [832.42699]
message: "Error executing command: /usr/local/share/c3/condaEnvs/dti-jupyter/tc02/py-hycom_1_0_0/bin/python /tmp/pythonActionSourceCache3500929226809844567/HycomDataset_upsertHycomDatasetFromCatalog.py
p_logger=main url=http://dev-dti-app-m-02:8080 connector=null mode="thick" Action failed!
Traceback (most recent call last):
  File "/tmp/pythonActionSourceCache3500929226809844567/HycomDataset_upsertHycomDatasetFromCatalog.py", line 379, in _c3_remote_bootstrap__run_c3_action
    _c3_result = _action()
  File "/tmp/pythonActionSourceCache3500929226809844567/HycomDataset_upsertHycomDatasetFromCatalog.py", line 567, in <lambda>
    action=lambda: upsertHycomDatasetFromCatalog(url = _c3_inputs.get('url')),
  File "/tmp/pythonActionSourceCache3500929226809844567/HycomDataset_upsertHycomDatasetFromCatalog.py", line 507, in upsertHycomDatasetFromCatalog
    with requests.get(url) as r:
  File "/usr/local/share/c3/condaEnvs/dti-jupyter

C3RuntimeException: 500 - NotClassified - c3.love.util.OsUtil_err2 [832.42699]
message: "Error executing command: /usr/local/share/c3/condaEnvs/dti-jupyter/tc02/py-hycom_1_0_0/bin/python /tmp/pythonActionSourceCache3500929226809844567/HycomDataset_upsertHycomDatasetFromCatalog.py
p_logger=main url=http://dev-dti-app-m-02:8080 connector=null mode="thick" Action failed!
Traceback (most recent call last):
  File "/tmp/pythonActionSourceCache3500929226809844567/HycomDataset_upsertHycomDatasetFromCatalog.py", line 379, in _c3_remote_bootstrap__run_c3_action
    _c3_result = _action()
  File "/tmp/pythonActionSourceCache3500929226809844567/HycomDataset_upsertHycomDatasetFromCatalog.py", line 567, in <lambda>
    action=lambda: upsertHycomDatasetFromCatalog(url = _c3_inputs.get('url')),
  File "/tmp/pythonActionSourceCache3500929226809844567/HycomDataset_upsertHycomDatasetFromCatalog.py", line 507, in upsertHycomDatasetFromCatalog
    with requests.get(url) as r:
  File "/usr/local/share/c3/condaEnvs/dti-jupyter/tc02/py-hycom_1_0_0/lib/python3.6/site-packages/requests/api.py", line 76, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/local/share/c3/condaEnvs/dti-jupyter/tc02/py-hycom_1_0_0/lib/python3.6/site-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/share/c3/condaEnvs/dti-jupyter/tc02/py-hycom_1_0_0/lib/python3.6/site-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/share/c3/condaEnvs/dti-jupyter/tc02/py-hycom_1_0_0/lib/python3.6/site-packages/requests/sessions.py", line 649, in send
    adapter = self.get_adapter(url=request.url)
  File "/usr/local/share/c3/condaEnvs/dti-jupyter/tc02/py-hycom_1_0_0/lib/python3.6/site-packages/requests/sessions.py", line 742, in get_adapter
    raise InvalidSchema("No connection adapters were found for {!r}".format(url))
requests.exceptions.InvalidSchema: No connection adapters were found for 'thttps://tds.hycom.org/thredds/catalog/GOMu0.04/expt_90.1m000/FMRC/runs/catalog.xml'"
JSON: {"url": "thttps://tds.hycom.org/thredds/catalog/GOMu0.04/expt_90.1m000/FMRC/runs/catalog.xml"}

In [12]:
# Grab the HycomDataset record that was created.
objs = c3.HycomDataset.fetch().objs
if objs:
    display(pd.DataFrame(objs.toJson()))


Unnamed: 0,type,id,name,meta,version,hycom_version,description,geospatialCoverage,catalog_url
0,HycomDataset,GOMu0.04_901m000_FMRC_1.0.1,GOMu0.04_901m000_FMRC_1.0.1,"{'type': 'Meta', 'tenantTagId': 155, 'tenant':...",2,1.0.1,HYCOM + NCODA Gulf of Mexico 1/25° Analysis (N...,"{'type': 'GeospatialCoverage', 'start': {'type...",https://tds.hycom.org/thredds/catalog/GOMu0.04...


In [13]:
# Create HycomFMRC records for every run that is currenty listed in the catalog
# This uses the...
fmrcs = gom_dataset.upsertFMRCFromDatasetCatalog()
fmrcs

NameError: name 'gom_dataset' is not defined

In [None]:
# Grab the HycomFMRC records that were created.
objs = c3.HycomFMRC.fetch().objs
if objs:
    display(pd.DataFrame(objs.toJson()))

In [None]:
# Detail: look at the timeCoverage for a single HycomFMRC
fmrcs = c3.HycomFMRC.fetch()
fmrcs.objs[0].timeCoverage

In [None]:
# Download a datafile for each FMRC record
# Note: currently only a fetch of a single timestep is supported, but multiple 
# files can be retrived for a single HycomFMRC record.
# This demo grabs the first available forcast time for the run.
def downloadAll(fmrcs):
#     fmr = fmrcs.objs[0]
#     start = fmr.timeCoverage.start
#     times = [
#         fstr+(start + timedelta(hours=t)).strftime("%Y-%m-%dT%H:%M:%SZ") 
#         for t in range(0,2)
#         for fstr in ['a','b']
#     ]
#     print(times)
    fmrc_files = [
        fmr.downloadFMRCRunData(
            time_start = (fmr.timeCoverage.start+timedelta(hours=h)).strftime("%Y-%m-%dT%H:%M:%SZ"),
            time_end = (fmr.timeCoverage.start+timedelta(hours=h)).strftime("%Y-%m-%dT%H:%M:%SZ"),
            vars = ['water_u','water_v']
        )
        for h in range(0,3)
        for fmr in fmrcs.objs
    ]
    #     fmr = fmrcs.objs[0]
    return fmrc_files
        
downloadAll(fmrcs)

In [None]:
# List the resulting HycomFMRCFile records
objs = c3.HycomFMRCFile.fetch().objs
if objs:
    display(pd.DataFrame(objs.toJson()))

In [None]:
files = c3.FileSystem.inst().listFiles("hycom-data")
files

In [None]:
# ToDoOpen a file to confirm...
# Question: How do I call member functions of type "File" from HycomRMRCFile?
file = c3.HycomFMRCFile.fetch().objs[0]
file.fs()
#f =c3.File(url=c3.HycomFMRCFile.fetch().objs[0].url)
#ds = nc.Dataset(f.stream())
#print(ds)
#with c3.HycomFMRCFile.fetch().objs[0].stream() as stream:
#    ds = nc.Dataset(stream)
#    print(ds)

In [16]:
# Cleanup
print(f"Removed {c3.HycomFMRCFile.removeAll()} HycomFMRCFile records.")
print(f"Removed {c3.HycomFMRC.removeAll()} HycomFMRC records.")
print(f"Removed {c3.HycomDataset.removeAll()} HycomDataset records")
files = c3.FileSystem.inst().listFiles("hycom-data")
if files.files:
    print(f"Deleting {len(files.files)} files")
    c3.FileSystem.inst().deleteFilesBatch(files.files)
print("Done.")

Removed 14 HycomFMRCFile records.
Removed 6 HycomFMRC records.
Removed 1 HycomDataset records
Deleting 6 files
Done.


In [14]:
file = c3.HycomFMRCFile.fetch().objs[0]
file.fs()

500 - NotClassified - c3.love.exceptions.C3RuntimeException_wrapIt [897.36346]
message: "wrapped ClassCastException: c3.type.metadata.impl.PersistableImpl cannot be cast to c3.type.file.File"
JSON: {"this": {"type": "HycomFMRCFile", "url": "hycom-data/GOMu0.04_901m000_FMRC_RUN_2021-08-26T12:00:00Z-2021-08-26T12:00:00Z.nc", "id": "08ca912a-a7ae-4a49-aacb-d67fda049b60", "name": "GOMu0.04_901m000_FMRC_RUN_2021-08-26T12:00:00Z-2021-08-26T12:00:00Z.nc", "meta": {"type": "Meta", "tenantTagId": 155, "tenant": "dti-jupyter", "tag": "tc02", "created": "2021-08-31T00:00:34+00:00", "createdBy": "dadams@illinois.edu", "updated": "2021-08-31T00:00:34+00:00", "updatedBy": "dadams@illinois.edu", "timestamp": "2021-08-31T00:00:34+00:00", "fetchInclude": "[]", "fetchType": "HycomFMRCFile"}, "version": 1, "hycomFMRC": {"type": "HycomFMRC", "id": "GOMu0.04/expt_90.1m000/FMRC/runs/GOMu0.04_901m000_FMRC_RUN_2021-08-26T12:00:00Z"}, "timeCoverage": {"type": "TimeRange", "start": "2021-08-26T12:00:00+00:00", 

C3RuntimeException: 500 - NotClassified - c3.love.exceptions.C3RuntimeException_wrapIt [897.36346]
message: "wrapped ClassCastException: c3.type.metadata.impl.PersistableImpl cannot be cast to c3.type.file.File"
JSON: {"this": {"type": "HycomFMRCFile", "url": "hycom-data/GOMu0.04_901m000_FMRC_RUN_2021-08-26T12:00:00Z-2021-08-26T12:00:00Z.nc", "id": "08ca912a-a7ae-4a49-aacb-d67fda049b60", "name": "GOMu0.04_901m000_FMRC_RUN_2021-08-26T12:00:00Z-2021-08-26T12:00:00Z.nc", "meta": {"type": "Meta", "tenantTagId": 155, "tenant": "dti-jupyter", "tag": "tc02", "created": "2021-08-31T00:00:34+00:00", "createdBy": "dadams@illinois.edu", "updated": "2021-08-31T00:00:34+00:00", "updatedBy": "dadams@illinois.edu", "timestamp": "2021-08-31T00:00:34+00:00", "fetchInclude": "[]", "fetchType": "HycomFMRCFile"}, "version": 1, "hycomFMRC": {"type": "HycomFMRC", "id": "GOMu0.04/expt_90.1m000/FMRC/runs/GOMu0.04_901m000_FMRC_RUN_2021-08-26T12:00:00Z"}, "timeCoverage": {"type": "TimeRange", "start": "2021-08-26T12:00:00+00:00", "end": "2021-08-26T12:00:00+00:00"}, "fileType": "netcdf4"}}

In [15]:
file

c3.HycomFMRCFile(
 url='hycom-data/GOMu0.04_901m000_FMRC_RUN_2021-08-26T12:00:00Z-2021-08-26T12:00:00Z.nc',
 id='08ca912a-a7ae-4a49-aacb-d67fda049b60',
 name='GOMu0.04_901m000_FMRC_RUN_2021-08-26T12:00:00Z-2021-08-26T12:00:00Z.nc',
 meta=c3.Meta(
        tenantTagId=155,
        tenant='dti-jupyter',
        tag='tc02',
        created=datetime.datetime(2021, 8, 31, 0, 0, 34, tzinfo=datetime.timezone.utc),
        createdBy='dadams@illinois.edu',
        updated=datetime.datetime(2021, 8, 31, 0, 0, 34, tzinfo=datetime.timezone.utc),
        updatedBy='dadams@illinois.edu',
        timestamp=datetime.datetime(2021, 8, 31, 0, 0, 34, tzinfo=datetime.timezone.utc),
        fetchInclude='[]',
        fetchType='HycomFMRCFile'),
 version=1,
 hycomFMRC=c3.HycomFMRC(
             id='GOMu0.04/expt_90.1m000/FMRC/runs/GOMu0.04_901m000_FMRC_RUN_2021-08-26T12:00:00Z'),
 timeCoverage=c3.TimeRange(
                start=datetime.datetime(2021, 8, 26, 12, 0, tzinfo=datetime.timezone.utc),
           

In [19]:
help(c3.File)