# Example: hycom Data Processing

This notebook outlines some of the basic concepts needed for defining a C3-based achitecture to archive and work with Hycom FMRC data.

## References
https://www.hycom.org/data/gomu0pt04/expt-90pt1m000  
https://www.unidata.ucar.edu/software/tds/current/tutorial/files/FmrcPoster.pdf  
https://tds.hycom.org/thredds/catalog/GOMu0.04/expt_90.1m000/FMRC/runs/catalog.xml

## Background and Goals

The cells below download and access a single run from the hycom simulation for the Gulf of Mexico called "GOMu0.04_901m000_FMRC". FMRC means: Forcast Model Run Collection. The files retrived are in NetCDF format.  [NetCDF](https://www.unidata.ucar.edu/software/netcdf/) is a binary file format (spec/api/library) written on top of the more general [HDF5](https://www.hdfgroup.org/solutions/hdf5/) library.

### Inital Goals
1. Define a type that Mixes `File` and/ or `Client` where Hycom sim data can be collected. We have a initial prototype provisioned called `HycomFMRC`
2. Define a type to handle the data download, possibly mixing the "REST" type.
  - Do file introspection (of NetCDF/HDF5 file) to populate fields of our `HycomFMRC` type once the file is downloaded.
  - Automate retrieval using Cron etc.
3. Explore possibilities for retrieving data from files:
  - One use case: Retrieve a series of 2D slices...over time (say surface temp or something) and be able to either directly load them or stream them.
  
Generally, after solving the storage issue and figuring out source/tranform and entiy types... I am _assuming_ we will want to support the ability to retrive and/or stream data from any one of the datasets(variables) in the collection of runs _across time_.
  
### More on NetCDF
NetCDF files are HDF5 files.  These formats both have rich software ecosystem that support accessing data efficiently and are used to manage large multidimensionall datasets for many large scale HPC-based codes.  IF one were to support the use case I mentioned above using NetCDF/HDF5 only it could be accomplished as follows:
* Create a directory containing the collection of FMRC run files
* Add a "parent" file that contains a dataset that points to  each dataset in the individual run files
* Use the netcdf (or HDF5) library to open the parent file and request an array that does any sort of sliceing and dicing across all the files on desires.


## Requirements
This Notebook requires the py-hycom_1_0_0 kernel.

A prototype `HycomFMRC` type is provision wit hthe `dti-jupyter` package:

## Types

In [1]:
help(c3.HycomDataset)

In [2]:
help(c3.HycomFMRC)

## Manual Download Procedure

In [3]:
from datetime import date
from datetime import timedelta
import xml.etree.ElementTree as ET
import netCDF4 as nc
import requests
import pandas as pd
from pivottablejs import pivot_ui

In [4]:
def download_ncss_file(url):
    local_filename = url.split('/')[-1].split('?')[0] + '.nc4'
    # NOTE the stream=True parameter below
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with open(local_filename, 'wb') as f:
            for chunk in r.iter_content(chunk_size=8192): 
                # If you have chunk encoded response uncomment if
                # and set chunk_size parameter to None.
                #if chunk: 
                f.write(chunk)
    return local_filename

In [5]:
# Here, I am trying to access the catalog of FMRC runs, list them, get metdata etc... 
url="https://tds.hycom.org/thredds/catalog/GOMu0.04/expt_90.1m000/FMRC/runs/catalog.xml"
with requests.get(url) as r:
    cat = ET.fromstring(r.text)
    xmltext=r.text
#for dataset in cat.find('dataset'):
#    print(dataset.attrib)
print(cat.tag)
print (cat.attrib)
print(cat[1][2].tag)
#print (xmltext)

{http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0}catalog
{'name': 'HYCOM + NCODA Gulf of Mexico 1/25° Analysis (NRL)/GOMu0.04/expt_90.1m000/FMRC (Forecast Model Run Collection)/GOMu0.04_901m000_FMRC', 'version': '1.0.1'}
{http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0}dataset


### Download a FMRC Run
The following code will download a recent FMRC and print the urls used 
to retrive it along with the header information from the NetCDF file
In particular note the dataset size information:
```
dimensions(sizes): time(1), lat(346), lon(541), depth(40)
```
This means tha tsome variables such as `float32 water_temp(time,depth,lat,lon`) are 4D arrays (3D, really since time = 1) 

In [6]:
today = date.today()
yesterday = today - timedelta(days=1)
daybeforeyesterday = today - timedelta(days=2)

yesterdaystring = yesterday.strftime("%Y-%m-%d") + "T00%3A00%3A00Z"
daybeforeyesterdaystring = daybeforeyesterday.strftime("%Y-%m-%d") + "T12:00:00Z"

url = "https://ncss.hycom.org/thredds/ncss/GOMu0.04/expt_90.1m000/FMRC/runs/GOMu0.04_901m000_FMRC_RUN_" + daybeforeyesterdaystring + "?var=surf_el&var=salinity&var=water_temp&var=water_u&var=water_v&disableLLSubset=on&disableProjSubset=on&horizStride=1&time=" + yesterdaystring + "&vertCoord=&accept=netcdf4"
print(url)
local_filename = download_ncss_file(url)
ds = nc.Dataset(local_filename)
print(ds) 

https://ncss.hycom.org/thredds/ncss/GOMu0.04/expt_90.1m000/FMRC/runs/GOMu0.04_901m000_FMRC_RUN_2021-08-25T12:00:00Z?var=surf_el&var=salinity&var=water_temp&var=water_u&var=water_v&disableLLSubset=on&disableProjSubset=on&horizStride=1&time=2021-08-26T00%3A00%3A00Z&vertCoord=&accept=netcdf4
<class 'netCDF4._netCDF4.Dataset'>
root group (NETCDF4 data model, file format HDF5):
    classification_level: UNCLASSIFIED
    distribution_statement: Approved for public release. Distribution unlimited.
    downgrade_date: not applicable
    classification_authority: not applicable
    institution: Naval Oceanographic Office
    source: HYCOM archive file
    history: archv2ncdf3z ;
FMRC Run 2021-08-25T12:00:00Z Dataset
    field_type: instantaneous
    Conventions: CF-1.4, NAVO_netcdf_v1.1
    cdm_data_type: GRID
    featureType: GRID
    location: Proto fmrc:GOMu0.04_901m000_FMRC
    History: Translated to CF-1.0 Conventions by Netcdf-Java CDM (CFGridWriter2)
Original Dataset = fmrc:GOMu0.04_901m

### Access Some Data
Below is a sample cell accessing some of the data.  Note that there are a lot of missing values... for some variables.

In [7]:
time = ds.variables['time'][0]
lat = ds.variables['lat'][:]
lon = ds.variables['lon'][:]
depth = ds.variables['depth'][:]
salinity = ds.variables['salinity'][:]
print(time)
#print(lat)
#print(long)
# A 1D slice for a particular time, depth, and lat:
print(salinity[0,0,0,:])

108.0
[      nan       nan       nan       nan       nan       nan       nan
       nan       nan       nan       nan       nan       nan       nan
       nan       nan       nan       nan       nan       nan       nan
       nan       nan       nan       nan       nan       nan       nan
       nan       nan       nan       nan       nan       nan       nan
       nan       nan       nan       nan       nan       nan       nan
       nan       nan       nan       nan       nan       nan       nan
       nan       nan       nan       nan       nan       nan       nan
       nan       nan       nan       nan       nan       nan       nan
       nan       nan       nan       nan       nan       nan       nan
       nan       nan       nan       nan       nan       nan       nan
       nan       nan       nan       nan       nan       nan       nan
       nan       nan       nan       nan       nan       nan       nan
       nan       nan       nan       nan       nan       nan       nan


  """


In [9]:
#df = pd.DataFrame({'lon': lon})
#pivot_ui(df)

In [11]:
c3.FileSystem.inst().listFiles("hycom-test")

c3.ListFilesResult(
 files=c3.Arry<File>([c3.AzureFile(
          contentLength=39390145,
          contentLocation='fs/dti/mpodolsky/hycom-test/GOMu0.04_901m000_FMRC_RUN_2021-08-25T12:00:00Z.nc4',
          eTag='"0x8D9699F8AA41780"',
          lastModified=datetime.datetime(2021, 8, 27, 21, 13, 39, tzinfo=datetime.timezone.utc),
          contentMD5='+dG5qmNNa6My21NM8F76OQ==',
          hasMetadata=False,
          url='azure://dev-dti/fs/dti/mpodolsky/hycom-test/GOMu0.04_901m000_FMRC_RUN_2021-08-25T12:00:00Z.nc4',
          blobType='BLOCK_BLOB')]))

In [15]:
help(c3.FileSystem)