# Example: hycom Data Processing

This notebook outlines some of the basic concepts needed for defining a C3-based achitecture to archive and work with Hycom FMRC data.

## References
https://www.hycom.org/data/gomu0pt04/expt-90pt1m000  
https://www.unidata.ucar.edu/software/tds/current/tutorial/files/FmrcPoster.pdf  
https://tds.hycom.org/thredds/catalog/GOMu0.04/expt_90.1m000/FMRC/runs/catalog.xml

## Background and Goals

The cells below download and access a single run from the hycom simulation for the Gulf of Mexico called "GOMu0.04_901m000_FMRC". FMRC means: Forcast Model Run Collection. The files retrived are in NetCDF format.  [NetCDF](https://www.unidata.ucar.edu/software/netcdf/) is a binary file format (spec/api/library) written on top of the more general [HDF5](https://www.hdfgroup.org/solutions/hdf5/) library.

### Inital Goals
1. Define a type that Mixes `File` and/ or `Client` where Hycom sim data can be collected. We have a initial prototype provisioned called `HycomFMRC`
2. Define a type to handle the data download, possibly mixing the "REST" type.
  - Do file introspection (of NetCDF/HDF5 file) to populate fields of our `HycomFMRC` type once the file is downloaded.
  - Automate retrieval using Cron etc.
3. Explore possibilities for retrieving data from files:
  - One use case: Retrieve a series of 2D slices...over time (say surface temp or something) and be able to either directly load them or stream them.
  
Generally, after solving the storage issue and figuring out source/tranform and entiy types... I am _assuming_ we will want to support the ability to retrive and/or stream data from any one of the datasets(variables) in the collection of runs _across time_.
  
### More on NetCDF
NetCDF files are HDF5 files.  These formats both have rich software ecosystem that support accessing data efficiently and are used to manage large multidimensionall datasets for many large scale HPC-based codes.  IF one were to support the use case I mentioned above using NetCDF/HDF5 only it could be accomplished as follows:
* Create a directory containing the collection of FMRC run files
* Add a "parent" file that contains a dataset that points to  each dataset in the individual run files
* Use the netcdf (or HDF5) library to open the parent file and request an array that does any sort of sliceing and dicing across all the files on desires.


## Requirements
This Notebook requires the py-hycom_1_0_0 kernel.

A prototype `HycomFMRC` type is provision wit hthe `dti-jupyter` package:

In [39]:
from datetime import date
from datetime import timedelta
import xml.etree.ElementTree as ET
import netCDF4 as nc
import requests
import pandas as pd
from pivottablejs import pivot_ui
import xmltodict
from urllib.parse import urlencode,urljoin
import pandas as pd

## Types
The following types are currently provisioned to support Hycom Data:  
(todo: run query to list all Types in hycom- package.)  
```
HycomDataset
HycomFMRC
HycomFMRCFile
GeospatialCoverage
```

In [37]:
#help(c3.HycomDataset)

In [38]:
#help(c3.HycomFMRC)

In [40]:

# def upsertFMRCFromDatasetCatalog(dataset):
#     url = dataset.catalog_url
#     with requests.get(url) as r:
#         doc = xmltodict.parse(r.text)
    
#     frmcs = [ 
#         c3.HycomFMRC(
#         **{
#             'id': d['@ID'],
#             'dataset': dataset,
#             'run': d['@name'],
#             'timeCoverage': {
#                 'start':d['timeCoverage']['start'],
#                 'end':d['timeCoverage']['end'],
#             },
#             'thredds_url': buildHycomFMRCUrl(
#                    urlpath = d['@urlPath'],
#                    time_start = d['timeCoverage']['start'],
#                    time_end = d['timeCoverage']['end']
#                    )
#         }
#     ).upsert() for d in doc['catalog']['dataset']['dataset']
#             ]
#     return frmcs

# def downloadToExternal(srcUrl, fileName, s3_folder):
#     tmp_path = "/tmp/" + fileName
#     with requests.get(srcUrl, stream=True) as r:
#         r.raise_for_status()
#         with open(tmp_path, 'wb') as f:
#             for chunk in r.iter_content(chunk_size=8192):
#                 f.write(chunk)
#     c3.Client.uploadLocalClientFiles(tmp_path, s3_folder, {"peekForMetadata": True})
#     #c3.Logger.info("file {} downloaded to {}".format(fileName, s3_folder + fileName))
#     os.remove(tmp_path)
#     return s3_folder + '/' + fileName


# def downloadFMRCRunData(this,time_start,time_end,
#                       path=None,
#                       vars=['surf_el','salinity','water_temp','water_u','water_v'],
#                       disableLLSubset='on',
#                       disableProjSubset='on',
#                       horizStride=1,
#                       timeStride=1,
#                       vertStride=1,
#                       addLatLon='true',
#                       accept='netcdf4'
#                      ):
#     """Download FMRC file and create a HycomFMRCFile instance
#     """
#     filetypes = ['netcdf','netcdf4']
#     if accept not in filetypes:
#         raise ValueError(f"Unsupported filetype: {accept} specifed in accept parameter")
    
#     file_ext = '.nc'
    
#     from urllib.parse import urlencode,urljoin
    
#     if path is None:
#         path = 'hycom-data'
    
#     base_url=f"https://ncss.hycom.org/thredds/ncss/{this.urlPath}"
#     #print(base_url)
#     varst = [('var',v) for v in vars]
#     url1 = urlencode(varst,{'d':2})
#     url3 = urlencode({'disableLLSubset':disableLLSubset,
#                       'disableProjSubset':disableProjSubset,
#                       'horizStride':horizStride,
#                       'timeStride':timeStride,
#                       'vertStride':vertStride,
#                       'addLatLon':addLatLon,
#                       'accept':accept
#                      })
    
#     if (time_start == time_end):
#         url2 = urlencode({'time':time_start})
#         filename = this.run + '-' + time_start + file_ext
#     else:
#         url2 = urlencode({'time_start':time_start,'time_end':time_end})
#         filename = this.run + '-' + time_start + '-' + time_end + file_ext
        

#     query = url1 + '&' + url2 + '&' + url3
#     url = base_url+'?'+query
    
#     # download file
#     file = downloadToExternal(
#        srcUrl = url,
#        fileName = filename,
#        s3_folder = path
#     )
    
#     # Upsert HycomFMRCFile instance
    
#     spec = {
#         'hycomFMRC': this,
#         'name': filename,
#         'timeCoverage': {
#             'start': time_start,
#             'end': time_end
#         },
#         'fileType': accept,
#         'url': file
#     }
    
#     fmrc_file = c3.HycomFMRCFile(**spec).upsert()
    
#     return fmrc_file
    

# Ensure we have a Dataset entry for the desired catalog
cat_url = "https://tds.hycom.org/thredds/catalog/GOMu0.04/expt_90.1m000/FMRC/runs/catalog.xml"
gom_dataset = c3.HycomDataset.upsertHycomDatasetFromCatalog(url = cat_url)
# Create HycomFMRC recors for every run that is currenty listed in the catalog
frmcs = gom_dataset.upsertFMRCFromDatasetCatalog()

In [None]:
group_df = pd.DataFrame({x.id: {"groupingField": x.get("groupingField").groupingField} 
                         for x in c3.MLTestParent.fetch().objs}).T
display(group_df)

In [36]:
fmrcs = c3.HycomFMRC.fetch()
single_run = fmrcs.objs[0]
start_str = single_run.timeCoverage.start.strftime("%Y-%m-%dT%H:%M:%SZ")
print(f"downloading time stamp: {start_str}")
end_str = start_str
downloadFMRCRunData(
    this=single_run,
    time_start=start_str,
    time_end=end_str
)

downloading time stamp: 2021-08-24T12:00:00Z


c3.HycomFMRCFile(
 id='34ef8fdb-f2c8-4366-9232-ac4453292614',
 name='GOMu0.04_901m000_FMRC_RUN_2021-08-24T12:00:00Z-2021-08-24T12:00:00Z.nc',
 meta=c3.Meta(
        created=datetime.datetime(2021, 8, 30, 17, 13, 3, tzinfo=datetime.timezone.utc),
        updated=datetime.datetime(2021, 8, 30, 17, 13, 3, tzinfo=datetime.timezone.utc),
        timestamp=datetime.datetime(2021, 8, 30, 17, 13, 3, tzinfo=datetime.timezone.utc)),
 version=1)

In [5]:
print("https://dataserver3.nccs.nasa.gov/thredds/ncss/bypass/GOMu0.04/expt_90.1m000/FMRC/runs/GOMu0.04_901m000_FMRC_RUN_2021-08-23T12:00:00Z?var=surf_el&var=salinity&var=water_temp&var=water_u&var=water_v&disableLLSubset=on&disableProjSubset=on&horizStride=1&time_start=2021-08-23T12%3A00%3A00Z&time_end=2021-08-29T00%3A00%3A00Z&timeStride=1&vertStride=1&addLatLon=true&accept=netcdf4")

https://dataserver3.nccs.nasa.gov/thredds/ncss/bypass/GOMu0.04/expt_90.1m000/FMRC/runs/GOMu0.04_901m000_FMRC_RUN_2021-08-23T12:00:00Z?var=surf_el&var=salinity&var=water_temp&var=water_u&var=water_v&disableLLSubset=on&disableProjSubset=on&horizStride=1&time_start=2021-08-23T12%3A00%3A00Z&time_end=2021-08-29T00%3A00%3A00Z&timeStride=1&vertStride=1&addLatLon=true&accept=netcdf4


## Manual Download Procedure

In [5]:
from datetime import date
from datetime import timedelta
import xml.etree.ElementTree as ET
import netCDF4 as nc
import requests
import pandas as pd
from pivottablejs import pivot_ui
import xmltodict
from urllib.parse import urlencode,urljoin

In [2]:
def download_ncss_file(url):
    local_filename = url.split('/')[-1].split('?')[0] + '.nc4'
    # NOTE the stream=True parameter below
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with open(local_filename, 'wb') as f:
            for chunk in r.iter_content(chunk_size=8192): 
                # If you have chunk encoded response uncomment if
                # and set chunk_size parameter to None.
                #if chunk: 
                f.write(chunk)
    return local_filename

def downloadToS3External(srcUrl, fileName, s3_folder):
    tmp_path = "/tmp/" + fileName
    with requests.get(srcUrl, stream=True) as r:
        r.raise_for_status()
        with open(tmp_path, 'wb') as f:
            for chunk in r.iter_content(chunk_size=8192):
                f.write(chunk)
    c3.Client.uploadLocalClientFiles(tmp_path, s3_folder, {"peekForMetadata": True})
    #c3.Logger.info("file {} downloaded to {}".format(fileName, s3_folder + fileName))
    os.remove(tmp_path)
    return s3_folder + fileName

In [26]:
url="https://ncss.hycom.org/thredds/ncss/GOMu0.04/expt_90.1m000/FMRC/runs/GOMu0.04_901m000_FMRC_RUN_2021-08-25T12:00:00Z?var=surf_el&var=salinity&var=water_temp&var=water_u&var=water_v&disableLLSubset=on&disableProjSubset=on&horizStride=1&time=2021-08-26T00%3A00%3A00Z&vertCoord=&accept=netcdf4"
downloadToS3External(url,"test01.nc4","hycom-test")

'hycom-testtest01.nc4'

Next steps:
- Add downloadToS3External as a method on HycomFRMC or HycomDataset
- Add method to parse catalog (xml) endpoint to HycmDataset
    - Catalog parsing can result on download calls

In [None]:
c3.HttpRequest.make(url=)

In [29]:
c3.FileSystem.inst().listFiles("hycom-test")

c3.ListFilesResult(
 files=c3.Arry<File>([c3.AzureFile(
          contentLength=39390145,
          contentLocation='fs/dti/mpodolsky/hycom-test/GOMu0.04_901m000_FMRC_RUN_2021-08-25T12:00:00Z.nc4',
          eTag='"0x8D9699F8AA41780"',
          lastModified=datetime.datetime(2021, 8, 27, 21, 13, 39, tzinfo=datetime.timezone.utc),
          contentMD5='+dG5qmNNa6My21NM8F76OQ==',
          hasMetadata=False,
          url='azure://dev-dti/fs/dti/mpodolsky/hycom-test/GOMu0.04_901m000_FMRC_RUN_2021-08-25T12:00:00Z.nc4',
          blobType='BLOCK_BLOB'),
         c3.AzureFile(
          contentLength=39390145,
          contentLocation='fs/dti/mpodolsky/hycom-test/test01.nc4',
          eTag='"0x8D96A25C61D4F48"',
          lastModified=datetime.datetime(2021, 8, 28, 13, 14, 31, tzinfo=datetime.timezone.utc),
          contentMD5='nEk6ehWNNW302aIrDnB+IA==',
          hasMetadata=False,
          url='azure://dev-dti/fs/dti/mpodolsky/hycom-test/test01.nc4',
          blobType='BLOCK_BLOB')]

In [5]:
print("https://ncss.hycom.org/thredds/ncss/GOMu0.04/expt_90.1m000/FMRC/runs/GOMu0.04_901m000_FMRC_RUN_2021-08-21T12:00:00Z?var=surf_el&var=salinity&var=water_temp&var=water_u&var=water_v&disableLLSubset=on&disableProjSubset=on&horizStride=1&time_start=2021-08-21T12%3A00%3A00Z&time_end=2021-08-27T00%3A00%3A00Z&timeStride=1&vertStride=1&addLatLon=true&accept=netcdf4")
def buildHycomFMRCUrl(urlpath,time_start,time_end,
                      vars=['surl_el','salinity','water_temp','water_u','water_v'],
                      disableLLSubset='on',
                      disableProjSubset='on',
                      horizStride=1,
                      timeStride=1,
                      vertStride=1,
                      addLatLon='true',
                      accept='netcdf4'
                     ):
    base_url=f"https://ncss.hycom.org/thredds/ncss/{urlpath}"
    #print(base_url)
    varst = [('var',v) for v in vars]
    url1 = urlencode(varst,{'d':2})
    url2 = urlencode({'disableLLSubset':disableLLSubset,
                      'disableProjSubset':disableProjSubset,
                      'hoizStride':horizStride,
                      'time_start':time_start,
                      'time_end':time_end,
                      'timeStride':timeStride,
                      'vertStride':vertStride,
                      'addLatLon':addLatLon,
                      'accept':accept
                     })
    query = url1+'&'+url2
    url = base_url+'?'+query
    return url

#createHycomDatasetFromCatalog(url="https://tds.hycom.org/thredds/catalog/GOMu0.04/expt_90.1m000/FMRC/runs/catalog.xml")
#ds = getFMRCDataset()

https://ncss.hycom.org/thredds/ncss/GOMu0.04/expt_90.1m000/FMRC/runs/GOMu0.04_901m000_FMRC_RUN_2021-08-21T12:00:00Z?var=surf_el&var=salinity&var=water_temp&var=water_u&var=water_v&disableLLSubset=on&disableProjSubset=on&horizStride=1&time_start=2021-08-21T12%3A00%3A00Z&time_end=2021-08-27T00%3A00%3A00Z&timeStride=1&vertStride=1&addLatLon=true&accept=netcdf4


In [7]:
hcds = createHycomDatasetFromCatalog()

[{'run': 'GOMu0.04_901m000_FMRC_RUN_2021-08-28T12:00:00Z',
  'start': '2021-08-28T12:00:00Z',
  'end': '2021-09-03T00:00:00Z',
  'thredds_url': 'https://ncss.hycom.org/thredds/ncss/GOMu0.04/expt_90.1m000/FMRC/runs/GOMu0.04_901m000_FMRC_RUN_2021-08-28T12:00:00Z?var=surl_el&var=salinity&var=water_temp&var=water_u&var=water_v&disableLLSubset=on&disableProjSubset=on&hoizStride=1&time_start=2021-08-28T12%3A00%3A00Z&time_end=2021-09-03T00%3A00%3A00Z&timeStride=1&vertStride=1&addLatLon=true&accept=netcdf4'},
 {'run': 'GOMu0.04_901m000_FMRC_RUN_2021-08-27T12:00:00Z',
  'start': '2021-08-27T12:00:00Z',
  'end': '2021-09-02T00:00:00Z',
  'thredds_url': 'https://ncss.hycom.org/thredds/ncss/GOMu0.04/expt_90.1m000/FMRC/runs/GOMu0.04_901m000_FMRC_RUN_2021-08-27T12:00:00Z?var=surl_el&var=salinity&var=water_temp&var=water_u&var=water_v&disableLLSubset=on&disableProjSubset=on&hoizStride=1&time_start=2021-08-27T12%3A00%3A00Z&time_end=2021-09-02T00%3A00%3A00Z&timeStride=1&vertStride=1&addLatLon=true&acce

### Download a FMRC Run
The following code will download a recent FMRC and print the urls used 
to retrive it along with the header information from the NetCDF file
In particular note the dataset size information:
```
dimensions(sizes): time(1), lat(346), lon(541), depth(40)
```
This means tha tsome variables such as `float32 water_temp(time,depth,lat,lon`) are 4D arrays (3D, really since time = 1) 

In [15]:
today = date.today()
yesterday = today - timedelta(days=1)
daybeforeyesterday = today - timedelta(days=2)

yesterdaystring = yesterday.strftime("%Y-%m-%d") + "T00%3A00%3A00Z"
daybeforeyesterdaystring = daybeforeyesterday.strftime("%Y-%m-%d") + "T12:00:00Z"

url = "https://ncss.hycom.org/thredds/ncss/GOMu0.04/expt_90.1m000/FMRC/runs/GOMu0.04_901m000_FMRC_RUN_" + daybeforeyesterdaystring + "?var=surf_el&var=salinity&var=water_temp&var=water_u&var=water_v&disableLLSubset=on&disableProjSubset=on&horizStride=1&time=" + yesterdaystring + "&vertCoord=&accept=netcdf4"
#url = "https://ncss.hycom.org/thredds/ncss/GOMu0.04/expt_90.1m000/FMRC/runs/GOMu0.04_901m000_FMRC_RUN_" + daybeforeyesterdaystring + "?accept=netcdf4"

print(url)
local_filename = download_ncss_file(url)
ds = nc.Dataset(local_filename)
print(ds) 

https://ncss.hycom.org/thredds/ncss/GOMu0.04/expt_90.1m000/FMRC/runs/GOMu0.04_901m000_FMRC_RUN_2021-08-26T12:00:00Z?var=surf_el&var=salinity&var=water_temp&var=water_u&var=water_v&disableLLSubset=on&disableProjSubset=on&horizStride=1&time=2021-08-27T00%3A00%3A00Z&vertCoord=&accept=netcdf4
<class 'netCDF4._netCDF4.Dataset'>
root group (NETCDF4 data model, file format HDF5):
    classification_level: UNCLASSIFIED
    distribution_statement: Approved for public release. Distribution unlimited.
    downgrade_date: not applicable
    classification_authority: not applicable
    institution: Naval Oceanographic Office
    source: HYCOM archive file
    history: archv2ncdf3z ;
FMRC Run 2021-08-26T12:00:00Z Dataset
    field_type: instantaneous
    Conventions: CF-1.4, NAVO_netcdf_v1.1
    cdm_data_type: GRID
    featureType: GRID
    location: Proto fmrc:GOMu0.04_901m000_FMRC
    History: Translated to CF-1.0 Conventions by Netcdf-Java CDM (CFGridWriter2)
Original Dataset = fmrc:GOMu0.04_901m

In [51]:
today = date.today()
yesterday = today - timedelta(days=1)
daybeforeyesterday = today - timedelta(days=2)

yesterdaystring = yesterday.strftime("%Y-%m-%d") + "T00%3A00%3A00Z"
daybeforeyesterdaystring = daybeforeyesterday.strftime("%Y-%m-%d") + "T12:00:00Z"

url = "https://ncss.hycom.org/thredds/ncss/GOMu0.04/expt_90.1m000/FMRC/runs/GOMu0.04_901m000_FMRC_RUN_" + daybeforeyesterdaystring + "?var=surf_el&var=salinity&var=water_temp&var=water_u&var=water_v&disableLLSubset=on&disableProjSubset=on&horizStride=1&time=" + yesterdaystring + "&vertCoord=&accept=netcdf4"
#url = "https://ncss.hycom.org/thredds/ncss/GOMu0.04/expt_90.1m000/FMRC/runs/GOMu0.04_901m000_FMRC_RUN_" + daybeforeyesterdaystring + "?accept=netcdf4"

#req = c3.HttpRequest(uri=url)
from urllib.parse import urlencode,urljoin

print(url)

print(buildHycomFMRCUrl(run='GOMu0.04_901m000_FMRC_RUN_2021-08-26T12:00:00Z'))

https://ncss.hycom.org/thredds/ncss/GOMu0.04/expt_90.1m000/FMRC/runs/GOMu0.04_901m000_FMRC_RUN_2021-08-26T12:00:00Z?var=surf_el&var=salinity&var=water_temp&var=water_u&var=water_v&disableLLSubset=on&disableProjSubset=on&horizStride=1&time=2021-08-27T00%3A00%3A00Z&vertCoord=&accept=netcdf4
https://ncss.hycom.org/thredds/ncss/GOMu0.04/expt_90.1m000/FMRC/runs/GOMu0.04_901m000_FMRC_RUN_2021-08-26T12:00:00Z
https://ncss.hycom.org/thredds/ncss/GOMu0.04/expt_90.1m000/FMRC/runs/GOMu0.04_901m000_FMRC_RUN_2021-08-26T12:00:00Z?var=surl_el&var=salinity&var=water_temp&var=water_u&var=water_v&disableLLSubset=on


### Access Some Data
Below is a sample cell accessing some of the data.  Note that there are a lot of missing values... for some variables.

In [11]:
time = ds.variables['time'][0]
lat = ds.variables['lat'][:]
lon = ds.variables['lon'][:]
depth = ds.variables['depth'][:]
salinity = ds.variables['salinity'][:]
print(time)
#print(lat)
#print(long)
# A 1D slice for a particular time, depth, and lat:
print(salinity[0,0,0,:])

132.0
[      nan       nan       nan       nan       nan       nan       nan
       nan       nan       nan       nan       nan       nan       nan
       nan       nan       nan       nan       nan       nan       nan
       nan       nan       nan       nan       nan       nan       nan
       nan       nan       nan       nan       nan       nan       nan
       nan       nan       nan       nan       nan       nan       nan
       nan       nan       nan       nan       nan       nan       nan
       nan       nan       nan       nan       nan       nan       nan
       nan       nan       nan       nan       nan       nan       nan
       nan       nan       nan       nan       nan       nan       nan
       nan       nan       nan       nan       nan       nan       nan
       nan       nan       nan       nan       nan       nan       nan
       nan       nan       nan       nan       nan       nan       nan
       nan       nan       nan       nan       nan       nan       nan


  """


In [7]:
#df = pd.DataFrame({'lon': lon})
#pivot_ui(df)