Scalability, cloud support and and fast access #38

nilsolav · 2020-03-20T09:18:48Z

There is a growing need in the community to support fast access to large volumes of sonar data, including interpretation (labels or annotations). Parallell processing, cloud computing and the use of deep learning frameworks like Pytorch or Keras/tensorflow need an efficient data model in the back end.

hmoustahfid-NOAA · 2020-03-20T13:49:20Z

High Throughput processing (job scheduler) with auto-scaling +S3+ZARR;
This method is applied to satellite data (netcdf, h5, bufr) processing before, object storage in bucket or blob storage on a container is the cheapest and scalable storage options;
the same idea from pangeo https://pangeo.io/data.html

also dask/xarray/zarr
Relies on a json files to record attributes/shape of the dataset that you can also interact with outside using the library itself.

see example of zarr datastore attached from a small test.
range_angle_40107_0_260000.zarr.zip

hmoustahfid-NOAA · 2020-03-20T13:55:03Z

Parallel HTTP 'range-requests' (https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests) with netCDF files stored on cloud object stores offers the same performance as using pre-chunked formats like zarr.

This works by requesting individual slices of bytes in a netCDF file, which are similar to a 'chunk' of a netCDF file in zarr. There's some overhead involved to make that happen, but apparently if you can do that you might be able to still use netCDF.

nilsolav · 2020-03-20T14:01:15Z

Keep in mind that the "raw" output from sonars are not tensors. They are usually not aligned in frequency and time, pings may drop out and the range vector may differ between frequencies. This was the rationale for suggesting the Gridded group, and perhaps that is a candidate for the more efficent N-d array methods? We used memory maps in python and that worked, but I would really like to see something that is platform/language independent.

sdehalleux · 2020-03-20T16:58:25Z

Data standards - Scalability consideration (Saildrone perspective)

Data standards for labeled acoustic data need to be able to support efficient computation in the cloud as the number of files and the volume of data scales exponentially.
A repository of netCDF files would not enable this. Based on Saildrone's cloud-centric experience, the considerations below, need to be taken into account to ensure data standards support exponential growth of acoustic data,with considerations for access speed, computational efficiency and storage volume / costs.

The Problem set

Input and outputs are tensors.
Data are larger than a memory.
Computation can be parallelized.
I/O is a bottleneck.
Data are compressible.
Speed matters.
Cost matters
Data mean different things to different users(unit & resolution)

Solution space to explore

Scalable cloud storage solution (eg AWS S3)
File format (eg. Parquet + NetCDF)
Compression - (eg. HDF5 + Parquet)
Metadata (eg. Netcdf + Metadata Service)
Checked, parallel tensor computing framework - (eg. Dask or Spark).
Chunked, parallel tensor storage library (eg Zarr)
Upload / Download / Process / POC (eg Jupyter notebook)

gavinmacaulay · 2020-03-20T22:15:54Z

This NOAA strategy is of relevance to the above: https://www.noaa.gov/media-release/noaa-finalizes-strategies-for-applying-emerging-science-and-technology

hmoustahfid-NOAA · 2020-03-23T17:05:12Z

Here is a process suggested by a colleague from the NOAA ocean modelers team, which may be useful for Sonar data?
(1) develop sample pipelines for pushing/post-processing model-generated data to/at the cloud in cloud-optimized format (Zarr); (2) deploy a Pangeo Cloud instance specifically configured for the analysis and visualization of data; (3) develop reproducible Jupyter notebooks to operate on the data with efficiency as a part of cloud workflow; (4) develop stand-alone web applications and services that utilize the same scalable infrastructure on the backend.

gavinmacaulay · 2020-04-02T03:18:06Z

To followup on Nils Olav's comment above, multi-channel echosounders can generally only provide data in a tensor format with some processing of their 'raw' data output.

A place can provided for this processed data in the sonar-netCDF format (e.g., the Gridded group mentioned in Nils Olav's comment) - but will this help to address the problem set?

nilsolav · 2020-04-02T08:58:36Z

Here is what I need, and perhaps also what others need:
-We need code that converts proprietary raw data to the sonar-netCDF format
-We need code that converts the interpretation masks in LSSS, Echoview and any other software to interpretation masks in sonar-netCDF. I have matlab code that can read the LSSS masks
-We need code that reads the sonar-netCDF (both raw data and interpretation masks) and regrid it into a common grid (set by parameters) and write it to a cloud friendly format that can be efficiently used by TensorFlow, pytorch or any other machine learning framework with a python API.

After following this discussion, I think that the interpretation masks in the gridded data should follow the grid, i.e. less flexible than what is suggested in sonar-nerCDF. Also, it seems that there are requirements for the gridded data in terms of cloud support that would suggest not to use netCDF, but the convention should still apply in terms of content.

pyEcholab offer ping alignment function.

Any thoughts on this?

carriecwall · 2020-04-02T20:15:27Z

Wu-Jung's Echopype converts the raw EK60, EK80 and AZFP to sonar-netCDF. However, she's identified that the sonar-netCDF isn't particularly cloud friendly and not immediately scalable to working in a cloud environment.
We are able to run pyEcholab on AWS now. Testing of large volumes of EK60 data hosted in S3 buckets to start shortly. Exported processed data format are still being explored and will be dictated by the preference of the community.
This is an interesting article on the different formats where Zarr and N5 - the java sibling of Zarr are discussed along with hdf.

ghost · 2020-06-26T00:24:40Z

This issue was moved by gavinmacaulay to ices-publications/SONAR-netCDF4#18.

gavinmacaulay mentioned this issue May 8, 2020

Scalability, cloud support and and fast access ices-publications/SONAR-netCDF4#2

Closed

ghost mentioned this issue Jun 26, 2020

Scalability, cloud support and and fast access ices-publications/SONAR-netCDF4#18

Open

ghost deleted a comment from gavinmacaulay Jun 26, 2020

ghost closed this as completed Jun 26, 2020

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scalability, cloud support and and fast access #38

Scalability, cloud support and and fast access #38

nilsolav commented Mar 20, 2020 •

edited

Loading

hmoustahfid-NOAA commented Mar 20, 2020

hmoustahfid-NOAA commented Mar 20, 2020

nilsolav commented Mar 20, 2020

sdehalleux commented Mar 20, 2020

gavinmacaulay commented Mar 20, 2020

hmoustahfid-NOAA commented Mar 23, 2020

gavinmacaulay commented Apr 2, 2020

nilsolav commented Apr 2, 2020 •

edited

Loading

carriecwall commented Apr 2, 2020 •

edited

Loading

ghost commented Jun 26, 2020

Scalability, cloud support and and fast access #38

Scalability, cloud support and and fast access #38

Comments

nilsolav commented Mar 20, 2020 • edited Loading

hmoustahfid-NOAA commented Mar 20, 2020

hmoustahfid-NOAA commented Mar 20, 2020

nilsolav commented Mar 20, 2020

sdehalleux commented Mar 20, 2020

gavinmacaulay commented Mar 20, 2020

hmoustahfid-NOAA commented Mar 23, 2020

gavinmacaulay commented Apr 2, 2020

nilsolav commented Apr 2, 2020 • edited Loading

carriecwall commented Apr 2, 2020 • edited Loading

ghost commented Jun 26, 2020

nilsolav commented Mar 20, 2020 •

edited

Loading

nilsolav commented Apr 2, 2020 •

edited

Loading

carriecwall commented Apr 2, 2020 •

edited

Loading