Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalability, cloud support and and fast access #38

Closed
nilsolav opened this issue Mar 20, 2020 · 10 comments
Closed

Scalability, cloud support and and fast access #38

nilsolav opened this issue Mar 20, 2020 · 10 comments

Comments

@nilsolav
Copy link
Contributor

nilsolav commented Mar 20, 2020

There is a growing need in the community to support fast access to large volumes of sonar data, including interpretation (labels or annotations). Parallell processing, cloud computing and the use of deep learning frameworks like Pytorch or Keras/tensorflow need an efficient data model in the back end.

@hmoustahfid-NOAA
Copy link

High Throughput processing (job scheduler) with auto-scaling +S3+ZARR;
This method is applied to satellite data (netcdf, h5, bufr) processing before, object storage in bucket or blob storage on a container is the cheapest and scalable storage options;
the same idea from pangeo https://pangeo.io/data.html

also dask/xarray/zarr
Relies on a json files to record attributes/shape of the dataset that you can also interact with outside using the library itself.

see example of zarr datastore attached from a small test.
range_angle_40107_0_260000.zarr.zip

@hmoustahfid-NOAA
Copy link

Parallel HTTP 'range-requests' (https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests) with netCDF files stored on cloud object stores offers the same performance as using pre-chunked formats like zarr.

This works by requesting individual slices of bytes in a netCDF file, which are similar to a 'chunk' of a netCDF file in zarr. There's some overhead involved to make that happen, but apparently if you can do that you might be able to still use netCDF.

@nilsolav
Copy link
Contributor Author

Keep in mind that the "raw" output from sonars are not tensors. They are usually not aligned in frequency and time, pings may drop out and the range vector may differ between frequencies. This was the rationale for suggesting the Gridded group, and perhaps that is a candidate for the more efficent N-d array methods? We used memory maps in python and that worked, but I would really like to see something that is platform/language independent.

@sdehalleux
Copy link

Data standards - Scalability consideration (Saildrone perspective)

Data standards for labeled acoustic data need to be able to support efficient computation in the cloud as the number of files and the volume of data scales exponentially.
A repository of netCDF files would not enable this. Based on Saildrone's cloud-centric experience, the considerations below, need to be taken into account to ensure data standards support exponential growth of acoustic data,with considerations for access speed, computational efficiency and storage volume / costs.

The Problem set

Input and outputs are tensors.
Data are larger than a memory.
Computation can be parallelized.
I/O is a bottleneck.
Data are compressible.
Speed matters.
Cost matters
Data mean different things to different users(unit & resolution)

Solution space to explore

Scalable cloud storage solution (eg AWS S3)
File format (eg. Parquet + NetCDF)
Compression - (eg. HDF5 + Parquet)
Metadata (eg. Netcdf + Metadata Service)
Checked, parallel tensor computing framework - (eg. Dask or Spark).
Chunked, parallel tensor storage library (eg Zarr)
Upload / Download / Process / POC (eg Jupyter notebook)

@gavinmacaulay
Copy link
Collaborator

@hmoustahfid-NOAA
Copy link

Here is a process suggested by a colleague from the NOAA ocean modelers team, which may be useful for Sonar data?
(1) develop sample pipelines for pushing/post-processing model-generated data to/at the cloud in cloud-optimized format (Zarr); (2) deploy a Pangeo Cloud instance specifically configured for the analysis and visualization of data; (3) develop reproducible Jupyter notebooks to operate on the data with efficiency as a part of cloud workflow; (4) develop stand-alone web applications and services that utilize the same scalable infrastructure on the backend.

@gavinmacaulay
Copy link
Collaborator

To followup on Nils Olav's comment above, multi-channel echosounders can generally only provide data in a tensor format with some processing of their 'raw' data output.

A place can provided for this processed data in the sonar-netCDF format (e.g., the Gridded group mentioned in Nils Olav's comment) - but will this help to address the problem set?

@nilsolav
Copy link
Contributor Author

nilsolav commented Apr 2, 2020

Here is what I need, and perhaps also what others need:
-We need code that converts proprietary raw data to the sonar-netCDF format
-We need code that converts the interpretation masks in LSSS, Echoview and any other software to interpretation masks in sonar-netCDF. I have matlab code that can read the LSSS masks
-We need code that reads the sonar-netCDF (both raw data and interpretation masks) and regrid it into a common grid (set by parameters) and write it to a cloud friendly format that can be efficiently used by TensorFlow, pytorch or any other machine learning framework with a python API.

After following this discussion, I think that the interpretation masks in the gridded data should follow the grid, i.e. less flexible than what is suggested in sonar-nerCDF. Also, it seems that there are requirements for the gridded data in terms of cloud support that would suggest not to use netCDF, but the convention should still apply in terms of content.

pyEcholab offer ping alignment function.

Any thoughts on this?

@carriecwall
Copy link

carriecwall commented Apr 2, 2020

Wu-Jung's Echopype converts the raw EK60, EK80 and AZFP to sonar-netCDF. However, she's identified that the sonar-netCDF isn't particularly cloud friendly and not immediately scalable to working in a cloud environment.
We are able to run pyEcholab on AWS now. Testing of large volumes of EK60 data hosted in S3 buckets to start shortly. Exported processed data format are still being explored and will be dictated by the preference of the community.
This is an interesting article on the different formats where Zarr and N5 - the java sibling of Zarr are discussed along with hdf.

@ghost
Copy link

ghost commented Jun 26, 2020

This issue was moved by gavinmacaulay to ices-publications/SONAR-netCDF4#18.

@ghost ghost closed this as completed Jun 26, 2020
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants