-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scalability, cloud support and and fast access #38
Comments
High Throughput processing (job scheduler) with auto-scaling +S3+ZARR; also dask/xarray/zarr see example of zarr datastore attached from a small test. |
Parallel HTTP 'range-requests' (https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests) with netCDF files stored on cloud object stores offers the same performance as using pre-chunked formats like zarr. This works by requesting individual slices of bytes in a netCDF file, which are similar to a 'chunk' of a netCDF file in zarr. There's some overhead involved to make that happen, but apparently if you can do that you might be able to still use netCDF. |
Keep in mind that the "raw" output from sonars are not tensors. They are usually not aligned in frequency and time, pings may drop out and the range vector may differ between frequencies. This was the rationale for suggesting the Gridded group, and perhaps that is a candidate for the more efficent N-d array methods? We used memory maps in python and that worked, but I would really like to see something that is platform/language independent. |
Data standards - Scalability consideration (Saildrone perspective) Data standards for labeled acoustic data need to be able to support efficient computation in the cloud as the number of files and the volume of data scales exponentially. The Problem set Input and outputs are tensors. Solution space to explore Scalable cloud storage solution (eg AWS S3) |
This NOAA strategy is of relevance to the above: https://www.noaa.gov/media-release/noaa-finalizes-strategies-for-applying-emerging-science-and-technology |
Here is a process suggested by a colleague from the NOAA ocean modelers team, which may be useful for Sonar data? |
To followup on Nils Olav's comment above, multi-channel echosounders can generally only provide data in a tensor format with some processing of their 'raw' data output. A place can provided for this processed data in the sonar-netCDF format (e.g., the Gridded group mentioned in Nils Olav's comment) - but will this help to address the problem set? |
Here is what I need, and perhaps also what others need: After following this discussion, I think that the interpretation masks in the gridded data should follow the grid, i.e. less flexible than what is suggested in sonar-nerCDF. Also, it seems that there are requirements for the gridded data in terms of cloud support that would suggest not to use netCDF, but the convention should still apply in terms of content. pyEcholab offer ping alignment function. Any thoughts on this? |
Wu-Jung's Echopype converts the raw EK60, EK80 and AZFP to sonar-netCDF. However, she's identified that the sonar-netCDF isn't particularly cloud friendly and not immediately scalable to working in a cloud environment. |
This issue was moved by gavinmacaulay to ices-publications/SONAR-netCDF4#18. |
There is a growing need in the community to support fast access to large volumes of sonar data, including interpretation (labels or annotations). Parallell processing, cloud computing and the use of deep learning frameworks like Pytorch or Keras/tensorflow need an efficient data model in the back end.
The text was updated successfully, but these errors were encountered: