What kind of tests did your run? #10

clstoulouse · 2018-02-02T10:44:16Z

Dear,

I would like to know some details about this interface between Thredds and S3:

Have you run tests only with Amazon S3 servers or also with other S3 providers?
Which Thredds version have you tested?
Does this implementation runs in production 24/7 without any issue, in particular memory supervision?
On your S3 bucket, have you run tests with netcdf files having a size of 1Mb, 100Mb, 1Gb, 5Gb and more? How many netcdf files by dataset did you managed: 1 file, 10 files, 100 files, 1K files, 100K files, more ?
Do you manage "aggregation" ? If there are several netcdf files is a same folder, how does Thredds do to aggregate them and see them as a single dataset ?

Thank in advance for your reply.

AlexHilson · 2018-02-02T12:18:24Z

Hi, thanks for the interest.

We've only tested with Amazon S3. We're very interested in exploring federated data storage, but right now all of our use-cases fit within AWS.
pom.xml states tds version 4.6.10 - pretty sure this is the only version we tried
This does not run in production, it's just a proof of concept.
We've only done exploratory performance testing - the results of which were not great. The first opendap request for a particular file is very slow - our real datasets are made of many small (<100mb) small netcdf files so this was a big problem. I believe that this is a bug rather than an inherent problem - my suspicion is that Thredds is requesting too much information up front. In order for this library to be useful this issue would have to be properly explored.

I believe that the library would be most useful when accessing netcdf files with a lot of data and a small number of variables.

We have not explored aggregation - I would really love to try this, but I don't have the Thredds expertise to really understand how to set this up.

For reference our follow up experiment to this work was https://github.com/met-office-lab/pysssix, and our current approach is to use https://github.com/kahing/goofys.

For our current use cases being able to access files on disk is all we really need. I think that this approach would work for running TDS, i.e. just run a regular server but the files on disk are accessed via a FUSE mount rather than by trying to replace the file access handler inside TDS.

We're always interested to hear about other peoples experience with this kind of stuff - your Motu work looks interesting. Or if you have an interest in exploring any of these topics further perhaps we could collaborate.

Thanks,
Alex.

clstoulouse · 2018-02-02T13:23:51Z

Hi, thank you very much for your detailed reply.
On our side, we have also tried the s3fs-fuse solution but it is too slow and giving to TDS a direct access to a file system is far faster.
We also look forward how it could be better to access to an S3 storage to manage large dataset composed of several files (Gb).

Best regards,
Sylvain.

AlexHilson · 2018-02-02T13:33:11Z

We did try s3fs-fuse as well, but found it unreliable at the time ( > 1 year ago). Goofys has so far worked out a lot better for us.

Ultimately these approaches will always be slower than direct disk access. At some point we intend to look at caching (both on local disk and in redis), I'm particularly interested in looking at whether a specialised cache for netcdf headers only is helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What kind of tests did your run? #10

What kind of tests did your run? #10

clstoulouse commented Feb 2, 2018 •

edited

AlexHilson commented Feb 2, 2018

clstoulouse commented Feb 2, 2018

AlexHilson commented Feb 2, 2018

What kind of tests did your run? #10

What kind of tests did your run? #10

Comments

clstoulouse commented Feb 2, 2018 • edited

AlexHilson commented Feb 2, 2018

clstoulouse commented Feb 2, 2018

AlexHilson commented Feb 2, 2018

clstoulouse commented Feb 2, 2018 •

edited