Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What kind of tests did your run? #10

Open
clstoulouse opened this issue Feb 2, 2018 · 3 comments
Open

What kind of tests did your run? #10

clstoulouse opened this issue Feb 2, 2018 · 3 comments

Comments

@clstoulouse
Copy link

clstoulouse commented Feb 2, 2018

Dear,

I would like to know some details about this interface between Thredds and S3:

  • Have you run tests only with Amazon S3 servers or also with other S3 providers?
  • Which Thredds version have you tested?
  • Does this implementation runs in production 24/7 without any issue, in particular memory supervision?
  • On your S3 bucket, have you run tests with netcdf files having a size of 1Mb, 100Mb, 1Gb, 5Gb and more? How many netcdf files by dataset did you managed: 1 file, 10 files, 100 files, 1K files, 100K files, more ?
  • Do you manage "aggregation" ? If there are several netcdf files is a same folder, how does Thredds do to aggregate them and see them as a single dataset ?

Thank in advance for your reply.

@AlexHilson
Copy link
Contributor

Hi, thanks for the interest.

  1. We've only tested with Amazon S3. We're very interested in exploring federated data storage, but right now all of our use-cases fit within AWS.

  2. pom.xml states tds version 4.6.10 - pretty sure this is the only version we tried

  3. This does not run in production, it's just a proof of concept.

  4. We've only done exploratory performance testing - the results of which were not great. The first opendap request for a particular file is very slow - our real datasets are made of many small (<100mb) small netcdf files so this was a big problem. I believe that this is a bug rather than an inherent problem - my suspicion is that Thredds is requesting too much information up front. In order for this library to be useful this issue would have to be properly explored.

I believe that the library would be most useful when accessing netcdf files with a lot of data and a small number of variables.

  1. We have not explored aggregation - I would really love to try this, but I don't have the Thredds expertise to really understand how to set this up.

For reference our follow up experiment to this work was https://github.com/met-office-lab/pysssix, and our current approach is to use https://github.com/kahing/goofys.

For our current use cases being able to access files on disk is all we really need. I think that this approach would work for running TDS, i.e. just run a regular server but the files on disk are accessed via a FUSE mount rather than by trying to replace the file access handler inside TDS.

We're always interested to hear about other peoples experience with this kind of stuff - your Motu work looks interesting. Or if you have an interest in exploring any of these topics further perhaps we could collaborate.

Thanks,
Alex.

@clstoulouse
Copy link
Author

Hi, thank you very much for your detailed reply.
On our side, we have also tried the s3fs-fuse solution but it is too slow and giving to TDS a direct access to a file system is far faster.
We also look forward how it could be better to access to an S3 storage to manage large dataset composed of several files (Gb).

Best regards,
Sylvain.

@AlexHilson
Copy link
Contributor

We did try s3fs-fuse as well, but found it unreliable at the time ( > 1 year ago). Goofys has so far worked out a lot better for us.

Ultimately these approaches will always be slower than direct disk access. At some point we intend to look at caching (both on local disk and in redis), I'm particularly interested in looking at whether a specialised cache for netcdf headers only is helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants