Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test and benchmark greenhost object storage #247

Closed
gabelula opened this issue Jan 31, 2019 · 4 comments
Closed

Test and benchmark greenhost object storage #247

gabelula opened this issue Jan 31, 2019 · 4 comments
Assignees
Labels
ooni/devops Issues related to https://github.com/ooni/sysadmin priority/medium

Comments

@gabelula
Copy link

On the Greenhost (GH) eclipsis cloud we currently have the majority of our data stored in Hong Kong, as we there is more availability of larger disks in that location.
The actual disk apparently runs on CEPH.

Since then GH has deployed a s3 like CEPH based object store on AMS. Using that would be the data would be closer to where it needs to be consumed and it may offer us better performance.

We should evaluate that solution and understand if it's suitable to our use-case.

@hellais
Copy link
Member

hellais commented Apr 8, 2019

@darkk should commit to the vault the object store credentials

@hellais hellais transferred this issue from ooni/sysadmin Jan 13, 2020
@hellais hellais added ooni/devops Issues related to https://github.com/ooni/sysadmin priority/medium labels Jan 13, 2020
@hellais
Copy link
Member

hellais commented Jan 23, 2020

cc @FedericoCeratto

@hellais hellais added this to the Sprint 6 - Dumbo Octopus milestone Jan 23, 2020
@hellais hellais self-assigned this Feb 17, 2020
@hellais hellais changed the title Benchmark GH CEPH HDD S3 in AMS to understand its suitability Test and benchmark greenhost object storage Mar 16, 2020
@hellais
Copy link
Member

hellais commented Mar 16, 2020

This came up again in the OONI Pipeline design work.

@hellais
Copy link
Member

hellais commented Apr 14, 2020

Work related to this was done during the last sprint by @FedericoCeratto.

Some notes on this were collected here: https://pad.riseup.net/p/ooni-s3-objstore-canning-eiyeeWaiwibe2Pai3Ash-keep

Upload times from AMS to local objstore:
Mbytes seconds
   1     1.7 <-- the setup time dominates the transfer time
  10     1.8
    100    3.3
   1000     33


Requirements:
- compress and upload to s3/objstore at intervals - not every msmt
- serve json bodies from somewhere before they are uploaded to s3/obj
- compression
- support multiple fastpath hosts

Implementation:
- options:
    - incrementally append+compress a local "live" file that is also served to the API?
    - simply use json files an a directory as a buffer and then delete them after the archive upload
   - format options:
 - reuse canning code in fastpath
- new format [1]
- how to tell the API what source to use? S3 with fallback to fastpath nginx?
    - look up by rid + input + optional number
- try S3 first then fall back to fastpath hosts
    - guess msmt time based on report id and reverse the lookup order as a small speed optimization?

[1]
https://blog.cloudflare.com/improving-compression-with-preset-deflate-dictionary/
https://en.wikibooks.org/wiki/Data_Compression/Dictionary_compression

https://github.com/gtoubassi/femtozip/wiki/How-FemtoZip-Works-%28In-Painful-Detail%29
https://github.com/gtoubassi/femtozip

https://github.com/facebook/zstd/issues/97
https://engineering.linkedin.com/shared-dictionary-compression-http-linkedin

https://lists.w3.org/Archives/Public/ietf-http-wg/2008JulSep/att-0441/Shared_Dictionary_Compression_over_HTTP.pdf

TODO:
    - test shared dictionary compression
    
SDCH --> brotli VS zstd

https://github.com/facebook/zstd/issues/412


### zstd tests

Use case 1:
    Compress incoming msmt on the "new collector" host for temporary local storage

Use case 2:
    Compress incoming msmt on the "new collector" host and upload them in small batches to S3 or objstore for temporary backup

Use case 3:
    Create a zstd dict using historical msmts. Use it to compress future msmts, upload them into S3, access them from the API
    - store msmt position in the compressed file in the fastpath database table
    - zstd support "external" dict
    - zstd supports access/decompression by range
    - the access pattern of the API is by row: columnar data formats are not beneficial

Initial benchmarks on data from raw cans showed compressed sizes down to 10%
Scan window size is very important

TODO: compare serialization formats that use zstd natively

I am going to close this issue and suggest we do future work as part of: #228

@hellais hellais closed this as completed Apr 14, 2020
FedericoCeratto pushed a commit that referenced this issue Mar 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ooni/devops Issues related to https://github.com/ooni/sysadmin priority/medium
Projects
None yet
Development

No branches or pull requests

3 participants