Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Encrypted FS] Use pre-allocacted free list instead of calloc/free for file nodes #1714

Closed
kailun-qin opened this issue Jan 11, 2024 · 3 comments · Fixed by #1763
Closed

Comments

@kailun-qin
Copy link
Contributor

Description of the feature

Instead of using calloc/free directly for the Encrypted FS file nodes (e.g., MHT nodes, data nodes), we can introduce a free-list cache for them (i.e., memory chunks exclusively reserved for encrypted-files usage), as a perf optimization.

PoC (for testing and illustration purposes ONLY):
master...kailun-qin:gramine:hack-file-node-free-list

Note: We should implement fallbacks to proper malloc/calloc when running out of free nodes (instead of BUG() in the PoC).

TO BE DISCUSSED:
Should we expose the size of the free list as a new Gramine manifest option? For example, if the manifest e.g. sets it to 8GB, then these 8GB will be put aside as a memory pool for encrypted-files.

Why Gramine should implement it?

To resolve perf bottlenecks seen on some workloads (e.g., RocksDB).

Specifically, RocksDB has a design of leveled compaction. It operates on a key range and two levels, it opens the files that contains the ranges and iterate sequentially over the files and merges them into new sorted files where the stale keys are being omitted.

When this compaction process happened, RocksDB throughput was reduced by ~10x when running w/ Gramine's encrypted FS.

The ongoing optimization for encrypted FS (#1599, #1681) don't help much in this case. The perf analysis showed that the one of the dominant contributors was the memory allocation of file data nodes:

new_file_data_node = calloc(1, sizeof(*new_file_data_node));

Gramine commit hash
1f72aaf

@dimakuv
Copy link
Contributor

dimakuv commented Jan 11, 2024

I reviewed this proposal privately before, and I'm fine with it. I think it's a simple and efficient optimization.

@kailun-qin Would it be possible to show some numbers on the RocksDB workload?

Alternatively, can we write a micro-benchmark that shows the benefit of this optimization? I guess a single large Protected File that is constantly read from and written to is enough.

Should we expose the size of the free list as a new Gramine manifest option? For example, if the manifest e.g. sets it to 8GB, then these 8GB will be put aside as a memory pool for encrypted-files.

Yes, I think it should be a new manifest option.

@dimakuv
Copy link
Contributor

dimakuv commented Jan 11, 2024

Since we also have #1712, and also #1576, I feel like we need to introduce a generic manifest syntax and rlimits:

loader.rlimit.[NAME] = "[VALUE]"
or
loader.rlimit.[NAME] = { value = "[VALUE]" }
or
loader.rlimit.[NAME] = { passthrough = true }

See also https://linux.die.net/man/2/setrlimit.

Then we can add rlimits in a generic fashion, and even allow applications to modify these rlimits at runtime.

For this particular issue, we could have a new non-standard rlimit: loader.rlimit.RLIMIT_ENCRYPTED_FILES_FREE_LIST or something like this.

@kailun-qin
Copy link
Contributor Author

kailun-qin commented Jan 11, 2024

Would it be possible to show some numbers on the RocksDB workload?

We tested the effectiveness of our proposed approach by running the RocksDB bench tool created by @cloudnoize (thanks!).

When the compaction happened, w/o the PoC optimization:

...
2024-01-11 14:53:17.646 Multi average duration for 50 puts: 3573 microseconds, estimated db size 9053804500Bytes
2024-01-11 14:53:17.908 Multi average duration for 50 puts: 3612 microseconds, estimated db size 9054829500Bytes
2024-01-11 14:53:18.235 Multi average duration for 50 puts: 4874 microseconds, estimated db size 9055854500Bytes
2024-01-11 14:53:18.564 Multi average duration for 50 puts: 4938 microseconds, estimated db size 9056879500Bytes
2024-01-11 14:53:18.890 Multi average duration for 50 puts: 4851 microseconds, estimated db size 9057904500Bytes
2024-01-11 14:53:19.214 Multi average duration for 50 puts: 4756 microseconds, estimated db size 9058929500Bytes
2024-01-11 14:53:19.626 Multi average duration for 50 puts: 6536 microseconds, estimated db size 9059954500Bytes
2024-01-11 14:53:20.033 Multi average duration for 50 puts: 6486 microseconds, estimated db size 9060979500Bytes
2024-01-11 14:53:20.438 Multi average duration for 50 puts: 6448 microseconds, estimated db size 9062004500Bytes
2024-01-11 14:53:20.843 Multi average duration for 50 puts: 6451 microseconds, estimated db size 9063029500Bytes
...

And w/ the PoC optimization:

...
2024-01-11 14:34:07.563 Multi average duration for 50 puts: 133 microseconds, estimated db size 9053804500Bytes
2024-01-11 14:34:07.652 Multi average duration for 50 puts: 115 microseconds, estimated db size 9054829500Bytes
2024-01-11 14:34:07.740 Multi average duration for 50 puts: 100 microseconds, estimated db size 9055854500Bytes
2024-01-11 14:34:07.829 Multi average duration for 50 puts: 110 microseconds, estimated db size 9056879500Bytes
2024-01-11 14:34:07.919 Multi average duration for 50 puts: 122 microseconds, estimated db size 9057904500Bytes
2024-01-11 14:34:08.008 Multi average duration for 50 puts: 105 microseconds, estimated db size 9058929500Bytes
2024-01-11 14:34:08.096 Multi average duration for 50 puts: 100 microseconds, estimated db size 9059954500Bytes
2024-01-11 14:34:08.185 Multi average duration for 50 puts: 119 microseconds, estimated db size 9060979500Bytes
2024-01-11 14:34:08.273 Multi average duration for 50 puts: 98 microseconds, estimated db size 9062004500Bytes
2024-01-11 14:34:08.362 Multi average duration for 50 puts: 120 microseconds, estimated db size 9063029500Bytes
...

Note:

  1. The bench tool was by default running w/ Gramine-direct (release build), but Gramine-SGX showed similar results.
  2. The PoC optimization showed generally better perf (>2x reduction in multi average duration for 50 puts) on the bench before or w/o any compaction happened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants