Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[common] Introduce pre-allocated node free list for encrypted files #1723

Closed

Conversation

kailun-qin
Copy link
Contributor

@kailun-qin kailun-qin commented Jan 18, 2024

Description of the changes

Allocation of the Encrypted FS file nodes (e.g., MHT nodes, data nodes) was found to be a performance bottleneck for certain workloads (e.g., RocksDB compaction).

Therefore, instead of using alloc/free directly on file nodes, this commit introduces a free list of pre-allocacted nodes reserved specifically for their usage, as a performance optimization that trades space for time. A new manifest option has been added to allow users to set and adjust the free list limit based on their applications.

Fixes #1714.

How to test this PR?

CI and RocksDB.


This change is Reviewable

Allocation of the Encrypted FS file nodes (e.g., MHT nodes, data nodes)
was found to be a performance bottleneck for certain workloads (e.g.,
RocksDB compaction).

Therefore, instead of using alloc/free directly on file nodes, this
commit introduces a free list of pre-allocacted nodes reserved
specifically for their usage, as a performance optimization that trades
space for time. A new manifest option has been added to allow users to
set and adjust the free list limit based on their applications.

Signed-off-by: Kailun Qin <kailun.qin@intel.com>
@kailun-qin kailun-qin force-pushed the kailun-qin/pfs-free-list branch 3 times, most recently from 09ec9cc to 60ef632 Compare January 18, 2024 08:16
Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r1, all commit messages.
Reviewable status: all files reviewed, 26 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @kailun-qin)

a discussion (no related file):
Can we add a new section Optimizations for file I/O into https://gramine.readthedocs.io/en/stable/performance.html?


a discussion (no related file):
Great PR!

Can I ask for some benchmark results? With different limits (say with 0, 50, and whatever is a good limit for your benchmark).



common/src/protected_files/protected_files.h line 313 at r1 (raw file):

 * \param limit_node_free_list  Limit number of nodes for the PF free list.
 *
 * \returns PF status.

We can add a note here that calling this function is optional -- if it's not called then no free list of file nodes will be created (i.e., this optimization will be disabled).

This will allow to revert changes in pf_util.c.


common/src/protected_files/protected_files.h line 315 at r1 (raw file):

 * \returns PF status.
 */

Remove empty line


common/src/protected_files/protected_files.c line 15 at r1 (raw file):

#ifdef IN_LIBOS
#include "libos_thread.h"

Why is this required?


common/src/protected_files/protected_files.c line 509 at r1 (raw file):

pf_status_t pf_init_node_free_list(size_t limit_node_free_list) {
    assert(!g_initialized);
    struct pf_node_item* item;

Please move this declaration inside the loop.


common/src/protected_files/protected_files.c line 515 at r1 (raw file):

        if (item == NULL) {
            log_error("Not enough memory for the encrypted files node free list. Please consider "
                      "dicreasing the limit in the manifest.");

Please add the exact name of the manifest option.


common/src/protected_files/protected_files.c line 527 at r1 (raw file):

}

static struct pf_node_item* ipf_allocate_node(void) {

Shouldn't it be ipf_allocate_node_item() ?


common/src/protected_files/protected_files.c line 535 at r1 (raw file):

        if (FIRST_TIME()) {
            log_warning("No free file nodes available; using malloc as fallbacks. "

fallback


common/src/protected_files/protected_files.c line 535 at r1 (raw file):

        if (FIRST_TIME()) {
            log_warning("No free file nodes available; using malloc as fallbacks. "

available -> left


common/src/protected_files/protected_files.c line 537 at r1 (raw file):

            log_warning("No free file nodes available; using malloc as fallbacks. "
                        "Please consider adjusting the limit of the encrypted files node free list "
                        "in the manifest.");

Please add the name of the manifest option here


common/src/protected_files/protected_files.c line 537 at r1 (raw file):

            log_warning("No free file nodes available; using malloc as fallbacks. "
                        "Please consider adjusting the limit of the encrypted files node free list "
                        "in the manifest.");

Won't this warning be printed even when manifest doesn't contain this manifest option? It will be weird and unexpected for users. Please add a helper variable that memorizes whether the user requested this list at all.


common/src/protected_files/protected_files.c line 542 at r1 (raw file):

        item = calloc(1, sizeof(struct pf_node_item));
        if (item == NULL)
            return NULL;

Why not directly return item;

And you won't need the else branch -- the below code will be guaranteed to execute only if the free-list exists.

(Actually, you can even do return calloc(...) for brevity)


common/src/protected_files/protected_files.c line 554 at r1 (raw file):

}

static void ipf_free_node(struct pf_node_item* item) {

Shouldn't it be ipf_free_node_item() ? Otherwise it reads as if you want to free item->file_node


common/src/protected_files/protected_files.c line 556 at r1 (raw file):

static void ipf_free_node(struct pf_node_item* item) {
    if (!item->from_free_list) {
        free(item);

I don't like the else cascades where we can avoid them.

Can we rewrite to:

    if (!item->from_free_list) {
        free(item);
        return;

    spinlock_lock(&g_pf_node_free_list_lock);
    ...

common/src/protected_files/protected_files.c line 567 at r1 (raw file):

}

static void ipf_free_node_from_data(file_node_t* data_ptr) {

Can we add assert(data_ptr)?

I'm scared when I see pointer arithmetic without additional checks. I know nothing can really go wrong, but let's add an assert.


common/src/protected_files/protected_files.c line 567 at r1 (raw file):

}

static void ipf_free_node_from_data(file_node_t* data_ptr) {

Shouldn't it be ipf_free_node_item_from_data() ? Otherwise it reads as if you want to free item->file_node


common/src/protected_files/protected_files.c line 618 at r1 (raw file):

        return NULL;
    }
    file_node_t* new_file_mht_node = &(new_file_mht_node_item->file_node);

There's no need in (). But I don't block as I understand that for some readers, this may be hard to figure out.


Documentation/manifest-syntax.rst line 1024 at r1 (raw file):

   such a "doubly-used" key, you must apply a KDF.

Encrypted files free list node limit

I would add a word "optimization" somewhere here.

Maybe Encrypted files optimization: free list node limit


Documentation/manifest-syntax.rst line 1025 at r1 (raw file):

Encrypted files free list node limit

No need for an empty line -- I think RST will even incorrectly render in such case.


Documentation/manifest-syntax.rst line 1026 at r1 (raw file):

Encrypted files free list node limit

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Too many ^ symbols (must end right below )


Documentation/manifest-syntax.rst line 1030 at r1 (raw file):

::

    fs.limit.encrypted_files_node_free_list = [NUM]

limit -> limits


Documentation/manifest-syntax.rst line 1030 at r1 (raw file):

::

    fs.limit.encrypted_files_node_free_list = [NUM]

Isn't a different order of words more readable? encrypted_files_free_list_nodes


Documentation/manifest-syntax.rst line 1034 at r1 (raw file):

This syntax specifies a limit on the number of nodes in the pre-allocated
encrypted files free list, as a performance optimization that trades space for

space -> memory?


libos/src/fs/libos_fs_encrypted.c line 288 at r1 (raw file):

    if (limit_node_free_list_int64 < 0) {
        log_error("'limit.encrypted_files_node_free_list' = %ld is negative",
                  limit_node_free_list_int64);

I don't see a need to output the exact value -- it's obvious what the user should look at and fix.


libos/src/fs/libos_fs_encrypted.c line 294 at r1 (raw file):

    ret = pf_init_node_free_list((size_t)limit_node_free_list_int64);
    if (ret < 0)
        return ret;

I would add a log error message. UPDATE: you have it inside the function itself, unblocking.


tools/sgx/common/pf_util.c line 223 at r1 (raw file):

 * performance-sensitive so we set the limit of the node free list to 0 which by default disables
 * the node free list optimization. */
#define LIMIT_ENCRYPTED_FILES_NODE_FREE_LIST 0

Why not just revert all these changes? See my other comment.

If you want to leave a hint that we don't care about performance of PF utils, then we can put it in the commit message.

… files

!TODO: use below commit msg:

[common] Introduce pre-allocacted node free list for encrypted files

Allocation of the Encrypted FS file nodes (e.g., MHT nodes, data nodes)
was found to be a performance bottleneck for certain workloads (e.g.,
RocksDB compaction).

Therefore, instead of using alloc/free directly on file nodes, this
commit introduces a free list of pre-allocacted nodes reserved
specifically for their usage, as a performance optimization that trades
memory for time. A new manifest option has been added to allow users to
set and adjust the free list limit based on their applications.

Note that the encrypted files related tools (`gramine-sgx-pf-crypt`,
`gramine-sgx-pf-tamper`) are not performance-sensitive so we do not
enable this node free list optimization for them. */

Signed-off-by: Kailun Qin <kailun.qin@intel.com>
Copy link
Contributor Author

@kailun-qin kailun-qin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 6 files at r2, all commit messages.
Reviewable status: all files reviewed, 25 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv)

a discussion (no related file):
Bench setup: RocksDB bench tool, release build, gramine-direct, estimated 2GB DB size

Limit nodes 0 (opt disabled) 128 256 500 1024 1280 (no fallback)
Multi average duration for 50 puts (µs) 286.10 211.22 188.72 170.85 150.88 150.48

Note:
The compaction process was not triggered in the above setup. For the bench results on RocksDB compaction, pls refer to #1714 (comment).


a discussion (no related file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Can we add a new section Optimizations for file I/O into https://gramine.readthedocs.io/en/stable/performance.html?

Done.



common/src/protected_files/protected_files.h line 313 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

We can add a note here that calling this function is optional -- if it's not called then no free list of file nodes will be created (i.e., this optimization will be disabled).

This will allow to revert changes in pf_util.c.

Done.


common/src/protected_files/protected_files.h line 315 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Remove empty line

Done.


common/src/protected_files/protected_files.c line 15 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Why is this required?

Explained in the comments.


common/src/protected_files/protected_files.c line 509 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Please move this declaration inside the loop.

Done.


common/src/protected_files/protected_files.c line 515 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Please add the exact name of the manifest option.

Done.


common/src/protected_files/protected_files.c line 527 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Shouldn't it be ipf_allocate_node_item() ?

Done.


common/src/protected_files/protected_files.c line 535 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

fallback

Done.


common/src/protected_files/protected_files.c line 535 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

available -> left

Done.


common/src/protected_files/protected_files.c line 537 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Please add the name of the manifest option here

Done.


common/src/protected_files/protected_files.c line 537 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Won't this warning be printed even when manifest doesn't contain this manifest option? It will be weird and unexpected for users. Please add a helper variable that memorizes whether the user requested this list at all.

Done.


common/src/protected_files/protected_files.c line 542 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Why not directly return item;

And you won't need the else branch -- the below code will be guaranteed to execute only if the free-list exists.

(Actually, you can even do return calloc(...) for brevity)

Done.


common/src/protected_files/protected_files.c line 554 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Shouldn't it be ipf_free_node_item() ? Otherwise it reads as if you want to free item->file_node

Done.


common/src/protected_files/protected_files.c line 556 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

I don't like the else cascades where we can avoid them.

Can we rewrite to:

    if (!item->from_free_list) {
        free(item);
        return;

    spinlock_lock(&g_pf_node_free_list_lock);
    ...

Done.


common/src/protected_files/protected_files.c line 567 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Can we add assert(data_ptr)?

I'm scared when I see pointer arithmetic without additional checks. I know nothing can really go wrong, but let's add an assert.

Done.


common/src/protected_files/protected_files.c line 567 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Shouldn't it be ipf_free_node_item_from_data() ? Otherwise it reads as if you want to free item->file_node

Done.


Documentation/manifest-syntax.rst line 1024 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

I would add a word "optimization" somewhere here.

Maybe Encrypted files optimization: free list node limit

Done.


Documentation/manifest-syntax.rst line 1025 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

No need for an empty line -- I think RST will even incorrectly render in such case.

Done.


Documentation/manifest-syntax.rst line 1026 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Too many ^ symbols (must end right below )

Done.


Documentation/manifest-syntax.rst line 1030 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

limit -> limits

Done.


Documentation/manifest-syntax.rst line 1030 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Isn't a different order of words more readable? encrypted_files_free_list_nodes

Done.


Documentation/manifest-syntax.rst line 1034 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

space -> memory?

Done.


libos/src/fs/libos_fs_encrypted.c line 288 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

I don't see a need to output the exact value -- it's obvious what the user should look at and fix.

Done.


tools/sgx/common/pf_util.c line 223 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Why not just revert all these changes? See my other comment.

If you want to leave a hint that we don't care about performance of PF utils, then we can put it in the commit message.

Done.

Copy link
Contributor

@llly llly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 6 files at r2, all commit messages.
Reviewable status: all files reviewed, 27 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @kailun-qin)


common/src/protected_files/protected_files.c line 576 at r2 (raw file):

    struct pf_node_item* item =
        (struct pf_node_item*)((char*)data_ptr - offsetof(struct pf_node_item, file_node));

Can use container_of Macro


libos/src/fs/libos_fs_encrypted.c line 277 at r2 (raw file):

    int ret;

    toml_table_t* manifest_fs = toml_table_in(g_manifest_root, "fs");

Need to make sure manifest_fs pointer is not NULL before using it.

… files

Signed-off-by: Kailun Qin <kailun.qin@intel.com>
Copy link
Contributor Author

@kailun-qin kailun-qin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 4 of 6 files reviewed, 27 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @llly)


common/src/protected_files/protected_files.c line 576 at r2 (raw file):

Previously, llly (Li Xun) wrote…

Can use container_of Macro

Done.


libos/src/fs/libos_fs_encrypted.c line 277 at r2 (raw file):

Previously, llly (Li Xun) wrote…

Need to make sure manifest_fs pointer is not NULL before using it.

Done.

Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 6 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status: all files reviewed, 9 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @kailun-qin and @llly)

a discussion (no related file):

Previously, kailun-qin (Kailun Qin) wrote…

Bench setup: RocksDB bench tool, release build, gramine-direct, estimated 2GB DB size

Limit nodes 0 (opt disabled) 128 256 500 1024 1280 (no fallback)
Multi average duration for 50 puts (µs) 286.10 211.22 188.72 170.85 150.88 150.48

Note:
The compaction process was not triggered in the above setup. For the bench results on RocksDB compaction, pls refer to #1714 (comment).

The numbers look good to me.


a discussion (no related file):

Previously, kailun-qin (Kailun Qin) wrote…

Done.

Thanks, but see my comments in there.



-- commits line 35 at r3:
*/ -- delete this. Can be done during final rebase.


common/src/protected_files/protected_files.c line 518 at r3 (raw file):

        if (item == NULL) {
            log_error("Not enough memory for the encrypted files node free list. Please consider "
                      "dicreasing the free list node limit via the manifest option "

typo: decreasing


Documentation/performance.rst line 310 at r3 (raw file):

Optimizations for file I/O
--------------------------

Would it be possible to quickly describe what are MHT nodes and data nodes? You can just mention that an encrypted file is split into equally-sized chunks (nodes). The first node is the header metadata node (header MHT node), and other nodes are stored as a Merkle tree, where the leaves are called "data nodes" (and host actual encrypted data) and non-leaves are MHT nodes (and host metadata like intermediate encryption keys).


Documentation/performance.rst line 314 at r3 (raw file):

one source of overhead for certain workloads (e.g., RocksDB compaction). To
address this performance bottleneck, instead of using alloc/free directly on
file nodes, Gramine provides an optional free list of pre-allocacted nodes as an

pre-allocated typo


Documentation/performance.rst line 317 at r3 (raw file):

optimization that trades memory for time. This optimization is by default
disabled, but can be enabled and tuned according to the needs of the application
via the manifest option ``fs.limits.encrypted_files_node_free_list``.

Can we add a link to this manifest option?


libos/src/fs/libos_fs_encrypted.c line 286 at r3 (raw file):

                      &limit_nodes_int64);
    if (ret < 0) {
        log_error("Cannot parse 'limits.encrypted_files_free_list_nodes'!");

Seems I forgot previously: you must add the fs. prefix too


libos/src/fs/libos_fs_encrypted.c line 290 at r3 (raw file):

    }
    if (limit_nodes_int64 < 0) {
        log_error("Invalid 'limits.encrypted_files_free_list_nodes' value!");

Seems I forgot previously: you must add the fs. prefix too

… files

!TODO: use below commit msg:

[common] Introduce pre-allocacted node free list for encrypted files

Allocation of the Encrypted FS file nodes (e.g., MHT nodes, data nodes)
was found to be a performance bottleneck for certain workloads (e.g.,
RocksDB compaction).

Therefore, instead of using alloc/free directly on file nodes, this
commit introduces a free list of pre-allocacted nodes reserved
specifically for their usage, as a performance optimization that trades
memory for time. A new manifest option has been added to allow users to
set and adjust the free list limit based on their applications.

Note that the encrypted files related tools (`gramine-sgx-pf-crypt`,
`gramine-sgx-pf-tamper`) are not performance-sensitive so we do not
enable this node free list optimization for them.

Signed-off-by: Kailun Qin <kailun.qin@intel.com>
Copy link
Contributor Author

@kailun-qin kailun-qin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r3.
Reviewable status: all files reviewed, 9 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @llly)


-- commits line 35 at r3:

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

*/ -- delete this. Can be done during final rebase.

oops, done.


common/src/protected_files/protected_files.c line 518 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

typo: decreasing

Done.


Documentation/performance.rst line 310 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Would it be possible to quickly describe what are MHT nodes and data nodes? You can just mention that an encrypted file is split into equally-sized chunks (nodes). The first node is the header metadata node (header MHT node), and other nodes are stored as a Merkle tree, where the leaves are called "data nodes" (and host actual encrypted data) and non-leaves are MHT nodes (and host metadata like intermediate encryption keys).

Done, w/ some text updates.


Documentation/performance.rst line 314 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

pre-allocated typo

Done.


Documentation/performance.rst line 317 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Can we add a link to this manifest option?

Done.


libos/src/fs/libos_fs_encrypted.c line 286 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Seems I forgot previously: you must add the fs. prefix too

Done.


libos/src/fs/libos_fs_encrypted.c line 290 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Seems I forgot previously: you must add the fs. prefix too

Done.

Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r4, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @llly)


Documentation/performance.rst line 310 at r3 (raw file):

Previously, kailun-qin (Kailun Qin) wrote…

Done, w/ some text updates.

Great text, thanks!

@kailun-qin kailun-qin changed the title [common] Introduce pre-allocacted node free list for encrypted files [common] Introduce pre-allocated node free list for encrypted files Jan 22, 2024
Copy link
Contributor Author

@kailun-qin kailun-qin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 3 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @llly)


-- commits line 49 at r4:
-> allocated, have to be fixed during rebase

Code quote:

allocacted

Copy link
Contributor

@llly llly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r4, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion, not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners

Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 2 unresolved discussions, not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @kailun-qin)

a discussion (no related file):
This optimization is what all modern memory allocators do, it's just our own allocator which lacks it. So, shouldn't this be implemented as an optimization to our allocator, not just inside encrypted files? It would speed up everything, not only this subsystem.


Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 2 unresolved discussions, not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @mkow)

a discussion (no related file):

This optimization is what all modern memory allocators do, it's just our own allocator which lacks it. So, shouldn't this be implemented as an optimization to our allocator, not just inside encrypted files?

I am against this approach.

Ideally, we would not "add an optimization to our allocator", but instead I would vote for replacing our allocator with a battle-tested allocator.

In particular, we already had discussions (and I think a general agreement) that we should move to a proper libc (like Musl or maybe newlib), instead of copy-pasting or re-implementing the libc APIs. I would prefer this to "adding optimizations to our code".

Moreover, we already have such micro-optimizations sprinkled all over our Gramine code. Two optimizations that come to mind:

  1. Using Linux-compatible structs for iovec objects for send/recv on the sockets. In the original networking sub-system by Borys, these LibOS objects were not-Linux-compatible, so the compatibility layer in PAL malloced and freed temporary helper objects. This resulted in huge perf overhead, so we just made iovec objects the same as in Linux, to side-steps memory allocation in PAL.
  2. Using a scratch buffer for the first thread during sendfile(), again to not allocate and free every time. This was particularly beneficial for Nginx.

I'm sure we have many such micro-optimizations in different parts of Gramine. They are all quite small and self-contained.

I would prefer this approach:

  1. Continue with micro-optimizations of different sub-systems. (This PR falls under this category.)
  2. In some future, choose, design and implement the replacement for the memory allocator by a default allocator of some libc version. (Far away in future.)

I really don't like the rabbit hole of "let's fix/replace our current allocator", as this will take months/years to accomplish and test.


Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 2 unresolved discussions, not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @mwkmwkmwk)

a discussion (no related file):

Ideally, we would not "add an optimization to our allocator", but instead I would vote for replacing our allocator with a battle-tested allocator.

I'm in favor of this, but do you mean that you want to do this instead of this PR?

Moreover, we already have such micro-optimizations sprinkled all over our Gramine code. Two optimizations that come to mind:

Yup, that's a very good point. I hate these micro-optimizations for very specific parts of the codebase, we could just use a proper allocator and all of them would be good (or at least acceptable).

I really don't like the rabbit hole of "let's fix/replace our current allocator", as this will take months/years to accomplish and test.

Hmm, but isn't it mostly the same algorithm as implemented in this PR? You only need to add buckets per size, but otherwise it would be the same?

btw, @mwkmwkmwk: How does allocation support looks like in nostd Rust? Can migrating a part of the codebase allow us to use some Rust allocator (but not sure how to get one in nostd) from C code and to remove our custom allocators?


Copy link
Contributor

@mwkmwkmwk mwkmwkmwk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 2 unresolved discussions, not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @mkow)

a discussion (no related file):

Previously, mkow (Michał Kowalczyk) wrote…

Ideally, we would not "add an optimization to our allocator", but instead I would vote for replacing our allocator with a battle-tested allocator.

I'm in favor of this, but do you mean that you want to do this instead of this PR?

Moreover, we already have such micro-optimizations sprinkled all over our Gramine code. Two optimizations that come to mind:

Yup, that's a very good point. I hate these micro-optimizations for very specific parts of the codebase, we could just use a proper allocator and all of them would be good (or at least acceptable).

I really don't like the rabbit hole of "let's fix/replace our current allocator", as this will take months/years to accomplish and test.

Hmm, but isn't it mostly the same algorithm as implemented in this PR? You only need to add buckets per size, but otherwise it would be the same?

btw, @mwkmwkmwk: How does allocation support looks like in nostd Rust? Can migrating a part of the codebase allow us to use some Rust allocator (but not sure how to get one in nostd) from C code and to remove our custom allocators?

We could replace our custom allocator with an existing allocator, and it doesn't really matter much whether it is written in Rust or in C. Allocator interface is simple enough that it's just a bit of FFI plumbing.

From a short study of available crates, general Rust allocators tend to come in three variants:

  • wrappers of heavyweight mmap-based allocators written in C, like jemalloc or mimalloc — not particularly interesting, as it just passes the buck
  • simple allocators for embedded bare-metal Rust — these tend to come with an assumption of using a single static-sized memory area as a heap, which again makes them unsuitable
  • simple allocators intended for WebAssembly, like wee_alloc — while the use case doesn't entirely match, this seems to be the closest to our needs; in particular, if given a constraint of "port gramine to some existing Rust allocator", I'd look into porting wee_alloc

Once we have something that implements the global allocator API, there's of course existing crates that we can use for optimizations on top of that, like slab.


Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 2 unresolved discussions, not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @mkow)

a discussion (no related file):

I'm in favor of this, but do you mean that you want to do this instead of this PR?

@mkow No, I want to merge this PR. My reasoning:

  1. In the short term, we should continue with these micro-optimizations. Hence we should merge this PR.
  2. In the long term, we should switch to some battle-tested allocator.

I strongly believe that doing item 2 will take us ~6 months. Our internal slab allocator is not for the faint of heart, and replacing it won't be a trivial task. Then we'll also need to measure overall performance of Gramine on different benchmarks, find all corner cases, fine-tune the allocator, choose the default parameters, and remove all micro-optimizations. That's a LOT of work.

Hmm, but isn't it mostly the same algorithm as implemented in this PR? You only need to add buckets per size, but otherwise it would be the same?

What do you mean by buckets per size? We have a slab allocator that has 8 levels (up to 2KB sizes), and then the fall back to host-backed allocation for objects >2KB sized. Now, for encrypted files we have 4KB chunks -- so do we create new bucket size (new level) for them?

simple allocators intended for WebAssembly, like wee_alloc — while the use case doesn't entirely match, this seems to be the closest to our needs; in particular, if given a constraint of "port gramine to some existing Rust allocator", I'd look into porting wee_alloc

I wouldn't mind this, sounds like a decent fit. Is this allocator no-stdlib? I vaguely remember that Rust didn't have a notion of nostdlib code at some point in time, not sure if this is fixed already or maybe it's not relevant to wee_alloc at all.

Also, wouldn't we want to consider a C-based simple allocator, like mimalloc? Two arguments:

  1. Introducing a Rust compiler in the toolchain requires some effort from our side (probably not big enough, but we e.g. support CentOS 8 distros, and we guarantee manual builds of Gramine on such systems -- does Rust work on CentOS 8?).
  2. A well-known and simple memory allocator shouldn't be susceptible to memory safety bugs, at least I don't remember any vulnerabilities concerning memory allocators. So why bother with a Rust-based one, if a C-based one is as good.

Copy link
Contributor Author

@kailun-qin kailun-qin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 2 unresolved discussions, not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @mkow)

a discussion (no related file):

  • In the short term, we should continue with these micro-optimizations. Hence we should merge this PR.
  • In the long term, we should switch to some battle-tested allocator.

+1

I strongly believe that doing item 2 will take us ~6 months. Our internal slab allocator is not for the faint of heart, and replacing it won't be a trivial task. Then we'll also need to measure overall performance of Gramine on different benchmarks, find all corner cases, fine-tune the allocator, choose the default parameters, and remove all micro-optimizations. That's a LOT of work.

Yeah, I think we actually did some similar PoCs before (about 1 year ago?) by replacing our internal slab allocator w/ dlmalloc and mimalloc respectively (for memchached perf optimization IIRC). But all these efforts were not moved fortward any further.

wouldn't we want to consider a C-based simple allocator, like mimalloc?

+1, any special reason that we consdier a Rust allocator in this case?


Copy link
Contributor

@mwkmwkmwk mwkmwkmwk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 2 unresolved discussions, not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @mkow)

a discussion (no related file):

Also, wouldn't we want to consider a C-based simple allocator, like mimalloc? Two arguments:

  1. Introducing a Rust compiler in the toolchain requires some effort from our side (probably not big enough, but we e.g. support CentOS 8 distros, and we guarantee manual builds of Gramine on such systems -- does Rust work on CentOS 8?).

the reason this came up is because we were talking with @mkow about gradually porting gramine to Rust for other reasons, so we wouldn't be doing this just for the allocator


Copy link
Contributor Author

@kailun-qin kailun-qin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 2 unresolved discussions, not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @mkow)

a discussion (no related file):

Hmm, but isn't it mostly the same algorithm as implemented in this PR? You only need to add buckets per size, but otherwise it would be the same?
Now, for encrypted files we have 4KB chunks -- so do we create new bucket size (new level) for them?

The size of the file node in encrypted files is actually 8235B (including both the encrypted data and the plaintext data) rather than 4KB:

typedef struct _file_node {
LIST_TYPE(_file_node) list;
uint8_t type;
uint64_t node_number;
struct _file_node* parent;
bool need_writing;
bool new_node;
struct {
uint64_t physical_node_number;
encrypted_node_t encrypted; // the actual data from the disk
};
union { // decrypted data
mht_node_t mht;
data_node_t data;
} decrypted;
} file_node_t;

So adding a new level/slab size that could benefit encrypted files would be too random/customized, pls see a reference change here: #1763. We're asking internal team to test on this one for its effectiveness as well -- for completeness.


Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 2 unresolved discussions, not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @mkow)

a discussion (no related file):

the reason this came up is because we were talking with @mkow about gradually porting gramine to Rust for other reasons, so we wouldn't be doing this just for the allocator

@mwkmwkmwk @mkow I am also in favor of exploring Rust applicability to Gramine codebase, but I think there are two good starting-point candidates:

  • the Linux PAL, as it is self-contained and rather simple, so a good playing ground
  • The checkpoint/restore code in LibOS -- the current code makes me cry, so it needs to be rewritten anyway, and this feels like a good opportunity to re-write a bad subsystem with a new design and in a new language (also it is currently hard-core low-level pointer juggling, and I would like to see how Rust deals with such stuff; my impression that it will be a sad unsafe case but hopefully I'm mistaken)

The size of the file node in encrypted files is actually 8235B (including both the encrypted data and the plaintext data) rather than 4KB:

@kailun-qin makes a good point -- the encrypted file node size is kinda random, and it doesn't fit the current slab allocator design.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Encrypted FS] Use pre-allocacted free list instead of calloc/free for file nodes
5 participants