[LibOS] Rework IPC transport layer #2313

boryspoplawski · 2021-04-15T00:02:31Z

Description of the changes

This took so long I dont remember anymore. Please refer to the commit messages and the code :)
As always please point me which places need more comments, this time it definitely lacks some.

Some step in resolving the #2107
Fixes #1986
Fixes partially #440

This change is

boryspoplawski

Reviewable status: 0 of 29 files reviewed, 7 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: ITL)

LibOS/shim/src/ipc/shim_ipc_worker.c, line 72 at r1 (raw file):

static noreturn void ipc_leader_died_callback(void) {
    // TODO maybe just ignore it? this triggers often if parent does wait+exit

If the ipc leader does:

wait(...);
exit(0);

then the child might see the parent (ipc leader) exiting in the window between the child sends child-death-notification and actually exits/kills ipc worker thread.

How bad of an idea would be to ignore IPC leader dying? I think it should work - if we need a leader, then an error will be spotted when trying to communicate with it.

LibOS/shim/src/ipc/shim_ipc_worker.c, line 83 at r1 (raw file):

    ipc_child_disconnect_callback(conn->vmid);

    // TODO: ipc_disconnect_msg_with_ack_callback(conn->vmid);

We might have some msg_with_ack that are waiting for a response, which might never come since the remote disconnected.

LibOS/shim/src/ipc/shim_ipc_worker.c, line 228 at r1 (raw file):

static noreturn void ipc_worker_main(void) {
    // TODO: maybe global array - but would require copying on conn delete

If we had a global array of connection (instead of a list) we wouldn't have to gather them all here in every loop iteration, but then deletion would be slower (but deletion should be rare).

LibOS/shim/src/ipc/shim_ipc_worker.c, line 294 at r1 (raw file):

        }

        int ret = DkStreamsWaitEvents(items_cnt, handles, events, ret_events, NO_TIMEOUT);

There is some weird behavior of this function (that I've not yet debugged) that causes a closed handle to report it's available for reading, although it's already closed - and the next read returns 0 as expected, causing spurious error messages in the log (about remote closing early).

LibOS/shim/src/ipc/shim_ipc_worker.c, line 333 at r1 (raw file):

            ret = read_exact(new_handle, &new_id, sizeof(new_id));
            if (ret < 0) {
                // TODO: maybe just die here?

I don't think this case can happen legitimately.

LibOS/shim/src/ipc/shim_ipc_worker.c, line 348 at r1 (raw file):

            conn = connections[i];
            if (ret_events[i] & PAL_WAIT_READ) {
                // TODO: what to do on errors? die or just delete connection?

Currently they are ignored here, I'm not sure which path to take. I guess removing connection is the proper one?

LibOS/shim/src/ipc/shim_ipc_worker.c, line 371 at r1 (raw file):

    log_setprefix(shim_get_tcb());

    // TODO; curently we have PAL provided stack = 64 pages

Default PAL provided stack is 64 pages. Is that enough for this worker thread? I think yes.

boryspoplawski

Reviewable status: 0 of 29 files reviewed, 8 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: ITL), "fixup! " found in commit messages' one-liners

a discussion (no related file):
TBH I don't like this approach of 2 separate sets of connections (incoming+outgoing) anymore. The alternative was to have only one set, but that really complicates the case of 2 processes connecting to each other at the same time (remember we want to have at most one connection between each pair) and probably has some other issues.
I'm not sure what to do now, stay with this weirdness (e.g. the connect back requests) or try rewriting this to the other version?

yamahata

Reviewed 27 of 29 files at r1, 1 of 2 files at r2, 1 of 1 files at r3.
Reviewable status: all files reviewed, 11 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: ITL), "fixup! " found in commit messages' one-liners (waiting on @boryspoplawski)

a discussion (no related file):

Previously, boryspoplawski (Borys Popławski) wrote…

TBH I don't like this approach of 2 separate sets of connections (incoming+outgoing) anymore. The alternative was to have only one set, but that really complicates the case of 2 processes connecting to each other at the same time (remember we want to have at most one connection between each pair) and probably has some other issues.
I'm not sure what to do now, stay with this weirdness (e.g. the connect back requests) or try rewriting this to the other version?

(I'm only thinking loudly and curious about what the complication of one set approach is.)

If simultaneous connection occurs with one set approach, two processes will have two connections (contrary to the naive assumption).
But does it harm?
Other than one extra connection on rare event(in wrost case, same to 2 separates sets of connections), I think of only avl node disambiguation, broadcast, closing connections on child death.
the process will use dedicated one of two connections to send message. callback on receiving message will be triggered properly.

LibOS/shim/src/shim_checkpoint.c, line 576 at r3 (raw file):

please introduce symbolic macro.

LibOS/shim/src/shim_init.c, line 469 at r3 (raw file):

symbolic value please.

LibOS/shim/src/ipc/shim_ipc.c, line 112 at r3 (raw file):

            BUG();
        }
        ret = DkStreamOpen(uri, 0, 0, 0, 0, &conn->handle);

This is connect system call on named pipe. So it's blocking operation.
the use of spinlock is suspicous.
Or unlock, DkStreamOpen, lock and recheck is needed.

dimakuv

Reviewed 27 of 29 files at r1, 1 of 2 files at r2, 1 of 1 files at r3.
Reviewable status: all files reviewed, 70 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: ITL), "fixup! " found in commit messages' one-liners (waiting on @boryspoplawski, @pwmarcz, and @yamahata)

a discussion (no related file):

Previously, yamahata wrote…

(I'm only thinking loudly and curious about what the complication of one set approach is.)

If simultaneous connection occurs with one set approach, two processes will have two connections (contrary to the naive assumption).
But does it harm?
Other than one extra connection on rare event(in wrost case, same to 2 separates sets of connections), I think of only avl node disambiguation, broadcast, closing connections on child death.
the process will use dedicated one of two connections to send message. callback on receiving message will be triggered properly.

I don't know, I like this two-set approach. I only recommend to rename your two shim_ipc_connection structs to shim_ipc_incoming_conn and shim_ipc_outgoing_conn and add a small comment that the former is handled solely by the IPC worker thread (and no locks/refcounts are needed) and the latter -- by the normal app threads.

a discussion (no related file):
Impressive PR!

LibOS/shim/include/shim_handle.h, line 356 at r3 (raw file):

    int flags; /* Linux' O_* flags */
    int acc_mode;
    IDTYPE owner;

Seems like this was never used? The actual tracking of ownership seems to be encapsulated in the ipc_ranges metadata.

LibOS/shim/include/shim_ipc.h, line 32 at r3 (raw file):

    IPC_MSG_RESP = 0,
    IPC_MSG_CONNBACK,
    IPC_MSG_DUMMY,

Could we have brief comments on these types? I'd prefer to have them on all types, but you can only add comments at least on the two new ones.

LibOS/shim/include/shim_ipc.h, line 63 at r3 (raw file):

struct shim_ipc_cp_data {
    IDTYPE parent_vmid;
    IDTYPE ns_vmid;

No more ns, thank god! I hated this name.

LibOS/shim/include/shim_ipc.h, line 64 at r3 (raw file):

struct shim_ipc_ids {
    IDTYPE parent_id;
    IDTYPE leader_id;

Sometimes you use id, and sometimes vmid. I would prefer to have vmid everywhere, otherwise it's confusing.

LibOS/shim/include/shim_ipc.h, line 89 at r3 (raw file):

/* common functions for pid & sysv namespaces */
int add_ipc_subrange(IDTYPE idx, IDTYPE owner);

When we have owner in these functions, does this actual mean "VMID of the owner process"?

If yes, maybe rename to a more concrete owner_vmid?

LibOS/shim/include/shim_ipc.h, line 125 at r3 (raw file):

int ipc_cld_exit_send(unsigned int exitcode, unsigned int term_signal);
int ipc_cld_exit_callback(struct shim_ipc_msg* msg, IDTYPE src);
void ipc_child_disconnect_callback(IDTYPE vmid);

We have mismatches in these namings. I would prefer to change:

cld -> child
vmid -> src

Alternatively, you could make longer names everywhere: src -> src_vmid, dst -> dst_vmid, but this seems like an overkill.

LibOS/shim/include/shim_ipc.h, line 158 at r3 (raw file):

} __attribute__((packed));

int ipc_offer_send(struct shim_ipc_port* port, IDTYPE dest, IDTYPE base, IDTYPE size,

Stupid question: why did we previously need port if we could already find the port via dest? Is it because there could be several open ports on one dest, so there was a need to have an explicit concrete port?

LibOS/shim/include/shim_ipc.h, line 328 at r3 (raw file):

                             void (*callback)(struct shim_ipc_msg_with_ack*, void*), void* data);

void wake_req_msg_thread(struct shim_ipc_msg_with_ack* req_msg, void* data);

Can you add a short comment on why data and seq are needed in these two functions?

LibOS/shim/include/shim_ipc.h, line 331 at r3 (raw file):

int connect_to_process(IDTYPE dest);
int request_leader_connect_back(void);

Need some comment what this function is.

LibOS/shim/include/shim_ipc.h, line 332 at r3 (raw file):

int connect_to_process(IDTYPE dest);
int request_leader_connect_back(void);
int ipc_dummy_callback(struct shim_ipc_msg* msg, IDTYPE src);

Need some comment what this dummy callback is.

LibOS/shim/include/shim_ipc.h, line 333 at r3 (raw file):

int request_leader_connect_back(void);
int ipc_dummy_callback(struct shim_ipc_msg* msg, IDTYPE src);
int broadcast_ipc(struct shim_ipc_msg* msg, IDTYPE exclude_id);

exclude_id -> exclude_vmid

LibOS/shim/include/shim_ipc.h, line 336 at r3 (raw file):

int send_ipc_message(struct shim_ipc_msg* msg, IDTYPE dest);
int send_ipc_message_with_ack(struct shim_ipc_msg_with_ack* msg, IDTYPE dest, unsigned long* seq);
int send_response_ipc_message(IDTYPE dest, int ret, unsigned long seq);

What is this seq? Since we have only one connection per pair of processes, and we use TCP/IP (and also TLS in case of SGX enclaves), why do we care about sequence numbers (the way I understand what seq is)?

LibOS/shim/include/shim_ipc.h, line 374 at r3 (raw file):

void ipc_port_with_child_fini(struct shim_ipc_port* port, IDTYPE vmid);

struct shim_thread* terminate_ipc_helper(void);

Could you also supply a quick PR to fix the name of the async helper thread? Otherwise we'll have inconsistency in names, which will be very confusing to people.

LibOS/shim/src/shim_checkpoint.c, line 504 at r3 (raw file):

    struct shim_ipc_ids process_ipc_ids = {
        .parent_id = g_self_vmid,
        .leader_id = g_process_ipc_ids.leader_id ?: g_self_vmid,

Are we sure there cannot be a legit vmid of 0 (or that it is always the vmid of the leader)? Otherwise this check will be wrong.

LibOS/shim/src/shim_checkpoint.c, line 576 at r3 (raw file):

Previously, yamahata wrote…

42
please introduce symbolic macro.

I don't think it deserves a macro, simply 0 and rename to e.g. dummy_c.

LibOS/shim/src/shim_checkpoint.c, line 585 at r3 (raw file):

         * after we returns from this function.
         */
        (void)mark_child_exited_by_vmid(child_vmid, /*uid=*/0, /*exit_code=*/0, SIGPWR);

But isn't SIGPWR is our way of saying "Well, looks like the host OS for some reason killed the process, which should not have happened". It's probably what you're saying in your comment, but Let's pretend the process creation was successful... is confusing.

Maybe Child might have already been marked dead by the host OS (e.g., killed by OOM killer)., and then Let's pretend the process creation itself was successful, but that the child process was terminated immediately afterwards with a special SIGPWR signal. -- we have no way...?

LibOS/shim/src/shim_checkpoint.c, line 623 at r3 (raw file):

    /* create new IPC port to communicate over pal_process channel with the child process */
    add_ipc_port_by_id(child_vmid, pal_process, &ipc_port_with_child_fini, NULL);

So it's not the parent who starts the port now, but the child (who is expected to connect on its startup)?

What happens if the child never connects? Do we have some kind of memory leak?

LibOS/shim/src/shim_init.c, line 443 at r3 (raw file):

     * before we start handling async IPC messages from this child (it will connect to us after we
     * acknowledge the creation). */

You said this in another file (where parent creates the child process). But here you first connect to the parent process, and only then continue with acknowledgements. Is this the correct order?

Same question for the leader (because the parent and the leader can be the same process).

LibOS/shim/src/shim_init.c, line 448 at r3 (raw file):

            DkProcessExit(1);
        }
        ret = request_leader_connect_back();

Why such a weird name? Why not simply connect_to_leader()? Definitely deserves a comment...

LibOS/shim/src/shim_init.c, line 463 at r3 (raw file):

        /* Wait for parent to settle its adult things. */
        char c = 0;

dummy_c

LibOS/shim/src/ipc/shim_ipc.c, line 6 at r3 (raw file):

/*
 * This file contains code to maintain generic bookkeeping of IPC: operations on shim_ipc_msg
 * (one-way IPC messages), shim_ipc_msg_with_ack (IPC messages with acknowledgement).

IDK, I like these quick introduction comments. Can you keep it but maybe rephrase with your new functions?

LibOS/shim/src/ipc/shim_ipc.c, line 24 at r3 (raw file):

    IDTYPE vmid;
    REFTYPE ref_count;
    PAL_HANDLE handle;

Can you add a comment of what this type of handle this is? Is it a bi-directional pipe (but used only in one primary direction)? Is it blocking? Does it preserve order (TCP semantics)?

LibOS/shim/src/ipc/shim_ipc.c, line 36 at r3 (raw file):

}

/* TODO: hashmap will probably be better as we only need insert, delete and find exact. */

Do we really care about the best data struct in this case? I assume we will never exceed e.g. 50 processes limit in real-world apps.

LibOS/shim/src/ipc/shim_ipc.c, line 43 at r3 (raw file):

    struct shim_ipc_msg_with_ack* a = container_of(_a, struct shim_ipc_msg_with_ack, node);
    struct shim_ipc_msg_with_ack* b = container_of(_b, struct shim_ipc_msg_with_ack, node);
    return a->msg.dst == b->msg.dst ? a->msg.seq <= b->msg.seq : a->msg.dst <= b->msg.dst;

Is this to prevent starving and to preserve ordering? Quite cool.

UPDATE: Actually, looks like no. You just need an exact match, according to the code below. This stuff with seq looks unnecessarily complicated to me...

LibOS/shim/src/ipc/shim_ipc.c, line 79 at r3 (raw file):

        DkObjectClose(conn->handle);
        destroy_lock(&conn->lock);
        free(conn);

What if this connection is still in the AVL tree? This is impossible? Can you add an assert?

LibOS/shim/src/ipc/shim_ipc.c, line 96 at r3 (raw file):

    lock(&g_ipc_connections_lock);
    struct shim_ipc_connection* conn = node2conn(avl_tree_find(&g_ipc_connections, &dummy.node));
    if (!conn) {

Wouldn't it be easier to follow this code if you first handle the case of already-existing conn (then you simply increase its refcount, unlock and return), and then you deal with the "create connection" case.

LibOS/shim/src/ipc/shim_ipc.c, line 112 at r3 (raw file):

Previously, yamahata wrote…

This is connect system call on named pipe. So it's blocking operation.
the use of spinlock is suspicous.
Or unlock, DkStreamOpen, lock and recheck is needed.

+1 to Isaku's question. Are these pipes blocking or non-blocking? I'd like to see some comments on the assumptions/requirements for these pipes.

LibOS/shim/src/ipc/shim_ipc.c, line 120 at r3 (raw file):

        if (ret < 0) {
            goto out;
        }

So we don't wait for an acknowledgement? We just send our VMID and the connection is considered established? Is this enough?

LibOS/shim/src/ipc/shim_ipc.c, line 123 at r3 (raw file):

        conn->vmid = dest;
        REF_SET(conn->ref_count, 1);

So it looks like ref_count contains both AVL-tree references and "Graphene subsystems" references, right?

LibOS/shim/src/ipc/shim_ipc.c, line 159 at r3 (raw file):

        return ret;
    }
    put_ipc_connection(conn);

This looks like you simply want to add a dest VMID to the list of opened ports in the AVL tree (if not yet added). And then you don't really need the connection, so you put its reference. Could you add a brief comment on this?

LibOS/shim/src/ipc/shim_ipc.c, line 220 at r3 (raw file):

        msg = container_of(node, struct shim_ipc_msg_with_ack, node);
    }
    callback(msg, data);

What if no node is found? Then your callback will work on callback(NULL, data). Why not move it under if (node)?

LibOS/shim/src/ipc/shim_ipc.c, line 220 at r3 (raw file):

        msg = container_of(node, struct shim_ipc_msg_with_ack, node);
    }
    callback(msg, data);

Also, why is the callback called under the global lock? From what I understand, the avl_tree_delete() logic below will kick in only after the callback is executed and the IPC worker thread kicks the sleeping thread. So it should be safe to call the callback without the lock.

UPDATE: Wow, the awakening of the thread seems to happen inside the callback. Maybe my logic is wrong then.

LibOS/shim/src/ipc/shim_ipc.c, line 224 at r3 (raw file):

}

int send_ipc_message_with_ack(struct shim_ipc_msg_with_ack* msg, IDTYPE dst, unsigned long* seq) {

Can we change the name seq -> out_seq? Also, why the heck would we need to know such an internal detail?

LibOS/shim/src/ipc/shim_ipc.c, line 233 at r3 (raw file):

    thread_setwait(&msg->thread, thread);

    static unsigned long ipc_seq_counter = 0;

So that's how it works... There is a global counter that is not per-connection. Well, this is uglier than I thought, but at least it makes sense now.

LibOS/shim/src/ipc/shim_ipc.c, line 261 at r3 (raw file):

out:
    lock(&g_msg_with_ack_tree_lock);
    avl_tree_delete(&g_msg_with_ack_tree, &msg->node);

It's interesting that we don't track ownership of this msg object. Even though it is used by two threads: the sleeping thread (the real owner of this msg) and the IPC worker (during callback). Well, I guess it makes sense since this is a very short-lived object, and it is impossible to destroy it before the IPC worker is done with the callback.

But maybe this deserves a comment?

LibOS/shim/src/ipc/shim_ipc.c, line 280 at r3 (raw file):

    size_t total_msg_size = get_ipc_msg_with_ack_size(0);
    struct shim_ipc_msg_with_ack* msg = __alloca(total_msg_size);

We should start get ridding of __alloca(). Do we consider using #include <alloca.h> ... alloca(); fine? @mkow

LibOS/shim/src/ipc/shim_ipc.c, line 283 at r3 (raw file):

    init_ipc_msg_with_ack(msg, IPC_MSG_CONNBACK, total_msg_size, leader);

    log_debug("sent IPC_MSG_CONNBACK message to %u\n", leader);

sent -> Sending

LibOS/shim/src/ipc/shim_ipc.c, line 288 at r3 (raw file):

}

void wake_req_msg_thread(struct shim_ipc_msg_with_ack* req_msg, void* data) {

If data is unused, why is it here?

LibOS/shim/src/ipc/shim_ipc.c, line 291 at r3 (raw file):

    __UNUSED(data);
    if (req_msg && req_msg->thread) {
        thread_wakeup(req_msg->thread);

All this looks very brittle. Shouldn't we have asserts here instead of an if? Otherwise we'll end up with a hanging thread, if something goes wrong in these messages-with-ack.

LibOS/shim/src/ipc/shim_ipc.c, line 295 at r3 (raw file):

}

int ipc_dummy_callback(struct shim_ipc_msg* msg, IDTYPE src) {

But this callback is not that dummy... It wakes up the sleeping thread. Maybe a better name for this callback then?

LibOS/shim/src/ipc/shim_ipc.c, line 301 at r3 (raw file):

}

int broadcast_ipc(struct shim_ipc_msg* msg, IDTYPE exclude_id) {

exclude_vmid

LibOS/shim/src/ipc/shim_ipc_pid.c, line 233 at r3 (raw file):

    if (req_msg->thread) {
        thread_wakeup(req_msg->thread);

Shouldn't we have asserts here instead of an if? Otherwise we'll end up with a hanging thread, if something goes wrong in these messages-with-ack.

LibOS/shim/src/ipc/shim_ipc_pid.c, line 410 at r3 (raw file):

    if (req_msg->thread) {
        thread_wakeup(req_msg->thread);

Shouldn't we have asserts here instead of an if? Otherwise we'll end up with a hanging thread, if something goes wrong in these messages-with-ack.

LibOS/shim/src/ipc/shim_ipc_ranges.c, line 573 at r3 (raw file):

        default:
            break;
    }

This switch statement doesn't make sense now. Just refactor to normal if.

LibOS/shim/src/ipc/shim_ipc_ranges.c, line 847 at r3 (raw file):

        }

        ret = ipc_pid_getstatus_send(owner, idx == UNDEF_IDX ? RANGE_SIZE : 1, pids, &range_status);

This file got slightly less sh*tty now!

LibOS/shim/src/ipc/shim_ipc_ranges.c, line 974 at r3 (raw file):

            p->port = port;
        }

I'm just pretending I understand this...

LibOS/shim/src/ipc/shim_ipc_sysv.c, line 263 at r3 (raw file):

    if (!req_msg->private) {
        return;
    }

Why not connect into one IF.

LibOS/shim/src/ipc/shim_ipc_sysv.c, line 642 at r3 (raw file):

    log_debug("ipc callback from %u: IPC_MSG_SYSV_SEMRET\n", msg->src);

    ipc_msg_response_handle(src, msg->seq, semret_callback, semret);

And this file also got slightly less sh*t. We still need to re-work all this craziness with messages-with-ack.

LibOS/shim/src/ipc/shim_ipc_worker.c, line 72 at r1 (raw file):

Previously, boryspoplawski (Borys Popławski) wrote…

If the ipc leader does:
wait(...);
exit(0);
then the child might see the parent (ipc leader) exiting in the window between the child sends child-death-notification and actually exits/kills ipc worker thread.

How bad of an idea would be to ignore IPC leader dying? I think it should work - if we need a leader, then an error will be spotted when trying to communicate with it.

I agree that we should simply ignore this (with log_debug() message that clearly states that this can happen even in normal circumstances, no need to worry dear user).

I think @pwmarcz has the same assumption on his FS-sync work: panic only when failed to explicitly communicate with the leader.

LibOS/shim/src/ipc/shim_ipc_worker.c, line 83 at r1 (raw file):

Previously, boryspoplawski (Borys Popławski) wrote…

We might have some msg_with_ack that are waiting for a response, which might never come since the remote disconnected.

Yeah, these messages-with-ack are nasty currently. Not for this PR though.

LibOS/shim/src/ipc/shim_ipc_worker.c, line 228 at r1 (raw file):

Previously, boryspoplawski (Borys Popławski) wrote…

If we had a global array of connection (instead of a list) we wouldn't have to gather them all here in every loop iteration, but then deletion would be slower (but deletion should be rare).

Indeed, I was also thinking about it. Probably not for this PR though.

LibOS/shim/src/ipc/shim_ipc_worker.c, line 294 at r1 (raw file):

Previously, boryspoplawski (Borys Popławski) wrote…

There is some weird behavior of this function (that I've not yet debugged) that causes a closed handle to report it's available for reading, although it's already closed - and the next read returns 0 as expected, causing spurious error messages in the log (about remote closing early).

Interesting, I don't remember seeing this (but this was ages ago when I worked on this part).

LibOS/shim/src/ipc/shim_ipc_worker.c, line 333 at r1 (raw file):

Previously, boryspoplawski (Borys Popławski) wrote…

I don't think this case can happen legitimately.

Yes, I think you should just die here. You already die on DkStreamWaitForClient() failure, so it makes sense to die here as well.

LibOS/shim/src/ipc/shim_ipc_worker.c, line 348 at r1 (raw file):

Previously, boryspoplawski (Borys Popławski) wrote…

Currently they are ignored here, I'm not sure which path to take. I guess removing connection is the proper one?

I think just removing the connection (dropping the client) with a loud warning message.

LibOS/shim/src/ipc/shim_ipc_worker.c, line 371 at r1 (raw file):

Previously, boryspoplawski (Borys Popławski) wrote…

Default PAL provided stack is 64 pages. Is that enough for this worker thread? I think yes.

What do you mean by this comment? I don't see anything in the IPC worker thread that uses a lot of stack. So should be fine.

LibOS/shim/src/ipc/shim_ipc_worker.c, line 26 at r3 (raw file):

DEFINE_LIST(shim_ipc_connection);
DEFINE_LISTP(shim_ipc_connection);
struct shim_ipc_connection {

In shim_ipc.c, you have another definition of the IPC connection -- that one uses the AVL tree.

Why do you use the LIST here? Why not AVL?
Can you rename to something like shim_ipc_incoming_connection? Well, that's too long so maybe shim_ipc_incoming_conn?

If you will rename this, then it would also make sense to rename the one in shim_ipc.c to something like shim_ipc_outgoing_conn.

UPDATE: Ok, even if we move away from LIST, we shouldn't move to AVL but instead move to a global array (see below, this will simplify collection of handles to wait on).

LibOS/shim/src/ipc/shim_ipc_worker.c, line 29 at r3 (raw file):

    LIST_TYPE(shim_ipc_connection) list;
    PAL_HANDLE handle;
    IDTYPE vmid;

You don't have a refcount here because you know that only the IPC worker thread is the sole owner of these all, right?

LibOS/shim/src/ipc/shim_ipc_worker.c, line 38 at r3 (raw file):

static AEVENTTYPE interrupt_event;
static int g_time_to_exit = 0;
static int g_clear_on_worker_exit = 1;

Can you add that this is a sync variable: the Graphene runtime resets it when the IPC worker thread is completely exited, and the terminate_ipc_worker() caller thread is waiting on this variable.

LibOS/shim/src/ipc/shim_ipc_worker.c, line 42 at r3 (raw file):

static int ipc_resp_callback(struct shim_ipc_msg* msg, IDTYPE src);
static int ipc_connect_back_callback(struct shim_ipc_msg* msg, IDTYPE src);

Again this weird name, please add some comment on this connect_back.

LibOS/shim/src/ipc/shim_ipc_worker.c, line 86 at r3 (raw file):

}

static int add_ipc_connection(PAL_HANDLE handle, IDTYPE id) {

id -> vmid

LibOS/shim/src/ipc/shim_ipc_worker.c, line 113 at r3 (raw file):

    size_t total_msg_size = get_ipc_msg_size(sizeof(struct shim_ipc_resp));
    struct shim_ipc_msg* resp_msg = __alloca(total_msg_size);

ditto (alloca)

LibOS/shim/src/ipc/shim_ipc_worker.c, line 145 at r3 (raw file):

static int ipc_connect_back_callback(struct shim_ipc_msg* orig_msg, IDTYPE src) {
    size_t total_msg_size = get_ipc_msg_size(0);
    struct shim_ipc_msg* msg = __alloca(total_msg_size);

ditto (alloca)

LibOS/shim/src/ipc/shim_ipc_worker.c, line 180 at r3 (raw file):

            }
            size += tmp_size;
        }

Can't you use your new read_exact() here?

LibOS/shim/src/ipc/shim_ipc_worker.c, line 210 at r3 (raw file):

        assert(conn->vmid == msg->src);

        if (msg->code < ARRAY_SIZE(ipc_callbacks) && ipc_callbacks[msg->code]) {

Actually, don't we want to log an error received unknown message type? Then we can revert this IF and simplify the code here.

LibOS/shim/src/ipc/shim_ipc_worker.c, line 257 at r3 (raw file):

        }

        /* Reserve 2 slots for `interrupt_event` and ``. */

What is the second slot?

LibOS/shim/src/ipc/shim_ipc_worker.c, line 316 at r3 (raw file):

            }
            log_debug(LOG_PREFIX "interrupt requested\n");
            /* XXX: Currently `interrupt_event` is used only for exit notification, no need to

Maybe we just rename it to exit_notification_event and remove this comment? Or do you have in mind some other future usages for it?

LibOS/shim/src/ipc/shim_ipc_worker.c, line 333 at r3 (raw file):

                goto out_die;
            }
            IDTYPE new_id = 0;

new_vmid?

LibOS/shim/src/ipc/shim_ipc_worker.c, line 417 at r3 (raw file):

    }

    enable_locking();

This is system-wide, right? Because we're creating a second thread now?

pwmarcz

Reviewable status: all files reviewed, 71 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: ITL), "fixup! " found in commit messages' one-liners (waiting on @boryspoplawski and @yamahata)

a discussion (no related file):

Ordering guarantees: In my code, I assume that if I send several messages to the same recipient (from the same thread), then the recipient processes them in the same order (in its IPC helper thread). Is that still valid?
Can I now connect/send message to myself? I didn't see any custom handling of such connections, but also no assertion against it. (From what I understand, just opening a pipe to myself would mostly work, but can technically cause a deadlock on IPC worker sending too many messages to itself. So that would mean I can already continue work on my branch, since that stuff is experimental anyway).
It would be nice to have a short introduction to IPC, maybe in a comment on top of shim_ipc.h. Things like:
- general architecture (pipes, worker thread)
- maybe the above ordering guarantees
- role of IPC leader?
- example usage (sending a message / message with ack)?

LibOS/shim/src/ipc/shim_ipc_worker.c, line 72 at r1 (raw file):

I think @pwmarcz has the same assumption on his FS-sync work: panic only when failed to explicitly communicate with the leader.

That's right, but I should mention that FS-sync flushes data on process exit (which requires IPC between the exiting process and the leader). But it sounds like that could happen before we send the child death notification?

yamahata

Reviewable status: all files reviewed, 71 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: ITL), "fixup! " found in commit messages' one-liners (waiting on @boryspoplawski and @yamahata)

a discussion (no related file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

I don't know, I like this two-set approach. I only recommend to rename your two shim_ipc_connection structs to shim_ipc_incoming_conn and shim_ipc_outgoing_conn and add a small comment that the former is handled solely by the IPC worker thread (and no locks/refcounts are needed) and the latter -- by the normal app threads.

Supplement: I'm not against two-set approach.

boryspoplawski

Reviewable status: all files reviewed, 65 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: ITL), "fixup! " found in commit messages' one-liners (waiting on @boryspoplawski, @dimakuv, @mkow, @pwmarcz, and @yamahata)

a discussion (no related file):

If simultaneous connection occurs with one set approach, two processes will have two connections (contrary to the naive assumption).

@yamahata The problem is we sometimes want to detect a disconnect and trigger a callback - in such case we could see a disconnect of one conenction, while the other still has some unread messages. This proved to be problematic e.g. in child death detection (in the old code).
We can solve that issue by having a set of pending connections and e.g. not accepting a new connection if we have one pending for the same recipient, but only in one of the processes (e.g. the one with lower id).
(I guess that's what you proposed)

rename your two shim_ipc_connection structs to shim_ipc_incoming_conn and shim_ipc_outgoing_conn

You think such naming will be better? They are a bit longer and both are actually internal interfaces (each specific to one file only).

add a small comment

Sure, will do

a discussion (no related file):

I assume that if I send several messages to the same recipient (from the same thread), then the recipient processes them in the same order (in its IPC helper thread). Is that still valid?

This is now valid, but was not before. In the old code there could be multiple connections between a pair of processes and if you sent 2 messages each on a different connection, the order could be reversed. Now we have only one connection per pair and the order is always preserved.

Can I now connect/send message to myself? I didn't see any custom handling of such connections, but also no assertion against it. (From what I understand, just opening a pipe to myself would mostly work, but can technically cause a deadlock on IPC worker sending too many messages to itself. So that would mean I can already continue work on my branch, since that stuff is experimental anyway).

In theory yes, but I did not consider that when rewriting the message_with_ack thingies, so there might be some deadlocks there.
Why can't you do, like the rest of the code:

if (Im_ipc_leader) {
    return some_ipc_callback();
}

return send_ipc_msg_to_leader(msg_causing_some_ipc_callback);

It would be nice to have a short introduction to IPC, maybe in a comment on top of shim_ipc.h.

Yeah, this PR definitely lacks a general description, will try to add something.

general architecture (pipes, worker thread)

Worker thread - yes, that should be documented; pipes - imo that's implementation details that rest of the code does not neet to be aware of.

maybe the above ordering guarantees

+1

role of IPC leader

Yes, this needs to be summarized, but in this PR I did not change anything related to that...

example usage (sending a message / message with ack)

I think this still needs to be changed/reworked/cleanedup. I did not touch that part in this PR (except for where I had to).

LibOS/shim/include/shim_handle.h, line 356 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Seems like this was never used? The actual tracking of ownership seems to be encapsulated in the ipc_ranges metadata.

Yes, this was never used

LibOS/shim/include/shim_ipc.h, line 32 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Could we have brief comments on these types? I'd prefer to have them on all types, but you can only add comments at least on the two new ones.

I did not investigate the semantics of all types here, so I would rather not describe all of them atm. Besides, I think we should rewrite the IPC data layer (messages format etc) too.
Commented on the new ones.

LibOS/shim/include/shim_ipc.h, line 64 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Sometimes you use id, and sometimes vmid. I would prefer to have vmid everywhere, otherwise it's confusing.

I had a plan to rename vmid, but not in this PR and I'm not yet sure about the new nam.
I can temporarily make this parent_vmid, to be consistent.
Proposed new names:

process_id
graphene_id
gprocess_id
gpid

This would also include introducing new type (e.g. process_id_t). What do you think?
Update: maybe it should be done in this PR, as there are a lot of inconsistencies.

LibOS/shim/include/shim_ipc.h, line 158 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Stupid question: why did we previously need port if we could already find the port via dest? Is it because there could be several open ports on one dest, so there was a need to have an explicit concrete port?

You are asking wrong person ;)
I suspect your explanation is correct - we could have multiple ports per one dest. Besides old code did not offer a way to send a message to dest only to a port (so it created a new port if it did not have a reference to one).

LibOS/shim/include/shim_ipc.h, line 374 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Could you also supply a quick PR to fix the name of the async helper thread? Otherwise we'll have inconsistency in names, which will be very confusing to people.

I would like to rewrite the async helper, so that it's cleaner, so wait for the thread termination and possibly move some code to some abstraction layer to share with ipc worker thread (if there is anything to be shared).
But that probably not a high priority task, so I'm not sure when it will happen. Do you want me to do just the rename for now?

LibOS/shim/src/shim_checkpoint.c, line 504 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Are we sure there cannot be a legit vmid of 0 (or that it is always the vmid of the leader)? Otherwise this check will be wrong.

Yes, all ids are >0.

LibOS/shim/src/shim_checkpoint.c, line 576 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

I don't think it deserves a macro, simply 0 and rename to e.g. dummy_c.

Yeah, this was just to have an assert, I don't think it's needed.
Removed.

LibOS/shim/src/shim_checkpoint.c, line 585 at r3 (raw file):

But isn't SIGPWR is our way of saying "Well, looks like the host OS for some reason killed the process, which should not have happened". It's probably what you're saying in your comment, but Let's pretend the process creation was successful... is confusing.

It is and that's the case here too: this write cannot fail, unless host-OS messes with the other thread.

Child might have already been marked dead by the host OS (e.g., killed by OOM killer).

That's not what I wanted to say here. By marked dead I mean marked dead inside graphene (ipc worker thread already noticed the disconnect), so failure of this function (mark_child_...) might be legitimate here.

LibOS/shim/src/shim_checkpoint.c, line 623 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

So it's not the parent who starts the port now, but the child (who is expected to connect on its startup)?

What happens if the child never connects? Do we have some kind of memory leak?

Yes, child connects at startup, before sending it's vimd to us.
If the child does not connect then either:

we have a bug somewhere - well, UB, with buggy code anything can happen...
host-OS messed something with connection or child process. In such case the child will die (either because host killed it or it exited itself due to connection failure) and the vmid read above will fail and return from this function (without adding this child at all)

LibOS/shim/src/shim_init.c, line 443 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

     * before we start handling async IPC messages from this child (it will connect to us after we
     * acknowledge the creation). */
You said this in another file (where parent creates the child process). But here you first connect to the parent process, and only then continue with acknowledgements. Is this the correct order?

Same question for the leader (because the parent and the leader can be the same process).

The comment in shim_checkpoint.c was outdated, fixed.

LibOS/shim/src/shim_init.c, line 448 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Why such a weird name? Why not simply connect_to_leader()? Definitely deserves a comment...

Because we do not want to connect to the leader (we do it only as side effect, but it's not needed) - we want the leader to connect to us.
But see the comment near leader disconnect callback in IPC worker.

LibOS/shim/src/shim_init.c, line 463 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

dummy_c

Done.

LibOS/shim/src/shim_init.c, line 469 at r3 (raw file):

Previously, yamahata wrote…

42
symbolic value please.

Removed.

LibOS/shim/src/ipc/shim_ipc.c, line 6 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

IDK, I like these quick introduction comments. Can you keep it but maybe rephrase with your new functions?

Imo they just state the obvious. As far as it would be nice to have some top-level comment/description, it should actually be meaningful...
Added a one sentence, let me know what kind of information would you like to have here.

LibOS/shim/src/ipc/shim_ipc.c, line 24 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Can you add a comment of what this type of handle this is? Is it a bi-directional pipe (but used only in one primary direction)? Is it blocking? Does it preserve order (TCP semantics)?

Why do you want to have such comments? This are all implementation details - no functions, other than connect and send should be aware of them. I can add such a description in one/both of them (e.g. near DkStreamOpen), but there it should follow from the code...

LibOS/shim/src/ipc/shim_ipc.c, line 36 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Do we really care about the best data struct in this case? I assume we will never exceed e.g. 50 processes limit in real-world apps.

Forgot to add a reveiwable comment here.
That was something to consider. If we expect very low number of connections (which I think we do) maybe a simple list will be better? Linear search, but if the N=4 that might be faster than tree/hashmap. But otoh do we care?

LibOS/shim/src/ipc/shim_ipc.c, line 43 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Is this to prevent starving and to preserve ordering? Quite cool.

UPDATE: Actually, looks like no. You just need an exact match, according to the code below. This stuff with seq looks unnecessarily complicated to me...

This is just simple ordering on pair of integers.

LibOS/shim/src/ipc/shim_ipc.c, line 79 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

What if this connection is still in the AVL tree? This is impossible? Can you add an assert?

If the connection is still in AVL tree, then ref_count >= 1 (because the reference in the tree is counted).
There is no is_in_tree function (and it's intentionally, so that nobody does some weird flows where they check if something is in tree explicitly - it should always follow from the code flow).

LibOS/shim/src/ipc/shim_ipc.c, line 96 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Wouldn't it be easier to follow this code if you first handle the case of already-existing conn (then you simply increase its refcount, unlock and return), and then you deal with the "create connection" case.

Is that a question or a statement? :)
I don't think it would be cleaner. In the current version we have the common case (of conn being non-NULL`) not repeated. Otoh I dont mind either version, so let me know which one you prefer (the one you proposed, I guess?).

LibOS/shim/src/ipc/shim_ipc.c, line 112 at r3 (raw file):
Yes, it's a blocking operation (at least I think so). The idea is that connection to another process is a very rare operation and 2 happening at the same time is even more unlikely.

the use of spinlock is suspicous

These are not necessary spinlocks. Anyway we cannot unlock here - this is to make sure we only have one connection per pair of processes.

I think we need two types of general purpose locks: for small and big/blocking critical sections. For the small case we could use spinlocks and for the big - mutexes. In a uncontested case both should be equaly fast - a simple memory operation - for contested the first would ensure that the thread does not exit enclave (since the section is small so it's more worth to wait that to pay the cost of enclave exit). The latter would make sure we don't waste cpu - after all the critical section exits enclave, so we might do so as well.
But that's definitely not for this PR.

LibOS/shim/src/ipc/shim_ipc.c, line 120 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

So we don't wait for an acknowledgement? We just send our VMID and the connection is considered established? Is this enough?

Yes, why not? I do not understand the question...

LibOS/shim/src/ipc/shim_ipc.c, line 123 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

So it looks like ref_count contains both AVL-tree references and "Graphene subsystems" references, right?

?
It contains all references, that's what it's for...

LibOS/shim/src/ipc/shim_ipc.c, line 159 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

This looks like you simply want to add a dest VMID to the list of opened ports in the AVL tree (if not yet added). And then you don't really need the connection, so you put its reference. Could you add a brief comment on this?

I've commented this function in global header. Do you want me to add a comment here as well?

LibOS/shim/src/ipc/shim_ipc.c, line 220 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

What if no node is found? Then your callback will work on callback(NULL, data). Why not move it under if (node)?

Because the callback might want to be run anyway. It's up to that callback function to decide what to do if the message is not found (the msg == NULL case).

LibOS/shim/src/ipc/shim_ipc.c, line 220 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Also, why is the callback called under the global lock? From what I understand, the avl_tree_delete() logic below will kick in only after the callback is executed and the IPC worker thread kicks the sleeping thread. So it should be safe to call the callback without the lock.

UPDATE: Wow, the awakening of the thread seems to happen inside the callback. Maybe my logic is wrong then.

We cannot use this message (msg) after giving up the lock - it might get taken of the tree and be destroyed.

LibOS/shim/src/ipc/shim_ipc.c, line 224 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Can we change the name seq -> out_seq? Also, why the heck would we need to know such an internal detail?

I was wondering that myself. The ipc sysv code seems to do something with it - I did not want to dive into it, so I've left it.

LibOS/shim/src/ipc/shim_ipc.c, line 233 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

So that's how it works... There is a global counter that is not per-connection. Well, this is uglier than I thought, but at least it makes sense now.

Yeah, but I did not want to touch the data layer (at least I call it that way) in this PR.

LibOS/shim/src/ipc/shim_ipc.c, line 261 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

It's interesting that we don't track ownership of this msg object. Even though it is used by two threads: the sleeping thread (the real owner of this msg) and the IPC worker (during callback). Well, I guess it makes sense since this is a very short-lived object, and it is impossible to destroy it before the IPC worker is done with the callback.

But maybe this deserves a comment?

The trick here is that we only use that object under g_msg_with_ack_tree_lock, so it cannot disappear inside the callback. The old code was bugged wrt to this - it did not ensure that the object is not destroyed while it's being used.

LibOS/shim/src/ipc/shim_ipc.c, line 280 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

We should start get ridding of __alloca(). Do we consider using #include <alloca.h> ... alloca(); fine? @mkow

As mentioned already - I did not touch the "data layer" in this PR.
Generally this should be just a normal object, it has a static size, it's just hidden behind get_ipc_msg_with_ack_size. I didnt want to rework the current interface in this PR, so I've left it.
Imo since the size is known and constant this alloca is fine.

LibOS/shim/src/ipc/shim_ipc.c, line 283 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

sent -> Sending

Done.

LibOS/shim/src/ipc/shim_ipc.c, line 288 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

If data is unused, why is it here?

Because it's a callback and needs to have a appropriate signature.

LibOS/shim/src/ipc/shim_ipc.c, line 291 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

All this looks very brittle. Shouldn't we have asserts here instead of an if? Otherwise we'll end up with a hanging thread, if something goes wrong in these messages-with-ack.

If req_msg is NULL, then there is no hanging thread (since it removed the message from the list).
req_msg->thread != NULL should be an assert indeed, but the old code checked explicitly and I'm afraid that something in sysv will break if I change it...

Update: ok, I already made sure this case cannot happen in send_ipc_msg_with-ack, so it's safe to change it into assert.

LibOS/shim/src/ipc/shim_ipc.c, line 295 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

But this callback is not that dummy... It wakes up the sleeping thread. Maybe a better name for this callback then?

Every callback should wake a thread, that's what these callbacks are for.
Uh, actually we have two kind of callbacks and this is all messed (because they are named the same way).

Any idea for the better name?

LibOS/shim/src/ipc/shim_ipc_pid.c, line 233 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Shouldn't we have asserts here instead of an if? Otherwise we'll end up with a hanging thread, if something goes wrong in these messages-with-ack.

ditto and done.

LibOS/shim/src/ipc/shim_ipc_pid.c, line 410 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Shouldn't we have asserts here instead of an if? Otherwise we'll end up with a hanging thread, if something goes wrong in these messages-with-ack.

ditto and done.

LibOS/shim/src/ipc/shim_ipc_ranges.c, line 573 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

This switch statement doesn't make sense now. Just refactor to normal if.

Done.

LibOS/shim/src/ipc/shim_ipc_ranges.c, line 847 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

This file got slightly less sh*tty now!

This file needs to be deleted, but that's for another PR.

LibOS/shim/src/ipc/shim_ipc_ranges.c, line 974 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

I'm just pretending I understand this...

Do not pretend, nobody understood this. I've just seen a (!p->port) which is not needed now and removed it. I might have broken this function, who knows.

LibOS/shim/src/ipc/shim_ipc_sysv.c, line 263 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Why not connect into one IF.

Done.

LibOS/shim/src/ipc/shim_ipc_sysv.c, line 642 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

And this file also got slightly less sh*t. We still need to re-work all this craziness with messages-with-ack.

Yup

LibOS/shim/src/ipc/shim_ipc_worker.c, line 72 at r1 (raw file):

Previously, pwmarcz (Paweł Marczewski) wrote…

I think @pwmarcz has the same assumption on his FS-sync work: panic only when failed to explicitly communicate with the leader.

That's right, but I should mention that FS-sync flushes data on process exit (which requires IPC between the exiting process and the leader). But it sounds like that could happen before we send the child death notification?

Yes, child death notification should be the last message.

LibOS/shim/src/ipc/shim_ipc_worker.c, line 83 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Yeah, these messages-with-ack are nasty currently. Not for this PR though.

Yeah

LibOS/shim/src/ipc/shim_ipc_worker.c, line 333 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Yes, I think you should just die here. You already die on DkStreamWaitForClient() failure, so it makes sense to die here as well.

On the other hand this could be just a remote client dying. Do we want to kill current process in such cases? We do not do that in other places.

LibOS/shim/src/ipc/shim_ipc_worker.c, line 348 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

I think just removing the connection (dropping the client) with a loud warning message.

Done.

LibOS/shim/src/ipc/shim_ipc_worker.c, line 371 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

What do you mean by this comment? I don't see anything in the IPC worker thread that uses a lot of stack. So should be fine.

Yeah, but the old code said something about PAL provided stack being small and allocated it's own, bigger one. I just wanted to make sure I did not miss something.

LibOS/shim/src/ipc/shim_ipc_worker.c, line 26 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

In shim_ipc.c, you have another definition of the IPC connection -- that one uses the AVL tree.

Why do you use the LIST here? Why not AVL?

Can you rename to something like shim_ipc_incoming_connection? Well, that's too long so maybe shim_ipc_incoming_conn?

If you will rename this, then it would also make sense to rename the one in shim_ipc.c to something like shim_ipc_outgoing_conn.

UPDATE: Ok, even if we move away from LIST, we shouldn't move to AVL but instead move to a global array (see below, this will simplify collection of handles to wait on).

These are in principle different types (incoming vs outgoing connections) so they can differ and their usage differ. Here we always iterate via the whole collection, we never search a specific one, hence list is better than a tree.

Let's keep the discussion about rename in one place (top-level comment).

LibOS/shim/src/ipc/shim_ipc_worker.c, line 29 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

You don't have a refcount here because you know that only the IPC worker thread is the sole owner of these all, right?

Yes, only the IPC worker thread can access these.

LibOS/shim/src/ipc/shim_ipc_worker.c, line 38 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Can you add that this is a sync variable: the Graphene runtime resets it when the IPC worker thread is completely exited, and the terminate_ipc_worker() caller thread is waiting on this variable.

Done.

LibOS/shim/src/ipc/shim_ipc_worker.c, line 42 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Again this weird name, please add some comment on this connect_back.

Done (near definition).

LibOS/shim/src/ipc/shim_ipc_worker.c, line 113 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

ditto (alloca)

ditto

LibOS/shim/src/ipc/shim_ipc_worker.c, line 145 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

ditto (alloca)

ditto

LibOS/shim/src/ipc/shim_ipc_worker.c, line 180 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Can't you use your new read_exact() here?

No, we do not read exact number of bytes but anything in range [MINIMAL_SIZE, MAX_SIZE]

LibOS/shim/src/ipc/shim_ipc_worker.c, line 210 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Actually, don't we want to log an error received unknown message type? Then we can revert this IF and simplify the code here.

Added log message.
How can we revert this if? It's inside do {} while loop and has code behind it...

LibOS/shim/src/ipc/shim_ipc_worker.c, line 257 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

What is the second slot?

Ah, was supposed to fill it later and forgot.
Done.

LibOS/shim/src/ipc/shim_ipc_worker.c, line 316 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Maybe we just rename it to exit_notification_event and remove this comment? Or do you have in mind some other future usages for it?

This highly depend if we stay with the current design or change something or even rework it to use one set of ipc connctions.
For now I don't see any other uses, so I'll rename it.

LibOS/shim/src/ipc/shim_ipc_worker.c, line 417 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

This is system-wide, right? Because we're creating a second thread now?

Yes. I wanted to get rid of it some time ago, but it turned out to be hard (early in the init code some functions are run that use not-yet-initialized locks and this all works only thanks to locking being disabled).

dimakuv

Reviewed 8 of 8 files at r4.
Reviewable status: all files reviewed, 16 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: ITL), "fixup! " found in commit messages' one-liners (waiting on @boryspoplawski, @pwmarcz, and @yamahata)

a discussion (no related file):

You think such naming will be better? They are a bit longer and both are actually internal interfaces (each specific to one file only).

Ok, fair enough. It was confusing to see two identical names in this PR, used for two different things. But I agree that this is internal details, so unblocking.

LibOS/shim/include/shim_ipc.h, line 64 at r3 (raw file):

Previously, boryspoplawski (Borys Popławski) wrote…

I had a plan to rename vmid, but not in this PR and I'm not yet sure about the new nam.
I can temporarily make this parent_vmid, to be consistent.
Proposed new names:

process_id

graphene_id

gprocess_id

gpid

This would also include introducing new type (e.g. process_id_t). What do you think?
Update: maybe it should be done in this PR, as there are a lot of inconsistencies.

I like your new names. And a new type. But then let's do it in a follow-up PR (this will be a trivial change that will be quickly merged). For now, we can live with these inconsistencies.

LibOS/shim/include/shim_ipc.h, line 336 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

What is this seq? Since we have only one connection per pair of processes, and we use TCP/IP (and also TLS in case of SGX enclaves), why do we care about sequence numbers (the way I understand what seq is)?

This was written before I looked at the code that uses it. Since this is out of scope of this PR, I'm unblocking this.

LibOS/shim/include/shim_ipc.h, line 374 at r3 (raw file):

Previously, boryspoplawski (Borys Popławski) wrote…

I would like to rewrite the async helper, so that it's cleaner, so wait for the thread termination and possibly move some code to some abstraction layer to share with ipc worker thread (if there is anything to be shared).
But that probably not a high priority task, so I'm not sure when it will happen. Do you want me to do just the rename for now?

Yes, you can just do a rename now. Actually, in a separate PR (no need to bundle in this PR). This will be a trivial change (just renames), so that PR will be approved quickly.

LibOS/shim/src/shim_checkpoint.c, line 571 at r4 (raw file):

    /* Child creation was successful, now we add it to the children list. Child process should have
     * already conencted to us, but is waiting for an acknowledgement, so it will not send any IPC

connected

LibOS/shim/src/ipc/shim_ipc.c, line 6 at r3 (raw file):

Previously, boryspoplawski (Borys Popławski) wrote…

Imo they just state the obvious. As far as it would be nice to have some top-level comment/description, it should actually be meaningful...
Added a one sentence, let me know what kind of information would you like to have here.

OK, this comment is very good.

LibOS/shim/src/ipc/shim_ipc.c, line 36 at r3 (raw file):

Previously, boryspoplawski (Borys Popławski) wrote…

Forgot to add a reveiwable comment here.
That was something to consider. If we expect very low number of connections (which I think we do) maybe a simple list will be better? Linear search, but if the N=4 that might be faster than tree/hashmap. But otoh do we care?

I mean, I'm pretty sure we don't care. Anything would do here. So I don't see a point in this TODO -- this is extremely low-priority. I suggest to just remove this.

LibOS/shim/src/ipc/shim_ipc.c, line 79 at r3 (raw file):

Previously, boryspoplawski (Borys Popławski) wrote…

If the connection is still in AVL tree, then ref_count >= 1 (because the reference in the tree is counted).
There is no is_in_tree function (and it's intentionally, so that nobody does some weird flows where they check if something is in tree explicitly - it should always follow from the code flow).

OK, thanks, I forgot (or didn't know) this about your AVL tree implementation.

LibOS/shim/src/ipc/shim_ipc.c, line 96 at r3 (raw file):

Previously, boryspoplawski (Borys Popławski) wrote…

Is that a question or a statement? :)
I don't think it would be cleaner. In the current version we have the common case (of conn being non-NULL`) not repeated. Otoh I dont mind either version, so let me know which one you prefer (the one you proposed, I guess?).

OK, just a matter of taste. No need to change anything.

LibOS/shim/src/ipc/shim_ipc.c, line 112 at r3 (raw file):

Previously, boryspoplawski (Borys Popławski) wrote…

Yes, it's a blocking operation (at least I think so). The idea is that connection to another process is a very rare operation and 2 happening at the same time is even more unlikely.

the use of spinlock is suspicous

These are not necessary spinlocks. Anyway we cannot unlock here - this is to make sure we only have one connection per pair of processes.

I think we need two types of general purpose locks: for small and big/blocking critical sections. For the small case we could use spinlocks and for the big - mutexes. In a uncontested case both should be equaly fast - a simple memory operation - for contested the first would ensure that the thread does not exit enclave (since the section is small so it's more worth to wait that to pay the cost of enclave exit). The latter would make sure we don't waste cpu - after all the critical section exits enclave, so we might do so as well.
But that's definitely not for this PR.

Yes, we've been wanting fast vs slow locks for a while now. Not for this PR.

LibOS/shim/src/ipc/shim_ipc.c, line 120 at r3 (raw file):

Previously, boryspoplawski (Borys Popławski) wrote…

Yes, why not? I do not understand the question...

I got confused because during process creation, there is an explicit ack (from parent to child) as the last step. But true, not needed in a general case of establishing an IPC connection between two processes.

LibOS/shim/src/ipc/shim_ipc.c, line 159 at r3 (raw file):

Previously, boryspoplawski (Borys Popławski) wrote…

I've commented this function in global header. Do you want me to add a comment here as well?

I think you forgot to push your changes with comments. I don't see anything there.

LibOS/shim/src/ipc/shim_ipc.c, line 295 at r3 (raw file):

Previously, boryspoplawski (Borys Popławski) wrote…

Every callback should wake a thread, that's what these callbacks are for.
Uh, actually we have two kind of callbacks and this is all messed (because they are named the same way).

Any idea for the better name?

It's ok, I don't want to complicate this PR with renames as it seems to be a major undertaking (for another PR).

LibOS/shim/src/ipc/shim_ipc_worker.c, line 333 at r1 (raw file):

Previously, boryspoplawski (Borys Popławski) wrote…

On the other hand this could be just a remote client dying. Do we want to kill current process in such cases? We do not do that in other places.

You mean the remote client wanted to connect, sent its vmid, proceeded with whatever it was executing, and then terminated (itself or host OS killed it)? Well, true, this can happen before we do read_exact() here. Let's just forget about this client then.

LibOS/shim/src/ipc/shim_ipc_worker.c, line 371 at r1 (raw file):

Previously, boryspoplawski (Borys Popławski) wrote…

Yeah, but the old code said something about PAL provided stack being small and allocated it's own, bigger one. I just wanted to make sure I did not miss something.

I don't see anything special, I think we can just remove this comment.

LibOS/shim/src/ipc/shim_ipc_worker.c, line 26 at r3 (raw file):

Previously, boryspoplawski (Borys Popławski) wrote…

These are in principle different types (incoming vs outgoing connections) so they can differ and their usage differ. Here we always iterate via the whole collection, we never search a specific one, hence list is better than a tree.

Let's keep the discussion about rename in one place (top-level comment).

OK, I'm resolving, see my top-level comment.

LibOS/shim/src/ipc/shim_ipc_worker.c, line 147 at r4 (raw file):

}

/* Another thread requested that we connect to it. */

Why do you say thread and not process?

LibOS/shim/src/ipc/shim_ipc_worker.c, line 361 at r4 (raw file):

            }
            /* If there was something else other than error reported, let the loop spin at least one
             * more time - in case there is a messages left to be read. */

there are messages

pwmarcz

Reviewable status: all files reviewed, 16 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: ITL), "fixup! " found in commit messages' one-liners (waiting on @boryspoplawski and @yamahata)

a discussion (no related file):

This is now valid, but was not before. In the old code there could be multiple connections between a pair of processes and if you sent 2 messages each on a different connection, the order could be reversed. Now we have only one connection per pair and the order is always preserved.

Ouch. I'm glad that code is gone, then.

In theory yes, but I did not consider that when rewriting the message_with_ack thingies, so there might be some deadlocks there.

I'll try it out then. I'm not using message_with_ack.

Why can't you do, like the rest of the code: (...)

Because it behaves differently - unlike sending message, some_ipc_callback() returns after processing the message, not before. And if some_ipc_callback() is supposed to, in turn, send another message back, I get 2 layers of nested calls. To give a specific example from my code: here the flow can possibly be like so (assuming client and server are both on the IPC leader):

(client) send_request_upgrade()
  (server) do_request_uprade()
    (server) send_confirm_upgrade()
      (client) do_confirm_upgrade()

The client code calling send_request_upgrade() has to assume that the response will sometimes be handled later, and sometimes before send_request_downgrade() returns, with implications for locking, state updates etc., and with no way to determine that before call (because sometimes send_confirm_upgrade() will be triggered later).

Furthermore, the server code calling send_confirm_downgrade will sometimes be blocked on the client handling the code. And I don't think more nesting is possible, the client shouldn't ever try to call server back in such a scenario, but it's still an annoying limitation to think about, if it ever becomes necessary.

If we make sure that "send a message from/to leader" simply works in all cases, the above problems do not appear and we have a relatively simple client-server protocol with "request / confirm" messages. So, I think the potential performance gain in one "happy case" is not worth complicating it.

And actually, now that I think of it, this might be solved by expanding "message with ack" to "message with arbitrary response". The server might then immediately respond, either with "request granted, here's the data" or with "I will call you later". But still it would be more of a performance improvement at the cost of complicating the code, so I don't want to be forced to implement it this way from the start.

boryspoplawski

Reviewable status: all files reviewed, 17 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: ITL), "fixup! " found in commit messages' one-liners (waiting on @boryspoplawski, @pwmarcz, and @yamahata)

a discussion (no related file):
To be honest, I do not understand that example and why it cannot be done in the calling thread. Are you saying that the thread needs to wait for do_confirm_upgrade? That doesn't sound like a problem, if these are all called in the same thread. Or that the there is no internal locking inside server states (e.g. because it assumes that it's run in one thread)? That might be an issue indeed...

the server code calling send_confirm_downgrade will sometimes be blocked on the client handling the code.

If you call that function directly from the thread it should block as well. Is that a problem, wouldn't it block anyway while waiting for response?

this might be solved by expanding "message with ack" to "message with arbitrary response".

Afaik "messages with ack" are exactly this. You need to send a response manually with the correct ack number and the response could be any message (i.e. it can have a body).

want to be forced to implement it this way from the start.

Sure don't, just saying for the future.

boryspoplawski

Reviewable status: all files reviewed, 17 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: ITL), "fixup! " found in commit messages' one-liners (waiting on @boryspoplawski, @pwmarcz, and @yamahata)

a discussion (no related file):

Previously, boryspoplawski (Borys Popławski) wrote…

To be honest, I do not understand that example and why it cannot be done in the calling thread. Are you saying that the thread needs to wait for do_confirm_upgrade? That doesn't sound like a problem, if these are all called in the same thread. Or that the there is no internal locking inside server states (e.g. because it assumes that it's run in one thread)? That might be an issue indeed...

the server code calling send_confirm_downgrade will sometimes be blocked on the client handling the code.

If you call that function directly from the thread it should block as well. Is that a problem, wouldn't it block anyway while waiting for response?

this might be solved by expanding "message with ack" to "message with arbitrary response".

Afaik "messages with ack" are exactly this. You need to send a response manually with the correct ack number and the response could be any message (i.e. it can have a body).

want to be forced to implement it this way from the start.

Sure don't, just saying for the future.

The main problem with making senc_ipc_message handle sending to the same process is that:
a) connection needs to be established - one thread cannot both connect and accept - so if the first message comes from IPC worker thread we get a deadlock,
b) sending the message might briefly block, e.g. when the underlying kernel buffer is full. If we block in the IPC worker thread then it won't ever empty that buffer - deadlock.
The two above are a problem, because IPC worker thread sometimes needs to send a response to a message - that's why I mentioned "messages with ack" before, but in theory any callback could send any message (not necessarily "msg_with_ack"). And judging from your example that's what you want - send messages as response to other messages. I don't think this can be handled in one thread (ipc worker).

pwmarcz

Reviewable status: all files reviewed, 17 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: ITL), "fixup! " found in commit messages' one-liners (waiting on @boryspoplawski and @yamahata)

a discussion (no related file):

Are you saying that the thread needs to wait for do_confirm_upgrade? That doesn't sound like a problem, if these are all called in the same thread.

The client thread does need to wait for do_confirm_upgrade. I guess the situation I described does work, as long as I release my lock before messaging the server (so I'll probably have to copy the data I'm sending). And the responses will still be handled in correct order, thanks to the server having a lock.

So, that covers situations like client -> server -> client. The other direction (server -> client -> server) is more tricky, because there is also a server lock, and the response for client does potentially trigger further messages. I could release the server lock before sending the outgoing messages, but then I can't guarantee the order of outgoing messages from multiple server callbacks...

Overall, it sounds like a lot of special cases, and again, all of this should not be a problem if there is a working, reliable, asynchronous "send message" primitive, whether it internally uses an inter-process pipe, in-process queue, or whatever else. I could just take a lock, send my messages, release lock, and have a much easier time reasoning about what happens in my critical section.

If you call that function directly from the thread it should block as well. Is that a problem, wouldn't it block anyway while waiting for response?

send_ipc_message will write message to the pipe, but not wait for it to be handled, or for response. And that's what I want, the server does not ever to wait for a response: for server code, that would be generally a bad idea, since it needs to handle other messages in the meantime, possibly related to the same resource.

Afaik "messages with ack" are exactly this. You need to send a response manually with the correct ack number and the response could be any message (i.e. it can have a body).

Yes, I think that's how it works right now. But it would be nice to have a "send message and wait for a response" as an available operation. Like an RPC call.

And judging from your example that's what you want - send messages as response to other messages. I don't think this can be handled in one thread (ipc worker).

This is already a problem, right? Since the IPC worker sends acks, and there could be a deadlock between two processes sending acks to each other. Maybe that means outgoing messages should be queued, so that sending doesn't ever block on a full buffer?

boryspoplawski

Reviewable status: all files reviewed, 17 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: ITL), "fixup! " found in commit messages' one-liners (waiting on @boryspoplawski, @pwmarcz, and @yamahata)

a discussion (no related file):
Since you have locking in server, I still do not understand. The client always needs to wait for a response from server, right?
Generally speaking: why something (whatever that is, that the callbacks for client messages do) can be done inside IPC worker, but not inside the calling thread itself? If you have proper locking schemes (I believe you said you do) it should be no difference, is there?
From what I understood this is the hard case:

A               S               B
|  up_req ->    |
                |
                |  down_req ->  |
                                |
                                |
                | <- down_conf  |
                |
                |
                |
|  <- up_conf   |

If thread A would be the same as S (server) then it could do the first part up_req + down_req directly and then sleep awaiting for the 2nd part to come asynchronously (handled by IPC worker this time).

Yes, I think that's how it works right now. But it would be nice to have a "send message and wait for a response" as an available operation. Like an RPC call.

But that's what send_msg_with_asck do!

This is already a problem, right? Since the IPC worker sends acks, and there could be a deadlock between two processes sending acks to each other. Maybe that means outgoing messages should be queued, so that sending doesn't ever block on a full buffer?

No, because IPC worker thread never awaits for ack messages, it just handles them. If it needs to send a message it does send a normal one (without confirmation).

pwmarcz

Reviewable status: all files reviewed, 17 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: ITL), "fixup! " found in commit messages' one-liners (waiting on @boryspoplawski and @yamahata)

a discussion (no related file):

If thread A would be the same as S (server) then it could do the first part up_req + down_req directly and then sleep awaiting for the 2nd part to come asynchronously (handled by IPC worker this time).

That sounds fine. Generally, I'm not worried about cases where a thread needs to sleep and wait. I'm worried about the ones where the other side messages me synchronously. For example (sorry, I'm sure it can be made simpler, but I hope it makes the problem clear):

A               S              B            C
.               | <-------------- up_req    |
.               | down_req ->  |            .
|  <- down_req  |              |            .
|  down_conf -> |              |            .
.               | <- down_conf |            . 
.               | up_conf --------------->  |

Assume A and S are the same process. Now, what I would like to have, is:

S receives up_reqfrom C, and then sends down_req to A and B
At a later time, S receives down_conf from A
At a later time, S receives down_conf from B, and then sends up_conf to C

All of the above 3 steps are separate invocations of a server handler, from the IPC worker main loop. The server functions as a nice state machine: lock, handle the incoming message, send resulting outgoing messages, unlock.

However, if "sending" down_req to A is actually a function call, and A "responds" immediately to S by issuing another function call, then I have the server handler running, nested, inside the same server handler. Which breaks my design: I cannot just take a lock in the server handler, and assume that my server handler invocations happen one after another.

And probably I can special-case that, but it looks like my code has to handle both "what if I'm calling the other side locally" and "what if I am being called by the other side locally"? This really should be the responsibility of the transport layer, not the code built on top of it.

But that's what send_msg_with_asck do!

Yes, send_msg_with_ack is like an RPC call that returns a single int. I would need it to return arbitrary data. I believe this would be useful for the sync engine, although I'm not sure if it would fully solve the above problem of local messages. So maybe it's not so important right now.

No, because IPC worker thread never awaits for ack messages, it just handles them. If it needs to send a message it does send a normal one (without confirmation).

I'm referring to this part:

        if (msg->code < ARRAY_SIZE(ipc_callbacks) && ipc_callbacks[msg->code]) {
            int ret = ipc_callbacks[msg->code](msg, conn->vmid);
            if ((ret < 0 || ret == RESPONSE_CALLBACK) && msg->seq) {
                ret = send_ipc_response(conn->vmid, ret, msg->seq);

AFAICT, this runs inside the IPC worker main loop. And send_ipc_response writes to a pipe. And the other end of that pipe is handled by another IPC worker. So if two IPC workers send responses to each other simultaneously, it's possible (although unlikely) that both these writes (to their respective pipes) could be blocked on full kernel buffers, causing a deadlock. Is that reasoning correct, or am I making a wrong assumption somewhere?

boryspoplawski

Reviewable status: all files reviewed, 17 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: ITL), "fixup! " found in commit messages' one-liners (waiting on @boryspoplawski, @dimakuv, @mkow, @pwmarcz, and @yamahata)

LibOS/shim/include/shim_ipc.h, line 64 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

I like your new names. And a new type. But then let's do it in a follow-up PR (this will be a trivial change that will be quickly merged). For now, we can live with these inconsistencies.

Well, I could make it consistent here, that should be hard...
vmid everywhere for now?

LibOS/shim/include/shim_ipc.h, line 374 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Yes, you can just do a rename now. Actually, in a separate PR (no need to bundle in this PR). This will be a trivial change (just renames), so that PR will be approved quickly.

I'll stash a commit to this PR, because otherwise I never do it :)
Update, I'll just prepare another PR to be merged after this one - to make it easier to review.
Update: #2318

LibOS/shim/src/shim_checkpoint.c, line 571 at r4 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

connected

Done.

LibOS/shim/src/shim_checkpoint.c, line 577 at r4 (raw file):

    char dummy_c = 0;
    ret = write_exact(pal_process, &dummy_c, sizeof(dummy_c));

Ughh, this is still not enough. Imagine scenario: everything goes ok, but the malicious host-OS kills the child thread just before we do add_child_process above and spoofs this write (i.e. reports it succeeded without sending anything). Since the process was killed before add_child_porcess the disconnect callback did not see this child, we added it only later. Now we end up in a situation when the child is dead, but we are not aware of it and have it sill marked as alive.
Is such scenario problematic one, or is it just a simple DoS, which we could ignore? I don't know anymore...

LibOS/shim/src/ipc/shim_ipc.c, line 36 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

I mean, I'm pretty sure we don't care. Anything would do here. So I don't see a point in this TODO -- this is extremely low-priority. I suggest to just remove this.

Most of the TODOs in this PR should be resolved before merging, e.g. this one. So since we don't really care, I see 2 options:

leave it as it is,
change tree to list - should be easy and fast to do, the question is would be the code cleaner? I don't know.

LibOS/shim/src/ipc/shim_ipc.c, line 79 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

OK, thanks, I forgot (or didn't know) this about your AVL tree implementation.

When I was writing this PR I wanted to use such function (is_in_tree) and then remembered that I did not add it intentionally. I think the code that used it would be much harder to read and maintain (e,g, getting track of object ownership), so I'm even more convinced that not adding such a function was a good decision.

LibOS/shim/src/ipc/shim_ipc.c, line 96 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

OK, just a matter of taste. No need to change anything.

I've tried rewriting it, but ended up with unlock in 3 different places, so I gave up and left the current version.

LibOS/shim/src/ipc/shim_ipc.c, line 112 at r3 (raw file):

the use of spinlock is suspicous.

Actually I've just checked and lock() is never a spinlock, even on Linux-SGX PAL. Actually that would be weird, LibOS is common for all PALs.

LibOS/shim/src/ipc/shim_ipc.c, line 159 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

I think you forgot to push your changes with comments. I don't see anything there.

Yeah, sorry, I got distracted with debugging that damn DkStreamsWaitEvents...
I want to cleanup shim_ipc.h and add comments to all public functions there. Unfortunately not finished yet.

LibOS/shim/src/ipc/shim_ipc.c, line 220 at r3 (raw file):

Previously, boryspoplawski (Borys Popławski) wrote…

We cannot use this message (msg) after giving up the lock - it might get taken of the tree and be destroyed.

Keep in mind that currently we use generic thread_sleep which might finish due to other wakeups (e.g. signal arriving). I've added TODO somewhere to change that.

LibOS/shim/src/ipc/shim_ipc_worker.c, line 294 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Interesting, I don't remember seeing this (but this was ages ago when I worked on this part).

Update: this function is broken, e.g. it also report available for read only once - if we call it 2 times in a row, without doing anything with polled handles, it will report PAL_WAIT_READ only on the first call.
I need to rewrite it before merging this PR :(

LibOS/shim/src/ipc/shim_ipc_worker.c, line 333 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

You mean the remote client wanted to connect, sent its vmid, proceeded with whatever it was executing, and then terminated (itself or host OS killed it)? Well, true, this can happen before we do read_exact() here. Let's just forget about this client then.

In such case the read_exact() above would succeed (since that process already sent the data). This was just a general question what to do in case of communication failures. I think we should just remove the client and let the app handle communication failures.

LibOS/shim/src/ipc/shim_ipc_worker.c, line 371 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

I don't see anything special, I think we can just remove this comment.

Ok, removing this, but cc @mkow @pwmarcz @yamahata

LibOS/shim/src/ipc/shim_ipc_worker.c, line 147 at r4 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Why do you say thread and not process?

Done.

LibOS/shim/src/ipc/shim_ipc_worker.c, line 361 at r4 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

there are messages

Or singluar? Which version is correct? (ignore the current version with typo for now)

pwmarcz

Reviewable status: 27 of 29 files reviewed, 15 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: ITL), "fixup! " found in commit messages' one-liners (waiting on @boryspoplawski, @dimakuv, @mkow, @pwmarcz, and @yamahata)

a discussion (no related file):

Previously, pwmarcz (Paweł Marczewski) wrote…

If thread A would be the same as S (server) then it could do the first part up_req + down_req directly and then sleep awaiting for the 2nd part to come asynchronously (handled by IPC worker this time).

That sounds fine. Generally, I'm not worried about cases where a thread needs to sleep and wait. I'm worried about the ones where the other side messages me synchronously. For example (sorry, I'm sure it can be made simpler, but I hope it makes the problem clear):
A               S              B            C
.               | <-------------- up_req    |
.               | down_req ->  |            .
|  <- down_req  |              |            .
|  down_conf -> |              |            .
.               | <- down_conf |            . 
.               | up_conf --------------->  |    
Assume A and S are the same process. Now, what I would like to have, is:

S receives up_reqfrom C, and then sends down_req to A and B

At a later time, S receives down_conf from A

At a later time, S receives down_conf from B, and then sends up_conf to C

All of the above 3 steps are separate invocations of a server handler, from the IPC worker main loop. The server functions as a nice state machine: lock, handle the incoming message, send resulting outgoing messages, unlock.

However, if "sending" down_req to A is actually a function call, and A "responds" immediately to S by issuing another function call, then I have the server handler running, nested, inside the same server handler. Which breaks my design: I cannot just take a lock in the server handler, and assume that my server handler invocations happen one after another.

And probably I can special-case that, but it looks like my code has to handle both "what if I'm calling the other side locally" and "what if I am being called by the other side locally"? This really should be the responsibility of the transport layer, not the code built on top of it.

But that's what send_msg_with_asck do!

Yes, send_msg_with_ack is like an RPC call that returns a single int. I would need it to return arbitrary data. I believe this would be useful for the sync engine, although I'm not sure if it would fully solve the above problem of local messages. So maybe it's not so important right now.

No, because IPC worker thread never awaits for ack messages, it just handles them. If it needs to send a message it does send a normal one (without confirmation).

I'm referring to this part:
        if (msg->code < ARRAY_SIZE(ipc_callbacks) && ipc_callbacks[msg->code]) {
            int ret = ipc_callbacks[msg->code](msg, conn->vmid);
            if ((ret < 0 || ret == RESPONSE_CALLBACK) && msg->seq) {
                ret = send_ipc_response(conn->vmid, ret, msg->seq);
AFAICT, this runs inside the IPC worker main loop. And send_ipc_response writes to a pipe. And the other end of that pipe is handled by another IPC worker. So if two IPC workers send responses to each other simultaneously, it's possible (although unlikely) that both these writes (to their respective pipes) could be blocked on full kernel buffers, causing a deadlock. Is that reasoning correct, or am I making a wrong assumption somewhere?

We had a quick call about this. Short summary from my side:

I still need to have a way to send messages to self, e.g. because of the above example
I was wrong about send_msg_with_ack, it actually does send arbitrary data an response. I will take a look and might be able to use that.
Looks like sending messages from IPC worker in general is incorrect, and sending IPC responses / "acks" has to be handled differently, but it's not clear how best to do it:
- One solution is two helper threads threads (one for receiving messages, another for sending outgoing messages), but this raises performance concerns
- Another solution is sending in non-blocking mode (queue outgoing data, send as much as you're able, include the outgoing pipes in select() to wait until they're unblocked)
- ...
Hopefully, if we solve the problem with sending messages from IPC worker, it will also be a good solution for sending messages to self.
For now, I'll continue working with the current IPC code - it should be "good enough" for experimental feature although in practice it can deadlock.

Please correct if anything is not right :) And I'm resolving this.

boryspoplawski

Reviewable status: 27 of 29 files reviewed, 16 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: ITL), "fixup! " found in commit messages' one-liners (waiting on @boryspoplawski, @dimakuv, @mkow, @pwmarcz, and @yamahata)

a discussion (no related file):

Previously, pwmarcz (Paweł Marczewski) wrote…

We had a quick call about this. Short summary from my side:

I still need to have a way to send messages to self, e.g. because of the above example

I was wrong about send_msg_with_ack, it actually does send arbitrary data an response. I will take a look and might be able to use that.

Looks like sending messages from IPC worker in general is incorrect, and sending IPC responses / "acks" has to be handled differently, but it's not clear how best to do it:

One solution is two helper threads threads (one for receiving messages, another for sending outgoing messages), but this raises performance concerns

Another solution is sending in non-blocking mode (queue outgoing data, send as much as you're able, include the outgoing pipes in select() to wait until they're unblocked)

...

Hopefully, if we solve the problem with sending messages from IPC worker, it will also be a good solution for sending messages to self.

For now, I'll continue working with the current IPC code - it should be "good enough" for experimental feature although in practice it can deadlock.

Please correct if anything is not right :) And I'm resolving this.

To be clear the case when it can deadlock is:

we have two processes A and B,
IPC worker in A sends data to process B,
IPC worker in B sends data to process A,
they both block on send due to underlying host buffer (e.g. pipe) being full.
The problem is that since they are both blocked on send, they wont process any IPC messages, hence won't read from any buffer and will never allow the other side to unblock.

This issue definitely needs to be addressed, the question is how and when (in this PR?)
I'll try to create some draft proposals and present them on nearest contributors meeting.

dimakuv

Reviewed 2 of 2 files at r5.
Reviewable status: all files reviewed, 12 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: ITL), "fixup! " found in commit messages' one-liners (waiting on @boryspoplawski, @dimakuv, and @yamahata)

LibOS/shim/include/shim_ipc.h, line 64 at r3 (raw file):

Previously, boryspoplawski (Borys Popławski) wrote…

Well, I could make it consistent here, that should be hard...
vmid everywhere for now?

vmid everywhere then.

LibOS/shim/src/ipc/shim_ipc_worker.c, line 361 at r4 (raw file):

Previously, boryspoplawski (Borys Popławski) wrote…

Or singluar? Which version is correct? (ignore the current version with typo for now)

I think there are messages.

boryspoplawski

Reviewable status: 20 of 29 files reviewed, 11 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: ITL), "fixup! " found in commit messages' one-liners (waiting on @boryspoplawski, @dimakuv, and @yamahata)

LibOS/shim/include/shim_ipc.h, line 64 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

vmid everywhere then.

Done.

LibOS/shim/include/shim_ipc.h, line 328 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Can you add a short comment on why data and seq are needed in these two functions?

Done.

LibOS/shim/include/shim_ipc.h, line 331 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Need some comment what this function is.

Done.

LibOS/shim/include/shim_ipc.h, line 332 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Need some comment what this dummy callback is.

Done.

LibOS/shim/src/ipc/shim_ipc.c, line 159 at r3 (raw file):

Previously, boryspoplawski (Borys Popławski) wrote…

Yeah, sorry, I got distracted with debugging that damn DkStreamsWaitEvents...
I want to cleanup shim_ipc.h and add comments to all public functions there. Unfortunately not finished yet.

Done.

LibOS/shim/src/ipc/shim_ipc_worker.c, line 361 at r4 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

I think there are messages.

Done.

dimakuv

Reviewed 9 of 9 files at r6.
Reviewable status: all files reviewed, 10 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: ITL), "fixup! " found in commit messages' one-liners (waiting on @boryspoplawski and @yamahata)

a discussion (no related file):

Previously, boryspoplawski (Borys Popławski) wrote…

To be clear the case when it can deadlock is:

we have two processes A and B,

IPC worker in A sends data to process B,

IPC worker in B sends data to process A,

they both block on send due to underlying host buffer (e.g. pipe) being full.
The problem is that since they are both blocked on send, they wont process any IPC messages, hence won't read from any buffer and will never allow the other side to unblock.

This issue definitely needs to be addressed, the question is how and when (in this PR?)
I'll try to create some draft proposals and present them on nearest contributors meeting.

I think this should be a separate PR.

Using two worker threads seems like the simple solution, but this will become even trickier to debug... Also, now our SGX examples may break because we add one more enclave thread. I don't like this option.

Looks like some form of non-blocking writes with reads-in-between should be the solution.

LibOS/shim/include/shim_ipc.h, line 94 at r6 (raw file):

 * broadcast ipc messages.
 */
int request_leader_connect_back(void);

Sorry I'm still confused about this function. Why is it different from connect_to_process(leader_vmid)? The comment doesn't help in explaining this (if this is about broadcast, doesn't leader always broadcast to all connected processes, no matter how they got connected initially?).

LibOS/shim/include/shim_ipc.h, line 146 at r6 (raw file):

 * \param[out] seq upon return contains sequence number of this message
 *
 * This functions sends a ipc message to the \p dest process and awaits for a response.

function

LibOS/shim/include/shim_ipc.h, line 169 at r6 (raw file):

/*!
 * \brief Handle a response to previousle sent message

previously

LibOS/shim/include/shim_ipc.h, line 186 at r6 (raw file):

 *
 * \param req_msg original message which got the response
 * \param data unused (just to confrom #ipc_msg_response_handle callbacks interface)

conform to

LibOS/shim/src/shim_checkpoint.c, line 577 at r4 (raw file):

Previously, boryspoplawski (Borys Popławski) wrote…

Ughh, this is still not enough. Imagine scenario: everything goes ok, but the malicious host-OS kills the child thread just before we do add_child_process above and spoofs this write (i.e. reports it succeeded without sending anything). Since the process was killed before add_child_porcess the disconnect callback did not see this child, we added it only later. Now we end up in a situation when the child is dead, but we are not aware of it and have it sill marked as alive.
Is such scenario problematic one, or is it just a simple DoS, which we could ignore? I don't know anymore...

It's a DoS situation. As soon as you write "malicious OS does X", then the only thing that Graphene shouldn't do is to reveal any secret info. Graphene can hang, try sending messages to the non-existing child (which will result in either a hang or a TLS error), terminate itself.

So this looks fine. The malicious OS can pretend that the terminated child is still alive at any moment, right (by simply not forwarding the SIGCHLD message to the parent)? So it doesn't really matter at which moment the malicious OS "hides" this fact.

LibOS/shim/src/ipc/shim_ipc.c, line 36 at r3 (raw file):

Previously, boryspoplawski (Borys Popławski) wrote…

Most of the TODOs in this PR should be resolved before merging, e.g. this one. So since we don't really care, I see 2 options:

leave it as it is,

change tree to list - should be easy and fast to do, the question is would be the code cleaner? I don't know.

Well, then I vote for changing to a list.

boryspoplawski

Reviewable status: all files reviewed, 8 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: ITL), "fixup! " found in commit messages' one-liners (waiting on @boryspoplawski, @dimakuv, and @yamahata)

LibOS/shim/include/shim_ipc.h, line 94 at r6 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Sorry I'm still confused about this function. Why is it different from connect_to_process(leader_vmid)? The comment doesn't help in explaining this (if this is about broadcast, doesn't leader always broadcast to all connected processes, no matter how they got connected initially?).

We probably need some general description of IPC inner workings somewhere...
For now updated these comments.

LibOS/shim/include/shim_ipc.h, line 146 at r6 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

function

Done.

LibOS/shim/include/shim_ipc.h, line 169 at r6 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

previously

Done.

LibOS/shim/include/shim_ipc.h, line 186 at r6 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

conform to

Done.

LibOS/shim/src/ipc/shim_ipc.c, line 36 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Well, then I vote for changing to a list.

Sorry for messing, but since the tree is a bit better than list here, I've just left it.

dimakuv

Reviewed 6 of 6 files at r7.
Reviewable status: all files reviewed, 9 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: ITL), "fixup! " found in commit messages' one-liners (waiting on @boryspoplawski and @yamahata)

LibOS/shim/include/shim_ipc.h, line 94 at r6 (raw file):

Previously, boryspoplawski (Borys Popławski) wrote…

We probably need some general description of IPC inner workings somewhere...
For now updated these comments.

OK. Now I understood, thanks.

LibOS/shim/include/shim_ipc.h, line 146 at r6 (raw file):

Previously, boryspoplawski (Borys Popławski) wrote…

Done.

Now you have Ssend :)

Also, await for -> wait for

LibOS/shim/include/shim_ipc.h, line 186 at r6 (raw file):

Previously, boryspoplawski (Borys Popławski) wrote…

Done.

Still confrom -> conform

LibOS/shim/src/ipc/shim_ipc_worker.c, line 294 at r1 (raw file):

Previously, boryspoplawski (Borys Popławski) wrote…

Update: this function is broken, e.g. it also report available for read only once - if we call it 2 times in a row, without doing anything with polled handles, it will report PAL_WAIT_READ only on the first call.
I need to rewrite it before merging this PR :(

But your new commit fixes a slightly more specific case: if we call poll 2 times in a row on an errored handle, then it will report read only on the first call. Right? Was that the actual problem?

LibOS/shim/src/ipc/shim_ipc_worker.c, line 238 at r7 (raw file):

static noreturn void ipc_worker_main(void) {
    /* TODO: If we had a global array of connection (instead of a list) we wouldn't have to gather

connections (multiple)

Pal/src/db_object.c, line 67 at r7 (raw file):

                        PAL_FLG* ret_events, PAL_NUM timeout_us) {
    if (!count || !handle_array || !events || !ret_events) {
        return -PAL_ERROR_INVAL;

Why did you remove it? Seems to make sense to check these values...

Pal/src/host/Linux/db_object.c, line 75 at r7 (raw file):

                 * skip it but update its ret_events */
                if (events[i] & (PAL_WAIT_READ | PAL_WAIT_WRITE)) {
                    ret_events[i] |= PAL_WAIT_ERROR;

Ok, so the error was here. If we would cache READ/WRITE events in the hdl, we could update the ret_events[i] with this. But in the current code, we only put ERROR event in ret_events[i], so availability of read/write is suppressed.

Is my understanding correct? If yes, then it was quite a bad bug :(

Now we might poll twice on the "errored" handle, but that doesn't seem to pose any problems.

Pal/src/host/Linux/db_object.c, line 103 at r7 (raw file):

    if (ret < 0) {
        ret = unix_to_pal_error(ret);

This is just a simplification, right? There was no bug here?

boryspoplawski

Reviewable status: all files reviewed, 9 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: ITL), "fixup! " found in commit messages' one-liners (waiting on @boryspoplawski, @dimakuv, and @yamahata)

LibOS/shim/include/shim_ipc.h, line 146 at r6 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Now you have Ssend :)

Also, await for -> wait for

Done.

LibOS/shim/include/shim_ipc.h, line 186 at r6 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Still confrom -> conform

Done.

LibOS/shim/src/ipc/shim_ipc_worker.c, line 294 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

But your new commit fixes a slightly more specific case: if we call poll 2 times in a row on an errored handle, then it will report read only on the first call. Right? Was that the actual problem?

Closing the remote end is considered an error (at least by Graphene).
The new commit fixes all kinds of errors we could encounter by not polling what the user requested.

LibOS/shim/src/ipc/shim_ipc_worker.c, line 238 at r7 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

connections (multiple)

Done.

Pal/src/db_object.c, line 67 at r7 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Why did you remove it? Seems to make sense to check these values...

Because we actually allow for count == 0 case.

Pal/src/host/Linux/db_object.c, line 75 at r7 (raw file):

Is my understanding correct? If yes, then it was quite a bad bug :(

Yes, but note that ERROR might be set by other functions as well (whereas there would be no way for them to set availability of read/write).

Now we might poll twice on the "errored" handle, but that doesn't seem to pose any problems.

Yup, it will just return immediately with an error in revents (which is what the call should expect). Also I've left the caching of errors, so that mallicious host cannot report 1 poll erroring and another succeeding (not sure if that's needed, but doesn't hurt).

Pal/src/host/Linux/db_object.c, line 103 at r7 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

This is just a simplification, right? There was no bug here?

Yes

yamahata

Reviewed 6 of 9 files at r6, 4 of 6 files at r7, 2 of 2 files at r8.
Reviewable status: all files reviewed, 9 unresolved discussions, "fixup! " found in commit messages' one-liners (waiting on @boryspoplawski, @dimakuv, and @yamahata)

LibOS/shim/src/ipc/shim_ipc_worker.c, line 371 at r1 (raw file):

Previously, boryspoplawski (Borys Popławski) wrote…

Ok, removing this, but cc @mkow @pwmarcz @yamahata

IIRC, the comment was stale, I think.

dimakuv

Reviewable status: all files reviewed, 2 unresolved discussions, "fixup! " found in commit messages' one-liners (waiting on @boryspoplawski and @yamahata)

Pal/src/host/Linux/db_object.c, line 75 at r7 (raw file):

Yes, but note that ERROR might be set by other functions as well (whereas there would be no way for them to set availability of read/write).

True, good point, I didn't take it into account. Yes, it was quite a bad bug :( Resolving.

boryspoplawski · 2021-04-21T18:14:58Z

Jenkins, retest Jenkins-Debug-18.04 please (kill11, known issue)

pwmarcz

Reviewable status: all files reviewed, 1 unresolved discussion, "fixup! " found in commit messages' one-liners (waiting on @boryspoplawski)

LibOS/shim/test/ltp/runltp_xml.py, line 278 at r12 (raw file):

Previously, boryspoplawski (Borys Popławski) wrote…

Changed got bigger than expected. Pros:

there shouldn't be any runaway processes no more (previously if the process was already dead, os.getpgid failed, which prevented calling killpg),

stdout+stderr are extracted even if the process timed out,

fixed a hang, where child process was still holding a communication pipe and python tried to extract stdout and hanged.

The logic looks OK, I also extracted it to a simple program and tested for myself.

I think it can be made more readable if you run two tasks in parallel, "kill after N seconds" and "communicate":

async def kill_after(self, proc, timeout):
    '''Kill the process if it doesn't exit after `timeout` time. Return True if the wait timed out.'''
    try:
        await asyncio.wait_for(asyncio.ensure_future(proc.wait()), timeout=timeout)
        return False

    except asyncio.exceptions.TimeoutError:
        if proc.pid != os.getpgid(proc.pid):
            # ... (log warning)
        try:
            os.killpg(proc.pid, signal.SIGKILL)
        except subprocess.ProcessLookupError:
            pass

        # not strictly necessary because of communicate() below
        await proc.wait()
        return True

async def _run_cmd(...):
    ... # start process

    timed_out, (stdout, stderr) = await asyncio.gather(
         self.kill_after(proc, timeout), proc.communicate())

    ... # process stdout, stderr

    if timed_out:
        raise Error(...)

But I'm not sure about it - feel free to ignore if it doesn't read better for you.

LibOS/shim/test/ltp/runltp_xml.py, line 293 at r13 (raw file):

            for stream in tasks[1].result())

        if not done:

I think this raise Error should be after the log lines (just before setting returncode).

Also, in earlier code you redefine pending but not done, which is somewhat confusing. How about:

done, pending = ...

if pending:
    timed_out = True
    done, pending = await asyncio.wait(pending)
    assert not pending
else:
    timed_out = False

...

if timed_out:
    raise Error()

boryspoplawski

Reviewable status: all files reviewed, 1 unresolved discussion, "fixup! " found in commit messages' one-liners (waiting on @pwmarcz)

LibOS/shim/test/ltp/runltp_xml.py, line 278 at r12 (raw file):

Previously, pwmarcz (Paweł Marczewski) wrote…

The logic looks OK, I also extracted it to a simple program and tested for myself.

I think it can be made more readable if you run two tasks in parallel, "kill after N seconds" and "communicate":

async def kill_after(self, proc, timeout):
    '''Kill the process if it doesn't exit after `timeout` time. Return True if the wait timed out.'''
    try:
        await asyncio.wait_for(asyncio.ensure_future(proc.wait()), timeout=timeout)
        return False

    except asyncio.exceptions.TimeoutError:
        if proc.pid != os.getpgid(proc.pid):
            # ... (log warning)
        try:
            os.killpg(proc.pid, signal.SIGKILL)
        except subprocess.ProcessLookupError:
            pass

        # not strictly necessary because of communicate() below
        await proc.wait()
        return True

async def _run_cmd(...):
    ... # start process

    timed_out, (stdout, stderr) = await asyncio.gather(
         self.kill_after(proc, timeout), proc.communicate())

    ... # process stdout, stderr

    if timed_out:
        raise Error(...)

But I'm not sure about it - feel free to ignore if it doesn't read better for you.

I'm not sure which version is better, so I took lazy approach of not doing anything.

LibOS/shim/test/ltp/runltp_xml.py, line 293 at r13 (raw file):

I think this raise Error should be after the log lines (just before setting returncode).

The old version did not print that log and IIUC it's supposed to be a succesful log. On errors we get stdout/err anyway.

Also, in earlier code you redefine pending but not done, which is somewhat confusing. How about:

if pending: is wrong, pending can be non-empty both on successful run.
I've changed the part with if done tho.

mkow

Reviewed 5 of 5 files at r14.
Reviewable status: all files reviewed, 1 unresolved discussion, not enough approvals from different teams (1 more required, approved so far: ITL) (waiting on @pwmarcz)

dimakuv · 2021-05-07T12:06:11Z

Jenkins, retest Jenkins-Debug-18.04 please (select04 LTP test timed out, just a slow Jenkins worker?)

dimakuv

Reviewed 4 of 5 files at r14, 12 of 12 files at r15.
Reviewable status: all files reviewed, 3 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: ITL) (waiting on @boryspoplawski and @pwmarcz)

Pal/src/host/Linux/db_memory.c, line 128 at r15 (raw file):

}

static void parse_line(const char* line, uintptr_t* start_ptr, uintptr_t* end_ptr) {

Maybe add a comment that this function expects a line in the format (starting with):
08048000-08056000 ...

Pal/src/host/Linux/pal_linux.h, line 109 at r15 (raw file):

extern uintptr_t g_vdso_end;
bool is_in_vdso(uintptr_t addr);
/* Parse "/proc/self/maps" and return addres ranges for "vdso" and "vvar". */

address

boryspoplawski

Reviewable status: 43 of 45 files reviewed, 3 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: ITL), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @pwmarcz)

Pal/src/host/Linux/db_memory.c, line 128 at r15 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Maybe add a comment that this function expects a line in the format (starting with):
08048000-08056000 ...

Done.

Pal/src/host/Linux/pal_linux.h, line 109 at r15 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

address

Done.

dimakuv

Reviewed 2 of 2 files at r16.
Reviewable status: all files reviewed, 1 unresolved discussion, "fixup! " found in commit messages' one-liners (waiting on @pwmarcz)

boryspoplawski · 2021-05-07T13:18:30Z

Jenkins, retest Jenkins-Debug-18.04 please (kill11 failed, known issue)

mkow

Reviewed 10 of 12 files at r15, 2 of 2 files at r16.
Reviewable status: all files reviewed, 5 unresolved discussions, "fixup! " found in commit messages' one-liners (waiting on @boryspoplawski and @pwmarcz)

Pal/src/db_memory.c, line 63 at r16 (raw file):

int add_preloaded_range(uintptr_t start, uintptr_t end, const char* comment) {
    size_t new_len = g_pal_control.preloaded_ranges_len + 1;

I'd prefer _cnt suffix for array element counts, but I'm not blocking.

Pal/src/host/Linux/db_memory.c, line 129 at r16 (raw file):

hexadecimal_number-hexadecimalnumber

Once with and once without an underscore? :)

Pal/src/host/Linux/db_memory.c, line 132 at r16 (raw file):

static void parse_line(const char* line, uintptr_t* start_ptr, uintptr_t* end_ptr) {
    char* next = NULL;
    long start = strtol(line, &next, 16);

I don't like this, it relies on user addresses being in the lower half of address space.

Pal/src/host/Linux/db_memory.c, line 162 at r16 (raw file):

    ssize_t got = 0;
    do {
        /* There should be no failures or partial reads from this fd. */

At first I got confused by this buf + size iteration despite the comment saying that there are no partial reads possible. Could you say that we still need to loop because buf is small?

Pal/src/host/Linux/db_memory.c, line 187 at r16 (raw file):

            }
        }
        if (!memcmp(vvar_str, line_end - vvar_str_len, vvar_str_len)) {

Please convert intoelse-if, otherwise it's harder to reason about the correctness of this - parse_line() assumes that the out arguments were zeroed by the caller, otherwise you will use undefined data on failure. With if-else it would be clearly visible that both codepaths are correct.

boryspoplawski

Reviewable status: all files reviewed, 5 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: ITL), "fixup! " found in commit messages' one-liners (waiting on @mkow and @pwmarcz)

Pal/src/db_memory.c, line 63 at r16 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

I'd prefer _cnt suffix for array element counts, but I'm not blocking.

Done.

Pal/src/host/Linux/db_memory.c, line 129 at r16 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

hexadecimal_number-hexadecimalnumber
Once with and once without an underscore? :)

Done.

Pal/src/host/Linux/db_memory.c, line 132 at r16 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

I don't like this, it relies on user addresses being in the lower half of address space.

Done.

Pal/src/host/Linux/db_memory.c, line 162 at r16 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

At first I got confused by this buf + size iteration despite the comment saying that there are no partial reads possible. Could you say that we still need to loop because buf is small?

Done.

Pal/src/host/Linux/db_memory.c, line 187 at r16 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Please convert intoelse-if, otherwise it's harder to reason about the correctness of this - parse_line() assumes that the out arguments were zeroed by the caller, otherwise you will use undefined data on failure. With if-else it would be clearly visible that both codepaths are correct.

Done.

boryspoplawski · 2021-05-07T22:34:50Z

Jenkins, retest Jenkins-20.04 please (kill11 known issue)

mkow

Reviewed 4 of 4 files at r17.
Reviewable status: all files reviewed, 1 unresolved discussion, not enough approvals from different teams (1 more required, approved so far: ITL), "fixup! " found in commit messages' one-liners (waiting on @pwmarcz)

They read exactly as much data as requested, ignoring interrupts and handling partial reads/writes. Signed-off-by: Borys Popławski <borysp@invisiblethingslab.com>

Old version was bugged, unstable, unnecessary complex and hard to maintain. After this PR other parts of code need not to worry about ipc implementation details. As a bonus there is no 0.5 sleep on the program exit. IPC connections are now managed using two sets: incoming and outgoing. The former is fully managed by a IPC worker thread, which handles new connections and all incoming IPC messages. The latter is shared between all threads, but encapsulated in the IPC module and the only thing you need to send a IPC message is the destination process ID. Signed-off-by: Borys Popławski <borysp@invisiblethingslab.com>

It was testing some old bug in Graphene and did never trigger in a couple of years, but took a long time on each run. Signed-off-by: Borys Popławski <borysp@invisiblethingslab.com>

If `DkStreamsWaitEvents` spotted a previous error, it skipped polling the fd, which could result in missing return events or hangs. Signed-off-by: Borys Popławski <borysp@invisiblethingslab.com>

Signed-off-by: Borys Popławski <borysp@invisiblethingslab.com>

Previously PALs could only report a predefined set of preloaded memory ranges ("vdso" and "manifest"). Now they could report an arbitrary array of such ranges. This commit also adds reporting of "vvar" range on Linux PAL. Signed-off-by: Borys Popławski <borysp@invisiblethingslab.com>

mkow

Reviewed 1 of 29 files at r1.
Reviewable status: all files reviewed, 1 unresolved discussion, not enough approvals from different teams (1 more required, approved so far: ITL) (waiting on @pwmarcz)

yamahata

Reviewed 14 of 19 files at r12, 5 of 5 files at r14, 7 of 12 files at r15, 1 of 2 files at r16, 4 of 4 files at r17.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @pwmarcz)

pwmarcz

Reviewable status: complete! all files reviewed, all discussions resolved

boryspoplawski commented Apr 15, 2021

View reviewed changes

boryspoplawski changed the title ~~[LibOS] Reword IPC transport layer~~ [LibOS] Rework IPC transport layer Apr 15, 2021

boryspoplawski commented Apr 15, 2021

View reviewed changes

yamahata suggested changes Apr 15, 2021

View reviewed changes

dimakuv suggested changes Apr 15, 2021

View reviewed changes

pwmarcz reviewed Apr 15, 2021

View reviewed changes

yamahata reviewed Apr 15, 2021

View reviewed changes

boryspoplawski commented Apr 15, 2021

View reviewed changes

dimakuv suggested changes Apr 16, 2021

View reviewed changes

pwmarcz reviewed Apr 16, 2021

View reviewed changes

boryspoplawski commented Apr 16, 2021

View reviewed changes

pwmarcz reviewed Apr 16, 2021

View reviewed changes

boryspoplawski commented Apr 16, 2021

View reviewed changes

pwmarcz reviewed Apr 16, 2021

View reviewed changes

boryspoplawski commented Apr 16, 2021

View reviewed changes

pwmarcz reviewed Apr 16, 2021

View reviewed changes

boryspoplawski commented Apr 16, 2021

View reviewed changes

dimakuv suggested changes Apr 16, 2021

View reviewed changes

boryspoplawski commented Apr 20, 2021

View reviewed changes

dimakuv suggested changes Apr 20, 2021

View reviewed changes

boryspoplawski commented Apr 20, 2021

View reviewed changes

dimakuv suggested changes Apr 21, 2021

View reviewed changes

boryspoplawski commented Apr 21, 2021

View reviewed changes

yamahata previously approved these changes Apr 21, 2021

View reviewed changes

dimakuv previously approved these changes Apr 21, 2021

View reviewed changes

boryspoplawski dismissed stale reviews from dimakuv and yamahata via 5a4b8a9 April 21, 2021 17:41

boryspoplawski force-pushed the borys/i_dont_even_know_what_im_doing branch from 30d903d to 5a4b8a9 Compare April 21, 2021 17:41

pwmarcz reviewed May 6, 2021

View reviewed changes

boryspoplawski commented May 6, 2021

View reviewed changes

boryspoplawski dismissed dimakuv’s stale review via a7bd32b May 6, 2021 11:54

boryspoplawski force-pushed the borys/i_dont_even_know_what_im_doing branch from 8603631 to a7bd32b Compare May 6, 2021 11:54

mkow previously approved these changes May 6, 2021

View reviewed changes

boryspoplawski dismissed mkow’s stale review via 30fbb8d May 7, 2021 11:37

dimakuv suggested changes May 7, 2021

View reviewed changes

boryspoplawski commented May 7, 2021

View reviewed changes

dimakuv previously approved these changes May 7, 2021

View reviewed changes

mkow suggested changes May 7, 2021

View reviewed changes

boryspoplawski dismissed dimakuv’s stale review via 9c97b1a May 7, 2021 22:06

boryspoplawski commented May 7, 2021

View reviewed changes

mkow previously approved these changes May 7, 2021

View reviewed changes

boryspoplawski dismissed mkow’s stale review via 1d5dfb4 May 8, 2021 00:21

boryspoplawski force-pushed the borys/i_dont_even_know_what_im_doing branch from 9c97b1a to 1d5dfb4 Compare May 8, 2021 00:21

boryspoplawski added 6 commits May 8, 2021 02:21

[LibOS] Introduce read_exact/write_exact wrappers

2d92a10

They read exactly as much data as requested, ignoring interrupts and handling partial reads/writes. Signed-off-by: Borys Popławski <borysp@invisiblethingslab.com>

[LibOS] Remove str_close_leak regression test

60d2796

It was testing some old bug in Graphene and did never trigger in a couple of years, but took a long time on each run. Signed-off-by: Borys Popławski <borysp@invisiblethingslab.com>

[Pal] Make DkStreamsWaitEvents always poll underlying fds

78d8805

If `DkStreamsWaitEvents` spotted a previous error, it skipped polling the fd, which could result in missing return events or hangs. Signed-off-by: Borys Popławski <borysp@invisiblethingslab.com>

[LTP] Kill timed out processes before collecting output streams

62a45cd

Signed-off-by: Borys Popławski <borysp@invisiblethingslab.com>

mkow approved these changes May 8, 2021

View reviewed changes

yamahata approved these changes May 8, 2021

View reviewed changes

pwmarcz approved these changes May 8, 2021

View reviewed changes

boryspoplawski merged commit 1d5dfb4 into master May 8, 2021

boryspoplawski deleted the borys/i_dont_even_know_what_im_doing branch May 8, 2021 17:33

boryspoplawski mentioned this pull request Jul 14, 2021

try_process_exit() should wait for the async helper thread to exit #440

Open

[LibOS] Rework IPC transport layer #2313

[LibOS] Rework IPC transport layer #2313

Conversation

boryspoplawski commented Apr 15, 2021 • edited

Description of the changes

boryspoplawski left a comment

Choose a reason for hiding this comment

boryspoplawski left a comment

Choose a reason for hiding this comment

yamahata left a comment

Choose a reason for hiding this comment

dimakuv left a comment

Choose a reason for hiding this comment

pwmarcz left a comment

Choose a reason for hiding this comment

yamahata left a comment

Choose a reason for hiding this comment

boryspoplawski left a comment

Choose a reason for hiding this comment

dimakuv left a comment

Choose a reason for hiding this comment

pwmarcz left a comment

Choose a reason for hiding this comment

boryspoplawski left a comment

Choose a reason for hiding this comment

boryspoplawski left a comment

Choose a reason for hiding this comment

pwmarcz left a comment

Choose a reason for hiding this comment

boryspoplawski left a comment

Choose a reason for hiding this comment

pwmarcz left a comment

Choose a reason for hiding this comment

boryspoplawski left a comment

Choose a reason for hiding this comment

pwmarcz left a comment

Choose a reason for hiding this comment

boryspoplawski left a comment

Choose a reason for hiding this comment

dimakuv left a comment

Choose a reason for hiding this comment

boryspoplawski left a comment

Choose a reason for hiding this comment

dimakuv left a comment

Choose a reason for hiding this comment

boryspoplawski left a comment

Choose a reason for hiding this comment

dimakuv left a comment

Choose a reason for hiding this comment

boryspoplawski left a comment

Choose a reason for hiding this comment

yamahata left a comment

Choose a reason for hiding this comment

dimakuv left a comment

Choose a reason for hiding this comment

boryspoplawski commented Apr 21, 2021

pwmarcz left a comment

Choose a reason for hiding this comment

boryspoplawski left a comment

Choose a reason for hiding this comment

mkow left a comment

Choose a reason for hiding this comment

dimakuv commented May 7, 2021

dimakuv left a comment

Choose a reason for hiding this comment

boryspoplawski left a comment

Choose a reason for hiding this comment

dimakuv left a comment

Choose a reason for hiding this comment

boryspoplawski commented May 7, 2021

mkow left a comment

Choose a reason for hiding this comment

boryspoplawski left a comment

Choose a reason for hiding this comment

boryspoplawski commented May 7, 2021

mkow left a comment

Choose a reason for hiding this comment

mkow left a comment

Choose a reason for hiding this comment

yamahata left a comment

boryspoplawski commented Apr 15, 2021 •

edited