Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime-rs: improving io performance using dragonball's vsock fd passthrough #7483

Merged
merged 16 commits into from Feb 1, 2024

Conversation

frezcirno
Copy link
Contributor

@frezcirno frezcirno commented Jul 28, 2023

Currently, in the kata container, every io read/write operation requires an RPC request from the runtime to the agent.
This process involves unnecessary data copying into/from an RPC request/response, which introduces high overhead. In scenarios where containers have multiple process streams, this results in poor performance and additional CPU consumption.

To solve this issue, this PR proposes utilizing the vsock fd passthrough #7585, a newly introduced feature in the Dragonball hypervisor. This feature allows other host programs to pass a file descriptor to the Dragonball process directly as the backend of an ordinary hybrid vsock connection. The detail is depicted in the following diagram:
passfd.png

Changes Made in this PR:

  1. The runtime-rs now utilizes this feature for container process io. It passes the stdin/stdout/stderr fifo from containerd to Dragonball, eliminating the need for an RPC for each io read/write operation.
  2. The agent uses the vsock streams as the child process's stdin/stdout/stderr in passfd mode, eliminating the need for a pipe to bump data (in non-tty mode).
  3. As this is a Dragonball-only improvement, two toml options, use_passfd_io and passfd_listener_port, are introduced to enable and configure the feature.
  4. The vsock implementation in the Dragonball vmm is modified to support container io detach and attach.

Implementation Details:
Agent side:

  1. The agent starts an AF_VSOCK server (the passfd_listener) based on a boot parameter. The server does accept() and saves the (hostport, stream) pairs for later use.
  2. Upon receiving CreateContainer or ExecProcess requests, the agent gets the port info, finds corresponding streams, and uses them as the child process's stdin/stdout/stderr. Note that if a terminal is required, the agent spawns two tokio tasks to copy from the stdin stream to the term_master, and from the term_master to the stdout stream.

Runtime-rs side:

  1. When creating a new container (with the container init process to run) or executing a process in an existing container, the runtime-rs opens the process io FIFO provided by containerd and connects to the passfd_listener in the guest. It then passes the fds to Dragonball, and saves the OK <hostport> results.
  2. The CreateContainerRequest and ExecProcessRequest carry the stream's hostport information.

Dragonball vmm:

  1. A stream can be set in "keep" mode to allow a peer to detach and attach.

New Protobuf Fields:
Three extra u32 fields (stdin, stdout, and stderr stream ports) are added to the CreateContainerRequest and ExecProcessRequest structs.

This PR is part of the student open-source practice program hosted in GitLink Code Camp, similar to the GSoC.
cc mentor @lifupan

@katacontainersbot
Copy link
Contributor

Can one of the admins verify this patch?

@katacontainersbot katacontainersbot added the size/huge Largest and most complex task (probably needs breaking into small pieces) label Jul 28, 2023
@frezcirno frezcirno changed the title WIP: runtime-rs: refactor io using dragonball's vsock fd passthrough feature WIP: runtime-rs: improving io performance using dragonball's vsock fd passthrough Jul 28, 2023
@frezcirno frezcirno marked this pull request as ready for review July 28, 2023 18:33
@frezcirno frezcirno changed the title WIP: runtime-rs: improving io performance using dragonball's vsock fd passthrough runtime-rs: improving io performance using dragonball's vsock fd passthrough Jul 28, 2023
@bergwolf bergwolf added the soc Summer of Code related label Jul 31, 2023
@bergwolf bergwolf linked an issue Jul 31, 2023 that may be closed by this pull request
@bergwolf
Copy link
Member

@frezcirno I assumed that there should be some modification to the dragonball vmm, but it turned out to be not the case?

@frezcirno
Copy link
Contributor Author

@frezcirno I assumed that there should be some modification to the dragonball vmm, but it turned out to be not the case?

Hi, the main functionality of passfd has merged into dragonball in this openanolis/dragonbal-sandbox PR #278 already, which is done before the original dragonball repo archives and moves into kata.

Moreover, another precondition of this PR, which fixes a passfd type issue, is on the way, so this pr should be merged after that.

@bergwolf
Copy link
Member

bergwolf commented Aug 1, 2023

@frezcirno another precondition of this PR has already been merged in dragonball-sandbox. Do you mean that it needs to be ported to kata-containers (now that dragonball is part of kata)?

Copy link
Member

@bergwolf bergwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good. Nice PR, thanks! A few comments are inline.

src/agent/src/rpc.rs Outdated Show resolved Hide resolved
src/libs/protocols/protos/agent.proto Outdated Show resolved Hide resolved
src/libs/protocols/protos/agent.proto Outdated Show resolved Hide resolved
@frezcirno
Copy link
Contributor Author

@frezcirno another precondition of this PR has already been merged in dragonball-sandbox.

Sorry, but I don't find this fix in openanolis/dragonball-sandbox or in kata-containers. Is it merged? I searched for the fix change and didn't find a similar commit or PR?

Do you mean that it needs to be ported to kata-containers (now that dragonball is part of kata)?

Yes, this PR won't work normally without that fix to dragonball.

@bergwolf
Copy link
Member

bergwolf commented Aug 2, 2023

@frezcirno oops, I was looking at a different PR. Then could you help port another precondition of this PR to kata-containers ?

@frezcirno frezcirno force-pushed the passfd_io_feature branch 2 times, most recently from 6f3132e to e823eea Compare August 4, 2023 07:18
@frezcirno frezcirno force-pushed the passfd_io_feature branch 3 times, most recently from b0a082c to 33716fa Compare August 10, 2023 09:02
@katacontainersbot katacontainersbot added size/tiny Smallest and simplest task and removed size/huge Largest and most complex task (probably needs breaking into small pieces) labels Aug 10, 2023
@frezcirno frezcirno force-pushed the passfd_io_feature branch 2 times, most recently from 71b3b6b to 813583b Compare August 14, 2023 02:32
@katacontainersbot katacontainersbot added size/huge Largest and most complex task (probably needs breaking into small pieces) and removed size/tiny Smallest and simplest task labels Aug 14, 2023
@frezcirno
Copy link
Contributor Author

Ping @bergwolf The dragonball fix has been merged in #7585 ;-)

@lifupan
Copy link
Member

lifupan commented Sep 21, 2023

Hi @frezcirno

I took a look and found that there are no other problems. There are two questions you need to figure out:

  1. Why did you need to wait for io close before waiting for the container process? The root cause of this problem is that you need to wait for all the data in the io stream of the process to be copied before initiating wait. otherwise
    This may cause some io content to be lost. Can't see it on the console. Now we have separated the io stream separately, and there is no need to wait for the io to end on the runtime side. Initiate directly at runtime
    wait process Wait for the process on the kata-agent side to return and it will be ok.

  2. close stdin should not need to be sent to the agent. Because when containerd initiates close stdin, the pipe at containerd's end should have been closed. At this time
    Our vsock should be able to sense the peer close on the guest side. At this time, for non-tty cases, the guest side should not need to process anything, because vsock fd is the stdin of the process.
    It will be closed automatically. If it is a tty case, after the guest senses that the vsock peer is closed, it can just close the tty master.

@frezcirno
Copy link
Contributor Author

Hi @frezcirno. It appears that two required tests were failed in every attempt, see:

Unfortunately, no detailed logs are printed into console. I'd recommend you run these tests in your local environment to see what is going wrong. Please refer to this doc for more information.

Hi @justxuewei. The problem is that the quay.io/sjenning/nginx:1.15-alpine image used in the tests relies on procfs to redirect its stdout and stderr, which is not a perfect way and is not supported by this PR.

The /proc/x/fd/y is a virtual file representing the file descriptor y of the process with pid number x. Another process can open it to acquire the same duplicated file. However, if the fd is a socket, the Linux kernel forbids opening it and fails the open() call with ENXIO ("No such device or address"). Ref: [1], [2]

In the nginx image, /var/log/nginx/access.log (error.log) is a soft link to /dev/stdout (stderr), which is another soft link to /proc/self/fd/1 (/2). In the context of this PR, the stdout and stderr of the process are actually sockets (of vsock type), and blocked by the kernel.

For the CI, maybe we could use another nginx image without using the procfs? AFAIK, the nginx now supports specifying stdout or stderr directly rather than /dev/std* or /proc/self/fd/* as access_log and error_log, maybe it's time to update the config?

Thanks,

@justxuewei
Copy link
Member

For the CI, maybe we could use another nginx image without using the procfs? AFAIK, the nginx now supports specifying stdout or stderr directly rather than /dev/std* or /proc/self/fd/* as access_log and error_log, maybe it's time to update the config?

Hi @frezcirno, thanks for your efforts to investigate this issue. It's ok to me overall, but there is one question remaining: Is this feature enabled by default? If yes, it means some images, like aforementioned nginx image, are required to be adapted. Otherwise, pods can't be started as expected.

Wdyt? /cc @lifupan @bergwolf

@justxuewei
Copy link
Member

/test

@justxuewei
Copy link
Member

/test

@justxuewei
Copy link
Member

/test-arm

@justxuewei
Copy link
Member

justxuewei commented Jan 23, 2024

Hi @frezcirno. Your pull request possibly breaks CI, since I raised another PR without changes at here. run-k8s-tests (ubuntu, dragonball, small) passed.

I observed that it stucks at one of Empty dir volumes and Empty dir volume when FSGroup is specified with non-root container. Your pull request somehow breaks those tests.

You could deploy a K8s cluster on local to do the test. I have a document about how to create a test cluster. You can ping me (@Xavier) in Slack if you are interested in.

@justxuewei
Copy link
Member

@frezcirno Please rebase atop of latest main containing a CI fix.

frezcirno and others added 16 commits January 31, 2024 21:07
Fixes: kata-containers#6714

Signed-off-by: Zixuan Tan <tanzixuan.me@gmail.com>
Two toml options, `use_passfd_io` and `passfd_listener_port` are introduced
to enable and configure dragonball's vsock fd passthrough io feature.

This commit is a preparation for vsock fd passthrough io feature.

Fixes: kata-containers#6714

Signed-off-by: Zixuan Tan <tanzixuan.me@gmail.com>
Currently in the kata container, every io read/write operation requires
an RPC request from the runtime to the agent. This process involves
data copying into/from an RPC request/response, which are high overhead.

To solve this issue, this commit utilize the vsock fd passthrough, a
newly introduced feature in the Dragonball hypervisor. This feature
allows other host programs to pass a file descriptor to the Dragonball
process, directly as the backend of an ordinary hybrid vsock connection.

The runtime-rs now utilizes this feature for container process io. It
open the stdin/stdout/stderr fifo from containerd, and pass them to
Dragonball, then don't bother with process io any more, eliminating
the need for an RPC for each io read/write operation.

In passfd io mode, the agent uses the vsock connections as the child
process's stdin/stdout/stderr, eliminating the need for a pipe
to bump data (in non-tty mode).

Fixes: kata-containers#6714

Signed-off-by: Zixuan Tan <tanzixuan.me@gmail.com>
When one end of the connection close, the epoll event will be triggered
forever. We should close the connection and kill the connection.

Fixes: kata-containers#6714

Signed-off-by: Zixuan Tan <tanzixuan.me@gmail.com>
When container exits, the agent should clean up the term master fd,
otherwise the fd will be leaked.

Fixes: kata-containers#6714

Signed-off-by: Zixuan Tan <tanzixuan.me@gmail.com>
In linux, when a FIFO is opened and there are no writers, the reader
will continuously receive the HUP event. This can be problematic
when creating containers in detached mode, as the stdin FIFO writer
is closed after the container is created, resulting in this situation.

In passfd io mode, open stdin fifo with O_RDWR|O_NONBLOCK to avoid the
HUP event.

Fixes: kata-containers#6714
Signed-off-by: Zixuan Tan <tanzixuan.me@gmail.com>
Partially fix some issues related to container io detach and attach.

Fixes: kata-containers#6714
Signed-off-by: Zixuan Tan <tanzixuan.me@gmail.com>
Support the hybrid fd passthrough mode with passing pipe fd,
which can specify this connection kept even when the pipe
peer closed, and this connection can be reget wich re-opening
the pipe.

Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
We want the io connection keep connected when the containerd closed
the io pipe, thus it can be attached on the io stream.

Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
In passfd io mode, when not using a terminal, the stdout/stderr vsock
streams are directly used as the stdout/stderr of the child process.
These streams are non-blocking by default.

The stdout/stderr of the process should be blocking, otherwise
the process may encounter EAGAIN error when writing to stdout/stderr.

Fixes: kata-containers#6714
Signed-off-by: Zixuan Tan <tanzixuan.me@gmail.com>
This patch uses a biased select to avoid stdin data loss in case of
CloseStdinRequest.

Fixes: kata-containers#6714
Signed-off-by: Zixuan Tan <tanzixuan.me@gmail.com>
This patch adds O_NONBLOCK flag when open stdout and stderr FIFOs
to avoid blocking.

Fixes: kata-containers#6714
Signed-off-by: Zixuan Tan <tanzixuan.me@gmail.com>
Fix rustfmt and clippy warnings detected by CI.

Fixes: kata-containers#6714
Signed-off-by: Zixuan Tan <tanzixuan.me@gmail.com>
Linux forbids opening an existing socket through /proc/<pid>/fd/<fd>,
making some images relying on the special file /dev/stdout(stderr),
/proc/self/fd/1(2) fail to boot in passfd io mode, where the
stdout/stderr of a container process is a vsock socket.

For back compatibility, a pipe is introduced between the process
and the socket, and its read end is set as stdout/stderr of the
container process instead of the socket. The agent will do the
forwarding between the pipe and the socket.

Fixes: kata-containers#6714
Signed-off-by: Zixuan Tan <tanzixuan.me@gmail.com>
Fixes: kata-containers#6714
Signed-off-by: Zixuan Tan <tanzixuan.me@gmail.com>
There is a race condition in agent HVSOCK_STREAMS hashmap, where a
stream may be taken before it is inserted into the hashmap. This patch
add simple retry logic to the stream consumer to alleviate this issue.

Fixes: kata-containers#6714
Signed-off-by: Zixuan Tan <tanzixuan.me@gmail.com>
@justxuewei
Copy link
Member

/test

@justxuewei justxuewei merged commit 2332552 into kata-containers:main Feb 1, 2024
281 of 292 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok-to-test runtime-rs size/huge Largest and most complex task (probably needs breaking into small pieces) soc Summer of Code related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Kata Containers streaming IO
7 participants