Skip to content

Conversation

@Jongy
Copy link
Contributor

@Jongy Jongy commented Apr 3, 2021

Implements what's been discussed in #397 .

General flow

  • The start operation now accepts a flag fdtransfer, which initializes the FdTransfer class (creates a UDS and waits for a connection).
  • Components in async-profiler can check in runtime whether we're connected with an FdTransfer peer, and if so, can send requests to that peer, instead of doing certain operations on their own.
  • "transferred" operations that we support are: the perf_event_open call, and obtaining a file descriptor of /proc/kallsyms.
  • profiler.sh accepts --fdtransfer which lets it start the fdtransfer program alongside with jattach. The fdtransfer program connects to the target async-profiler and serves all requests it receives, until async-profiler closes the peer socket (upon stop).

Using this, I'm able to profile a low-privileged container, running test/Target.java and started this way: docker run --user 8888:8888 java ... with this command: sudo ./profiler.sh -d 2 -e cpu -f ... -o flamegraph --fdtransfer $(pgrep java) - and I get proper kernel stacks. On my box, perf_event_paranoid = 3 and kptr_restrict = 1.

TODOs

Got some small ones in the code, these are the major ones.

  • Base branch - I'm currently based on 5dd9e86 (v1.8.4), on which should I rebase, master or v2.0? I just saw that v2.0 is merged into master, so I'll rebase on master.
  • Decide on appropriate tests to add. I guess something like smoke-test.sh with fdtransfer, which proves that we get kernel stacks in addition to Java stacks. Added a basic smoke test.
  • I currently ignore PerfEvents::check (that is, I don't check for fdtransfer there). Do you think it's relevant? I implemented this, but IIUC the purpose of the check action, I don't think it's very relevant. I can add it if you think it's useful.
  • Waiting for fdtransfer to exit? Currently it will connect to async-profiler and serve all requests until EOF (which happens upon stop). It works fine with the collect action. If running start directly, fdtransfer might need to be disowned so it continues executing after profiler.sh exits. I run it without waiting for the collect action, because we run stop before exiting; for start & resume, I run it with nohup &.
  • Update the readme & usage - https://github.com/jvm-profiling-tools/async-profiler#profiling-java-in-a-container and all parts that tell you to modify perf_event_paranoid and kptr_restrict, since it's not needed if using fdtransfer.

Error handling:

  • Got some cases where we might have truncated reads/writes, should be replaced with a "read all" / "write all" loops.
  • Use of perror / fprintf(stderr) in async-profiler side - is it legit? async-profiler already writes to the process' stderr, but perhaps not as verbose as I've been here... Post-rebase and 7ec5c19, I'll use the new logger class.
  • In case of errors with fdtransfer, I think the correct behavior should be as if the "original" (non-transferred) operation has failed. Make sure it's the case in all relevant sites.

Notes

  • I made FdTransfer a static class, because it's simpler, and I don't see how we'll ever need more than 1 instance. Plus, in most of async-profiler code we already use such singletons.
  • In general I'm more of a C guy, so the C++ might be a bit too C-ish, I'll be happy to improve any areas that you think are too much C, if you care about it.
  • Upon the first fdtransfer usage in async-profiler, the listener socket remains open (and further invocations of start,fdtransfer will use it). This implicitly "solves" Potential crash when running async-profiler while already active #395 because the listener socket acts as a mutex.
  • I took efforts to write the FdTransfer class in a way it's easier to add new "request types", because I already have some ideas that can be added after this PR:
    • Use it to pass the fd of the output file - this way, the output file can be written "in the host" and not in the mount namespace where async-profiler runs (Async in Docker with outside report file #353)
    • I think it'd be nice to change async-profiler to write all errors & warning to a separate file (instead of the process' stderr). This will be much more convenient when checking for errors. If we make this change - (this change was done on master already) we can, again, pass the fd of this "error file" to async-profiler, so its output is written "in the host".
    • This "protocol" can also be used to stream events (Event streaming #404), or at least, to pass a file descriptor on which you stream the events.

@Jongy
Copy link
Contributor Author

Jongy commented Apr 4, 2021

Rebased on latest master.

@Jongy Jongy force-pushed the perf-events-low-privs branch from 3f5b544 to 3e2c7ce Compare May 18, 2021 00:05
@Jongy
Copy link
Contributor Author

Jongy commented May 18, 2021

Rebased on latest master & squashed all.

@Jongy Jongy force-pushed the perf-events-low-privs branch from 3e2c7ce to 504b54d Compare June 6, 2021 00:19
@Jongy
Copy link
Contributor Author

Jongy commented Jun 6, 2021

Finally had some time to fix the remaining issues here. @apangin , would love your feedback.

What's left:

  • Your opinion about this entire approach & implementation
  • Update the documentation
  • Fix the tests (they fail with E: Package 'default-jdk' has no installation candidate so probably just a rerun, then fix any real issues...)
  • Think of any other, more complex tests to perform; this passes sanity.

@apangin
Copy link
Member

apangin commented Jul 30, 2021

Excuse me for a delay, I finally got a chance to look into this deeper.

First of all, thank you for working on this - it's certainly an interesting feature.
The approach looks fine in general. I have some questions/comments on the implementation though.

  1. I'd like to keep changes in the profiler core as little as possible. From the API point, there shouldn't be difference in calling a real syscall vs. a transfer function, in particular, in return value or error code, so that the caller doesn't need to handle it differently. This also applies to kallsyms parsing - I think the only required difference is in the way to obtain fd, but not in the way how to parse the file.
  2. I wonder why JVM side opens a socket, and the host connects to it, and not vice versa?
  • Logically the JVM side is a client, and the host is a server. The lifecycle management of a socket will be simpler and more intuitive in the profiler agent, isn't it?
  • Security-wise there should be no difference. A host may check SO_PEERCRED to verify the client.
  • The race between starting fd server and attaching to the VM will automatically go away. No need for trying connection in a loop.
  • In future, fd server can be shared between multiple JVM clients (but it's not something that needs to be done right now).
  1. Why not SOCK_DGRAM? The code will benefit from automatic message boundaries.
  2. I'd suggest not to reuse the same fdTransfer.cpp for the agent and the server, that's confisuing. The client and and the server do not share implementation anyway. It's OK to share just a small .h file with common protocol constants, and that's it. You won't need to explain in comments why you can't use Log, etc. Having two files named fdtransfer.cpp is confusing as well.
  3. Please make sure profiler builds on macOS. Linux-specific files should have _linux in the name, with the contents surrounded by the proper #ifdefs.

More comments will follow inline in the diffs.

@Jongy
Copy link
Contributor Author

Jongy commented Jul 31, 2021

Thanks for the review :)

  1. I'd like to keep changes in the profiler core as little as possible. From the API point, there shouldn't be difference in calling a real syscall vs. a transfer function, in particular, in return value or error code, so that the caller doesn't need to handle it differently. This also applies to kallsyms parsing - I think the only required difference is in the way to obtain fd, but not in the way how to parse the file.

Syscall vs transfer - you're correct, I have tried to keep the changes at minimum - the API does remain the same, PerfEvents::createForThread has the same prototype. Did you mean something else?

About kallsyms parsing - existing code uses C++ ifstream and I didn't want to change that. Originally, I tried to obtain an ifstream from a file descriptor, but a quick search proved it not easily possible, so I resorted to use FILE. I can fix both branches of the the code to use FILE to avoid the duplication & reduce code changes, if that's what you're referring to.

  1. I wonder why JVM side opens a socket, and the host connects to it, and not vice versa?
  • Logically the JVM side is a client, and the host is a server. The lifecycle management of a socket will be simpler and more intuitive in the profiler agent, isn't it?
  • Security-wise there should be no difference. A host may check SO_PEERCRED to verify the client.
  • The race between starting fd server and attaching to the VM will automatically go away. No need for trying connection in a loop.
  • In future, fd server can be shared between multiple JVM clients (but it's not something that needs to be done right now).

Hmm, originally I designed the JVM as a server because it made some sense, as the JVM is the one who remains in its namespace, and the transfer program moves namespaces and finds the target. I also had in mind the use of this socket as a mutex preventing multiple APs from running concurrently (mentioned in the "Notes" section). We've since talked about the mutex in other issues and I now know it's not enough anyway :/

Swapping client/server will eliminate the listen loop and add the peer check. I'm not sure if the lifecycle of the socket will change - I will still maintain an open socket throughout profiling (so I can respond to new threads being created). Therefore only the initialization code will change.

Overall I think it's worth it - so I will try to make this change.

  1. Why not SOCK_DGRAM? The code will benefit from automatic message boundaries.

Hmmm good idea, all requests are small enough to fit in a single message of a UDS (IDK what's the limit but I'm certain they fit...)

  1. I'd suggest not to reuse the same fdTransfer.cpp for the agent and the server, that's confisuing. The client and and the server do not share implementation anyway. It's OK to share just a small .h file with common protocol constants, and that's it. You won't need to explain in comments why you can't use Log, etc. Having two files named fdtransfer.cpp is confusing as well.

Nothing is shared indeed, each of the functions is used either in the client or in the server. Okay, I'll split them up and maintain just one header with whatever needs to be shared:

I'll also rename fdtransfer.cpp.

So we'll have:

  • fdtransfer_server_linux.cpp which will have FdTransferServer.
  • fdtransfer_client_linux.cpp which will have FdTransferClient. This will link with the profiler.
  • fdtransfer_prog_linux.cpp which will be the CLI & will link with FdTransferServer.
  1. Please make sure profiler builds on macOS.

Of course, was planning to fix the build & CI in general after getting your general approval about this implementation.

Linux-specific files should have _linux in the name, with the contents surrounded by the proper #ifdefs.

Will do 👍

@apangin
Copy link
Member

apangin commented Aug 1, 2021

PerfEvents::createForThread has the same prototype. Did you mean something else?

Right, this part is fine, except that the caller expects errno in the case of failure.
What do you think about passing the error code via the socket protocol?

I can fix both branches of the the code to use FILE to avoid the duplication & reduce code changes

That's what I meant. Thanks.

I'm not sure if the lifecycle of the socket will change - I will still maintain an open socket throughout profiling (so I can respond to new threads being created). Therefore only the initialization code will change.

Yeah, there will be just connect at the beginning, and close and the end of the profiling session, with no listening socket.

So we'll have:

  • fdtransfer_server_linux.cpp which will have FdTransferServer.
  • fdtransfer_client_linux.cpp which will have FdTransferClient. This will link with the profiler.
  • fdtransfer_prog_linux.cpp which will be the CLI & will link with FdTransferServer.

Naming is the most difficult programming task :) I thought about two files: fdtransfer_client_linux.cpp, which will be a part of the agent, and fdtransfer_server.cpp, which will be a standalone server. As to the latter, if it will be in a separate directory and not included in the build process on macOS, it probably does not need linux suffix/ifdefs.

@Jongy Jongy force-pushed the perf-events-low-privs branch from 504b54d to de86115 Compare August 1, 2021 23:58
@Jongy
Copy link
Contributor Author

Jongy commented Aug 1, 2021

I have attempted to make the changes according to your suggestions:

  • Split the big fdTransfer.cpp to 2 classes in 2 different files. No more shared code between the profiler & jattach - only 1 small header.
  • Swapped server/client: now the profiler connects to fdtransfer.
  • Changed the socket type to SOCK_SEQPACKET (preferred over SOCK_DGRAM; being connection-oriented makes things a bit easier, for example we can now let fdtransfer run indefinitely until the connection ends - with DGRAM we'll need to stop it at some point. Also, it lets me use SO_PEERCRED which is simpler compared to SCM_CREDENTIALS). DGRAM is needed if we plan to share the fdtransfer/fdserver (you can decide on a name and I'll refactor :) between several JVMs, but that will make it more complex - since different JVMs can run in different network & PID namespaces.
  • Renamed the files & added #ifdefs.
  • Better shared kallsyms parsing code.
  • The CI doesn't build for macOS, I see; so I'll try get hold of some macOS box and try building on it later, to make sure build wasn't broken.

Ran another sanity after the changes.

Also, I have rebased on latest master. Here is the result of git range-diff:

  1:  c7ca480 =  1:  dcfaeba jattach: Print specialized message for read(...) == 0
 2:  7d8d99f =  2:  0a55f84 test: Print a message when the test program starts
 3:  eda52c6 =  3:  530aa4a jattach: Extract utility functions to utils.c
 4:  cc15483 =  4:  930159e Refactor enter_mount_ns() into generic enter_ns()
 5:  f0c817d !  5:  8aec6ca Add FdTransfer mechanism
    @@ Makefile: endif
     -all: build build/$(LIB_PROFILER) build/$(JATTACH) build/$(API_JAR) build/$(CONVERTER_JAR)
     +all: build build/$(LIB_PROFILER) build/$(JATTACH) build/$(API_JAR) build/$(CONVERTER_JAR) build/fdtransfer
      
    - release: build $(PACKAGE_NAME).tar.gz
    + release: build $(PACKAGE_NAME).$(PACKAGE_EXT)
      
     @@ Makefile: build/$(LIB_PROFILER_SO): $(SOURCES) $(HEADERS) $(JAVA_HEADERS)
      build/$(JATTACH): src/jattach/jattach.c src/jattach/utils.c
    @@ src/arguments.cpp: Error Arguments::parse(const char* args) {
     
      ## src/arguments.h ##
     @@ src/arguments.h: class Arguments {
    -     bool _threads;
    +     bool _sched;
          int _style;
          CStack _cstack;
     +    bool _fdtransfer;
    @@ src/arguments.h: class Arguments {
          int _jfr_options;
          int _dump_traces;
     @@ src/arguments.h: class Arguments {
    -         _threads(false),
    +         _sched(false),
              _style(0),
              _cstack(CSTACK_DEFAULT),
     +        _fdtransfer(false),
    @@ src/profiler.cpp
      #include "flightRecorder.h"
     +#include "fdTransfer.h"
      #include "frameName.h"
    - #include "log.h"
      #include "os.h"
    + #include "stackFrame.h"
     @@ src/profiler.cpp: Error Profiler::start(Arguments& args, bool reset) {
              return Error("Only JFR output supports multiple events");
          }
 6:  5c8ceb7 =  6:  898cbe6 Add fdtransfer-smoke-test.sh
 7:  58b7e8f =  7:  f004b26 Add & use send_all(), recv_all()
 8:  8021e99 =  8:  d34ecc9 Don't use nsenter since it's missing
 9:  e61263e =  9:  6a9d591 Run fdtransfer for all relevant actions, only in Linux, and wih nohup when needed
10:  585f5ed = 10:  9f770a1 Create kallsyms file in different paths per process, and clean it up
11:  0c4502e = 11:  8f4837c Different error prints for error/truncated sendmsg
12:  3e86891 = 12:  efb36b3 Ensure TID sent in perf_event_open request indeed belongs to the process
13:  ad48cfd = 13:  f6537eb Enter mount NS as well in fdtransfer
14:  ac68a16 = 14:  c827e05 Don't attempt to send fd if it's -1, it fails the sendmsg()
 -:  ------- > 15:  606ae07 Close fds after sending
 -:  ------- > 16:  3055a93 Pass /dev/null fd if kallsyms tmp open fails
 -:  ------- > 17:  d2747b7 Use tgkill to test for pid-tid relation
 -:  ------- > 18:  7d75f23 Remove tgid from get_process_info() - unused
 -:  ------- > 19:  1a87775 Refactor FdTransfer to multiple files & classes, switch to SOCK_SEQPACKET
 -:  ------- > 20:  d14c9b6 Add attribute((format(printf))) to various log functions
 -:  ------- > 21:  04e345e Try better at sharing code in parseKernelSymbols()
 -:  ------- > 22:  52c4fe4 Add SO_PEERCRED verification
15:  504b54d = 23:  de86115 PATCH dump output file in CI

(no diff besides the added commits)

@Jongy
Copy link
Contributor Author

Jongy commented Aug 2, 2021

There are some synchronization issues now:

  • The 2nd time I run the profiler & fdtransfer on the same JVM process, I always get [WARN] FdTransferClient connect(): Connection refused. After killing the JVM, I'll get Done serving, had an error in the shell where I ran profiler.sh. Means fdtransfer got stuck somehow...
  • I tried to stress it and randomly got FdTransfer bind(): Address already in use

I tried to add a simple flag file (let fdtransfer flag profiler.sh after it has called bind() and listen()) but things require more handling since fdtransfer moves namespaces. I'll try to implement that with /proc/pid/root. Anyway, I'm not sure it'll solve all the issues.

I think we should add some accept() timeout to fdtransfer as well.

@Jongy Jongy force-pushed the perf-events-low-privs branch from de86115 to cb874bb Compare August 2, 2021 00:10
@Jongy Jongy force-pushed the perf-events-low-privs branch from 9aa3cfe to ed6cdf6 Compare August 2, 2021 00:27
@Jongy
Copy link
Contributor Author

Jongy commented Aug 2, 2021

Whoopsie. Saw this just now. I really need to fix my Gmail notifications.

PerfEvents::createForThread has the same prototype. Did you mean something else?

Right, this part is fine, except that the caller expects errno in the case of failure.
What do you think about passing the error code via the socket protocol?

Well, the profiler merely logs the errno after a failed perf_event_open call. PerfEvents::start does have some special handling depending on the errno, but the messages it shows are misleading for fdtransfer? Or maybe not...
In case of fdtransfer, this log will happen in the fdtransfer side. Either way, those messages will be printed via profiler.sh, there will be some difference in the format tho (as one is perror() and one is Log::warn.
I didn't think fixing this difference is important to add new fields to the protocol; if you think it is, then sure, I'll add it.

I can fix both branches of the the code to use FILE to avoid the duplication & reduce code changes

That's what I meant. Thanks.

👍

I'm not sure if the lifecycle of the socket will change - I will still maintain an open socket throughout profiling (so I can respond to new threads being created). Therefore only the initialization code will change.

Yeah, there will be just connect at the beginning, and close and the end of the profiling session, with no listening socket.

👍 that's how it goes now.

So we'll have:

  • fdtransfer_server_linux.cpp which will have FdTransferServer.
  • fdtransfer_client_linux.cpp which will have FdTransferClient. This will link with the profiler.
  • fdtransfer_prog_linux.cpp which will be the CLI & will link with FdTransferServer.

Naming is the most difficult programming task :) I thought about two files: fdtransfer_client_linux.cpp, which will be a part of the agent, and fdtransfer_server.cpp, which will be a standalone server. As to the latter, if it will be in a separate directory and not included in the build process on macOS, it probably does not need linux suffix/ifdefs.

Naming is indeed hard 😉
All right - I'll merge the server-side files. Currently they go to src/jattach. If you prefer e.g src/fdtransfer, I'll move them.

@Jongy
Copy link
Contributor Author

Jongy commented Aug 2, 2021

Tested macOS (x64) build & ./profiler.sh PID, passes sanity (on commit 051346a)

@apangin
Copy link
Member

apangin commented Aug 2, 2021

Thank you for addressing my comments; we're approaching the finish :)

Well, the profiler merely logs the errno after a failed perf_event_open call. PerfEvents::start does have some special handling depending on the errno, but the messages it shows are misleading for fdtransfer? Or maybe not...
In case of fdtransfer, this log will happen in the fdtransfer side. Either way, those messages will be printed via profiler.sh, there will be some difference in the format tho (as one is perror() and one is Log::warn.
I didn't think fixing this difference is important to add new fields to the protocol; if you think it is, then sure, I'll add it.

I though about it once again, and now I'm quite sure the error needs to be passed down the connection.

  • There are many reasons why perf_event_open may fail, even under superuser. In this case, the error code should be communicated to the user. Async-profiler already handles errors from the syscall somehow; in order to keep the changes to the core minimal, and to stay consistent, let it similarly handle errors from remote perf_event_open.
  • The output of the agent and the fd server does not necessarily have the same destination. Log::warn can be redirected to a file with the log option. And yes, the format makes difference for tools - after all, profiler.sh is not the only way to run the profiler.

@apangin
Copy link
Member

apangin commented Aug 2, 2021

I realized there are two more problems:

  1. fd transfer does not currently work when the profiler runs as a startup time agent. This can be possibly solved by allowing an optional fdtransfer argument that specifies an alternative socket name (instead of the hard-coded async-profiler-pid). In this scenario, a reusable fd server starts making sense; on the other hand, it can no longer verify the client by pid.
  2. Another problem is that PerfEvents::createForThread can be called concurrently by multiple threads. There is no guarantee that threads requesting a perf fd will receive the responses for their own requests. A possible solution is to include tid in the response, so a client may correctly process a response designated for a different thread.

@Jongy
Copy link
Contributor Author

Jongy commented Aug 2, 2021

Thank you for addressing my comments; we're approaching the finish :)

No rush, I'm very pleased with the finish level of AP and I fully understand your desire to bring every detail to perfection.

I though about it once again, and now I'm quite sure the error needs to be passed down the connection.

  • There are many reasons why perf_event_open may fail, even under superuser. In this case, the error code should be communicated to the user. Async-profiler already handles errors from the syscall somehow; in order to keep the changes to the core minimal, and to stay consistent, let it similarly handle errors from remote perf_event_open.
  • The output of the agent and the fd server does not necessarily have the same destination. Log::warn can be redirected to a file with the log option. And yes, the format makes difference for tools - after all, profiler.sh is not the only way to run the profiler.

Gotcha. I agree that it will work better for non-profiler.sh drivers; and also it's best for the future - if any code later builds on the return values from perf_event_open, it's best if fdtransfer behaves the same.

I realized there are two more problems:

  1. fd transfer does not currently work when the profiler runs as a startup time agent. This can be possibly solved by allowing an optional fdtransfer argument that specifies an alternative socket name (instead of the hard-coded async-profiler-pid). In this scenario, a reusable fd server starts making sense; on the other hand, it can no longer verify the client by pid.

Hmm. I didn't have startup-time support in mind. I can run fdtransfer with pid == 0 which will indicate no PID verification, and a constant socket path which can be passed to the agent. Still remains the issue of different net namespaces, though - if the spawning JVM runs in a new container, we can't possibly have it connect to fdtransfer because it's containerized (here, fdtransfer acting as a client could work - it could iterate all running JVMs every X seconds, and attempt to connect to newly started ones, whichever namespaces they're running in. Or alternatively, fdtransfer could identify the new JVMs and fork itself into their namespace, providing a server there. This introduces new sync issues as the agent will now need to give fdtransfer some time to do that).

Do we want to support it in this PR? We can add partial support, perhaps - only for JVMs running in the host (or, in the same network namespace)

  1. Another problem is that PerfEvents::createForThread can be called concurrently by multiple threads. There is no guarantee that threads requesting a perf fd will receive the responses for their own requests. A possible solution is to include tid in the response, so a client may correctly process a response designated for a different thread.

That's true. I previously added minimal protection against incorrect fd transfer, in the form of "request ID". In case requests do get mixed up, we'll get explicit error.
Possible solutions for that:

  • Your suggestion here
  • Add locking around requests - to ensure order. This adds even more blocking code to our ThreadStart JNI call, which is not optimal, I assume...
  • Add simple queue management - so threads can read responses of other threads and queue them. This makes synchronization more complex (now that we block not only in recvmsg but also in that queue, we'll need select/poll to sync with that other queue).

Looking in PerfEvents::createForThread it seems alright to return the response designated for another thread, if I also return the tid and update the local variable. The only use of tid before creating the perf fd is with the tid >= _max_events check, which was done by another thread for the "different" tid, should we get a mismatch. So yeah that's alright, will change the code accordingly.

@Jongy
Copy link
Contributor Author

Jongy commented Aug 2, 2021

I briefly tested the errno passing with this patch:

diff --git a/src/perfEvents_linux.cpp b/src/perfEvents_linux.cpp
index 20fbbf3..4620659 100644
--- a/src/perfEvents_linux.cpp
+++ b/src/perfEvents_linux.cpp
@@ -548,6 +548,7 @@ int PerfEvents::createForThread(pid_t tid) {
     int fd;
 
     if (FdTransferClient::hasPeer()) {
+        attr.size = 12;
         fd = FdTransferClient::requestPerfFd(&tid, &attr);
     } else {
         fd = syscall(__NR_perf_event_open, &attr, tid, -1, -1, 0);

Profiler log, as expected:

[WARN] Install JVM debug symbols to improve profile accuracy
[WARN] perf_event_open for TID 40 failed: Argument list too long
[WARN] perf_event_open for TID 41 failed: Argument list too long
[WARN] perf_event_open for TID 42 failed: Argument list too long
[WARN] perf_event_open for TID 43 failed: Argument list too long
....

@Jongy
Copy link
Contributor Author

Jongy commented Aug 2, 2021

There are some synchronization issues now:

  • The 2nd time I run the profiler & fdtransfer on the same JVM process, I always get [WARN] FdTransferClient connect(): Connection refused. After killing the JVM, I'll get Done serving, had an error in the shell where I ran profiler.sh. Means fdtransfer got stuck somehow...
  • I tried to stress it and randomly got FdTransfer bind(): Address already in use

I tried to add a simple flag file (let fdtransfer flag profiler.sh after it has called bind() and listen()) but things require more handling since fdtransfer moves namespaces. I'll try to implement that with /proc/pid/root. Anyway, I'm not sure it'll solve all the issues.

I think we should add some accept() timeout to fdtransfer as well.

Okay, I figured it out, it's funny.

  • Profiling a fresh process where AP isn't loaded - the first invocation takes a while, and fdtransfer finishes its bootstrap before AP attempts to connect.
  • After AP was loaded, subsequent attaches with the start command are faster, making AP always win the race. So on the 2nd invocation, AP beats fdtransfer, and we get [WARN] FdTransferClient connect(): Connection refused.
  • After that, fdtransfer remains running because it's still waiting for a connection.. So on the next profiling run, AP will connect to the old fdtransfer, and the transactions will finish before the new one attempts to bind()! (or at least, the listener socket will be closed before that, making the address available for the next fdtransfer).

I'll think of a synchronization mechanism, either something between profiler.sh and fdtransfer; or simply add some retries in FdTransferClient?

@Jongy
Copy link
Contributor Author

Jongy commented Aug 3, 2021

I'll think of a synchronization mechanism, either something between profiler.sh and fdtransfer; or simply add some retries in FdTransferClient?

I went with the simplest solution, which is retries. I don't like it much because it does add a bit of unnecessary delay, and because in extreme cases it can also fail (total timeout is 500ms).
It does solve the behavior I exhibited, tho. I don't encounter those races anymore.

I thought for a while about a more accurate scheme. The thing is, I don't want to add anything in profiler.sh itself, because that means other drivers will also have to implement it. Ideas I had in mind:

  • fork() and let the parent die after we create the socket - this way we can run fdtransfer in blocking mode, and once it returns we know it's ready. That's not nice tho, because the process "escapes" its controlling driver.
  • Pass some fd to fdtransfer and let it notify via that fd. For example, a write-side of a pipe. Then we wait in the read side. Once ready, it closes the write side and our reader unblocks.
  • ^^ same thing but with a named pipe
  • "status file" that fdtransfer can create? (although, it runs in the target's NS so it makes things complicated.... Or we can first enter the net NS, bind the socket, create the file and then move to the mount NS).

@apangin do you have any idea here? I might try the pipe one meanwhile.

@apangin
Copy link
Member

apangin commented Aug 5, 2021

Do we want to support it in this PR? We can add partial support, perhaps - only for JVMs running in the host (or, in the same network namespace)

Namespaces make this feature non-trivial indeed. But we can implement it just for a simple yet intuitive case: only if the socket is bind-mounted in the container explicitly (or if a container shares the host's namespace, of course).
Anyway, it's OK to skip this feature for now.

do you have any idea here? I might try the pipe one meanwhile.

I thought about running fdtransfer as a daemon service, which, when started, returns control as soon as the socket is ready. Basically, the same as your first idea. profiler.sh already starts the server detached, i.e. resigns control over it - in this sense, a daemon will not be much worse. There can be even a user option to start the service interactively, like other services have, but I don't think it's really necessary.

Jongy added 3 commits August 12, 2021 03:28
* Fork after binding, so caller can continue once we exit & there's no race with the profiler agent
* Allow binding on a path
* Accept fdtransfer=/path/to/socket in the agent, so it can connect to a pre-running fdtransfer
@Jongy
Copy link
Contributor Author

Jongy commented Aug 12, 2021

I've recently committed jattach refactoring related to OpenJ9, can you please rebase your PR on top of that? I hope there will not be many conflicts - I've retained a separate utils file with the same functions as here.

Sure, done. Here's the git range-diff for reference. I removed all commits that had anything to do with the jattach utils as I see you've indeed done the same. I just had to fix the API and a bunch of sites since return values meanings were swapped. Also, I didn't fix all commits along the way (that is, after the rebase I added some fix commits, so intermediate commits may not compile, ... I guess we will squash the PR anyway when merging)

 1:  dcfaeba <  -:  ------- jattach: Print specialized message for read(...) == 0
 -:  ------- >  1:  4a77d68 Sync with jattach 2.0: psutil interface, better container support, Open9 VMs
 2:  0a55f84 =  2:  4bcdaeb test: Print a message when the test program starts
 3:  530aa4a <  -:  ------- jattach: Extract utility functions to utils.c
 4:  930159e <  -:  ------- Refactor enter_mount_ns() into generic enter_ns()
 5:  8aec6ca !  3:  727141a Add FdTransfer mechanism
    @@ Makefile: endif
      release: build $(PACKAGE_NAME).$(PACKAGE_EXT)
      
     @@ Makefile: build/$(LIB_PROFILER_SO): $(SOURCES) $(HEADERS) $(JAVA_HEADERS)
    - build/$(JATTACH): src/jattach/jattach.c src/jattach/utils.c
    - 	$(CC) $(CFLAGS) -DJATTACH_VERSION=\"$(PROFILER_VERSION)-ap\" -o $@ $^
    + build/$(JATTACH): src/jattach/*.c src/jattach/*.h
    + 	$(CC) $(CFLAGS) -DJATTACH_VERSION=\"$(PROFILER_VERSION)-ap\" -o $@ src/jattach/*.c
      
     +build/fdtransfer: src/fdTransfer.cpp src/log.cpp src/jattach/fdtransfer.cpp src/jattach/utils.c
     +	$(CXX) $(CFLAGS) -o $@ $^
 6:  898cbe6 =  4:  49c4c05 Add fdtransfer-smoke-test.sh
 7:  f004b26 =  5:  69eddf0 Add & use send_all(), recv_all()
 8:  d34ecc9 =  6:  4cdd580 Don't use nsenter since it's missing
 9:  6a9d591 =  7:  a7404e8 Run fdtransfer for all relevant actions, only in Linux, and wih nohup when needed
10:  9f770a1 =  8:  373cbd1 Create kallsyms file in different paths per process, and clean it up
11:  8f4837c =  9:  6fbb0c3 Different error prints for error/truncated sendmsg
12:  efb36b3 ! 10:  8e48643 Ensure TID sent in perf_event_open request indeed belongs to the process
    @@ src/fdTransfer.h: class FdTransfer {
          static int requestKallsymsFd();
     
      ## src/jattach/fdtransfer.cpp ##
    -@@ src/jattach/fdtransfer.cpp: int main(int argc, const char *argv[]) {
    -     pid_t nspid = -1;
    -     uid_t _target_uid;
    -     gid_t _target_gid;
    --    if (!get_process_info(pid, &_target_uid, &_target_gid, &nspid)) {
    -+    pid_t _tgid;
    -+    if (!get_process_info(pid, &_target_uid, &_target_gid, &nspid, &_tgid)) {
    -         fprintf(stderr, "Process %d not found\n", pid);
    -         return 1;
    -     }
     @@ src/jattach/fdtransfer.cpp: int main(int argc, const char *argv[]) {
          if (0 == fork()) {
              // do the actual work
    @@ src/jattach/fdtransfer.cpp: int main(int argc, const char *argv[]) {
              } else {
                  _exit(1);
              }
    -
    - ## src/jattach/jattach.c ##
    -@@ src/jattach/jattach.c: int main(int argc, char** argv) {
    -     uid_t target_uid = my_uid;
    -     gid_t target_gid = my_gid;
    -     int nspid = -1;
    --    if (!get_process_info(pid, &target_uid, &target_gid, &nspid)) {
    -+    int _tgid;
    -+    if (!get_process_info(pid, &target_uid, &target_gid, &nspid, &_tgid)) {
    -         fprintf(stderr, "Process %d not found\n", pid);
    -         return 1;
    -     }
    -
    - ## src/jattach/utils.c ##
    -@@ src/jattach/utils.c: char tmp_path[TMP_PATH];
    - 
    - #ifdef __linux__
    - 
    --int get_process_info(int pid, uid_t* uid, gid_t* gid, int* nspid) {
    -+int get_process_info(int pid, uid_t* uid, gid_t* gid, int* nspid, int *tgid) {
    -     // Parse /proc/pid/status to find process credentials
    -     char path[64];
    -     snprintf(path, sizeof(path), "/proc/%d/status", pid);
    -@@ src/jattach/utils.c: int get_process_info(int pid, uid_t* uid, gid_t* gid, int* nspid) {
    -         } else if (strncmp(line, "NStgid:", 7) == 0) {
    -             // PID namespaces can be nested; the last one is the innermost one
    -             *nspid = atoi(strrchr(line, '\t'));
    -+        } else if (strncmp(line, "Tgid:\t", 6) == 0) {
    -+            *tgid = atoi(line + 6);
    -         }
    -     }
    - 
    -@@ src/jattach/utils.c: int alt_lookup_nspid(int pid) {
    - 
    - #elif defined(__APPLE__)
    - 
    --int get_process_info(int pid, uid_t* uid, gid_t* gid, int* nspid) {
    -+int get_process_info(int pid, uid_t* uid, gid_t* gid, int* nspid, int* tgid) {
    -     int mib[4] = {CTL_KERN, KERN_PROC, KERN_PROC_PID, pid};
    -     struct kinfo_proc info;
    -     size_t len = sizeof(info);
    -@@ src/jattach/utils.c: int get_process_info(int pid, uid_t* uid, gid_t* gid, int* nspid) {
    -     *uid = info.kp_eproc.e_ucred.cr_uid;
    -     *gid = info.kp_eproc.e_ucred.cr_gid;
    -     *nspid = pid;
    -+    *tgid = pid;
    -     return 1;
    - }
    - 
    -@@ src/jattach/utils.c: int alt_lookup_nspid(int pid) {
    - 
    - #else // __FreeBSD__
    - 
    --int get_process_info(int pid, uid_t* uid, gid_t* gid, int* nspid) {
    -+int get_process_info(int pid, uid_t* uid, gid_t* gid, int* nspid, int *tgid) {
    -     int mib[4] = {CTL_KERN, KERN_PROC, KERN_PROC_PID, pid};
    -     struct kinfo_proc info;
    -     size_t len = sizeof(info);
    -@@ src/jattach/utils.c: int get_process_info(int pid, uid_t* uid, gid_t* gid, int* nspid) {
    -     *uid = info.ki_uid;
    -     *gid = info.ki_groups[0];
    -     *nspid = pid;
    -+    *tgid = pid;
    -     return 1;
    - }
    - 
    -
    - ## src/jattach/utils.h ##
    -@@
    - 
    - extern char tmp_path[TMP_PATH];
    - 
    --int get_process_info(int pid, uid_t* uid, gid_t* gid, int* nspid);
    -+int get_process_info(int pid, uid_t* uid, gid_t* gid, int* nspid, int *tgid);
    - int get_tmp_path(int pid);
    - int enter_ns(int pid, const char *type);
    - int alt_lookup_nspid(int pid);
13:  f6537eb = 11:  ac2a6d4 Enter mount NS as well in fdtransfer
14:  c827e05 = 12:  ba52737 Don't attempt to send fd if it's -1, it fails the sendmsg()
15:  606ae07 = 13:  dc0d59c Close fds after sending
16:  3055a93 = 14:  21c0691 Pass /dev/null fd if kallsyms tmp open fails
17:  d2747b7 = 15:  b8a7d27 Use tgkill to test for pid-tid relation
18:  7d75f23 <  -:  ------- Remove tgid from get_process_info() - unused
19:  1a87775 ! 16:  eca9890 Refactor FdTransfer to multiple files & classes, switch to SOCK_SEQPACKET
    @@ Commit message
     
      ## Makefile ##
     @@ Makefile: build/$(LIB_PROFILER_SO): $(SOURCES) $(HEADERS) $(JAVA_HEADERS)
    - build/$(JATTACH): src/jattach/jattach.c src/jattach/utils.c
    - 	$(CC) $(CFLAGS) -DJATTACH_VERSION=\"$(PROFILER_VERSION)-ap\" -o $@ $^
    + build/$(JATTACH): src/jattach/*.c src/jattach/*.h
    + 	$(CC) $(CFLAGS) -DJATTACH_VERSION=\"$(PROFILER_VERSION)-ap\" -o $@ src/jattach/*.c
      
     -build/fdtransfer: src/fdTransfer.cpp src/log.cpp src/jattach/fdtransfer.cpp src/jattach/utils.c
     +build/fdtransfer: src/jattach/fdTransfer_server_linux.cpp src/jattach/fdtransfer_prog_linux.cpp src/jattach/utils.c
20:  d14c9b6 = 17:  7c77436 Add attribute((format(printf))) to various log functions
21:  04e345e = 18:  b3f30d9 Try better at sharing code in parseKernelSymbols()
22:  52c4fe4 = 19:  d6035eb Add SO_PEERCRED verification
23:  e6bb512 <  -:  ------- Header fixes
 -:  ------- > 20:  5ef0506 Header fixes
24:  f7493f2 = 21:  1e83ae6 Remove old decl
25:  ed6cdf6 = 22:  cb5b6de Add sudo in tests (code runs as uid=1001)
26:  aaf87c9 = 23:  17b7131 Update copyrights
27:  051346a ! 24:  0d4a0d4 Merge fdtransfer server-side files into fdtransfer_server.cpp
    @@ Makefile: endif
      release: build $(PACKAGE_NAME).$(PACKAGE_EXT)
      
     @@ Makefile: build/$(LIB_PROFILER_SO): $(SOURCES) $(HEADERS) $(JAVA_HEADERS)
    - build/$(JATTACH): src/jattach/jattach.c src/jattach/utils.c
    - 	$(CC) $(CFLAGS) -DJATTACH_VERSION=\"$(PROFILER_VERSION)-ap\" -o $@ $^
    + build/$(JATTACH): src/jattach/*.c src/jattach/*.h
    + 	$(CC) $(CFLAGS) -DJATTACH_VERSION=\"$(PROFILER_VERSION)-ap\" -o $@ src/jattach/*.c
      
     -build/fdtransfer: src/jattach/fdTransfer_server_linux.cpp src/jattach/fdtransfer_prog_linux.cpp src/jattach/utils.c
     +build/fdtransfer: src/jattach/fdtransfer_server.cpp src/jattach/utils.c
28:  ed8702e = 25:  665e516 Protocol improvements
29:  1224748 = 26:  9237933 Add simple retries to combat races
30:  18243c1 = 27:  0a0b180 Revert "Add simple retries to combat races"
31:  ecb1950 ! 28:  c260d28 FdTransfer server improvements
    @@ Commit message
     
      ## Makefile ##
     @@ Makefile: build/$(LIB_PROFILER_SO): $(SOURCES) $(HEADERS) $(JAVA_HEADERS)
    - build/$(JATTACH): src/jattach/jattach.c src/jattach/utils.c
    - 	$(CC) $(CFLAGS) -DJATTACH_VERSION=\"$(PROFILER_VERSION)-ap\" -o $@ $^
    + build/$(JATTACH): src/jattach/*.c src/jattach/*.h
    + 	$(CC) $(CFLAGS) -DJATTACH_VERSION=\"$(PROFILER_VERSION)-ap\" -o $@ src/jattach/*.c
      
     -build/fdtransfer: src/jattach/fdtransfer_server.cpp src/jattach/utils.c
     +build/fdtransfer: src/jattach/fdtransfer_server.cpp src/fdTransfer_shared_linux.h src/jattach/utils.c
 -:  ------- > 29:  c0b2a06 Fix copyright
 -:  ------- > 30:  e19996a Post rebase fixes

Jongy added 3 commits August 12, 2021 03:49
I retained the same message - because it's the same error, although in the fdtransfer
case it has happened in another process.
Combine SA_NOCLDWAIT with exiting the peer PID NS after the single fork().
@Jongy Jongy force-pushed the perf-events-low-privs branch from ecb1950 to 4fd5f23 Compare August 12, 2021 01:16

static inline bool socketPath(const char *path, struct sockaddr_un *sun, socklen_t *addrlen) {
const int path_len = strlen(path);
if (path_len > sizeof(sun->sun_path) + 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, path_len + 1 > sizeof(sun->sun_path) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm the rationale was to save space for the null terminator, so yeah. But anyway the null terminator is not required for the kernel, since length is taken per the addrlen. So I will remove the + 1 and change it to memcpy.

CStack PerfEvents::_cstack;

int PerfEvents::createForThread(int tid) {
int PerfEvents::createForThread(pid_t tid) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's int in the header file; I'd change type in either place for consistency.

Copy link
Contributor Author

@Jongy Jongy Aug 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicer as pid_t, I suppose. So I'll make use of this option and change it everywhere :)

EDIT: oof, there are really many. I don't want to mix this unrelated change into the PR. Reverting to int.

@apangin
Copy link
Member

apangin commented Aug 17, 2021

Thank you for your hard work on the fixes. I think everything is conceptually good now.
I'd only move fdTransfer_server (and probably fdTransfer_shared) to a separate directory out of jattach, because current jattach contents is an exact copy of the posix directory from jattach project, so I can probably make it a git submodule.

@Jongy
Copy link
Contributor Author

Jongy commented Aug 18, 2021

Made all discussed changes.

I checked again both modes of execution (./profiler.sh collect on a PID, and -agentpath; oh, also checked ./profiler.sh start then stop) - all pass. @apangin , I noticed the tests are not running on this branch anymore? (we should run them again since I made some changes in them, so let's ensure they pass).

I also discovered one edge case now, which resulted in unwanted behavior: if we get [ERROR] Profiler already started, then the agent never connects to fdtransfer, and the latter remains running forever. So on the next run, after issuing ./profiler.sh stop and start, we get FdTransfer bind(): Address already in use.

We can fix it either by remembering to kill fdtransfer once profiler.sh exits (unless it was invoked in start/resume modes). Or we can place some arbitrary timeout on it, so in case Profiler already started happens, the situation will be resolved by itself within a few seconds. I think this is much simpler than the mentioned profiler.sh changes.

@Jongy
Copy link
Contributor Author

Jongy commented Aug 18, 2021

I noticed the tests are not running on this branch anymore? (we should run them again since I made some changes in them, so let's ensure they pass).

Oh nvm, the action name has changed but I see they do run.

@apangin
Copy link
Member

apangin commented Sep 1, 2021

I have no more substantial comments, I think the feature is ready for merge.
Would you mind updating the readme paragraph about profiling in containers? Thanks.

@Jongy
Copy link
Contributor Author

Jongy commented Sep 4, 2021

I have no more substantial comments, I think the feature is ready for merge.
Would you mind updating the readme paragraph about profiling in containers? Thanks.

@apangin done. I have added --fdtransfer to the list of options, and referenced it in the paragraph about containers. Should we also drop a few words in https://github.com/jvm-profiling-tools/async-profiler#basic-usage? that sections tells you to modify kptr_restrict & perf_event_paranoid, which is also not needed anymore if using fdtransfer.

@apangin apangin merged commit fc3b1ca into async-profiler:master Sep 5, 2021
@apangin
Copy link
Member

apangin commented Sep 5, 2021

Thank you so much, and congratulations on the largest external contribution to async-profiler so far!

@apangin
Copy link
Member

apangin commented Sep 5, 2021

that sections tells you to modify kptr_restrict & perf_event_paranoid, which is also not needed anymore if using fdtransfer.

Well, it says about non-privileged user, but fdtransfer needs to be run by root anyway. So I thinks it's OK to leave it as is, unless you know how to put it in a clear and concise way.

@Jongy
Copy link
Contributor Author

Jongy commented Sep 6, 2021

Thank you so much, and congratulations on the largest external contribution to async-profiler so far!

Thanks! It was a pleasure.

Well, it says about non-privileged user, but fdtransfer needs to be run by root anyway. So I thinks it's OK to leave it as is, unless you know how to put it in a clear and concise way.

Maybe:

diff --git a/README.md b/README.md
index ed67f23..8be54e0 100644
--- a/README.md
+++ b/README.md
@@ -175,6 +175,9 @@ sysctl or as follows:
 # sysctl kernel.kptr_restrict=0
 
+If you run the profiler as root, you can use `--fdtransfer` without modifying those
+sysctls.
+
 To run the agent and pass commands to it, the helper script `profiler.sh`
 is provided. A typical workflow would be to launch your Java application,
 attach the agent and start profiling, exercise your performance scenario, and

Anyway I agree that it doesn't matter much, a careful reader who encounters this problem will eventually find --fdtransfer mentioned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants