Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

laminard crashing #146

Closed
jbglaw opened this issue Mar 5, 2021 · 31 comments
Closed

laminard crashing #146

jbglaw opened this issue Mar 5, 2021 · 31 comments

Comments

@jbglaw
Copy link
Contributor

jbglaw commented Mar 5, 2021

Hi!

I'm a novice laminar user and noticed that for me, laminard quite easily crashes. This is with Debian unstable (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=983645), which is quite up-to-date with upstream. However, as this seems to be not Debian specific, I came here to discuss it.

Basically, I see that starting a job with lots of output:

root@spock:~# cat /var/lib/laminar/cfg/jobs/lines.run 
#!/usr/bin/env bash

while [ -f /tmp/lines ]; do
        echo "."
done

(with /tmp/lines existing) is enough to do. With such a job running, direct your browser to follow the log (ie. http://localhost:8080/jobs/lines/4). It'll take a second or two to crash for me.

On a systemd-driven system, laminard will probably be restarted, but the run is lost.

As for the bug, I wasn't able to track it down yet. My guess is that it somehow overflows its read buffer (readDescription()) crushing the run pointer before it calls http->notifyLog() in laminar.cpp, but I may be wrong on that. I am, however, quite confident that it can easily be triggered with a job emitting lots of output and viewing the log online.

Thanks a lot,
Jan-Benedict Glaw

@jbglaw
Copy link
Contributor Author

jbglaw commented Mar 5, 2021

Hi!

Commented out memory checking in kj/memory.h, rebuild with -g -O0 and got (heavy log output, following in browser):

(gdb) run -v
Starting program: /tmp/laminar/x/laminard -v
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
laminard version debian/1.0-3 started
rpc.cpp:55: info: RPC queue; jobName = lines
[Detaching after fork from child process 5452]
laminar.cpp:672: info: Started job; run->name = lines; run->build = 15; ctx->name = default

Program received signal SIGPIPE, Broken pipe.
0x00007ffff7698da3 in __GI___writev (fd=13, iov=0x7fffffffcf40, iovcnt=3) at ../sysdeps/unix/sysv/linux/writev.c:26
26      ../sysdeps/unix/sysv/linux/writev.c: No such file or directory.
(gdb) handle SIGPIPE nostop noprint pass
Signal        Stop      Print   Pass to program Description
SIGPIPE       No        No      Yes             Broken pipe
(gdb) cont
Continuing.

Program received signal SIGSEGV, Segmentation fault.
Http::notifyLog (this=0x5555556f4a50, job="lines", run=15, log_chunk="", eot=false) at /tmp/laminar/src/http.cpp:292
292                 lw->fulfiller->fulfill(kj::mv(eot));
(gdb) bt
#0  Http::notifyLog (this=0x5555556f4a50, job="lines", run=15, log_chunk="", eot=false) at /tmp/laminar/src/http.cpp:292
#1  0x000055555561b57b in operator() (__closure=0x555555731ad0,
    b=0x55555573b880 ".\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n"..., n=152) at /tmp/laminar/src/laminar.cpp:659
#2  0x00005555556242e2 in std::__invoke_impl<void, Laminar::tryStartRun(std::shared_ptr<Run>, int)::<lambda(char const*, size_t)>&, char const*, long unsigned int>(std::__invoke_other, struct {...} &) (__f=...) at /usr/include/c++/10/bits/invoke.h:60
#3  0x0000555555622311 in std::__invoke_r<void, Laminar::tryStartRun(std::shared_ptr<Run>, int)::<lambda(char const*, size_t)>&, char const*, long unsigned int>(struct {...} &) (__fn=...)
    at /usr/include/c++/10/bits/invoke.h:110
#4  0x000055555561fd24 in std::_Function_handler<void(char const*, long unsigned int), Laminar::tryStartRun(std::shared_ptr<Run>, int)::<lambda(char const*, size_t)> >::_M_invoke(const std::_Any_data &, const char *&&, unsigned long &&) (__functor=...,
    __args#0=@0x7fffffffcf80: 0x55555573b880 ".\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n"...,
    __args#1=@0x7fffffffcf78: 152) at /usr/include/c++/10/bits/std_function.h:291
#5  0x000055555567986d in std::function<void (char const*, unsigned long)>::operator()(char const*, unsigned long) const (this=0x55555573b588,
    __args#0=0x55555573b880 ".\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n"..., __args#1=152)
    at /usr/include/c++/10/bits/std_function.h:622
#6  0x0000555555674eeb in operator() (__closure=0x55555573b570, sz=152) at /tmp/laminar/src/server.cpp:151
#7  0x0000555555678dd8 in kj::_::MaybeVoidCaller<unsigned long, kj::Promise<void> >::apply<Server::handleFdRead(kj::AsyncInputStream*, char*, std::function<void(char const*, long unsigned int)>)::<lambda(size_t)> >(struct {...} &, unsigned long &&) (func=..., in=@0x7fffffffd1b8: 152) at /usr/include/kj/async-prelude.h:134
#8  0x0000555555677fe0 in kj::_::TransformPromiseNode<kj::Promise<void>, long unsigned int, Server::handleFdRead(kj::AsyncInputStream*, char*, std::function<void(char const*, long unsigned int)>)::<lambda(size_t)>, kj::_::PropagateException>::getImpl(kj::_::ExceptionOrValue &) (this=0x55555573b550, output=...) at /usr/include/kj/async-inl.h:401
#9  0x00007ffff7cd1502 in ?? () from /usr/lib/x86_64-linux-gnu/libkj-async-0.7.0.so
#10 0x00007ffff7c4af9b in kj::_::runCatchingExceptions(kj::_::Runnable&) () from /usr/lib/x86_64-linux-gnu/libkj-0.7.0.so
#11 0x00007ffff7cd06fa in kj::_::TransformPromiseNodeBase::get(kj::_::ExceptionOrValue&) () from /usr/lib/x86_64-linux-gnu/libkj-async-0.7.0.so
#12 0x00007ffff7cd51e9 in kj::_::ChainPromiseNode::fire() () from /usr/lib/x86_64-linux-gnu/libkj-async-0.7.0.so
#13 0x00007ffff7cd0e3c in kj::EventLoop::turn() () from /usr/lib/x86_64-linux-gnu/libkj-async-0.7.0.so
#14 0x00007ffff7cd60c5 in kj::_::waitImpl(kj::Own<kj::_::PromiseNode>&&, kj::_::ExceptionOrValue&, kj::WaitScope&) () from /usr/lib/x86_64-linux-gnu/libkj-async-0.7.0.so
#15 0x0000555555679294 in kj::Promise<void>::wait (this=0x7fffffffe570, waitScope=...) at /usr/include/kj/async-inl.h:902
#16 0x0000555555674149 in Server::start (this=0x5555556f41e0) at /tmp/laminar/src/server.cpp:56
#17 0x000055555567d120 in main (argc=2, argv=0x7fffffffe778) at /tmp/laminar/src/main.cpp:98
(gdb) print lw
$1 = (LogWatcher *) 0x5555557330d0
(gdb) print *lw
$2 = {job = "lines", run = 15, pendingOutput = std::__cxx11::list = {
    [0] = ".\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n"}, fulfiller = {disposer = 0x0, ptr = 0x0}}
(gdb) print lw->fulfiller
$3 = {disposer = 0x0, ptr = 0x0}

Guess it's the missing disposer? But I fail to gain overview over the overall code flow.

@ohwgiles
Copy link
Owner

ohwgiles commented Mar 6, 2021

Thanks for the detailed report.

Unfortunately I can't reproduce it locally, but I have a potential fix for your last backtrace. Since stream->write() at http.cpp:235 will go back to the event loop, more output could be ready to notify the logWatcher before that write is resolved and the fulfiller prepared for the next write, which would explain why lw->fulfiller is uninitialized in your backtrace.

defer-initial-log-notification-until-fulfiller-instantiated.patch.txt

@jbglaw
Copy link
Contributor Author

jbglaw commented Mar 6, 2021

Hi!

With this fix in place, laminard doesn't crash any longer. The frontend feels a bit "weird", though. It seems that, while such a massively sending run is active, the chunked transfer towards the web client isn't served?

In real life, that's probably not a problem, because even for my current jobs (builds for gcc/binutils/glibc), there are times of high-volume logging (configure running) and less logging (compilation.) So it no longer crashes and seems to work.

@ohwgiles
Copy link
Owner

ohwgiles commented Mar 7, 2021

That last patch turned out to be wrong, it causes the whole log to be sent out to all clients when a new client connects. It looks like in general the multiple simultaneous log clients was broken anyway, hopefully is fixed in this patch:

logwatcher-ensure-fulfiller-instantiated-before-returning-to-eventloop.patch.txt

Please let me know if this one works for you too.

It seems that, while such a massively sending run is active, the chunked transfer towards the web client isn't served?

I can't think of a good reason for that, I could speculate that laminar is too busy reading (it is highly asynchronous, but single threaded) the output in order to write it in time. Perhaps tuning buffer sizes could improve this; I agree it's probably not a real issue.

@jbglaw
Copy link
Contributor Author

jbglaw commented Mar 7, 2021

While your first patch worked well for me, the second (in #146 (comment)) does not. With this applied, watching a binutils' log live on the web frontend, it gets intermixed with the (newly introduced) "admin" job (generating links and queueing all jobs.)

@jbglaw
Copy link
Contributor Author

jbglaw commented Mar 7, 2021

Oh and it seems I had one crash with the second patch compared to no crash with the first one.

@ohwgiles
Copy link
Owner

ohwgiles commented Mar 8, 2021

Weird, I guess I need to think about this patch some more. Any chance you have a backtrace of the crash?

@jbglaw
Copy link
Contributor Author

jbglaw commented Mar 8, 2021

Unfortunately, no backtrace. I gave it a try (actually: my base is the Debian source package, so I applied my recursive-artifacts patch and your second no-crash-on-web-watching) and had this one crash. Thereafter, I changed it back to your first patch. That test box is off right now, but I'll "revert" to your second patch asap and attach gdb.

@jbglaw
Copy link
Contributor Author

jbglaw commented Mar 8, 2021

Now running again with your second patch applied, under gdb or strace control.

What I quite recognize is unresponsiveness in the web frontend. chromium shows connections as "pending". However, I wonder that it isn't in a busy read() loop, but quite waiting for most of the time:

read(21, 0x555555660c40, 4096)          = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(3, [{EPOLLIN, {u32=1432740464, u64=93824993321584}}], 16, 5545) = 1
read(22, "g++  -fno-PIE -c   -g -O2 -DIN_G"..., 4096) = 680
read(22, 0x5555556648c0, 4096)          = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(3, [{EPOLLIN, {u32=1434834720, u64=93824995415840}}], 16, 5442) = 1
read(29, "g++  -fno-PIE -c   -g -O2 -DIN_G"..., 4096) = 664
read(29, 0x55555585e290, 4096)          = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(3, [{EPOLLIN, {u32=1434817072, u64=93824995398192}}], 16, 5183) = 1
read(27, "g++  -fno-PIE -c   -g -O2 -DIN_G"..., 4096) = 684
read(27, 0x555555859cc0, 4096)          = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(3, [{EPOLLIN, {u32=1432774176, u64=93824993355296}}], 16, 5049) = 1
read(25, "g++  -fno-PIE -c   -g -O2 -DIN_G"..., 4096) = 684
read(25, 0x555555855ab0, 4096)          = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(3, [{EPOLLIN, {u32=1434834720, u64=93824995415840}}], 16, 4320) = 1
read(29, "g++  -fno-PIE -c   -g -O2 -DIN_G"..., 4096) = 652
read(29, 0x55555585e290, 4096)          = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(3, [{EPOLLIN, {u32=1432773856, u64=93824993354976}}], 16, 4247) = 1
read(24, "g++  -fno-PIE -c   -g -O2 -DIN_G"..., 4096) = 668
read(24, 0x5555558541f0, 4096)          = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(3, [{EPOLLIN, {u32=1432738800, u64=93824993319920}}], 16, 4219) = 1
read(20, "g++  -fno-PIE -c   -g -O2 -DIN_G"..., 4096) = 676
read(20, 0x55555565ec50, 4096)          = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(3, [{EPOLLIN, {u32=1432722240, u64=93824993303360}}], 16, 4056) = 1
read(21, "g++  -fno-PIE -c   -g -O2 -DIN_G"..., 4096) = 668
read(21, 0x555555660c40, 4096)          = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(3, [{EPOLLIN, {u32=1434809296, u64=93824995390416}}], 16, 4052) = 1
read(26, "g++  -fno-PIE -c   -g -O2 -DIN_G"..., 4096) = 684
read(26, 0x555555857d70, 4096)          = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(3, [{EPOLLIN, {u32=1432722240, u64=93824993303360}}], 16, 3923) = 1
read(21, "g++  -fno-PIE -c   -g -O2 -DIN_G"..., 4096) = 668
read(21, 0x555555660c40, 4096)          = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(3, [{EPOLLIN, {u32=1432682816, u64=93824993263936}}], 16, 3454) = 1
read(28, "g++  -fno-PIE -c   -g -O2 -DIN_G"..., 4096) = 668
read(28, 0x55555585bec0, 4096)          = -1 EAGAIN (Resource temporarily unavailable)

However, opening a good number of live logs in several tabs results in "loading nothing" or "just loading the page itself without the log". Once a tab has picked up following the log, it seems to properly follow all the time.

Is laminard limiting the number of parallel connections? That's with 16 paralle build jobs (the next box to be will be quite a bit more capable, I think it'll run around 64 parallel builds at least.)

@ohwgiles
Copy link
Owner

So no crash so far with the second patch?

From the strace output and your description it sounds like only the pipes to the job output are being serviced, and the connections from the frontend are neglected. I can't see any reason for this...

Is laminard limiting the number of parallel connections?

Nope. At least not intentionally.

It might be interesting to try a compile and run against the latest capnproto - debian is still shipping 0.7.0 even in sid

@jbglaw
Copy link
Contributor Author

jbglaw commented Mar 11, 2021

No crash again up to now.

...and for watching at a goot number of parallel build logs in the web browser, I think that there's simple explanation: Web browsers limit the number of connections (overall and towards one host), so I guess that limit just applied.

@ohwgiles
Copy link
Owner

I'll consider the crash fixed in b16991b, please reopen if you can reproduce.

@jbglaw
Copy link
Contributor Author

jbglaw commented Apr 17, 2021

Just FTR, I had another unexpected crash/exit with this message in laminar.log:

fatal: ::read(fd, buffer, maxBytes): Connection reset by peer

What is a bit unfortunate is that right now, it doesn't pick up on queued runs. Aren't these just in the database? So recovering could be done by pulling outstanding runs from the database?

@jbglaw
Copy link
Contributor Author

jbglaw commented Apr 21, 2021

Got hit by that again. So now running with tcpdump on port 80 and with GDB attached, breaking on abort, exit and _exit. Lets see what we can catch. (So my current setup has about 770 jobs defined, and I think I didn't ever make it through all of them by just queueing them all in and waiting for completion. Working around that by checking the database for the least run jobs and starting them. But in the end, I'll of course want to get that up'n'running properly.)

So recovery after crash (re-starting queued jobs) would be high on my wish list right now.

@ohwgiles
Copy link
Owner

Thanks for your help debugging this, it will be nice to get to the bottom of it.

Regarding restarting queued jobs, I don't particularly like the sound of this. Laminar could be down for quite a while (e.g. power cut) and trying to restart its queue could just make a big mess if the resources the job scripts expect have gone stale in the meantime. Anyway it isn't quite as simple as you suggest, because the database does not store enough information to reschedule the queued run.

I would prefer to just fix the bug and leave the existing behaviour with queued jobs. Crashes like the one you describe I think are unacceptable.

@jbglaw
Copy link
Contributor Author

jbglaw commented Apr 25, 2021

I yet have to understand it, but I guess it's a bug in libcapnp. It breaks in src/kj/async-io-unix.c++:301, but I still have to understand how to trigger it.

@jbglaw
Copy link
Contributor Author

jbglaw commented Apr 25, 2021

My guess is that this is happening from kj's http code, but I don't know (cf. capnproto issue 1219) who should fix it.

@jbglaw
Copy link
Contributor Author

jbglaw commented Apr 28, 2021

I must admit that I horribly fail to get the kj / libcapn code. Next gdb round trying to catch this. I really hope to get to trace down these kj-generated exceptions on closed network FDs. Right now, running on the small box, it'll only loose the two running jobs, it'll soon be more like 32 ...

@ohwgiles
Copy link
Owner

Are you sure the crash is due to that message? I suspect what you are seeing is this, which is not actually fatal despite "fatal" being in the error message...

Anyway a backtrace would be needed to investigate this further

@jbglaw
Copy link
Contributor Author

jbglaw commented May 2, 2021

This time not a crash, but rather a exit (1)...

Breakpoint 1, __GI_exit (status=status@entry=1) at exit.c:139
Download failed: Invalid argument. Continuing without source file ./stdlib/exit.c.
139 exit.c: No such file or directory.
(gdb) bt full
#0 GI_exit (status=status@entry=1) at exit.c:139
No locals.
#1 0x0000560e17f1ae2a in Server::taskFailed (this=, exception=...) at ./src/server.cpp:162
No locals.
#2 0x00007f4031ac5de9 in kj::TaskSet::Task::fire (this=0x560e18c01390) at src/kj/async.c++:188
e = 0x7ffce92b7fd8
result = {kj::_::ExceptionOrValue = {exception = {ptr = {isSet = true, {value = {ownFile = {content = {ptr = 0x0, size
= 0, disposer = 0x0}},
file = 0x7f4031b1791e "kj/async-io-unix.c++", line = 301, type = kj::Exception::Type::DISCONNECTED, description = {content = {
ptr = 0x560e19230cf0 "::read(fd, buffer, maxBytes): Connection reset by peer", size
= 55, disposer = 0x560e17f57730 kj::_::HeapArrayDisposer::instance}},
context = {ptr = {disposer = 0x0, ptr = 0x0}}, trace = {0x7f4031afa9e0 <kj::(anonymous namespace)::AsyncStreamFd::tryRead(void*, size_t, size_t)+32>,
0x7f4031b9bd7a <kj::(anonymous namespace)::HttpInputStream::awaitNextMessage()+730>, 0x7f4031be040a kj::HttpServer::Connection::loop(bool)+1114,
0x7f4031ba3c1e kj::HttpServer::listenHttpCleanDrain(kj::AsyncIoStream&)+318, 0x7f4031babdf1 <kj::HttpServer::listenHttp(kj::Ownkj::AsyncIoStream)+49>,
0x7f4031bc2a53 <kj::::TransformPromiseNode<kj::Promise, kj::Ownkj::AsyncIoStream, kj::HttpServer::listenLoop(kj::ConnectionReceiver&)::<lambda(kj::Ownkj::AsyncIoStream&&)>, kj::::PropagateException>::getImpl(kj::::ExceptionOrValue &)+2051>,
0x7f4031abe501 <kj::
::RunnableImpl<kj::::TransformPromiseNodeBase::get(kj::::ExceptionOrValue&)::<lambda()> >::run(void)+17>, 0x7f4031a37f9a
kj::_::runCatchingExceptions(kj::_::Runnable&)+42, 0x7f4031abd6f9 kj::_::TransformPromiseNodeBase::get(kj::_::ExceptionOrValue&)+73,
0x7f4031ac21e8 kj::_::ChainPromiseNode::fire()+88, 0x7f4031abde3b kj::EventLoop::turn()+107,
0x7f4031ac30c4 <kj::::waitImpl(kj::Ownkj::_::PromiseNode&&, kj::::ExceptionOrValue&, kj::WaitScope&)+148>, 0x560e17f1c53b Server::start()+267,
0x560e17edcf19 <main(int, char**)+585>, 0x7f40313b7d09 <__libc_start_main+233>, 0x560e17edd139 <_start+41>, 0x31323032206c6c61, 0x322032302d35302d, 0x2e36353a38313a31,
0x205d363733303739, 0x3a5d315b656b616d, 0x676e697661654c20, 0x6f74636572696420, 0x7261762f27207972, 0x6d616c2f62696c2f, 0x6e75722f72616e69, 0x786f6d2d6363672f,
0x312f666c652d6569, 0x6f682f6363672f32, 0x365f3638782d7473, 0x6e696c2d63702d34, 0x692f756e672d7875}, traceCount = 7}}}}}, value = {ptr = {isSet = false, {
value = {}}}}}
self = {disposer = , ptr = }
#3 0x00007f4031abde3c in kj::EventLoop::turn (this=0x560e18bfff98) at src/kj/async.c++:373
_kjDefer372 = {func = {__event =
/build/gdb-yyhBNJ/gdb-10.1/gdb/dwarf2/loc.c:2168: internal-error: value* coerce_pieced_ref(const value*): Assertion `closure->pieces.size () == 1' failed.

Unfortunately, no complete stack trace since GDB had issues with the DWARFs... But maybe you can make something out of it (ie. a non-catched exception?)

@jbglaw
Copy link
Contributor Author

jbglaw commented May 2, 2021

That was, by the way, along with a

fatal: ::read(fd, buffer, maxBytes): Connection reset by peer
laminard version 1.0 started

Notice the stray following output of "laminard version 1.0 started"? Odd, but haven't looked at the sources.

@ohwgiles
Copy link
Owner

ohwgiles commented May 2, 2021

I pushed 0281030 to deal with the exit() - I didn't think that a dropped http client would bubble up to that exception handler, but otherwise all attached resources should be freed because they are attached to the promise. So I just made the exception handler non-fatal.

Notice the stray following output of "laminard version 1.0 started"? Odd, but haven't looked at the sources.

Laminard execs itself (via /proc/self/exe) when starting a job to act as a leader process for the child job scripts. main.cpp detects this by inspecting argv[0][0]. If you are running under gdb this will probably fail. But I guess you must have already fixed that (by patching the execl call in src/run.cpp to call the real laminard instead of /proc/self/exe) otherwise no jobs would have been able to run under gdb...

Anyway that would be the obvious explanation why you would see "laminard version 1.0 started" - because main.cpp could not detect that this instance was supposed to be a job leader process.

@jbglaw
Copy link
Contributor Author

jbglaw commented May 4, 2021

So first day/night of operation.

Good news: laminard didn't crash/exit.

Bad news: It closed it's port 8080 listening socket.

No debugger attached, this is the log:

rpc.cpp:55: info: RPC queue; jobName = gdb-m68k-netbsdelf
rpc.cpp:55: info: RPC queue; jobName = gdb-sh-superh-elf
rpc.cpp:55: info: RPC queue; jobName = gdb-i686-pc-msdosdjgpp
rpc.cpp:55: info: RPC queue; jobName = gcc-sparc64-sun-solaris2.11OPT-with-gnu-ldOPT-with-gnu-asOPT-enable-threads=posix
rpc.cpp:55: info: RPC queue; jobName = gdb-rx-elf
rpc.cpp:55: info: RPC queue; jobName = netbsd-landisk-sh3el
rpc.cpp:55: info: RPC queue; jobName = gdb-lm32-elf
rpc.cpp:55: info: RPC queue; jobName = netbsd-mipsco-mipseb
rpc.cpp:55: info: RPC queue; jobName = netbsd-evbmips-mips64el
laminar.cpp:716: info: Run completed; r->name = gdb-aarch64-linux-gnu; to_string(r->result) = success
laminar.cpp:675: info: Started job; run->name = gdb-m32c-rtems; run->build = 15; ctx->name = default
run.cpp:77: info: Run destroyed
kj/compat/http.c++:4284: error: HttpService threw exception after generating a partial response; too late to report error to client; e = kj/compat/http.c++:1653: failed: expected !writeInProgress; concurrent write()s not allowed
stack: 7fb3bc038414 55ba62c2b5dc 7fb3bbf56501 55ba62c24d00 7fb3bc07ad90 7fb3bc068040
laminar.cpp:716: info: Run completed; r->name = gdb-m32c-rtems; to_string(r->result) = success
laminar.cpp:675: info: Started job; run->name = netbsd-cats-arm; run->build = 7; ctx->name = default
run.cpp:77: info: Run destroyed
kj/compat/http.c++:4284: error: HttpService threw exception after generating a partial response; too late to report error to client; e = kj/compat/http.c++:1653: failed: expected !writeInProgress; concurrent write()s not allowed
stack: 7fb3bc038414 55ba62c2b5dc 7fb3bbf56501 55ba62c24d00 7fb3bc07ad90 7fb3bc068040
kj/compat/http.c++:4284: error: HttpService threw exception after generating a partial response; too late to report error to client; e = kj/compat/http.c++:1653: failed: expected !writeInProgress; concurrent write()s not allowed
stack: 7fb3bc038414 55ba62c2b5dc 7fb3bbf56501 55ba62c24d00 7fb3bc07ad90 7fb3bc068040
laminar.cpp:716: info: Run completed; r->name = netbsd-riscv-riscv32; to_string(r->result) = success
laminar.cpp:675: info: Started job; run->name = netbsd-sgimips-mipseb; run->build = 7; ctx->name = default
run.cpp:77: info: Run destroyed
kj/compat/http.c++:4284: error: HttpService threw exception after generating a partial response; too late to report error to client; e = kj/compat/http.c++:1653: failed: expected !writeInProgress; concurrent write()s not allowed
stack: 7fb3bc038414 55ba62c2b5dc 7fb3bbf56501 55ba62c24d00 7fb3bc07ad90 7fb3bc068040
kj/compat/http.c++:4284: error: HttpService threw exception after generating a partial response; too late to report error to client; e = kj/compat/http.c++:1653: failed: expected !writeInProgress; concurrent write()s not allowed
stack: 7fb3bc038414 55ba62c2b5dc 7fb3bbf56501 55ba62c24d00 7fb3bc07ad90 7fb3bc068040
laminar.cpp:716: info: Run completed; r->name = netbsd-cats-arm; to_string(r->result) = failed
laminar.cpp:675: info: Started job; run->name = netbsd-cobalt-mipsel; run->build = 7; ctx->name = default
run.cpp:77: info: Run destroyed
kj/compat/http.c++:4284: error: HttpService threw exception after generating a partial response; too late to report error to client; e = kj/async-io-unix.c++:379: overloaded: ::writev(fd, iov.begin(), iov.size()): Connection timed out
stack: 7fb3bc051dc5 7fb3bbf56501 7fb3bc016d70 55ba62c24ac0 7fb3bc07ad90 7fb3bc068040
laminar.cpp:716: info: Run completed; r->name = netbsd-sgimips-mipseb; to_string(r->result) = success
laminar.cpp:675: info: Started job; run->name = netbsd-hpcarm-earmv4; run->build = 7; ctx->name = default
run.cpp:77: info: Run destroyed
taskFailed: ::read(fd, buffer, maxBytes): Connection reset by peer
laminar.cpp:716: info: Run completed; r->name = netbsd-cobalt-mipsel; to_string(r->result) = success
laminar.cpp:675: info: Started job; run->name = gdb-mipsisa64r2-sde-elf; run->build = 15; ctx->name = default
run.cpp:77: info: Run destroyed
laminar.cpp:716: info: Run completed; r->name = gdb-mipsisa64r2-sde-elf; to_string(r->result) = aborted
laminar.cpp:675: info: Started job; run->name = gdb-ia64-hp-vms; run->build = 15; ctx->name = default
run.cpp:77: info: Run destroyed
laminar.cpp:716: info: Run completed; r->name = netbsd-hpcarm-earmv4; to_string(r->result) = aborted

@jbglaw
Copy link
Contributor Author

jbglaw commented May 4, 2021

Got bitten by it again: It doesn't exit() any longer, but upon taskFailed: ::read(fd, buffer, maxBytes): Connection reset by peer, port 8080 is close()d.

@ohwgiles ohwgiles reopened this May 8, 2021
@ohwgiles
Copy link
Owner

ohwgiles commented May 8, 2021

Note to self: need to write a test case to reproduce this

@jbglaw
Copy link
Contributor Author

jbglaw commented May 8, 2021

I have a sliiight hint maybe: I think this happens when I have the main page opened in my mobile phone (chrome browser) and just leave the browser and put the mobile phone to sleep (short press on the on/off button.) It seems to not happen when I haven't touched the laminar server with my mobile phone after having it restarted.

@jbglaw
Copy link
Contributor Author

jbglaw commented May 15, 2021

Yeah, I'm quite certain by now: This exception (formerly leading to an exit, now leading to the 8080 fd being closed) gets triggered when I keep the Laminar dashboard open and (common in Germany) my IPv4 router gets a new external IP address assigned. The former address will be either not handed out within a short timeframe (would result in a timeout or filling buffer during send), or should lead to a RST (new IPv4 address holder doesn't know about the former connection.)

@jbglaw
Copy link
Contributor Author

jbglaw commented May 18, 2021

Tested this two more nights: It seems to be as I expected. When there's an open connection (event-stream) and my external IP address changes, the server closes its listening port 8080 after some time. So any client can probably lead to a denial-of-service situation with simply going offline without shutting down the TCP connection first.

@ohwgiles
Copy link
Owner

I tried to reproduce this by reducing /proc/sys/net/ipv4/{tcp_keepalive_time,tcp_keepalive_intvl,tcp_keepalive_probes,tcp_retries2}, creating a veth pair on a common subnet, connecting to laminar over it, then bringing it down/up, changing its address etc. Did not reproduce for me... :(

@jbglaw
Copy link
Contributor Author

jbglaw commented May 19, 2021

My brain really bends looking at the C++'ized approach to the POSIX API. I think my next try will be to find out where the listening socket can be closed and break on all those places (guessing it's only one or two actually.) But yeah, it's hard to trigger but quite fatal once it happened.

Just to document it: To not loose the currently running run(s), I'm setting the default workers to zero once I'm locked out through the API and wait for the runs to settle. Then restart and re-queue what was lost.

@ohwgiles
Copy link
Owner

Just to clean things up a bit, I'm closing this issue since the crash in the initial report was fixed in b16991b. Opened #164 to track the Connection reset by peer issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants