laminard crashing #146

jbglaw · 2021-03-05T15:58:03Z

Hi!

I'm a novice laminar user and noticed that for me, laminard quite easily crashes. This is with Debian unstable (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=983645), which is quite up-to-date with upstream. However, as this seems to be not Debian specific, I came here to discuss it.

Basically, I see that starting a job with lots of output:

root@spock:~# cat /var/lib/laminar/cfg/jobs/lines.run 
#!/usr/bin/env bash

while [ -f /tmp/lines ]; do
        echo "."
done

(with /tmp/lines existing) is enough to do. With such a job running, direct your browser to follow the log (ie. http://localhost:8080/jobs/lines/4). It'll take a second or two to crash for me.

On a systemd-driven system, laminard will probably be restarted, but the run is lost.

As for the bug, I wasn't able to track it down yet. My guess is that it somehow overflows its read buffer (readDescription()) crushing the run pointer before it calls http->notifyLog() in laminar.cpp, but I may be wrong on that. I am, however, quite confident that it can easily be triggered with a job emitting lots of output and viewing the log online.

Thanks a lot,
Jan-Benedict Glaw

The text was updated successfully, but these errors were encountered:

jbglaw · 2021-03-05T19:13:40Z

Hi!

Commented out memory checking in kj/memory.h, rebuild with -g -O0 and got (heavy log output, following in browser):

(gdb) run -v
Starting program: /tmp/laminar/x/laminard -v
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
laminard version debian/1.0-3 started
rpc.cpp:55: info: RPC queue; jobName = lines
[Detaching after fork from child process 5452]
laminar.cpp:672: info: Started job; run->name = lines; run->build = 15; ctx->name = default

Program received signal SIGPIPE, Broken pipe.
0x00007ffff7698da3 in __GI___writev (fd=13, iov=0x7fffffffcf40, iovcnt=3) at ../sysdeps/unix/sysv/linux/writev.c:26
26      ../sysdeps/unix/sysv/linux/writev.c: No such file or directory.
(gdb) handle SIGPIPE nostop noprint pass
Signal        Stop      Print   Pass to program Description
SIGPIPE       No        No      Yes             Broken pipe
(gdb) cont
Continuing.

Program received signal SIGSEGV, Segmentation fault.
Http::notifyLog (this=0x5555556f4a50, job="lines", run=15, log_chunk="", eot=false) at /tmp/laminar/src/http.cpp:292
292                 lw->fulfiller->fulfill(kj::mv(eot));
(gdb) bt
#0  Http::notifyLog (this=0x5555556f4a50, job="lines", run=15, log_chunk="", eot=false) at /tmp/laminar/src/http.cpp:292
#1  0x000055555561b57b in operator() (__closure=0x555555731ad0,
    b=0x55555573b880 ".\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n"..., n=152) at /tmp/laminar/src/laminar.cpp:659
#2  0x00005555556242e2 in std::__invoke_impl<void, Laminar::tryStartRun(std::shared_ptr<Run>, int)::<lambda(char const*, size_t)>&, char const*, long unsigned int>(std::__invoke_other, struct {...} &) (__f=...) at /usr/include/c++/10/bits/invoke.h:60
#3  0x0000555555622311 in std::__invoke_r<void, Laminar::tryStartRun(std::shared_ptr<Run>, int)::<lambda(char const*, size_t)>&, char const*, long unsigned int>(struct {...} &) (__fn=...)
    at /usr/include/c++/10/bits/invoke.h:110
#4  0x000055555561fd24 in std::_Function_handler<void(char const*, long unsigned int), Laminar::tryStartRun(std::shared_ptr<Run>, int)::<lambda(char const*, size_t)> >::_M_invoke(const std::_Any_data &, const char *&&, unsigned long &&) (__functor=...,
    __args#0=@0x7fffffffcf80: 0x55555573b880 ".\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n"...,
    __args#1=@0x7fffffffcf78: 152) at /usr/include/c++/10/bits/std_function.h:291
#5  0x000055555567986d in std::function<void (char const*, unsigned long)>::operator()(char const*, unsigned long) const (this=0x55555573b588,
    __args#0=0x55555573b880 ".\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n"..., __args#1=152)
    at /usr/include/c++/10/bits/std_function.h:622
#6  0x0000555555674eeb in operator() (__closure=0x55555573b570, sz=152) at /tmp/laminar/src/server.cpp:151
#7  0x0000555555678dd8 in kj::_::MaybeVoidCaller<unsigned long, kj::Promise<void> >::apply<Server::handleFdRead(kj::AsyncInputStream*, char*, std::function<void(char const*, long unsigned int)>)::<lambda(size_t)> >(struct {...} &, unsigned long &&) (func=..., in=@0x7fffffffd1b8: 152) at /usr/include/kj/async-prelude.h:134
#8  0x0000555555677fe0 in kj::_::TransformPromiseNode<kj::Promise<void>, long unsigned int, Server::handleFdRead(kj::AsyncInputStream*, char*, std::function<void(char const*, long unsigned int)>)::<lambda(size_t)>, kj::_::PropagateException>::getImpl(kj::_::ExceptionOrValue &) (this=0x55555573b550, output=...) at /usr/include/kj/async-inl.h:401
#9  0x00007ffff7cd1502 in ?? () from /usr/lib/x86_64-linux-gnu/libkj-async-0.7.0.so
#10 0x00007ffff7c4af9b in kj::_::runCatchingExceptions(kj::_::Runnable&) () from /usr/lib/x86_64-linux-gnu/libkj-0.7.0.so
#11 0x00007ffff7cd06fa in kj::_::TransformPromiseNodeBase::get(kj::_::ExceptionOrValue&) () from /usr/lib/x86_64-linux-gnu/libkj-async-0.7.0.so
#12 0x00007ffff7cd51e9 in kj::_::ChainPromiseNode::fire() () from /usr/lib/x86_64-linux-gnu/libkj-async-0.7.0.so
#13 0x00007ffff7cd0e3c in kj::EventLoop::turn() () from /usr/lib/x86_64-linux-gnu/libkj-async-0.7.0.so
#14 0x00007ffff7cd60c5 in kj::_::waitImpl(kj::Own<kj::_::PromiseNode>&&, kj::_::ExceptionOrValue&, kj::WaitScope&) () from /usr/lib/x86_64-linux-gnu/libkj-async-0.7.0.so
#15 0x0000555555679294 in kj::Promise<void>::wait (this=0x7fffffffe570, waitScope=...) at /usr/include/kj/async-inl.h:902
#16 0x0000555555674149 in Server::start (this=0x5555556f41e0) at /tmp/laminar/src/server.cpp:56
#17 0x000055555567d120 in main (argc=2, argv=0x7fffffffe778) at /tmp/laminar/src/main.cpp:98
(gdb) print lw
$1 = (LogWatcher *) 0x5555557330d0
(gdb) print *lw
$2 = {job = "lines", run = 15, pendingOutput = std::__cxx11::list = {
    [0] = ".\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n.\n"}, fulfiller = {disposer = 0x0, ptr = 0x0}}
(gdb) print lw->fulfiller
$3 = {disposer = 0x0, ptr = 0x0}

Guess it's the missing disposer? But I fail to gain overview over the overall code flow.

ohwgiles · 2021-03-06T02:07:21Z

Thanks for the detailed report.

Unfortunately I can't reproduce it locally, but I have a potential fix for your last backtrace. Since stream->write() at http.cpp:235 will go back to the event loop, more output could be ready to notify the logWatcher before that write is resolved and the fulfiller prepared for the next write, which would explain why lw->fulfiller is uninitialized in your backtrace.

defer-initial-log-notification-until-fulfiller-instantiated.patch.txt

jbglaw · 2021-03-06T10:05:37Z

Hi!

With this fix in place, laminard doesn't crash any longer. The frontend feels a bit "weird", though. It seems that, while such a massively sending run is active, the chunked transfer towards the web client isn't served?

In real life, that's probably not a problem, because even for my current jobs (builds for gcc/binutils/glibc), there are times of high-volume logging (configure running) and less logging (compilation.) So it no longer crashes and seems to work.

ohwgiles · 2021-03-07T02:59:59Z

That last patch turned out to be wrong, it causes the whole log to be sent out to all clients when a new client connects. It looks like in general the multiple simultaneous log clients was broken anyway, hopefully is fixed in this patch:

logwatcher-ensure-fulfiller-instantiated-before-returning-to-eventloop.patch.txt

Please let me know if this one works for you too.

It seems that, while such a massively sending run is active, the chunked transfer towards the web client isn't served?

I can't think of a good reason for that, I could speculate that laminar is too busy reading (it is highly asynchronous, but single threaded) the output in order to write it in time. Perhaps tuning buffer sizes could improve this; I agree it's probably not a real issue.

jbglaw · 2021-03-07T09:37:49Z

While your first patch worked well for me, the second (in #146 (comment)) does not. With this applied, watching a binutils' log live on the web frontend, it gets intermixed with the (newly introduced) "admin" job (generating links and queueing all jobs.)

jbglaw · 2021-03-07T11:47:34Z

Oh and it seems I had one crash with the second patch compared to no crash with the first one.

ohwgiles · 2021-03-08T07:34:27Z

Weird, I guess I need to think about this patch some more. Any chance you have a backtrace of the crash?

jbglaw · 2021-03-08T08:42:47Z

Unfortunately, no backtrace. I gave it a try (actually: my base is the Debian source package, so I applied my recursive-artifacts patch and your second no-crash-on-web-watching) and had this one crash. Thereafter, I changed it back to your first patch. That test box is off right now, but I'll "revert" to your second patch asap and attach gdb.

jbglaw · 2021-03-08T19:04:17Z

Now running again with your second patch applied, under gdb or strace control.

What I quite recognize is unresponsiveness in the web frontend. chromium shows connections as "pending". However, I wonder that it isn't in a busy read() loop, but quite waiting for most of the time:

read(21, 0x555555660c40, 4096)          = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(3, [{EPOLLIN, {u32=1432740464, u64=93824993321584}}], 16, 5545) = 1
read(22, "g++  -fno-PIE -c   -g -O2 -DIN_G"..., 4096) = 680
read(22, 0x5555556648c0, 4096)          = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(3, [{EPOLLIN, {u32=1434834720, u64=93824995415840}}], 16, 5442) = 1
read(29, "g++  -fno-PIE -c   -g -O2 -DIN_G"..., 4096) = 664
read(29, 0x55555585e290, 4096)          = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(3, [{EPOLLIN, {u32=1434817072, u64=93824995398192}}], 16, 5183) = 1
read(27, "g++  -fno-PIE -c   -g -O2 -DIN_G"..., 4096) = 684
read(27, 0x555555859cc0, 4096)          = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(3, [{EPOLLIN, {u32=1432774176, u64=93824993355296}}], 16, 5049) = 1
read(25, "g++  -fno-PIE -c   -g -O2 -DIN_G"..., 4096) = 684
read(25, 0x555555855ab0, 4096)          = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(3, [{EPOLLIN, {u32=1434834720, u64=93824995415840}}], 16, 4320) = 1
read(29, "g++  -fno-PIE -c   -g -O2 -DIN_G"..., 4096) = 652
read(29, 0x55555585e290, 4096)          = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(3, [{EPOLLIN, {u32=1432773856, u64=93824993354976}}], 16, 4247) = 1
read(24, "g++  -fno-PIE -c   -g -O2 -DIN_G"..., 4096) = 668
read(24, 0x5555558541f0, 4096)          = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(3, [{EPOLLIN, {u32=1432738800, u64=93824993319920}}], 16, 4219) = 1
read(20, "g++  -fno-PIE -c   -g -O2 -DIN_G"..., 4096) = 676
read(20, 0x55555565ec50, 4096)          = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(3, [{EPOLLIN, {u32=1432722240, u64=93824993303360}}], 16, 4056) = 1
read(21, "g++  -fno-PIE -c   -g -O2 -DIN_G"..., 4096) = 668
read(21, 0x555555660c40, 4096)          = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(3, [{EPOLLIN, {u32=1434809296, u64=93824995390416}}], 16, 4052) = 1
read(26, "g++  -fno-PIE -c   -g -O2 -DIN_G"..., 4096) = 684
read(26, 0x555555857d70, 4096)          = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(3, [{EPOLLIN, {u32=1432722240, u64=93824993303360}}], 16, 3923) = 1
read(21, "g++  -fno-PIE -c   -g -O2 -DIN_G"..., 4096) = 668
read(21, 0x555555660c40, 4096)          = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(3, [{EPOLLIN, {u32=1432682816, u64=93824993263936}}], 16, 3454) = 1
read(28, "g++  -fno-PIE -c   -g -O2 -DIN_G"..., 4096) = 668
read(28, 0x55555585bec0, 4096)          = -1 EAGAIN (Resource temporarily unavailable)

However, opening a good number of live logs in several tabs results in "loading nothing" or "just loading the page itself without the log". Once a tab has picked up following the log, it seems to properly follow all the time.

Is laminard limiting the number of parallel connections? That's with 16 paralle build jobs (the next box to be will be quite a bit more capable, I think it'll run around 64 parallel builds at least.)

ohwgiles · 2021-03-11T08:00:02Z

So no crash so far with the second patch?

From the strace output and your description it sounds like only the pipes to the job output are being serviced, and the connections from the frontend are neglected. I can't see any reason for this...

Is laminard limiting the number of parallel connections?

Nope. At least not intentionally.

It might be interesting to try a compile and run against the latest capnproto - debian is still shipping 0.7.0 even in sid

jbglaw · 2021-03-11T10:15:12Z

No crash again up to now.

...and for watching at a goot number of parallel build logs in the web browser, I think that there's simple explanation: Web browsers limit the number of connections (overall and towards one host), so I guess that limit just applied.

ohwgiles · 2021-03-19T08:12:11Z

I'll consider the crash fixed in b16991b, please reopen if you can reproduce.

jbglaw · 2021-04-17T14:13:53Z

Just FTR, I had another unexpected crash/exit with this message in laminar.log:

fatal: ::read(fd, buffer, maxBytes): Connection reset by peer

What is a bit unfortunate is that right now, it doesn't pick up on queued runs. Aren't these just in the database? So recovering could be done by pulling outstanding runs from the database?

jbglaw · 2021-04-21T13:25:07Z

Got hit by that again. So now running with tcpdump on port 80 and with GDB attached, breaking on abort, exit and _exit. Lets see what we can catch. (So my current setup has about 770 jobs defined, and I think I didn't ever make it through all of them by just queueing them all in and waiting for completion. Working around that by checking the database for the least run jobs and starting them. But in the end, I'll of course want to get that up'n'running properly.)

So recovery after crash (re-starting queued jobs) would be high on my wish list right now.

ohwgiles · 2021-04-25T01:38:18Z

Thanks for your help debugging this, it will be nice to get to the bottom of it.

Regarding restarting queued jobs, I don't particularly like the sound of this. Laminar could be down for quite a while (e.g. power cut) and trying to restart its queue could just make a big mess if the resources the job scripts expect have gone stale in the meantime. Anyway it isn't quite as simple as you suggest, because the database does not store enough information to reschedule the queued run.

I would prefer to just fix the bug and leave the existing behaviour with queued jobs. Crashes like the one you describe I think are unacceptable.

jbglaw · 2021-04-25T11:31:43Z

I yet have to understand it, but I guess it's a bug in libcapnp. It breaks in src/kj/async-io-unix.c++:301, but I still have to understand how to trigger it.

jbglaw · 2021-04-25T18:48:27Z

My guess is that this is happening from kj's http code, but I don't know (cf. capnproto issue 1219) who should fix it.

jbglaw · 2021-04-28T13:41:32Z

I must admit that I horribly fail to get the kj / libcapn code. Next gdb round trying to catch this. I really hope to get to trace down these kj-generated exceptions on closed network FDs. Right now, running on the small box, it'll only loose the two running jobs, it'll soon be more like 32 ...

ohwgiles · 2021-04-29T04:27:40Z

Are you sure the crash is due to that message? I suspect what you are seeing is this, which is not actually fatal despite "fatal" being in the error message...

Anyway a backtrace would be needed to investigate this further

jbglaw · 2021-05-02T20:30:52Z

This time not a crash, but rather a exit (1)...

Breakpoint 1, __GI_exit (status=status@entry=1) at exit.c:139
Download failed: Invalid argument. Continuing without source file ./stdlib/exit.c.
139 exit.c: No such file or directory.
(gdb) bt full
#0 GI_exit (status=status@entry=1) at exit.c:139
No locals.
#1 0x0000560e17f1ae2a in Server::taskFailed (this=, exception=...) at ./src/server.cpp:162
No locals.
#2 0x00007f4031ac5de9 in kj::TaskSet::Task::fire (this=0x560e18c01390) at src/kj/async.c++:188
e = 0x7ffce92b7fd8
result = {kj::_::ExceptionOrValue = {exception = {ptr = {isSet = true, {value = {ownFile = {content = {ptr = 0x0, size = 0, disposer = 0x0}},
file = 0x7f4031b1791e "kj/async-io-unix.c++", line = 301, type = kj::Exception::Type::DISCONNECTED, description = {content = {
ptr = 0x560e19230cf0 "::read(fd, buffer, maxBytes): Connection reset by peer", size = 55, disposer = 0x560e17f57730 kj::_::HeapArrayDisposer::instance}},
context = {ptr = {disposer = 0x0, ptr = 0x0}}, trace = {0x7f4031afa9e0 <kj::(anonymous namespace)::AsyncStreamFd::tryRead(void*, size_t, size_t)+32>,
0x7f4031b9bd7a <kj::(anonymous namespace)::HttpInputStream::awaitNextMessage()+730>, 0x7f4031be040a kj::HttpServer::Connection::loop(bool)+1114,
0x7f4031ba3c1e kj::HttpServer::listenHttpCleanDrain(kj::AsyncIoStream&)+318, 0x7f4031babdf1 <kj::HttpServer::listenHttp(kj::Ownkj::AsyncIoStream)+49>,
0x7f4031bc2a53 <kj::::TransformPromiseNode<kj::Promise, kj::Ownkj::AsyncIoStream, kj::HttpServer::listenLoop(kj::ConnectionReceiver&)::<lambda(kj::Ownkj::AsyncIoStream&&)>, kj::::PropagateException>::getImpl(kj::::ExceptionOrValue &)+2051>,
0x7f4031abe501 <kj::::RunnableImpl<kj::::TransformPromiseNodeBase::get(kj::::ExceptionOrValue&)::<lambda()> >::run(void)+17>, 0x7f4031a37f9a
kj::_::runCatchingExceptions(kj::_::Runnable&)+42, 0x7f4031abd6f9 kj::_::TransformPromiseNodeBase::get(kj::_::ExceptionOrValue&)+73,
0x7f4031ac21e8 kj::_::ChainPromiseNode::fire()+88, 0x7f4031abde3b kj::EventLoop::turn()+107,
0x7f4031ac30c4 <kj::::waitImpl(kj::Ownkj::_::PromiseNode&&, kj::::ExceptionOrValue&, kj::WaitScope&)+148>, 0x560e17f1c53b Server::start()+267,
0x560e17edcf19 <main(int, char**)+585>, 0x7f40313b7d09 <__libc_start_main+233>, 0x560e17edd139 <_start+41>, 0x31323032206c6c61, 0x322032302d35302d, 0x2e36353a38313a31,
0x205d363733303739, 0x3a5d315b656b616d, 0x676e697661654c20, 0x6f74636572696420, 0x7261762f27207972, 0x6d616c2f62696c2f, 0x6e75722f72616e69, 0x786f6d2d6363672f,
0x312f666c652d6569, 0x6f682f6363672f32, 0x365f3638782d7473, 0x6e696c2d63702d34, 0x692f756e672d7875}, traceCount = 7}}}}}, value = {ptr = {isSet = false, {
value = {}}}}}
self = {disposer = , ptr = }
#3 0x00007f4031abde3c in kj::EventLoop::turn (this=0x560e18bfff98) at src/kj/async.c++:373
_kjDefer372 = {func = {__event =
/build/gdb-yyhBNJ/gdb-10.1/gdb/dwarf2/loc.c:2168: internal-error: value* coerce_pieced_ref(const value*): Assertion `closure->pieces.size () == 1' failed.

Unfortunately, no complete stack trace since GDB had issues with the DWARFs... But maybe you can make something out of it (ie. a non-catched exception?)

jbglaw · 2021-05-02T20:36:31Z

That was, by the way, along with a

fatal: ::read(fd, buffer, maxBytes): Connection reset by peer
laminard version 1.0 started

Notice the stray following output of "laminard version 1.0 started"? Odd, but haven't looked at the sources.

ohwgiles · 2021-05-02T21:24:35Z

I pushed 0281030 to deal with the exit() - I didn't think that a dropped http client would bubble up to that exception handler, but otherwise all attached resources should be freed because they are attached to the promise. So I just made the exception handler non-fatal.

Notice the stray following output of "laminard version 1.0 started"? Odd, but haven't looked at the sources.

Laminard execs itself (via /proc/self/exe) when starting a job to act as a leader process for the child job scripts. main.cpp detects this by inspecting argv[0][0]. If you are running under gdb this will probably fail. But I guess you must have already fixed that (by patching the execl call in src/run.cpp to call the real laminard instead of /proc/self/exe) otherwise no jobs would have been able to run under gdb...

Anyway that would be the obvious explanation why you would see "laminard version 1.0 started" - because main.cpp could not detect that this instance was supposed to be a job leader process.

jbglaw · 2021-05-04T07:04:08Z

So first day/night of operation.

Good news: laminard didn't crash/exit.

Bad news: It closed it's port 8080 listening socket.

No debugger attached, this is the log:

rpc.cpp:55: info: RPC queue; jobName = gdb-m68k-netbsdelf
rpc.cpp:55: info: RPC queue; jobName = gdb-sh-superh-elf
rpc.cpp:55: info: RPC queue; jobName = gdb-i686-pc-msdosdjgpp
rpc.cpp:55: info: RPC queue; jobName = gcc-sparc64-sun-solaris2.11OPT-with-gnu-ldOPT-with-gnu-asOPT-enable-threads=posix
rpc.cpp:55: info: RPC queue; jobName = gdb-rx-elf
rpc.cpp:55: info: RPC queue; jobName = netbsd-landisk-sh3el
rpc.cpp:55: info: RPC queue; jobName = gdb-lm32-elf
rpc.cpp:55: info: RPC queue; jobName = netbsd-mipsco-mipseb
rpc.cpp:55: info: RPC queue; jobName = netbsd-evbmips-mips64el
laminar.cpp:716: info: Run completed; r->name = gdb-aarch64-linux-gnu; to_string(r->result) = success
laminar.cpp:675: info: Started job; run->name = gdb-m32c-rtems; run->build = 15; ctx->name = default
run.cpp:77: info: Run destroyed
kj/compat/http.c++:4284: error: HttpService threw exception after generating a partial response; too late to report error to client; e = kj/compat/http.c++:1653: failed: expected !writeInProgress; concurrent write()s not allowed
stack: 7fb3bc038414 55ba62c2b5dc 7fb3bbf56501 55ba62c24d00 7fb3bc07ad90 7fb3bc068040
laminar.cpp:716: info: Run completed; r->name = gdb-m32c-rtems; to_string(r->result) = success
laminar.cpp:675: info: Started job; run->name = netbsd-cats-arm; run->build = 7; ctx->name = default
run.cpp:77: info: Run destroyed
kj/compat/http.c++:4284: error: HttpService threw exception after generating a partial response; too late to report error to client; e = kj/compat/http.c++:1653: failed: expected !writeInProgress; concurrent write()s not allowed
stack: 7fb3bc038414 55ba62c2b5dc 7fb3bbf56501 55ba62c24d00 7fb3bc07ad90 7fb3bc068040
kj/compat/http.c++:4284: error: HttpService threw exception after generating a partial response; too late to report error to client; e = kj/compat/http.c++:1653: failed: expected !writeInProgress; concurrent write()s not allowed
stack: 7fb3bc038414 55ba62c2b5dc 7fb3bbf56501 55ba62c24d00 7fb3bc07ad90 7fb3bc068040
laminar.cpp:716: info: Run completed; r->name = netbsd-riscv-riscv32; to_string(r->result) = success
laminar.cpp:675: info: Started job; run->name = netbsd-sgimips-mipseb; run->build = 7; ctx->name = default
run.cpp:77: info: Run destroyed
kj/compat/http.c++:4284: error: HttpService threw exception after generating a partial response; too late to report error to client; e = kj/compat/http.c++:1653: failed: expected !writeInProgress; concurrent write()s not allowed
stack: 7fb3bc038414 55ba62c2b5dc 7fb3bbf56501 55ba62c24d00 7fb3bc07ad90 7fb3bc068040
kj/compat/http.c++:4284: error: HttpService threw exception after generating a partial response; too late to report error to client; e = kj/compat/http.c++:1653: failed: expected !writeInProgress; concurrent write()s not allowed
stack: 7fb3bc038414 55ba62c2b5dc 7fb3bbf56501 55ba62c24d00 7fb3bc07ad90 7fb3bc068040
laminar.cpp:716: info: Run completed; r->name = netbsd-cats-arm; to_string(r->result) = failed
laminar.cpp:675: info: Started job; run->name = netbsd-cobalt-mipsel; run->build = 7; ctx->name = default
run.cpp:77: info: Run destroyed
kj/compat/http.c++:4284: error: HttpService threw exception after generating a partial response; too late to report error to client; e = kj/async-io-unix.c++:379: overloaded: ::writev(fd, iov.begin(), iov.size()): Connection timed out
stack: 7fb3bc051dc5 7fb3bbf56501 7fb3bc016d70 55ba62c24ac0 7fb3bc07ad90 7fb3bc068040
laminar.cpp:716: info: Run completed; r->name = netbsd-sgimips-mipseb; to_string(r->result) = success
laminar.cpp:675: info: Started job; run->name = netbsd-hpcarm-earmv4; run->build = 7; ctx->name = default
run.cpp:77: info: Run destroyed
taskFailed: ::read(fd, buffer, maxBytes): Connection reset by peer
laminar.cpp:716: info: Run completed; r->name = netbsd-cobalt-mipsel; to_string(r->result) = success
laminar.cpp:675: info: Started job; run->name = gdb-mipsisa64r2-sde-elf; run->build = 15; ctx->name = default
run.cpp:77: info: Run destroyed
laminar.cpp:716: info: Run completed; r->name = gdb-mipsisa64r2-sde-elf; to_string(r->result) = aborted
laminar.cpp:675: info: Started job; run->name = gdb-ia64-hp-vms; run->build = 15; ctx->name = default
run.cpp:77: info: Run destroyed
laminar.cpp:716: info: Run completed; r->name = netbsd-hpcarm-earmv4; to_string(r->result) = aborted

jbglaw · 2021-05-04T11:42:11Z

Got bitten by it again: It doesn't exit() any longer, but upon taskFailed: ::read(fd, buffer, maxBytes): Connection reset by peer, port 8080 is close()d.

ohwgiles · 2021-05-08T02:01:04Z

Note to self: need to write a test case to reproduce this

jbglaw · 2021-05-08T11:46:26Z

I have a sliiight hint maybe: I think this happens when I have the main page opened in my mobile phone (chrome browser) and just leave the browser and put the mobile phone to sleep (short press on the on/off button.) It seems to not happen when I haven't touched the laminar server with my mobile phone after having it restarted.

jbglaw · 2021-05-15T09:19:08Z

Yeah, I'm quite certain by now: This exception (formerly leading to an exit, now leading to the 8080 fd being closed) gets triggered when I keep the Laminar dashboard open and (common in Germany) my IPv4 router gets a new external IP address assigned. The former address will be either not handed out within a short timeframe (would result in a timeout or filling buffer during send), or should lead to a RST (new IPv4 address holder doesn't know about the former connection.)

jbglaw · 2021-05-18T19:24:58Z

Tested this two more nights: It seems to be as I expected. When there's an open connection (event-stream) and my external IP address changes, the server closes its listening port 8080 after some time. So any client can probably lead to a denial-of-service situation with simply going offline without shutting down the TCP connection first.

ohwgiles · 2021-05-19T08:51:43Z

I tried to reproduce this by reducing /proc/sys/net/ipv4/{tcp_keepalive_time,tcp_keepalive_intvl,tcp_keepalive_probes,tcp_retries2}, creating a veth pair on a common subnet, connecting to laminar over it, then bringing it down/up, changing its address etc. Did not reproduce for me... :(

jbglaw · 2021-05-19T09:14:39Z

My brain really bends looking at the C++'ized approach to the POSIX API. I think my next try will be to find out where the listening socket can be closed and break on all those places (guessing it's only one or two actually.) But yeah, it's hard to trigger but quite fatal once it happened.

Just to document it: To not loose the currently running run(s), I'm setting the default workers to zero once I'm locked out through the API and wait for the runs to settle. Then restart and re-queue what was lost.

ohwgiles · 2021-12-11T03:08:49Z

Just to clean things up a bit, I'm closing this issue since the crash in the initial report was fixed in b16991b. Opened #164 to track the Connection reset by peer issue.

ohwgiles closed this as completed Mar 19, 2021

ohwgiles reopened this May 8, 2021

ohwgiles mentioned this issue Dec 11, 2021

fatal: ::read(fd, buffer, maxBytes): Connection reset by peer #164

Closed

ohwgiles closed this as completed Dec 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

laminard crashing #146

laminard crashing #146

jbglaw commented Mar 5, 2021

jbglaw commented Mar 5, 2021

ohwgiles commented Mar 6, 2021

jbglaw commented Mar 6, 2021

ohwgiles commented Mar 7, 2021

jbglaw commented Mar 7, 2021

jbglaw commented Mar 7, 2021

ohwgiles commented Mar 8, 2021

jbglaw commented Mar 8, 2021

jbglaw commented Mar 8, 2021

ohwgiles commented Mar 11, 2021

jbglaw commented Mar 11, 2021

ohwgiles commented Mar 19, 2021

jbglaw commented Apr 17, 2021

jbglaw commented Apr 21, 2021

ohwgiles commented Apr 25, 2021

jbglaw commented Apr 25, 2021

jbglaw commented Apr 25, 2021

jbglaw commented Apr 28, 2021

ohwgiles commented Apr 29, 2021

jbglaw commented May 2, 2021

jbglaw commented May 2, 2021

ohwgiles commented May 2, 2021

jbglaw commented May 4, 2021

jbglaw commented May 4, 2021

ohwgiles commented May 8, 2021

jbglaw commented May 8, 2021

jbglaw commented May 15, 2021

jbglaw commented May 18, 2021

ohwgiles commented May 19, 2021

jbglaw commented May 19, 2021

ohwgiles commented Dec 11, 2021

laminard crashing #146

laminard crashing #146

Comments

jbglaw commented Mar 5, 2021

jbglaw commented Mar 5, 2021

ohwgiles commented Mar 6, 2021

jbglaw commented Mar 6, 2021

ohwgiles commented Mar 7, 2021

jbglaw commented Mar 7, 2021

jbglaw commented Mar 7, 2021

ohwgiles commented Mar 8, 2021

jbglaw commented Mar 8, 2021

jbglaw commented Mar 8, 2021

ohwgiles commented Mar 11, 2021

jbglaw commented Mar 11, 2021

ohwgiles commented Mar 19, 2021

jbglaw commented Apr 17, 2021

jbglaw commented Apr 21, 2021

ohwgiles commented Apr 25, 2021

jbglaw commented Apr 25, 2021

jbglaw commented Apr 25, 2021

jbglaw commented Apr 28, 2021

ohwgiles commented Apr 29, 2021

jbglaw commented May 2, 2021

jbglaw commented May 2, 2021

ohwgiles commented May 2, 2021

jbglaw commented May 4, 2021

jbglaw commented May 4, 2021

ohwgiles commented May 8, 2021

jbglaw commented May 8, 2021

jbglaw commented May 15, 2021

jbglaw commented May 18, 2021

ohwgiles commented May 19, 2021

jbglaw commented May 19, 2021

ohwgiles commented Dec 11, 2021