Getting nil response on MSGPACK API after connecting multiple times #1063

Closed
myitcv opened this Issue Aug 13, 2014 · 29 comments

Projects

None yet

6 participants

@myitcv
Neovim member

Using the neovim Go package, I have a very simple test that exercises a couple of the API methods:

$ NEOVIM_LISTEN_ADDRESS=/tmp/neovim go test
OK: 7 passed
PASS
ok      github.com/myitcv/neovim        0.029s

I launch Neovim manually for now - hence a test run assumes there is an instance available at NEOVIM_LISTEN_ADDRESS=/tmp/neovim

The test is very basic for now:

  • Calls to vim_get_buffers, vim_get_current_buffer, buffer_get_length, buffer_set_line
  • 100 concurrent calls to vim_get_buffers
  • A test of notifications, from subscription to running the command to send_event and the handling of the notification

If however I run this same test a number of times in succession, the test starts to fail:

while true
do 
NEOVIM_LISTEN_ADDRESS=/tmp/neovim go test
sleep 1
done

Around the ~110th iteration I start to see (variously):

  • "Connection reset by peer"
  • A nil message being returned
  • Broken pipe (when trying to make my first call)

After the first failure in this loop, things never recover.

And yet the Neovim process remains totally responsive throughout.

Given that each iteration starts a new go test process I'm minded to think my code is probably not at fault. I don't rule out the possibility that my Go client is closing/cleaning up properly because:

  • I don't unsubscribe from my topic of interest
  • My client code is probably not closing the pipe cleanly

That said, I'm guessing Neovim should handle bad clients more cleanly.

Any thoughts on how I can help debug this further from the Neovim side?
Dump the process in some way?


Want to back this issue? Place a bounty on it! We accept bounties via Bountysource.

@splinterofchaos
Neovim member

I've also encountered this issue by calling getch() and sending input to vim_eval("feedkeys(...)") (a quick-fix because I couldn't get vim_feedkeys to work). When I press and hold a key long enough, my client becomes unresponsive, but I can switch to vim and continue without problems. I'll even reconnect and try again without fail until I do it again.

My client can receive messages in any order, but will wait infinitely if a reply never comes, and doesn't resend messages. If the channel had closed, read() would return zero instead of hanging, so I wonder if it has to do with either dropped messages or the server waiting for the rest of a message that's already been sent.

I have found that either many small messages or a few large ones will reproduce this. (Try requesting the API data 10 times in a row. Stops at 9, for me.) Maybe it has to do with per-channel memory allocation.

I don't unsubscribe from my topic of interest

This happens for me without subscribing to anything.

Any thoughts on how I can help debug this further from the Neovim side?
Dump the process in some way?

ditto

Update: I changed it to not wait on messages so my client didn't become unresponsive. Now when I close the client, I see a segfault from msgpack in my stack trace:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff6dc5700 (LWP 32030)]
call_finalizer_array (fa=<optimized out>, fa=<optimized out>) at zone.c:109
109         (*(fin-1)->func)((fin-1)->data);
(gdb) where
#0  call_finalizer_array (fa=<optimized out>, fa=<optimized out>) at zone.c:109
#1  destroy_finalizer_array (fa=0x7ffff0011118) at zone.c:115
#2  msgpack_zone_destroy (zone=zone@entry=0x7ffff0011100) at zone.c:172
#3  0x00007ffff7bd747e in msgpack_zone_free (zone=0x7ffff0011100) at zone.c:218
#4  0x00007ffff7bd388d in msgpack_unpacker_destroy (mpac=0x7ffff6dc4e80) at unpack.c:234
#5  0x000000000042af08 in msgpack::unpacker::~unpacker (this=0x7ffff6dc4e80, __in_chrg=<optimized out>) at /usr/local/include/msgpack/unpack.hpp:194
#6  0x000000000042a870 in NeoServer::listen (pthis=0x7fffffffd910) at /home/admin/src/neovim-cursed/src/NeoServer.cpp:273
#7  0x00007ffff79bc182 in start_thread (arg=0x7ffff6dc5700) at pthread_create.c:312
#8  0x00007ffff71c730d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Since you're seeing "Connection reset by peer" and "nil" responses, maybe we're dealing with two separate issues...

@justinmk justinmk added the bug label Aug 15, 2014
@justinmk
Neovim member

I wonder if these randomly-failing tests are related to this issue? Search for "error" in any of these logs:

  1. https://s3.amazonaws.com/archive.travis-ci.org/jobs/32502037/log.txt
  2. https://s3.amazonaws.com/archive.travis-ci.org/jobs/32567494/log.txt

In both cases, self becomes None starting with test_events.test_receiving_events and many of the following tests fail after that (but eventually it recovers). But since the test order is random, this starts at different times in the two logs. These logs are both from yesterday.

ping @tarruda

update: happened again with 3d9a0cc ; restarting the build made the problem go away (which also overwrote the build log, unfortunately)

@justinmk
Neovim member

Random python API tests failure happened again. Build log here: https://gist.github.com/justinmk/d4f3b7182e473f8a2898

I'll post the build logs every time it happens just in case (they get overwritten when I restart the build).

@justinmk justinmk added this to the first release milestone Aug 18, 2014
@tarruda
Neovim member

@justinmk My son was born last Friday and I didn't touch my laptop since then, but as soon as I have some free time I will give priority to this issue

@justinmk
Neovim member

My son was born last Friday and I didn't touch my laptop since then

Good!

@splinterofchaos
Neovim member

Congratulations!

@aktau
Neovim member

@justinmk My son was born last Friday and I didn't touch my laptop since then, but as soon as I have some free time I will give priority to this issue

Fantastic news!

@tarruda
Neovim member

Thanks for the support/patience, I will get on this issue today

@tarruda
Neovim member

@myitcv I couldn't reproduce this using python, so I downloaded the go client to reproduce the bug and attach to nvim using gdb and see what's going on, but I'm having trouble running the tests successfully. Here's what I'm using:

$ while true; do NEOVIM_LISTEN_ADDRESS=/tmp/neovim go test; done

but I always get the same error(even after a hundred times):

FAIL    github.com/myitcv/neovim        0.003s
2014/08/28 09:49:34 Could not decode response: Could not decode Buffer: msgpack: invalid code 91 decoding uint64
exit status 1

Could you describe exactly what steps I need to take to reproduce the error? Also, what is your OS/version?

@tarruda
Neovim member

Another failure FWIW: https://gist.github.com/justinmk/047fd4c9c5d49000c5f0

I haven't been able to reproduce this locally, which is why I'm hoping to do it using the go client(I need to do it locally so I can attach and have an idea of what's going on)

@myitcv
Neovim member
@tarruda
Neovim member

@myitcv perhaps if you try using the same version of the go client you did the first time?

@bfredl
Neovim member

In a certain situation, I frequently (once in five cases or so) get a None return on buf[-1], however if I add a None check and retry, I always get correct response on immediate retry:

lastline = self.buf[-1] # sometimes fails
if lastline is None:
     print("retry:", repr(self.buf[-1])) # this prints '' (correct value)

I've been trying to isolate a small test case that triggers this, without any luck so far.

@tarruda
Neovim member

I've been trying to isolate a small test case that triggers this, without any luck so far.

I would appreciate if you could put together a python script that reproduces the problem, even if only sometimes

@bfredl
Neovim member

This reproduces systematically for me: https://gist.github.com/bfredl/5a008fbe56aae07703c7 (If I understand test_concurrency.py, this should be thread-safe)
Invoke with nvim socket as command line argument, the assertion should fail after 20 printed lines or so. If not try altering the time in sleep(0.001)

@tarruda
Neovim member

@bfredl managed to reproduce by altering to 10k iterations and reducing the sleep interval to 0.00001, gonna try to debug what causes it now. Thanks

@tarruda
Neovim member

@bfredl I'm almost certain this is a bug with the rpc_blocking_request method in the python-client(nvim never returns nil for buffer_get_line). If I am correct, then it is distinct from the bug detected by @myitcv

@bfredl
Neovim member

@tarruda Thanks for the pointer, just investigating client.py I think there is slight racyness between next_message (invoked in blocking_request) and post, will post issue/pr if I can confirm it.

@tarruda
Neovim member

@tarruda Thanks for the pointer, just investigating client.py I think there is slight racyness between next_message (invoked in blocking_request) and post, will post issue/pr if I can confirm it.

No need, I have pushed a fix to neovim/python-client#22

@bfredl
Neovim member

Nice, cherry-picking that commit the failures seems to be gone (also in my original code) 👍

@splinterofchaos
Neovim member

Hard to tell if this is the same issue that @myitcv reported, but I might as well post it here. This segfault I used to only get in my own client (#1063 (comment)), but I just got it from nvim after having built from a branch off of e0d812a (my branch) and enabled the python clipboard using via https://github.com/neovim/neovim/wiki/Python-plugins,-clipboard-integration-and-other-service-providers-through-msgpack-RPC-channels

image

^^ What I'm trying to do here is highlight end - p to replace it with len

EDIT: I kinda have been wondering if this bug might be in msgpack itself, but I don't want to bother its maintainers without confirmation.

@tarruda
Neovim member

EDIT: I kinda have been wondering if this bug might be in msgpack itself, but I don't want to bother its maintainers without confirmation.

Probably not. From the backtrace I suspect a single channel is being closed twice.

@splinterofchaos could you open another issue for this bug? I'm gonna merge #1121 and the corresponding python-client branch then I will look into this segfault

@splinterofchaos
Neovim member
@myitcv
Neovim member

@tarruda - I'm having a real job trying to reproduce this now. Without a concrete means of reproducing this bug, and given the various changes that have happened since then (#1121 etc) I'm tempted to suggest we close this issue. Unless of course you want to keep investigating via @justinmk's traces?

@tarruda
Neovim member

Let's leave this open for a while and see if we can determine the cause, but this is not related with the bug reported by @justinmk which was fixed in the python client

@myitcv
Neovim member

@tarruda I'm happy to close this now on the basis I haven't seen anything similar for some time.. thoughts?

@tarruda
Neovim member

@myitcv The msgpack/event layer has suffered so many changes that I think it's safe to close this now

@justinmk
Neovim member

That's great to hear. @tarruda and @oni-link have done significant stability improvements.

@justinmk justinmk closed this Aug 25, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment