Skip to content

(evdns related) Assertion bufev_private->refcnt > 0 failed in _bufferevent_decref_and_unlock #60

Closed
ghazel opened this Issue Jun 14, 2012 · 5 comments

2 participants

@ghazel
ghazel commented Jun 14, 2012
 * thread #5: tid = 0x3703, 0x90d839c6 libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGABRT
    frame #0: 0x90d839c6 libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x943eaf78 libsystem_c.dylib`pthread_kill + 106
    frame #2: 0x943dbbdd libsystem_c.dylib`abort + 167
    frame #3: 0x0008137b myapp`event_exit + 75 at log.c:79
      abort();
    frame #4: 0x00081557 myapp`event_errx + 71 at log.c:136
      event_exit(eval);
    frame #5: 0x00078f50 myapp`_bufferevent_decref_and_unlock + 112 at bufferevent.c:611
      EVUTIL_ASSERT(bufev_private->refcnt > 0);
    frame #6: 0x0007a75c myapp`bufferevent_connect_getaddrinfo_cb + 188 at bufferevent_sock.c:447
      _bufferevent_decref_and_unlock(bev);
    frame #7: 0x000a0e68 myapp`evdns_getaddrinfo_gotresolve + 888 at evdns.c:4474
      data->user_cb(err, NULL, data->user_data);
    frame #8: 0x00093c63 myapp`reply_run_callback + 499 at evdns.c:809
      cb->user_callback(cb->err, 0, 0, cb->ttl, NULL, user_pointer);
    frame #9: 0x00068b25 myapp`event_process_deferred_callbacks + 181 at event.c:1387
      cb->cb(cb, cb->arg);
    frame #10: 0x00068c6c myapp`event_process_active + 204 at event.c:1428
      event_process_deferred_callbacks(&base->defer_queue,&base->event_break);
    frame #11: 0x0006940d myapp`event_base_loop + 733 at event.c:1617
      int n = event_process_active(base);
    frame #12: 0x00068cdf myapp`event_base_dispatch + 31 at event.c:1446
      return (event_base_loop(event_base, 0));

My guess is: bufferevent_socket_connect_hostname called evutil_getaddrinfo_async which started two DNS queries. the TYPE_A responded quickly and bufferevent went on connecting (and got a failure, unrelated). then the TYPE_AAAA came back, and the bufferevent had already been disposed of. I think this can happen if the TYPE_AAAA query responds after the global_getaddrinfo_allow_skew (which is set to the default of 3).

I guess that implies something is wrong with evdns_getaddrinfo_timeout_cb, but I haven't determined what.

@ghazel
ghazel commented Jun 14, 2012

Ok, more information:

evdns_getaddrinfo_timeout_cb tries to cancel the request, but evdns_cancel_request returns immediately, because handle->current_req is NULL. The reason it's NULL is that earlier request_finished set it to NULL when it was called by reply_handle. reply_handle also called reply_schedule_callback, which created a pending callback, which was what later triggers the call to evdns_getaddrinfo_gotresolve for a request which should be cancelled.

So here's a summary of the order of events:

[debug] Sending request for domain.com on ipv4 as 0x38da64
[debug] Sending request for domain.com on ipv6 as 0x38da6c
(here the timeout timer is started)
(here a pending cb (1) is created)
[debug] reply_schedule_callback req:0x38df10 handle:0x38daf0
[debug] Removing timeout for request 0x38df10 handle:0x38daf0
[debug] search_try_next handle:0x679d80 req:0x678ba0
[debug] Search: now trying domain.com.local (0)
[debug] Removing timeout for request 0x678ba0 handle:0x679d80
[debug] search_try_next handle:0x679d80 newreq:0x562d20
(here the pending cb (1) fires)
[msg] evdns_getaddrinfo_gotresolve req:0x38da64 other_req:0x38da6c
[debug] search_try_next handle:0x679d80 req:0x562d20
(here a pending cb (2) is created)
[debug] reply_schedule_callback req:0x562d20 handle:0x679d80
[debug] Removing timeout for request 0x562d20 handle:0x679d80
(here the timeout timer fires)
[debug] evdns_getaddrinfo_timeout_cb v4:0x38da64 v6:0x38da6c
[debug] evdns_cancel_request handle:0x679d80
[debug] evdns_cancel_request has a pending_cb!!!!
[debug] evdns_cancel_request returning because current_req == NULL
[debug] pending:0x65c2a0
[warn] bufferevent_connect_getaddrinfo_cb bev:0x38dcb0 result:0
[debug] v4_timedout:0 v6_timedout:1
(here the pending cb (2) fires)
[msg] evdns_getaddrinfo_gotresolve req:0x38da6c other_req:0x38da64
[warn] bufferevent_connect_getaddrinfo_cb bev:0x38dcb0 result:8
(crash)

So either the pending cb needs to be cancelled when evdns_cancel_request runs, or it should handle the case where it occurs after the timeout does.

@ghazel
ghazel commented Jun 15, 2012

I'm not really proposing this as a fix, but this does hack around the issue:

--- a/libevent/evdns.c
+++ b/libevent/evdns.c
@@ -4283,6 +4283,8 @@ evdns_getaddrinfo_timeout_cb(evutil_socket_t fd, short what, void *ptr)
        data->user_cb(e, NULL, data->user_data);
    }

+    data->user_cb = NULL;
+
    if (!v4_timedout && !v6_timedout) {
        /* should be impossible? XXXX */
        free_getaddrinfo_request(data);
@@ -4438,6 +4440,12 @@ evdns_getaddrinfo_gotresolve(int result, char type, int count,
        return;
    }

+    if (data->user_cb == NULL) {
+       /* We already answered the user. */
+       free_getaddrinfo_request(data);
+       return;
+    }
+
    if (result == DNS_ERR_NONE) {
        if (count == 0)
            err = EVUTIL_EAI_NODATA;
@nmathewson
Owner

I actually think your solution might be a good idea for a workaround in 2.0. For 2.1, we can look into changing how cancelling DNS requests works. Have you been trying it out this fix in practice? It seems like it ought to work.

@ghazel
ghazel commented Jul 30, 2012

Yes, my fix seems to work. I chose ultimately, though, to disable IPv6. That avoids the delay and also this issue.

@nmathewson nmathewson pushed a commit that referenced this issue Jul 30, 2012
@ghazel ghazel Avoid segfault on weird timeout during name lookup.
If an evdns_getaddrinfo timeout happens while pending_cb is set, and
a callback is about to run, but we get a call to
evdns_getaddrinfo_gotresolve before it finishes.

Github issue #60. Thanks to Greg Hazel for patch and patience.
dc32077
@nmathewson
Owner

Okay; I've merged this, along with some XXXX comments to suggest that it ought not be permanent.

@nmathewson nmathewson closed this Jul 30, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.