Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"show fds" crashes pgbouncer in 1.8.1 #311

Closed
quixoten opened this issue Jul 16, 2018 · 8 comments
Labels
bug

Comments

@quixoten
Copy link

@quixoten quixoten commented Jul 16, 2018

This occurs even when not connection through a unix socket, e.g.

2018-07-16 16:00:15.683 723 LOG C-0x7fd30b31ebf8: pgbouncer/pgbouncer@10.64.33.195:46452 login attempt: db=pgbouncer user=pgbouncer tls=no
2018-07-16 16:00:27.245 2645 LOG File descriptor limit: 65536 (H:65536), max_client_conn: 80000, max fds possible: 109556

The first log line is from running:

watch "psql -U pgbouncer -h 10.64.33.195 -p 6432 pgbouncer -c 'show fds;'|grep 'America/Denver'|wc -l"

This doesn't always crash the process, but I had three instances of pgbouncer running, and the watch command running on each of those. They crashed within 15 to 16 minutes of each other.

@eulerto

This comment has been minimized.

Copy link
Contributor

@eulerto eulerto commented Jul 18, 2018

@quixoten

This comment has been minimized.

Copy link
Author

@quixoten quixoten commented Jul 18, 2018

So far, I've only been able to reproduce this in our production environment. I'm unable to get a stack trace there due to downtime sensitivity. I'll continue to try in another environment, but the workloads are vastly smaller.

@eulerto

This comment has been minimized.

Copy link
Contributor

@eulerto eulerto commented Jul 19, 2018

You only need to enable core files (ulimit, limits.conf) for pgbouncer user. After that get a stack trace from the core file.

@quixoten

This comment has been minimized.

Copy link
Author

@quixoten quixoten commented Jul 19, 2018

GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.3) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
(gdb) handle SIG33 pass nostop noprint
Signal        Stop	Print	Pass to program	Description
SIG33         No	No	Yes		Real-time event 33
(gdb) set pagination 0
(gdb) attach 7419
Attaching to process 7419
Reading symbols from /usr/sbin/pgbouncer-1.8.1...done.
Reading symbols from /lib/x86_64-linux-gnu/libssl.so.1.0.0...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libssl.so.1.0.0
Reading symbols from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0...(no debugging symbols found)...done.
Loaded symbols for /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
Reading symbols from /usr/lib/x86_64-linux-gnu/libevent-2.0.so.5...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/x86_64-linux-gnu/libevent-2.0.so.5
Reading symbols from /lib/x86_64-linux-gnu/libc.so.6...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libc-2.19.so...done.
done.
Loaded symbols for /lib/x86_64-linux-gnu/libc.so.6
Reading symbols from /lib/x86_64-linux-gnu/libdl.so.2...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libdl-2.19.so...done.
done.
Loaded symbols for /lib/x86_64-linux-gnu/libdl.so.2
Reading symbols from /lib/x86_64-linux-gnu/libpthread.so.0...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libpthread-2.19.so...done.
done.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Loaded symbols for /lib/x86_64-linux-gnu/libpthread.so.0
Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/ld-2.19.so...done.
done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
0x00007f462dfeb6b3 in __epoll_wait_nocancel () at ../sysdeps/unix/syscall-template.S:81
81	../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) continue
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x00000000004078ec in show_one_fd (sk=0x202dbe0, admin=0x7f462ed0f918) at src/admin.c:333
333	src/admin.c: No such file or directory.
(gdb) backtrace full
#0  0x00000000004078ec in show_one_fd (sk=0x202dbe0, admin=0x7f462ed0f918) at src/admin.c:333
        timezone = 0x0
        addrbuf = "\020\000\000\000\000\000\000\000\060\000\000\000\377\177\000\000\270@\316\334\377\177\000\000\000a\300s\026,\250u\000\000\000\000\000\000\000\000\000a\300s\026,\250uU\001\000\000\000\000\000"
        tmp = {data = <optimized out>, read_pos = 4, write_pos = <optimized out>, alloc_len = <optimized out>, reader = <optimized out>, fixed = <optimized out>}
        ckey = 0
        client_encoding = 0x0
        std_strings = 0x0
        datestyle = 0x0
        addr = 0x202dc48
        v = 0x202dc88
        password = 0x0
#1  show_fds_from_list (admin=admin@entry=0x7f462ed0f918, list=0x63fc10 <login_client_list>) at src/admin.c:378
        item = 0x202dbe0
        sk = 0x202dbe0
        res = true
#2  0x00000000004092c2 in admin_show_fds (admin=0x7f462ed0f918, arg=<optimized out>) at src/admin.c:438
        item = <optimized out>
        pool = <optimized out>
        res = <optimized out>
#3  0x0000000000409c18 in admin_parse_query (admin=admin@entry=0x7f462ed0f918, q=q@entry=0x7f462ed3f2e1 "show fds;") at src/admin.c:1315
        grp = {{rm_so = 0, rm_eo = 9}, {rm_so = 0, rm_eo = 4}, {rm_so = -1, rm_eo = -1}, {rm_so = 4, rm_eo = 8}, {rm_so = 5, rm_eo = 8}, {rm_so = -1, rm_eo = -1}, {rm_so = 8, rm_eo = 9}, {rm_so = -1, rm_eo = -1}, {rm_so = -1, rm_eo = -1}, {rm_so = -1, rm_eo = -1}}
        cmd = "show\000\000\000\000@\374\000\002\000\000\000"
        arg = "fds\000\377\177\000\000EAA\000\005\000\000\000SET Time\030\000\000\000\060\000\000\000 E\316\334\377\177\000\000@D\316\334\377\177\000\000client c\"CB\000\000\000\000"
        val = "uest\000\000\000\000\000a\300s\f\000\000\000\304U\000\000\031'\000\000d\000\000\000\000\000\000\000\030\371\320.F\177\000\000\351\362\323.F\177\000\000\351\362\323.F\177\000\000^CB", '\000' <repeats 13 times>, "Hu-.\304U\000\000\001\000\000\000\016\000\000\000\322\314A\000\000\000\000\000\030\371\320.F\177\000\000\374\362\323.F\177\000\000\351\362\323.F\177\000\000A\026A\000\000\000\000\000\374\362\323.F\177\000\000\000a\300s\026,\250u\034\363\323.F\177\000\000\034\363\323.F\177\000\000\060E\316\334\377\177\000\000\030\371\320.F\177\000\000\027\363\323.F\177\000\000O\256@\000\000\000\000\000\360;\001\002\000\000\000\000"...
        res = <optimized out>
        ok = <optimized out>
#4  0x0000000000409d3d in admin_handle_client (admin=admin@entry=0x7f462ed0f918, pkt=pkt@entry=0x7fffdcce4530) at src/admin.c:1363
        q = 0x7f462ed3f2e1 "show fds;"
        res = <optimized out>
#5  0x000000000040bb88 in handle_client_work (pkt=0x7fffdcce4530, client=0x7f462ed0f918) at src/client.c:675
        sbuf = 0x7f462ed0f9e8
        rfq_delta = 1
#6  client_proto (sbuf=0x7f462ed0f9e8, evtype=<optimized out>, data=0x7fffdcce45a0) at src/client.c:747
        res = false
        client = 0x7f462ed0f918
        pkt = {type = 81, len = 15, data = {data = 0x7f462ed3f2dc "Q", read_pos = 15, write_pos = 15, alloc_len = 15, reader = true, fixed = true}}
        __func__ = "client_proto"
#7  0x0000000000416dce in sbuf_call_proto (sbuf=sbuf@entry=0x7f462ed0f9e8, event=event@entry=0) at src/sbuf.c:387
        mbuf = {data = 0x7f462ed3f2dc "Q", read_pos = 15, write_pos = 15, alloc_len = 15, reader = true, fixed = true}
        io = <optimized out>
        res = <optimized out>
#8  0x0000000000417159 in sbuf_process_pending (sbuf=0x7f462ed0f9e8) at src/sbuf.c:578
        avail = <optimized out>
        io = 0x7f462ed3f2d0
        full = false
        res = <optimized out>
#9  0x0000000000417850 in sbuf_main_loop (sbuf=0x7f462ed0f9e8, skip_recv=<optimized out>) at src/sbuf.c:753
        free = <optimized out>
        ok = <optimized out>
        loopcnt = 1
        skip_recv = <optimized out>
        sbuf = 0x7f462ed0f9e8
#10 0x00007f462e2c3f24 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent-2.0.so.5
No symbol table info available.
#11 0x000000000040fd93 in main_loop_once () at src/main.c:719
        err = <optimized out>
#12 0x00000000004065d8 in main (argc=<optimized out>, argv=<optimized out>) at src/main.c:941
        c = <optimized out>
        did_takeover = <optimized out>
        arg_username = <optimized out>
        long_idx = 32582
        long_options = {{name = 0x42dc49 "quiet", has_arg = 0, flag = 0x0, val = 113}, {name = 0x42dc4f "verbose", has_arg = 0, flag = 0x0, val = 118}, {name = 0x42c176 "help", has_arg = 0, flag = 0x0, val = 104}, {name = 0x42dc57 "daemon", has_arg = 0, flag = 0x0, val = 100}, {name = 0x433d3f "version", has_arg = 0, flag = 0x0, val = 86}, {name = 0x42dc5e "reboot", has_arg = 0, flag = 0x0, val = 82}, {name = 0x42c988 "user", has_arg = 1, flag = 0x0, val = 117}, {name = 0x0, has_arg = 0, flag = 0x0, val = 0}}
        __func__ = "main"
(gdb) info registers
rax            0x0	0
rbx            0x63fc10	6552592
rcx            0x0	0
rdx            0x0	0
rsi            0x0	0
rdi            0x7f462ed0f918	139939409885464
rbp            0x0	0x0
rsp            0x7fffdcce3f50	0x7fffdcce3f50
r8             0x0	0
r9             0x0	0
r10            0x0	0
r11            0x246	582
r12            0x0	0
r13            0x202dbe0	33741792
r14            0x0	0
r15            0x0	0
rip            0x4078ec	0x4078ec <show_fds_from_list+204>
eflags         0x10246	[ PF ZF IF RF ]
cs             0x33	51
ss             0x2b	43
ds             0x0	0
es             0x0	0
fs             0x0	0
gs             0x0	0
(gdb) x/16i $pc
=> 0x4078ec <show_fds_from_list+204>:	mov    0x20(%rax),%rax
   0x4078f0 <show_fds_from_list+208>:	cmpq   $0x0,0x68(%rax)
   0x4078f5 <show_fds_from_list+213>:	je     0x407930 <show_fds_from_list+272>
   0x4078f7 <show_fds_from_list+215>:	mov    0x20(%r13),%rdi
   0x4078fb <show_fds_from_list+219>:	test   %rdi,%rdi
   0x4078fe <show_fds_from_list+222>:	je     0x407930 <show_fds_from_list+272>
   0x407900 <show_fds_from_list+224>:	add    $0x38,%rdi
   0x407904 <show_fds_from_list+228>:	mov    %rsi,0x78(%rsp)
   0x407909 <show_fds_from_list+233>:	mov    %rcx,0x70(%rsp)
   0x40790e <show_fds_from_list+238>:	mov    %rdx,0x68(%rsp)
   0x407913 <show_fds_from_list+243>:	callq  0x411630 <find_user>
   0x407918 <show_fds_from_list+248>:	test   %rax,%rax
   0x40791b <show_fds_from_list+251>:	mov    0x68(%rsp),%rdx
   0x407920 <show_fds_from_list+256>:	mov    0x70(%rsp),%rcx
   0x407925 <show_fds_from_list+261>:	mov    0x78(%rsp),%rsi
   0x40792a <show_fds_from_list+266>:	je     0x407b5a <show_fds_from_list+826>
(gdb) thread apply all backtrace

Thread 1 (Thread 0x7f462ed47740 (LWP 7419)):
#0  0x00000000004078ec in show_one_fd (sk=0x202dbe0, admin=0x7f462ed0f918) at src/admin.c:333
#1  show_fds_from_list (admin=admin@entry=0x7f462ed0f918, list=0x63fc10 <login_client_list>) at src/admin.c:378
#2  0x00000000004092c2 in admin_show_fds (admin=0x7f462ed0f918, arg=<optimized out>) at src/admin.c:438
#3  0x0000000000409c18 in admin_parse_query (admin=admin@entry=0x7f462ed0f918, q=q@entry=0x7f462ed3f2e1 "show fds;") at src/admin.c:1315
#4  0x0000000000409d3d in admin_handle_client (admin=admin@entry=0x7f462ed0f918, pkt=pkt@entry=0x7fffdcce4530) at src/admin.c:1363
#5  0x000000000040bb88 in handle_client_work (pkt=0x7fffdcce4530, client=0x7f462ed0f918) at src/client.c:675
#6  client_proto (sbuf=0x7f462ed0f9e8, evtype=<optimized out>, data=0x7fffdcce45a0) at src/client.c:747
#7  0x0000000000416dce in sbuf_call_proto (sbuf=sbuf@entry=0x7f462ed0f9e8, event=event@entry=0) at src/sbuf.c:387
#8  0x0000000000417159 in sbuf_process_pending (sbuf=0x7f462ed0f9e8) at src/sbuf.c:578
#9  0x0000000000417850 in sbuf_main_loop (sbuf=0x7f462ed0f9e8, skip_recv=<optimized out>) at src/sbuf.c:753
#10 0x00007f462e2c3f24 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent-2.0.so.5
#11 0x000000000040fd93 in main_loop_once () at src/main.c:719
#12 0x00000000004065d8 in main (argc=<optimized out>, argv=<optimized out>) at src/main.c:941
(gdb) quit
A debugging session is active.

	Inferior 1 [process 7419] will be detached.

Quit anyway? (y or n) Detaching from program: /usr/sbin/pgbouncer-1.8.1, process 7419
@film42

This comment has been minimized.

Copy link

@film42 film42 commented Jul 20, 2018

It's interesting that sk->pool is checked a few lines down from src/admin.c:333. I'm guessing it's NULL.

pgbouncer/src/admin.c

Lines 333 to 343 in e8e8f64

if (sk->pool->db->auth_user && sk->auth_user && !find_user(sk->auth_user->name))
password = sk->auth_user->passwd;
/* PAM requires passwords as well since they are not stored externally */
if (cf_auth_type == AUTH_PAM && !find_user(sk->auth_user->name))
password = sk->auth_user->passwd;
return send_one_fd(admin, sbuf_socket(&sk->sbuf),
is_server_socket(sk) ? "server" : "client",
sk->auth_user ? sk->auth_user->name : NULL,
sk->pool ? sk->pool->db->name : NULL,

Checking sk->pool before accessing sk->pool->db should prevent a crash, but it doesn't really explain why sk->pool or sk->pool->db is not set.

@achix

This comment has been minimized.

Copy link

@achix achix commented Jul 23, 2018

maybe a diff of the config file against the distro config ?

btw, in case unencrypted passwds are used, show fds just shows them. This should be fixed.

@petere

This comment has been minimized.

Copy link
Contributor

@petere petere commented Jul 26, 2018

sk->pool could be unset if the client has not been assigned to a pool yet. That would also explain that the crash happens only occasionally but repeatedly under load.

@quixoten Try adding sk->pool && ... to the if statement on line 333 of admin.c.

@petere petere added the bug label Jul 26, 2018
@petere petere closed this in 7b97cb8 Sep 11, 2018
@petere

This comment has been minimized.

Copy link
Contributor

@petere petere commented Sep 11, 2018

I was able to reproduce this, by running SHOW FDS under watch as shown in the report and then having several clients connect repeatedly in a loop, with more clients than the pool size (maybe that wasn't important). It would segfault after 10 to 20 minutes. The committed patch fixed it for me and also is consistent with our earlier analysis.

netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this issue Aug 25, 2019
Changes since 1.9.0

2019-07-01 - PgBouncer 1.10.0 - "Afraid of the World"

    Features
        Add support for enabling and disabling TLS 1.3. (TLS 1.3 was
        already supported, depending on the OpenSSL library, but now the
        configuration settings to pick the TLS protocol versions also
        support it.)
    Fixes
        Fix TLS 1.3 support. This was broken with OpenSSL 1.1.1 and
        1.1.1a (but not before or after).
        Fix a rare crash in SHOW FDS
        (pgbouncer/pgbouncer#311).
        Fix an issue that could lead to prolonged downtime if many cancel
        requests arrive
        (pgbouncer/pgbouncer#329).
        Avoid "unexpected response from login query" after a postgres
        reload
        (pgbouncer/pgbouncer#220).
        Fix idle_transaction_timeout calculation
        (pgbouncer/pgbouncer#125). The
        bug would lead to premature timeouts in specific situations.
    Cleanups
        Make various log and error messages more precise.
        Fix issues found by Coverity (none had a significant impact in
        practice).
        Improve and document all test scripts.
        Add additional SHOW commands to the documentation.
        Convert the documentation from rst to Markdown.
        Python scripts in the source tree are all compatible with Python 3
        now.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.