online restart (-R) broken #113

Closed
ChristophBerg opened this Issue Feb 24, 2016 · 3 comments

Comments

Projects
None yet
2 participants
@ChristophBerg
Contributor

ChristophBerg commented Feb 24, 2016

Hi,

with 1.7.1 the -R online restart feature seems to be broken:

$ sudo -u postgres /usr/sbin/pgbouncer -R -d /etc/pgbouncer/pgbouncer.ini
2016-02-24 15:07:36.167 21571 LOG File descriptor limit: 65536 (H:65536), max_client_conn: 100, max fds possible: 110
2016-02-24 15:07:36.167 21571 LOG takeover_init: launching connection
2016-02-24 15:07:36.167 21571 LOG S-0x828550: pgbouncer/pgbouncer@unix:6432 new connection to server
2016-02-24 15:07:36.167 21571 LOG S-0x828550: pgbouncer/pgbouncer@unix:6432 Login OK, sending SUSPEND
2016-02-24 15:07:36.168 21571 LOG SUSPEND finished, sending SHOW FDS
2016-02-24 15:07:36.168 21571 LOG got pooler socket: 127.0.0.1:6432
2016-02-24 15:07:36.168 21571 LOG got pooler socket: unix:6432
2016-02-24 15:07:36.168 21571 LOG SHOW FDS finished
2016-02-24 15:07:36.168 21571 LOG disko over, going background

pgbouncer.log:

2016-02-24 15:07:28.429 21511 LOG File descriptor limit: 1024 (H:4096), max_client_conn: 100, max fds possible: 110
2016-02-24 15:07:28.430 21513 LOG listening on 127.0.0.1:6432
2016-02-24 15:07:28.430 21513 LOG listening on unix:/var/run/postgresql/.s.PGSQL.6432
2016-02-24 15:07:28.430 21513 LOG process up: pgbouncer 1.7.1, libevent 2.0.21-stable (epoll), adns: evdns2
2016-02-24 15:07:36.167 21571 LOG File descriptor limit: 65536 (H:65536), max_client_conn: 100, max fds possible: 110
2016-02-24 15:07:36.167 21571 LOG takeover_init: launching connection
2016-02-24 15:07:36.167 21571 LOG S-0x828550: pgbouncer/pgbouncer@unix:6432 new connection to server
2016-02-24 15:07:36.167 21513 LOG C-0x2549920: (nodb)/(nouser)@unix(21571):6432 closing because: client unexpected eof (age=0)
2016-02-24 15:07:36.167 21513 LOG C-0x2549920: (nodb)/pgbouncer@unix(21571):6432 pgbouncer access from unix socket
2016-02-24 15:07:36.167 21513 LOG C-0x2549920: pgbouncer/pgbouncer@unix(21571):6432 login attempt: db=pgbouncer user=pgbouncer tls=no
2016-02-24 15:07:36.167 21571 LOG S-0x828550: pgbouncer/pgbouncer@unix:6432 Login OK, sending SUSPEND
2016-02-24 15:07:36.168 21513 LOG SUSPEND command issued
2016-02-24 15:07:36.168 21571 LOG SUSPEND finished, sending SHOW FDS
2016-02-24 15:07:36.168 21571 LOG got pooler socket: 127.0.0.1:6432
2016-02-24 15:07:36.168 21571 LOG got pooler socket: unix:6432
2016-02-24 15:07:36.168 21571 LOG SHOW FDS finished
2016-02-24 15:07:36.168 21571 LOG disko over, going background
2016-02-24 15:07:36.169 21573 LOG sending SHUTDOWN;
2016-02-24 15:07:36.169 21513 LOG SHUTDOWN command issued
2016-02-24 15:07:36.169 21513 WARNING dropping database 'pgbouncer' as it does not exist anymore or inactive auto-database
2016-02-24 15:07:36.169 21513 LOG C-0x2549920: pgbouncer/pgbouncer@unix(21571):6432 closing because: database removed (age=0)
2016-02-24 15:07:36.169 21513 WARNING C-0x2549920: pgbouncer/pgbouncer@unix(21571):6432 Pooler Error: database removed
2016-02-24 15:07:36.169 21573 LOG S-0x828550: pgbouncer/pgbouncer@unix:6432 closing because: disko over (age=0)
2016-02-24 15:07:36.169 21573 LOG old process killed, resuming work
2016-02-24 15:07:36.169 21573 FATAL @src/main.c:627 in function write_pidfile(): /var/run/postgresql/pgbouncer.pid: File exists [17]

... and the process is gone.

I've tried the two commits past 1.7.1 relevant to PID file handling (07df4a5 and 106dc7b (which is missing an "int err", btw)), but they didn't fix the problem.

On IRC I got the report that this is a regression in 1.7.1, i.e. 1.7 worked fine, but I haven't verified this.

@markokr

This comment has been minimized.

Show comment
Hide comment
@markokr

markokr Feb 24, 2016

Contributor

Please test current master, does it work now?

Seems the pidfile race was always there only now the cleanup slowed down shutdown it was exposed.

But also the cleanup touches sockets, so it must be skipped for takeover.

Contributor

markokr commented Feb 24, 2016

Please test current master, does it work now?

Seems the pidfile race was always there only now the cleanup slowed down shutdown it was exposed.

But also the cleanup touches sockets, so it must be skipped for takeover.

@ChristophBerg

This comment has been minimized.

Show comment
Hide comment
@ChristophBerg

ChristophBerg Feb 25, 2016

Contributor

Hi Marko,

master fixes the problem, thanks!

I had some trouble extracting the diff to the last release from git because "git diff --submodule" isn't helpful at all there, so I had to manually extract it. Could you make a new release for this fix? TIA.

Contributor

ChristophBerg commented Feb 25, 2016

Hi Marko,

master fixes the problem, thanks!

I had some trouble extracting the diff to the last release from git because "git diff --submodule" isn't helpful at all there, so I had to manually extract it. Could you make a new release for this fix? TIA.

@markokr

This comment has been minimized.

Show comment
Hide comment
@markokr

markokr Feb 25, 2016

Contributor

Will do.

Thank you for the report.

Contributor

markokr commented Feb 25, 2016

Will do.

Thank you for the report.

@markokr markokr closed this Feb 25, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment