Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

postgres VACUUM exposes deadlock #369

Closed
ccaputo opened this issue Aug 9, 2020 · 7 comments · Fixed by #371
Closed

postgres VACUUM exposes deadlock #369

ccaputo opened this issue Aug 9, 2020 · 7 comments · Fixed by #371
Assignees
Labels
release blocker blocks the next release

Comments

@ccaputo
Copy link
Contributor

ccaputo commented Aug 9, 2020

Describe the bug

Running f4e5797 (just shy of 4.1.0) have noticed that a postgres vacuumdb --all --analyze --full results in some kind of deadlock as follows:

# ps auwx | grep -E "postgres|irrd"
root       927  0.0  0.0 107104 42952 ?        S    Aug03   0:00 irrd-RPSLMirrorImportUpdateRunner-BBOI
postgres   928  0.0  0.0 17225780 25032 ?      Ss   Aug03   0:00 postgres: irrd irrd [local] INSERT waiting
root      1233  0.0  0.0 107236 43048 ?        S    Aug02   0:00 irrd-RPSLMirrorImportUpdateRunner-TC
postgres  1275  0.0  0.0 17225780 24436 ?      Ss   Aug02   0:00 postgres: irrd irrd [local] INSERT waiting
root      4611  0.0  0.0 108052 43784 ?        S    Aug01   0:00 irrd-RPSLMirrorImportUpdateRunner-RADB
root      4612  0.0  0.0 107404 43140 ?        S    Aug01   0:00 irrd-RPSLMirrorImportUpdateRunner-RIPE
postgres  4614  0.0  0.0 17225964 25148 ?      Ss   Aug01   0:00 postgres: irrd irrd [local] INSERT waiting
postgres  4615  0.0  0.0 17225856 25776 ?      Ss   Aug01   0:00 postgres: irrd irrd [local] INSERT waiting
root      9508  0.0  0.0 107108 42952 ?        S    Aug01   0:00 irrd-RPSLMirrorImportUpdateRunner-JPIRR
postgres  9511  0.0  0.0 17225780 25056 ?      Ss   Aug01   0:00 postgres: irrd irrd [local] INSERT waiting
root     10572  0.0  0.0 106848 42884 ?        S    Aug02   0:00 irrd-RPSLMirrorImportUpdateRunner-LEVEL3
postgres 10583  0.0  0.0 17225780 25092 ?      Ss   Aug02   0:00 postgres: irrd irrd [local] DELETE waiting
postgres 10710  0.0  0.2 17224100 389384 ?     Ss   Jul27   0:14 /usr/local/pgsql/bin/postgres -D /db/postgresql --data-directory=/db/postgresql --unix-socket-directories=/run/postgresql
postgres 10712  0.0  3.1 17224248 4123948 ?    Ss   Jul27   6:01 postgres: checkpointer   
postgres 10713  0.0  0.1 17224100 147552 ?     Ss   Jul27   0:06 postgres: background writer   
postgres 10714  0.0  0.0 17224100 21136 ?      Ss   Jul27   1:23 postgres: walwriter   
postgres 10715  0.0  0.0 17224772 7404 ?       Ss   Jul27   0:08 postgres: autovacuum launcher   
postgres 10716  0.0  0.0  20244  4572 ?        Ss   Jul27   0:30 postgres: stats collector   
postgres 10717  0.0  0.0 17224528 6188 ?       Ss   Jul27   0:00 postgres: logical replication launcher   
root     11683  0.0  0.0 107104 42956 ?        S    Aug02   0:00 irrd-RPSLMirrorImportUpdateRunner-ALTDB
postgres 11694  0.0  0.0 17225780 25028 ?      Ss   Aug02   0:00 postgres: irrd irrd [local] INSERT waiting
root     12556  0.0  0.0 107104 42952 ?        S    Aug03   0:00 irrd-RPSLMirrorImportUpdateRunner-ARIN-NONAUTH
postgres 12557  0.0  0.0 17225780 25032 ?      Ss   Aug03   0:00 postgres: irrd irrd [local] INSERT waiting
root     15731  0.0  0.0  66596 37956 ?        S    Jul28   6:45 /home/example/irrd-venv/bin/python3.7 /home/example/irrd-venv/bin/irrd --config=/home/example/src/config/irrd.yaml
root     15738  0.2  0.0 288060 40248 ?        Sl   Jul28  49:32 irrd-whois-server-listener
root     15739  0.0  0.0  66596 38048 ?        S    Jul28   1:25 irrd-http-server-listener
root     15740  2.0  0.4 641752 546248 ?       S    Jul28 350:39 irrd-preload-store-manager
root     15743  0.3  0.4 799056 589372 ?       Sl   Jul28  62:43 irrd-whois-worker
root     15744  0.3  0.4 780176 569924 ?       Sl   Jul28  63:47 irrd-whois-worker
root     15746  0.3  0.4 810848 600412 ?       Sl   Jul28  62:40 irrd-whois-worker
root     15751  0.3  0.4 810976 600588 ?       Sl   Jul28  63:18 irrd-whois-worker
root     15753  0.3  0.4 813952 603348 ?       Sl   Jul28  62:46 irrd-whois-worker
root     15754  0.3  0.4 813016 602516 ?       Sl   Jul28  63:23 irrd-whois-worker
root     15755  0.3  0.4 781616 571048 ?       Sl   Jul28  63:16 irrd-whois-worker
root     15756  0.3  0.4 802696 592368 ?       Sl   Jul28  63:43 irrd-whois-worker
postgres 15760  0.4  2.6 17225932 3526636 ?    Ss   Jul28  78:11 postgres: irrd irrd [local] idle
postgres 15762  0.0  0.4 17231320 565940 ?     Ss   Jul28   2:16 postgres: irrd irrd [local] idle in transaction
postgres 15765  0.0  0.4 17231320 563452 ?     Ss   Jul28   2:06 postgres: irrd irrd [local] idle in transaction
postgres 15766  0.0  0.4 17231316 562464 ?     Ss   Jul28   2:05 postgres: irrd irrd [local] idle in transaction
postgres 15769  0.0  0.4 17231320 564672 ?     Ss   Jul28   1:47 postgres: irrd irrd [local] idle in transaction
postgres 15770  0.0  0.4 17231320 563224 ?     Ss   Jul28   1:48 postgres: irrd irrd [local] idle in transaction
postgres 15771  0.0  0.4 17231320 562136 ?     Ss   Jul28   2:08 postgres: irrd irrd [local] idle in transaction
postgres 15772  0.0  0.2 17231320 385600 ?     Ss   Jul28   1:46 postgres: irrd irrd [local] idle in transaction
postgres 15773  0.0  0.4 17231320 561820 ?     Ss   Jul28   2:03 postgres: irrd irrd [local] idle in transaction
root     18202  0.0  0.0 106848 42856 ?        S    Aug02   0:00 irrd-RPSLMirrorImportUpdateRunner-APNIC
postgres 18205  0.0  0.0 17225780 25028 ?      Ss   Aug02   0:00 postgres: irrd irrd [local] INSERT waiting
root     19770  0.0  0.0   3184   812 pts/1    S+   14:33   0:00 grep --colour=auto -E postgres|irrd
root     20104  0.0  0.0 107072 42900 ?        S    Aug03   0:00 irrd-RPSLMirrorImportUpdateRunner-AFRINIC
postgres 20106  0.0  0.0 17225828 23372 ?      Ss   Aug03   0:00 postgres: irrd irrd [local] INSERT waiting
postgres 23230  0.0  0.0 17225400 25324 ?      Ss   Aug01   0:00 postgres: root irrd [local] VACUUM waiting
root     26689  0.0  0.0 106852 42848 ?        S    Aug02   0:00 irrd-RPSLMirrorImportUpdateRunner-ARIN
postgres 26691  0.0  0.0 17225780 25124 ?      Ss   Aug02   0:00 postgres: irrd irrd [local] INSERT waiting
root     27728  0.0  0.0 106820 42552 ?        S    Aug01   0:00 irrd-RPSLMirrorImportUpdateRunner-NTT-RPKI
postgres 27731  0.0  0.0 17225780 25100 ?      Ss   Aug01   0:00 postgres: irrd irrd [local] DELETE waiting
root     28037  0.0  0.0 107236 42992 ?        S    Aug04   0:00 irrd-RPSLMirrorImportUpdateRunner-CANARIE
postgres 28038  0.0  0.0 17225780 25044 ?      Ss   Aug04   0:00 postgres: irrd irrd [local] INSERT waiting
root     30601  0.0  0.0 108044 43728 ?        S    Aug03   0:00 irrd-RPSLMirrorImportUpdateRunner-NTTCOM
postgres 30602  0.0  0.0 17225948 25096 ?      Ss   Aug03   0:00 postgres: irrd irrd [local] INSERT waiting
root     31992  0.0  0.0 106848 42800 ?        S    Aug03   0:00 irrd-RPSLMirrorImportUpdateRunner-RIPE-NONAUTH
postgres 31994  0.0  0.0 17225780 25032 ?      Ss   Aug03   0:00 postgres: irrd irrd [local] DELETE waiting
root     32224  0.0  0.0 106816 42568 ?        S    Aug02   0:00 irrd-RPSLMirrorImportUpdateRunner-ARIN-WHOIS
postgres 32275  0.0  0.0 17225780 25004 ?      Ss   Aug02   0:00 postgres: irrd irrd [local] DELETE waiting

Note the pid 23230 VACUUM waiting above, which was prompted by:

Aug  1 23:00:01 ca CROND[23212]: (root) CMD (   PGHOST="/run/postgresql" vacuumdb --all --analyze --full)

kill 23230 terminates the waiting VACUUM and then rest of the commands in progress are able to continue and things appear to have gotten caught up without problem.

After the VACUUM is killed, I note:

# ps auwx | grep postg             
root      8705  0.0  0.0   3040   748 pts/1    S+   14:51   0:00 grep --colour=auto postg
postgres 10710  0.0  0.2 17224100 389384 ?     Ss   Jul27   0:14 /usr/local/pgsql/bin/postgres -D /db/postgresql --data-directory=/db/postgresql --unix-socket-directories=/run/postgresql
postgres 10712  0.0  3.1 17224248 4188084 ?    Ds   Jul27   6:06 postgres: checkpointer   
postgres 10713  0.0  0.1 17224100 147552 ?     Ss   Jul27   0:06 postgres: background writer   
postgres 10714  0.0  0.0 17224100 21136 ?      Ds   Jul27   1:24 postgres: walwriter   
postgres 10715  0.0  0.0 17224772 7404 ?       Ss   Jul27   0:08 postgres: autovacuum launcher   
postgres 10716  0.0  0.0  20244  4572 ?        Ss   Jul27   0:30 postgres: stats collector   
postgres 10717  0.0  0.0 17224528 6188 ?       Ss   Jul27   0:00 postgres: logical replication launcher   
postgres 15760  0.4  2.6 17225932 3526636 ?    Ss   Jul28  78:54 postgres: irrd irrd [local] idle
postgres 15762  0.0  0.4 17231320 569316 ?     Ss   Jul28   2:16 postgres: irrd irrd [local] idle in transaction
postgres 15765  0.0  0.4 17231320 571764 ?     Ss   Jul28   2:07 postgres: irrd irrd [local] idle in transaction
postgres 15766  0.0  0.4 17231316 570596 ?     Ss   Jul28   2:05 postgres: irrd irrd [local] idle in transaction
postgres 15769  0.0  0.4 17231320 568596 ?     Ss   Jul28   1:47 postgres: irrd irrd [local] idle in transaction
postgres 15770  0.0  0.4 17231320 571056 ?     Ss   Jul28   1:48 postgres: irrd irrd [local] idle in transaction
postgres 15771  0.0  0.4 17231320 570268 ?     Ss   Jul28   2:09 postgres: irrd irrd [local] idle in transaction
postgres 15772  0.0  0.2 17231320 393552 ?     Ss   Jul28   1:47 postgres: irrd irrd [local] idle in transaction
postgres 15773  0.0  0.4 17231320 574192 ?     Ss   Jul28   2:04 postgres: irrd irrd [local] idle in transaction
postgres 26114  6.8  1.6 17244220 2111508 ?    Ss   14:48   0:10 postgres: irrd irrd [local] INSERT

I imagine the idle in transaction processes are what is blocking the VACUUM from being able to run.

I am pretty sure that prior to 4.1.0x that VACUUMs were able to run without problem.

To Reproduce
Run PGHOST="/run/postgresql" vacuumdb --all --analyze --full or equivalent.

Expected behaviour
My concern and reason for filing this bug is that a VACUUM should not be blocked, nor should it result in a deadlock situation.

IRRd version you are running
f4e5797 (just shy of 4.1.0)

@mxsasha mxsasha self-assigned this Aug 9, 2020
@mxsasha mxsasha added the release blocker blocks the next release label Aug 9, 2020
@mxsasha
Copy link
Collaborator

mxsasha commented Aug 9, 2020

Will investigate soon, flagging as a possible release blocker.

@mxsasha
Copy link
Collaborator

mxsasha commented Aug 11, 2020

I don't think this is an IRRd bug.

Quoting from the docs:

Plain VACUUM (without FULL) simply reclaims space and makes it available for re-use. This form of the command can operate in parallel with normal reading and writing of the table, as an exclusive lock is not obtained. However, extra space is not returned to the operating system (in most cases); it's just kept available for re-use within the same table. VACUUM FULL rewrites the entire contents of the table into a new disk file with no extra space, allowing unused space to be returned to the operating system. This form is much slower and requires an exclusive lock on each table while it is being processed.

So it makes sense that a full vacuum locks everything and therefore halts all importers. Or, the vacuum may wait for current importers to complete. It looks like your vacuum took quite long, looking at the number of importers that were running during it. If there was indeed a deadlock, PostgreSQL should detect that generally, and kill something to resolve it. If that's an IRRd importer, that import run will fail, but IRRd will recover the next time that importer is run. I don't know why PostgreSQL didn't detect the deadlock.

The irrd irrd [local] idle in transaction processes should not be a concern. Those are the PostgreSQL connections for the whois server workers, which keep their connection open to reduce latency. They don't acquire any locks, and don't do any writes.

I have tried the full vacuum on a few of my instances, and it ran without any issues at all. Have you seen this issue reoccur?

On a sidenote, #326 will reduce the amount of locks needed by the imports, if successful.

@ccaputo
Copy link
Contributor Author

ccaputo commented Aug 11, 2020

I don't think this is an IRRd bug.

At this point I disagree. :-)

Quoting from the docs:

Plain VACUUM (without FULL) simply reclaims space and makes it available for re-use. This form of the command can operate in parallel with normal reading and writing of the table, as an exclusive lock is not obtained. However, extra space is not returned to the operating system (in most cases); it's just kept available for re-use within the same table. VACUUM FULL rewrites the entire contents of the table into a new disk file with no extra space, allowing unused space to be returned to the operating system. This form is much slower and requires an exclusive lock on each table while it is being processed.

So it makes sense that a full vacuum locks everything and therefore halts all importers. Or, the vacuum may wait for current importers to complete. It looks like your vacuum took quite long, looking at the number of importers that were running during it. If there was indeed a deadlock, PostgreSQL should detect that generally, and kill something to resolve it. If that's an IRRd importer, that import run will fail, but IRRd will recover the next time that importer is run. I don't know why PostgreSQL didn't detect the deadlock.

The irrd irrd [local] idle in transaction processes should not be a concern. Those are the PostgreSQL connections for the whois server workers, which keep their connection open to reduce latency. They don't acquire any locks, and don't do any writes.

I have tried the full vacuum on a few of my instances, and it ran without any issues at all. Have you seen this issue reoccur?

On a sidenote, #326 will reduce the amount of locks needed by the imports, if successful.

VACUUM FULL requires an exclusive lock, yes, but I believe it can't get it because of the idle in transaction IRRd connections, which is very different than idle. Thus it queues up waiting for them to finish their transactions, while also preventing any other operations from being able to start. Since the idle transactions don't finish, VACUUM FULL is blocked. And since VACUUM FULL is blocked, further IRRd operations can't process. So it is not a Postgres-internal deadlock but rather one between the two processes.

Why would IRRd issue a BEGIN; and then be idle?

This is reproducible every time. This is the ps output right after it happens, prior to a bunch of INSERTs and DELETEs being queued up:

# ps auwx | grep -E "postgres|irrd"                                                                                                                                                 
postgres  6464  0.0  0.2 17224100 389584 ?     Ss   Aug09   0:04 /usr/local/pgsql/bin/postgres -D /db/postgresql --data-directory=/db/postgresql --unix-socket-directories=/run/postgresql                                                                                                                                                                          
postgres  6466  0.0  4.7 17224960 6290428 ?    Ss   Aug09   2:05 postgres: checkpointer                                                                                           
postgres  6467  0.0  0.1 17224100 144428 ?     Ss   Aug09   0:01 postgres: background writer                                                                                      
postgres  6468  0.0  0.0 17224100 21032 ?      Ss   Aug09   0:37 postgres: walwriter                                                                                              
postgres  6469  0.0  0.0 17224772 7284 ?       Ss   Aug09   0:01 postgres: autovacuum launcher                                                                                    
postgres  6470  0.0  0.0  20360  4484 ?        Ss   Aug09   0:08 postgres: stats collector                                                                                        
postgres  6471  0.0  0.0 17224528 5908 ?       Ss   Aug09   0:00 postgres: logical replication launcher                                                                           
postgres 28353  0.0  0.0 17225400 24664 ?      Ss   23:08   0:00 postgres: root irrd [local] VACUUM waiting                                                                       
root     29884  0.0  0.0   3172   748 pts/1    S+   23:09   0:00 grep --colour=auto -E postgres|irrd                                                                              
root     31573  0.0  0.0  66600 37984 ?        S    Aug09   1:29 /home/example/irrd-venv/bin/python3.7 /home/example/irrd-venv/bin/irrd --config=/home/example/src/config/irrd.yaml
root     31580  0.3  0.0 288064 40336 ?        Sl   Aug09  10:44 irrd-whois-server-listener                                                                                       
root     31581  0.0  0.0  66600 38108 ?        S    Aug09   0:16 irrd-http-server-listener                                                                                        
root     31582  7.1  0.2 496172 350940 ?       S    Aug09 241:09 irrd-preload-store-manager                     
root     31585  0.5  0.2 581916 294572 ?       Sl   Aug09  18:48 irrd-whois-worker                                                                                                
root     31587  0.5  0.2 579744 293324 ?       Sl   Aug09  18:54 irrd-whois-worker                                                                                                
root     31589  0.5  0.2 570156 282284 ?       Sl   Aug09  19:31 irrd-whois-worker                                                                                                
root     31592  0.5  0.2 555308 267888 ?       Sl   Aug09  18:48 irrd-whois-worker                                                                                                
root     31594  0.5  0.2 576140 292436 ?       Sl   Aug09  19:05 irrd-whois-worker                                                                                                
root     31596  0.5  0.2 576776 289768 ?       Sl   Aug09  18:54 irrd-whois-worker                                                                                                
root     31597  0.5  0.2 580204 299080 ?       Sl   Aug09  18:52 irrd-whois-worker                                                                                                
root     31598  0.5  0.2 580996 293388 ?       Sl   Aug09  19:10 irrd-whois-worker                                                                                                
postgres 31604  1.4  4.1 17225920 5461264 ?    Ss   Aug09  48:28 postgres: irrd irrd [local] idle                                                                                 
postgres 31606  0.0  0.2 17227980 302000 ?     Ss   Aug09   0:23 postgres: irrd irrd [local] idle in transaction                                                                  
postgres 31607  0.0  0.2 17227404 300604 ?     Ss   Aug09   0:22 postgres: irrd irrd [local] idle in transaction                                                                  
postgres 31608  0.0  0.2 17226664 300352 ?     Ss   Aug09   0:22 postgres: irrd irrd [local] idle in transaction
postgres 31609  0.0  0.2 17227980 301484 ?     Ss   Aug09   0:22 postgres: irrd irrd [local] idle in transaction
postgres 31610  0.0  0.2 17227984 302544 ?     Ss   Aug09   0:22 postgres: irrd irrd [local] idle in transaction
postgres 31611  0.0  0.2 17227980 301188 ?     Ss   Aug09   0:22 postgres: irrd irrd [local] idle in transaction
postgres 31613  0.0  0.2 17226664 300272 ?     Ss   Aug09   0:21 postgres: irrd irrd [local] idle in transaction
postgres 31614  0.0  0.2 17226664 300556 ?     Ss   Aug09   0:24 postgres: irrd irrd [local] idle in transaction

@mxsasha
Copy link
Collaborator

mxsasha commented Aug 12, 2020

Upon closer look, that may indeed be a cause. The reasons are a bit complicated, but it looks fixable. Will update.

I am curious though why you're running full vacuums, especially under load? As far as I know, they only add something if the database shrunk significantly, which should be very rare in IRRd. Full vacuum are also quite slow and rather disruptive, as they lock for all reads and writes.

@ccaputo
Copy link
Contributor Author

ccaputo commented Aug 12, 2020

Upon closer look, that may indeed be a cause. The reasons are a bit complicated, but it looks fixable. Will update.

Thanks.

I am curious though why you're running full vacuums, especially under load? As far as I know, they only add something if the database shrunk significantly, which should be very rare in IRRd. Full vacuum are also quite slow and rather disruptive, as they lock for all reads and writes.

I guess it is my habit from running postgres instances over the years. The vacuum we do includes an "--analyze" which can be useful for planning and I also normally do a monthly "clusterdb --all".

@mxsasha mxsasha linked a pull request Aug 13, 2020 that will close this issue
mxsasha added a commit that referenced this issue Aug 16, 2020
Particularly whois worker database connections are long lived, and would always start a session. This results in many long-lived sessions. This commit adds a readonly flag that sets the database to use true PostgreSQL autocommit (which strangely, is indeed set through the isolation level). See https://www.oddbird.net/2014/06/14/sqlalchemy-postgres-autocommit/ for explanation on implicit autocommit vs true autocommit.
@mxsasha
Copy link
Collaborator

mxsasha commented Aug 16, 2020

b84e66b is a fix for the long-running transactions. However, I do still recommend against full vacuums, especially under load, unless there is a specific need to recover disk space.

@ccaputo
Copy link
Contributor Author

ccaputo commented Aug 16, 2020

Testing confirms fix. No deadlock now.

While concurrently testing frequent serialized bgpq4 queries, I did encounter some bgpq4 select timeouts's likely caused by the VACUUM lock. Will make sure to only VACUUM at safe times to do so. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release blocker blocks the next release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants