Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: insert or update on table "pool_offline_fetch_error" #806

Closed
dmitrystas opened this issue Sep 11, 2021 · 8 comments
Closed

Error: insert or update on table "pool_offline_fetch_error" #806

dmitrystas opened this issue Sep 11, 2021 · 8 comments

Comments

@dmitrystas
Copy link

dmitrystas commented Sep 11, 2021

Mainnet, epoch 289
cardano-db-sync 11.0.0 - linux-x86_64 - ghc-8.10
git revision 07ee3b0

Important to note that although PostgreSQL reports this as a fatal error, db-sync does not consider it fatal, but rolls back to the start of the epoch and tries again. The second time around this will not fail because the rolled back PoolMetadataRef row no longer exists and a new correct PoolMetadataRef row exists in its place.

from logs

[db-sync-node:Info:97900] [2021-09-07 23:43:08.02 UTC] insertShelleyBlock(Mary): epoch 289, slot 39491565, block 6215190, hash 3ce4f01ceed8c47f1fd3f05ed96a0988a51d4d669042d6c928962624e54c1cfa
[db-sync-node:Info:97900] [2021-09-07 23:43:08.02 UTC] Offline pool metadata fetch: 33 results, 67 fetch errors
[db-sync-node:Error:97900] [2021-09-07 23:43:08.15 UTC] runDBThread: DbInsertException "PoolOfflineFetchError" (SqlError {sqlState = "23503", sqlExecStatus = FatalError, sqlErrorMsg = "insert or update on table \"pool_offline_fetch_error\" violates foreign key constraint \"pool_offline_fetch_error_pmr_id_fkey\"", sqlErrorDetail = "Key (pmr_id)=(22190) is not present in table \"pool_metadata_ref\".", sqlErrorHint = ""})
[db-sync-node:Error:97895] [2021-09-07 23:43:08.15 UTC] ChainSyncWithBlocksPtcl: DbInsertException "PoolOfflineFetchError" (SqlError {sqlState = "23503", sqlExecStatus = FatalError, sqlErrorMsg = "insert or update on table \"pool_offline_fetch_error\" violates foreign key constraint \"pool_offline_fetch_error_pmr_id_fkey\"", sqlErrorDetail = "Key (pmr_id)=(22190) is not present in table \"pool_metadata_ref\".", sqlErrorHint = ""})
[db-sync-node.Subscription:Notice:65] [2021-09-07 23:43:09.68 UTC] Identity Restarting Subscription after 420.896382303s desired valency 1 current valency 0
[db-sync-node.Subscription:Notice:65] [2021-09-07 23:43:09.68 UTC] Identity Starting Subscription Worker, valency 1
[db-sync-node.Subscription:Notice:97935] [2021-09-07 23:43:09.68 UTC] Identity Connection Attempt Start, destination LocalAddress "sockets/node0.socket"
[db-sync-node.Subscription:Notice:97935] [2021-09-07 23:43:09.69 UTC] Identity Connection Attempt End, destination LocalAddress "sockets/node0.socket" outcome: ConnectSuccessLast
[db-sync-node.Handshake:Info:97935] [2021-09-07 23:43:09.69 UTC] WithMuxBearer (ConnectionId {localAddress = LocalAddress "", remoteAddress = LocalAddress "/opt/cardano/cnode/sockets/node0.socket"}) Send (ClientAgency TokPropose,MsgProposeVersions (fromList [(NodeToClientV_1,TInt 764824073),(NodeToClientV_2,TInt 764824073),(NodeToClientV_3,TInt 764824073),(NodeToClientV_4,TInt 764824073),(NodeToClientV_5,TInt 764824073),(NodeToClientV_6,TInt 764824073),(NodeToClientV_7,TInt 764824073),(NodeToClientV_8,TInt 764824073),(NodeToClientV_9,TInt 764824073)]))
[db-sync-node.Handshake:Info:97935] [2021-09-07 23:43:09.70 UTC] WithMuxBearer (ConnectionId {localAddress = LocalAddress "", remoteAddress = LocalAddress "/opt/cardano/cnode/sockets/node0.socket"}) Recv (ServerAgency TokConfirm,MsgAcceptVersion NodeToClientV_9 (TInt 764824073))
[db-sync-node.Mux:Info:97935] [2021-09-07 23:43:09.70 UTC] Bearer on ConnectionId {localAddress = LocalAddress "", remoteAddress = LocalAddress "/opt/cardano/cnode/sockets/node0.socket"} event: Handshake Client end, duration 0.008607132s
[db-sync-node.Mux:Info:97936] [2021-09-07 23:43:09.70 UTC] Bearer on ConnectionId {localAddress = LocalAddress "", remoteAddress = LocalAddress "/opt/cardano/cnode/sockets/node0.socket"} event: State: Mature
[db-sync-node.Subscription:Notice:65] [2021-09-07 23:43:09.72 UTC] Identity Required subscriptions started
[db-sync-node:Info:97939] [2021-09-07 23:43:09.73 UTC] Starting chainSyncClient
[db-sync-node:Info:97939] [2021-09-07 23:43:09.90 UTC] Cardano.Db tip is at slot 39491560, block 6215189
[db-sync-node:Info:97944] [2021-09-07 23:43:09.90 UTC] Running DB thread
[db-sync-node:Info:97944] [2021-09-07 23:43:09.91 UTC] Rolling back to slot 39484797, hash 86ad35fd0bf316a036dd8794400ead9d4431d6df28eab986864bd5ef65e97f87
[db-sync-node:Info:97944] [2021-09-07 23:43:09.94 UTC] Deleting 325 blocks up to slot 39491560
[db-sync-node:Info:97944] [2021-09-07 23:43:13.67 UTC] Blocks deleted
[db-sync-node:Info:97944] [2021-09-07 23:44:24.77 UTC] Found snapshot file for slot 39484797, hash 86ad35fd0bf316a036dd8794400ead9d4431d6df28eab986864bd5ef65e97f87
[db-sync-node:Info:97944] [2021-09-07 23:46:16.91 UTC] File db_state/adastat/39484797-86ad35fd0b-288.lstate exists
[db-sync-node:Info:97944] [2021-09-07 23:48:35.89 UTC] insertShelleyBlock(Mary): epoch 289, slot 39488197, block 6215000, hash 9daef3159978f9da08cf02f726f68ac01e7d5883cebf96f418d6dcdc9ee2056a
[db-sync-node:Info:97944] [2021-09-07 23:51:10.62 UTC] insertShelleyBlock(Mary): continuing epoch 289 (slot 7081/432000)
[db-sync-node:Info:97944] [2021-09-07 23:51:10.62 UTC] insertShelleyBlock(Mary): epoch 289, slot 39491881, block 6215200, hash e97de7e839bfffefdcc2f166c5961e7756526322b47cb95ec0996eef713bf989
@erikd
Copy link
Contributor

erikd commented Sep 11, 2021

I am now running the 11.0.0 release against a datbase restored from the official snapshot. The snapshort starts are epoch 286 so it could be anything up to 3 hours to hit this issue.

I am expecting this to fail. I will then apply commit 4013035 which I am hoping fixes this.

@erikd
Copy link
Contributor

erikd commented Sep 11, 2021

[2021-09-11 11:33:23.39 UTC] Starting epoch 287

I am going to bed.

@erikd
Copy link
Contributor

erikd commented Sep 11, 2021

My db-sync instance is now fully synced and never ran into this problem. Not sure what happened here, but if I can't reproduce it, I can't know if I have fixed it or not.

@dmitrystas I assume your instance completed syncing after that rollback.

@kderme
Copy link
Contributor

kderme commented Sep 11, 2021

There is a race condition that can result in this error, between the OfflineFetch thread and the ChainSync thread. It can happen if these events happen in this order:

  • OfflineFetch thread queries pool_metadata_ref and gets pmr_id
  • ChainSync thread rollbacks the block that included this specific pmr_id. This will cascade delete the pmr_id
  • OfflineFetch thread writes to the queue the pool_offline_fetch_error referencing the non-existant pmr_id
  • ChainSync thread inserts the faulty pool_offline_fetch_error

There is nothing to prevent this from happening right now. It results in a restart and a rollback to the latest ledger state. However the error is not fatal, db-sync can sync further from that point.

@dmitrystas do you have the logs before the crash or do you remember if there was a rollback?

Solutions could be adding some delay to the OfflineFetch, to avoid inserting very recent pmr_id which may get rolled back, or some communication between the threads.

@erikd
Copy link
Contributor

erikd commented Sep 12, 2021

@kderme Yes, that makes sense. Nice analysis.

I think communicating between the threads is a little too complicated. A better solution might be to only query the pool_metadata_ref table for entries that are at least say 10 blocks old. This should reduce the probability of this happening by several orders of magnitude.

Another, even better solution is to catch that error and drop it (with a log message). This would avoid a the large rollback @dmitrystas saw, and the offline metadata would simply be fetched at a later time.

erikd added a commit that referenced this issue Sep 12, 2021
If a rollback occurs after the PoolMetadataRef is fetched, the
insertion of the result can fail. In this insert fail case, we simply log
and drop the exception.

Closes: #806
erikd added a commit that referenced this issue Sep 12, 2021
If a rollback occurs after the PoolMetadataRef is fetched, the
insertion of the result can fail. In this insert fail case, we simply log
and drop the exception.

Closes: #806
@erikd
Copy link
Contributor

erikd commented Sep 12, 2021

@kderme The "catch exception" solution implemented in #808 . What do you think?

@erikd erikd changed the title Fatal error: insert or update on table "pool_offline_fetch_error" Error: insert or update on table "pool_offline_fetch_error" Sep 12, 2021
erikd added a commit that referenced this issue Sep 12, 2021
If a rollback removes the PoolMetadataRef from the database after it is queried,
the insertion of the result can fail. In this insert fail case, we simply log
and drop the exception.

Closes: #806
@dmitrystas
Copy link
Author

@dmitrystas do you have the logs before the crash or do you remember if there was a rollback?

@kderme as far as I remember there was a rollback, so you're probably right

@erikd
Copy link
Contributor

erikd commented Sep 12, 2021

@MarcelKlammer Your issue is completely unrelated to this and therefore has been moved to #809 .

@IntersectMBO IntersectMBO deleted a comment from MarcelKlammer Sep 12, 2021
@erikd erikd closed this as completed in a105cb5 Sep 13, 2021
erikd added a commit that referenced this issue Sep 15, 2021
If a rollback removes the PoolMetadataRef from the database after it is queried,
the insertion of the result can fail. In this insert fail case, we simply log
and drop the exception.

Cherry picked commit a105cb5 from master.

Closes: #806
erikd added a commit that referenced this issue Sep 15, 2021
If a rollback removes the PoolMetadataRef from the database after it is queried,
the insertion of the result can fail. In this insert fail case, we simply log
and drop the exception.

Cherry picked commit a105cb5 from master.

Closes: #806
erikd pushed a commit that referenced this issue Sep 21, 2021
erikd pushed a commit that referenced this issue Sep 21, 2021
This avoids the race condition described in the ticket #806 below. It
does so by ensuring the required foreign keys exist before the insert.
If the required foreign keys do not exist the data is just dropped and
will be refetched later.

Closes: #806
Closes: #823
erikd pushed a commit that referenced this issue Sep 21, 2021
This avoids the race condition described in the ticket #806 below. It
does so by ensuring the required foreign keys exist before the insert.
If the required foreign keys do not exist the data is just dropped and
will be refetched later.

Closes: #806
Closes: #823
Closes: #831
kderme added a commit that referenced this issue Sep 21, 2021
This avoids the race condition described in the ticket #806 below. It
does so by ensuring the required foreign keys exist before the insert.
If the required foreign keys do not exist the data is just dropped and
will be refetched later.

Cherry pick of d658f38 from master.

Closes: #806
Closes: #823
Closes: #831
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants