Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of "violates foreign key constraint" exception still not working correctly #831

Closed
erikd opened this issue Sep 17, 2021 · 21 comments
Labels
bug Something isn't working

Comments

@erikd
Copy link
Contributor

erikd commented Sep 17, 2021

The supposed "fix" in a105cb5 was not actually a fix. @kderme has a PR (#829) with a correction and a test, but even that is not sufficient.

This issue is more complex. I updated @kderme 's test and it seems that if an exception occurs, the whole transaction is aborted.

@erikd erikd added the bug Something isn't working label Sep 17, 2021
@erikd
Copy link
Contributor Author

erikd commented Sep 17, 2021

My updated version of @kderme 's test is as follows:

insertForeignKeyMissing :: IO ()
insertForeignKeyMissing = do
  time <- getCurrentTime
  runDbNoLogging $ do
    slid <- insertSlotLeader testSlotLeader
    bid <- insertBlockChecked (blockZero slid)
    txid <- insertTx (txZero bid)
    phid <- insertPoolHash poolHash0
    pmrid <- insertPoolMetadataRef $ poolMetadataRef txid phid
    let fe = poolOfflineFetchError phid pmrid time
    void . insertAbortForeignKey nullPrint . void $ insertPoolOfflineFetchError fe

    count0 <- poolOfflineFetchErrorCount
    assertBool (show count0 ++ "/= 1") (count0 == 1)

    -- Delete the foreign key.
    delete pmrid
    -- Following insert should fail internally, but the exception should be caught
    -- and swallowed.
    void . insertAbortForeignKey nullPrint . void $ insertPoolOfflineFetchError fe

    count1 <- poolOfflineFetchErrorCount
    assertBool (show count1 ++ "/= 0") (count1 == 0)

   -- Try again to insert the PoolOfflineFetchError 
    pmrid2 <- insertPoolMetadataRef $ poolMetadataRef txid phid
    void . insertAbortForeignKey nullPrint .
              void $ insertPoolOfflineFetchError (poolOfflineFetchError phid pmrid2 time)

    count2 <- poolOfflineFetchErrorCount
    assertBool (show count2 ++ "/= 1") (count2 == 1)
  where
    nullPrint :: Text -> IO ()
    nullPrint = const $ pure ()

    poolOfflineFetchErrorCount :: MonadIO m => ReaderT SqlBackend m Int
    poolOfflineFetchErrorCount = do
        ls :: [Entity PoolOfflineFetchError] <- selectList [] []
        pure $ length ls

This now fails with:

Exception: DbInsertException "PoolMetadataRef"
   ( SqlError
      { sqlState = "23503"
      , sqlExecStatus = FatalError
      , sqlErrorMsg = "insert or update on table \"pool_metadata_ref\" violates foreign key constraint\
                                      \"pool_metadata_ref_registered_tx_id_fkey\""
      , sqlErrorDetail = "Key (registered_tx_id)=(32) is not present in table \"tx\"."
      , sqlErrorHint = ""
      }
  )

which is occurs on the line:

    pmrid2 <- insertPoolMetadataRef $ poolMetadataRef txid phid

The problem is that the PostgreSQL row identifiers txid and phid are no longer valid, because the whole transaction is aborted. The transaction starts at the call to runDbNoLogging at the top of the function.

This particular test can be fixed by adding a second call to runDbNoLogging after the execption and then reinserting the values that generate txid and phid. This fixes the test, but is not a general solution.

@kderme's PR #829 adds exception handlers other places, but because of the transaction abort issue, I do not think these are useful. They will catch and log the exception, but then cause problems elsewhere do to the data lost when the transaction was aborted.

@erikd
Copy link
Contributor Author

erikd commented Sep 17, 2021

I have tried adding a transactionCommit in the exception handler, but that does not help because the transaction has already been aborted. Any transaction commit would need to happen before the exception.

@erikd
Copy link
Contributor Author

erikd commented Sep 17, 2021

The problem is the exception. As soon as that happens we are in a world of hurt.

I think the solution might be a custom insert function (something like insertCheckUnique) that fails without causing an exception to be thrown (and hence aborting the current transaction).

@erikd
Copy link
Contributor Author

erikd commented Sep 17, 2021

Can I just say again, publicly, how much I hate exceptions.

@erikd
Copy link
Contributor Author

erikd commented Sep 17, 2021

Couple of possible "simple solutions":

  • Do a transactionCommit before every insertion of of these PoolOfflineData / PoolOfflineFetchError (this is a really shitty solution, because it drastically slows down sync speeds).
  • Test the existence of the required database indices and only insert if they exist (a race condition would still exist, but its a MUCH narrower window. Maybe this could even be extended to 2160 blocks (beyond which rollbacks are not possible) so there is never any possibility of rollbacks.
  • Modify the query that gets the metadata to be queried to only query things that are at least (say) 10 blocks old (race condition still exists, but MUCH narrower window.

@kderme
Copy link
Contributor

kderme commented Sep 17, 2021

Do a transactionCommit before every insertion of of these PoolOfflineData / PoolOfflineFetchError (this is a really shitty solution, because it drastically slows down sync speeds).

I noticed we already do a commit per block. So this won't really have a huge performance impact.

@erikd
Copy link
Contributor Author

erikd commented Sep 17, 2021

I noticed we already do a commit per block. So this won't really have a huge performance impact.

Sorry, that is not correct. We do a commit per block when we are following the chain, but only commit once every 1000 or so blocks when syncing (because syncing every block is a huge performance hit).

@kderme
Copy link
Contributor

kderme commented Sep 17, 2021

where is 1000 specified?

@erikd
Copy link
Contributor Author

erikd commented Sep 18, 2021

Sorry msi-remembered the 1000. I think it was that at some point but now its like this.

In Cardano.DbSync.Plugin.Default.handleLedgerEvents we have:

        LedgerNewEpoch en ss -> do
          lift $ do
            insertEpochSyncTime en ss (leEpochSyncTime lenv)
            adjustEpochRewards tracer (en - 2)
          finalizeEpochBulkOps lenv
          -- Commit everything in the db *AFTER* the epoch rewards have been inserted, the orphaned
          -- rewards removed and the bulk operations finalized.
          lift DB.transactionCommit

so that is once per epoch.

In Cardano.DbSync.Era.Shelley.Insert.insertShelleyBlock we have:

    when (getSyncStatus details == SyncFollowing) $
      -- Serializiing things during syncing can drastically slow down full sync
      -- times (ie 10x or more).
      lift DB.transactionCommit

which is commit every new block, when we are following the chain tip but not when syncing.

@kderme
Copy link
Contributor

kderme commented Sep 18, 2021

There is a new transaction for every insertDefaultBlock call. It's done in runDbIohkLogging

@erikd
Copy link
Contributor Author

erikd commented Sep 18, 2021

This code has changed so many times its confusing.

However, the code is:

insertDefaultBlock
    :: SqlBackend -> Trace IO Text -> SyncEnv -> [BlockDetails]
    -> IO (Either SyncNodeError ())
insertDefaultBlock backend tracer env blockDetails = do
    thisIsAnUglyHack tracer (envLedger env)
    DB.runDbIohkLogging backend tracer $
      runExceptT (traverse_ insert blockDetails)

So the commit is done for every call to insertDefaultBlock, but that function inserts a list of blocks, which at least during syncing is a list of more than one.

However, this does mean that some of the other transactionCommits can probably be removed.

@kderme
Copy link
Contributor

kderme commented Sep 18, 2021

Actually we don't call insertOfflineResults on every block either. The current policy is

unBlockNo (Generic.blkBlockNo blk) `mod` offlineModBase == 0
where
    offlineModBase :: Word64
    offlineModBase =
      case getSyncStatus details of
        SyncFollowing -> 10
        SyncLagging -> 2000

That's rare enough to add a commit. If we add a 'transactionCommit' first thing in insertOfflineResults that's a simple solution that should have no performance penalty.

@erikd
Copy link
Contributor Author

erikd commented Sep 18, 2021

But this is all getting away from the problem. The problem is the excepton. We should be finding a way to avoid that rather than trying to fix the issues resulting from the exception,

@kderme
Copy link
Contributor

kderme commented Sep 18, 2021

Well for the problem I think we need to go back to #806 (comment). We have a separate thread which collects FetchResult and they are asynchonously inserted later. Anything can happen in between involving rollbacks. If we remove the additional thread, which I think was added for performance, there won't be any issues.

@kderme
Copy link
Contributor

kderme commented Sep 18, 2021

Also I don't think it's the exception that causes the issue. Postgress aborts the transaction and returns an error code. The error code is translated to an exception in postgresql-simple. We catch the exception, but the transaction is still aborted in postgres, which causes the error current transaction is aborted, commands ignored until end of transaction block. We need to restart the transaction.

@erikd
Copy link
Contributor Author

erikd commented Sep 18, 2021

We need to restart the transaction.

But restarting the transaction would require a roll forward of all the insertions that were aborted. That would be a huge book keeping nightmare.

If we remove the additional thread, which I think was added for performance, there won't be any issues.

No, the separate thread is there is to remove the actual HTTP fetch from the actual main code path. We need that thread because each HTTP fetch takes at least 10 seconds (and fetches that fail can take 30 seconds or more).

I still think the cleanest and nicest solution is to avoid the exception, even if that means that when we insert the PoolOfflineData / PoolOfflineFetchError insertion first checks that the foreign key is valid.

I think we can write a custom insert that first checks if the foreign key is valid and only if it is does the regular insert and if the foreign key is invalid, it does nothing.

@kderme
Copy link
Contributor

kderme commented Sep 18, 2021

I still think the cleanest and nicest solution is to avoid the exception, even if that means that when we insert the PoolOfflineData / PoolOfflineFetchError insertion first checks that the foreign key is valid.

Yes I start to agree, handling the exception properly through all the layers: persistent, postgresql-simple woule be a nightmare.

Test the existence of the required database indices and only insert if they exist (a race condition would still exist, but its a MUCH narrower window. Maybe this could even be extended to 2160 blocks (beyond which rollbacks are not possible) so there is never any possibility of rollbacks.

Actually thinking about this again, there won't be any race condition. There is no other thread inserting in the db or doing rollbacks, so checking before inserting is 100% reliable.

@erikd
Copy link
Contributor Author

erikd commented Sep 18, 2021

Threads are hard, but I like threads. Exceptions I just hate.

@kderme
Copy link
Contributor

kderme commented Sep 18, 2021

fix here #835

@1000101
Copy link
Contributor

1000101 commented Sep 20, 2021

FYI it happened again today on multiple instances running this version at block 6266650.

Here's the log in case it might help:

[db-sync-node:Info:74] [2021-09-20 01:44:18.41 UTC] insertShelleyBlock(Alonzo): epoch 291, slot 40535965, block 6266649, hash eb167c14bc1d91d90b1390340302fb0b027931af3bac47ad14401021d01e0d16
[db-sync-node:Info:74] [2021-09-20 01:45:33.66 UTC] insertShelleyBlock(Alonzo): epoch 291, slot 40536040, block 6266650, hash beba868a7e230e0f2c552116c7a974b216e4b96597aa8e535bd17ad7e5f40b7d
[db-sync-node:Info:74] [2021-09-20 01:45:33.66 UTC] Offline pool metadata fetch: 35 results, 65 fetch errors
[db-sync-node:Info:74] [2021-09-20 01:45:34.32 UTC] Rolling back to slot 40535965, hash eb167c14bc1d91d90b1390340302fb0b027931af3bac47ad14401021d01e0d16
[db-sync-node:Info:74] [2021-09-20 01:45:34.32 UTC] Deleting 1 blocks up to slot 40536040
[db-sync-node:Info:74] [2021-09-20 01:45:34.43 UTC] Blocks deleted
[db-sync-node:Info:74] [2021-09-20 01:45:34.43 UTC] Found in memory ledger snapshot at slot 40535965, hash eb167c14bc1d91d90b1390340302fb0b027931af3bac47ad14401021d01e0d16
[db-sync-node:Info:74] [2021-09-20 01:45:35.07 UTC] insertShelleyBlock(Alonzo): epoch 291, slot 40536041, block 6266650, hash 7c278acc1cf3adeb2f697e3cff72529388769f3aa2a0a2a9c1cb22dc55ae1a68
[db-sync-node:Info:74] [2021-09-20 01:45:35.07 UTC] Offline pool metadata fetch: 1 results, 0 fetch errors
[db-sync-node:Info:74] [2021-09-20 01:45:35.41 UTC] Rolling back to slot 40535965, hash eb167c14bc1d91d90b1390340302fb0b027931af3bac47ad14401021d01e0d16
[db-sync-node:Info:74] [2021-09-20 01:45:35.41 UTC] Deleting 1 blocks up to slot 40536041
[db-sync-node:Info:74] [2021-09-20 01:45:35.53 UTC] Blocks deleted
[db-sync-node:Info:74] [2021-09-20 01:45:35.53 UTC] Found in memory ledger snapshot at slot 40535965, hash eb167c14bc1d91d90b1390340302fb0b027931af3bac47ad14401021d01e0d16
[db-sync-node:Info:74] [2021-09-20 01:45:36.14 UTC] insertShelleyBlock(Alonzo): epoch 291, slot 40536040, block 6266650, hash beba868a7e230e0f2c552116c7a974b216e4b96597aa8e535bd17ad7e5f40b7d
[db-sync-node:Info:74] [2021-09-20 01:45:36.14 UTC] Offline pool metadata fetch: 12 results, 1 fetch errors
[db-sync-node:Error:74] [2021-09-20 01:45:36.15 UTC] runDBThread: DbInsertException "PoolOfflineFetchError" (SqlError {sqlState = "23503", sqlExecStatus = FatalError, sqlErrorMsg = "insert or update on table \"pool_offline_fetch_error\" violates foreign key constraint \"pool_offline_fetch_error_pmr_id_fkey\"", sqlErrorDetail = "Key (pmr_id)=(23435) is not present in table \"pool_metadata_ref\".", sqlErrorHint = ""})
[db-sync-node:Error:69] [2021-09-20 01:45:36.15 UTC] ChainSyncWithBlocksPtcl: DbInsertException "PoolOfflineFetchError" (SqlError {sqlState = "23503", sqlExecStatus = FatalError, sqlErrorMsg = "insert or update on table \"pool_offline_fetch_error\" violates foreign key constraint \"pool_offline_fetch_error_pmr_id_fkey\"", sqlErrorDetail = "Key (pmr_id)=(23435) is not present in table \"pool_metadata_ref\".", sqlErrorHint = ""})
[db-sync-node.Mux:Info:66] [2021-09-20 01:45:36.15 UTC] Bearer on ConnectionId {localAddress = LocalAddress "", remoteAddress = LocalAddress "/run/cardano-node/node.socket"} event: State: Dead
[db-sync-node.Mux:Notice:66] [2021-09-20 01:45:36.15 UTC] Bearer on ConnectionId {localAddress = LocalAddress "", remoteAddress = LocalAddress "/run/cardano-node/node.socket"} event: Miniprotocol MiniProtocolNum 5 InitiatorDir terminated with exception DbInsertException "PoolOfflineFetchError" (SqlError {sqlState = "23503", sqlExecStatus = FatalError, sqlErrorMsg = "insert or update on table \"pool_offline_fetch_error\" violates foreign key constraint \"pool_offline_fetch_error_pmr_id_fkey\"", sqlErrorDetail = "Key (pmr_id)=(23435) is not present in table \"pool_metadata_ref\".", sqlErrorHint = ""})
[db-sync-node.Subscription:Error:65] [2021-09-20 01:45:36.15 UTC] Identity Application Exception: LocalAddress "/run/cardano-node/node.socket" DbInsertException "PoolOfflineFetchError" (SqlError {sqlState = "23503", sqlExecStatus = FatalError, sqlErrorMsg = "insert or update on table \"pool_offline_fetch_error\" violates foreign key constraint \"pool_offline_fetch_error_pmr_id_fkey\"", sqlErrorDetail = "Key (pmr_id)=(23435) is not present in table \"pool_metadata_ref\".", sqlErrorHint = ""})
[db-sync-node.ErrorPolicy:Error:6] [2021-09-20 01:45:36.15 UTC] IP LocalAddress "/run/cardano-node/node.socket" ErrorPolicyUnhandledApplicationException (DbInsertException "PoolOfflineFetchError" (SqlError {sqlState = "23503", sqlExecStatus = FatalError, sqlErrorMsg = "insert or update on table \"pool_offline_fetch_error\" violates foreign key constraint \"pool_offline_fetch_error_pmr_id_fkey\"", sqlErrorDetail = "Key (pmr_id)=(23435) is not present in table \"pool_metadata_ref\".", sqlErrorHint = ""}))
[db-sync-node.Subscription:Notice:64] [2021-09-20 01:45:37.56 UTC] Identity Restarting Subscription after 278087.81629785s desired valency 1 current valency 0
[db-sync-node.Subscription:Notice:64] [2021-09-20 01:45:37.56 UTC] Identity Starting Subscription Worker, valency 1
[db-sync-node.Subscription:Notice:151184] [2021-09-20 01:45:37.56 UTC] Identity Connection Attempt Start, destination LocalAddress "/run/cardano-node/node.socket"
[db-sync-node.Subscription:Notice:151184] [2021-09-20 01:45:37.56 UTC] Identity Connection Attempt End, destination LocalAddress "/run/cardano-node/node.socket" outcome: ConnectSuccessLast
[db-sync-node.Handshake:Info:151184] [2021-09-20 01:45:37.56 UTC] WithMuxBearer (ConnectionId {localAddress = LocalAddress "", remoteAddress = LocalAddress "/run/cardano-node/node.socket"}) Send (ClientAgency TokPropose,MsgProposeVersions (fromList [(NodeToClientV_1,TInt 764824073),(NodeToClientV_2,TInt 764824073),(NodeToClientV_3,TInt 764824073),(NodeToClientV_4,TInt 764824073),(NodeToClientV_5,TInt 764824073),(NodeToClientV_6,TInt 764824073),(NodeToClientV_7,TInt 764824073),(NodeToClientV_8,TInt 764824073),(NodeToClientV_9,TInt 764824073)]))
[db-sync-node.Handshake:Info:151184] [2021-09-20 01:45:37.56 UTC] WithMuxBearer (ConnectionId {localAddress = LocalAddress "", remoteAddress = LocalAddress "/run/cardano-node/node.socket"}) Recv (ServerAgency TokConfirm,MsgAcceptVersion NodeToClientV_9 (TInt 764824073))
[db-sync-node.Mux:Info:151184] [2021-09-20 01:45:37.56 UTC] Bearer on ConnectionId {localAddress = LocalAddress "", remoteAddress = LocalAddress "/run/cardano-node/node.socket"} event: Handshake Client end, duration 0.001279094s
[db-sync-node.Mux:Info:151185] [2021-09-20 01:45:37.56 UTC] Bearer on ConnectionId {localAddress = LocalAddress "", remoteAddress = LocalAddress "/run/cardano-node/node.socket"} event: State: Mature
[db-sync-node:Info:151188] [2021-09-20 01:45:37.56 UTC] Starting chainSyncClient
[db-sync-node:Info:151188] [2021-09-20 01:45:37.57 UTC] Cardano.Db tip is at slot 40535965, block 6266649
[db-sync-node:Info:151193] [2021-09-20 01:45:37.57 UTC] Running DB thread
[db-sync-node.Subscription:Notice:64] [2021-09-20 01:45:37.58 UTC] Identity Required subscriptions started
[db-sync-node:Info:151193] [2021-09-20 01:45:37.58 UTC] Rolling back to slot 40348798, hash 907b0257659c817a5e3dcf2fff293dfbe30bc6c5e17955d66b3e3888afefc77d
[db-sync-node:Info:151193] [2021-09-20 01:45:37.60 UTC] Deleting 9270 blocks up to slot 40535965
[db-sync-node:Info:151193] [2021-09-20 01:55:26.94 UTC] Blocks deleted
[db-sync-node:Info:151193] [2021-09-20 01:55:41.29 UTC] Found snapshot file for slot 40348798, hash 907b0257659c817a5e3dcf2fff293dfbe30bc6c5e17955d66b3e3888afefc77d

@erikd
Copy link
Contributor Author

erikd commented Sep 21, 2021

Yes @1000101 , that version is known to still have this issue.

@erikd erikd closed this as completed in d658f38 Sep 21, 2021
kderme added a commit that referenced this issue Sep 21, 2021
This avoids the race condition described in the ticket #806 below. It
does so by ensuring the required foreign keys exist before the insert.
If the required foreign keys do not exist the data is just dropped and
will be refetched later.

Cherry pick of d658f38 from master.

Closes: #806
Closes: #823
Closes: #831
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants