Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

End to end tests are flaky in CI #1558

Closed
4 tasks done
jpraynaud opened this issue Mar 8, 2024 · 1 comment
Closed
4 tasks done

End to end tests are flaky in CI #1558

jpraynaud opened this issue Mar 8, 2024 · 1 comment
Assignees
Labels

Comments

@jpraynaud
Copy link
Member

jpraynaud commented Mar 8, 2024

Why

Following the fix of the end to end flakiness in #1147, we have noticed that there still exist some rare occurrences where flakiness still occur.

Flakiness 1

Command failed: transaction build Error: Transactions can only be produced in the same era as the node. Requested era: Babbage, node era: Conway.

Flakiness 2

Minimum expected mithril stake distribution epoch not reached : XX < XX

What

Investigate the origins of the flakiness and provide fixes.

How

  • Investigate flakiness 1 (related to funds transfer)
  • Fix flakiness 1
  • Investigate flakiness 2 (related to immutable database discrepancy)
  • Fix flakiness 2
@jpraynaud
Copy link
Member Author

jpraynaud commented Apr 2, 2024

After investigation of the second flakiness, it appears that there is an epoch gap that is occurring which is responsible for the error with 2 different origins

  • Non P2P test / https://github.com/input-output-hk/mithril/actions/runs/8065925048/job/22033017314#step:6:479

After investigation it appears that the Cardano nodes don't all have the same version of the immutable files (which should not appear in real conditions as the immutable files must be strictly the same for all Cardano nodes on a network).

$ ls -al node-pool1/db/immutable | grep chunk | head
-rw-r--r-- 1 jp jp 85519 Feb 27 14:03 00000.chunk
-rw-r--r-- 1 jp jp 86272 Feb 27 14:03 00001.chunk
-rw-r--r-- 1 jp jp 86396 Feb 27 14:03 00002.chunk
-rw-r--r-- 1 jp jp 87216 Feb 27 14:03 00003.chunk
-rw-r--r-- 1 jp jp 86266 Feb 27 14:03 00004.chunk
-rw-r--r-- 1 jp jp 85800 Feb 27 14:03 00005.chunk
-rw-r--r-- 1 jp jp 46332 Feb 27 14:03 00006.chunk
$ ls -al node-pool2/db/immutable | grep chunk | head
-rw-r--r-- 1 jp jp 85519 Feb 27 14:03 00000.chunk
-rw-r--r-- 1 jp jp 86272 Feb 27 14:03 00001.chunk
-rw-r--r-- 1 jp jp 85688 Feb 27 14:03 00002.chunk
-rw-r--r-- 1 jp jp 85800 Feb 27 14:03 00003.chunk
-rw-r--r-- 1 jp jp 86732 Feb 27 14:03 00004.chunk
-rw-r--r-- 1 jp jp 85800 Feb 27 14:03 00005.chunk
-rw-r--r-- 1 jp jp 46332 Feb 27 14:03 00006.chunk
$ ls -al node-pool3/db/immutable | grep chunk | head
-rw-r--r-- 1 jp jp 85519 Feb 27 14:03 00000.chunk
-rw-r--r-- 1 jp jp 86272 Feb 27 14:03 00001.chunk
-rw-r--r-- 1 jp jp 85688 Feb 27 14:03 00002.chunk
-rw-r--r-- 1 jp jp 85800 Feb 27 14:03 00003.chunk
-rw-r--r-- 1 jp jp 86732 Feb 27 14:03 00004.chunk
-rw-r--r-- 1 jp jp 85800 Feb 27 14:03 00005.chunk
-rw-r--r-- 1 jp jp 46332 Feb 27 14:03 00006.chunk

We can clearly see that from 00002.chunk files:

  • node-pool1 which is running the aggregator has different immutable database
  • from node-pool2 and node-pool3 which are both running a signer

This situation is most likely due to the Cardano parameters that we use to make the test run fast.

In that case, there are no chance that the Cardano Immutable Files Snapshot could succeed after that point.
However, we could mitigate the problem by adding a timeout on this signed entity type which could avoid the epoch gap.

Nonetheless, the final outcome of the test would be a failure as the end to end test would detect that some immutable files have not been signed.

Given the low probability of occurrence of this flakiness, the fact that it is very unlikely to correspond to a real life situation and the overall high difficulty to properly handle that situation, there is no possible fix to implement.

  • P2P test / https://github.com/input-output-hk/mithril/actions/runs/8078344581/job/22070604203#step:6:796

After investigation, it appears that a signature that was sent by one of the signers was not properly received by the aggregator relay (whereas it was successfully received by other peers on the pubsub P2P topic). The aggregator was unable to create a multi-signature with individual signatures of only one signer as the quorum was not reached. As there is not timeout on the Mithril stake distribution signed entity, this lead to not creating any certificate for the corresponding epoch (which translates to an epoch gap at following epochs)

As this feature is still experimental, occurring with low probability, and currently being modified with the issue #1587, we'll not implement a fix directly but we will keep monitoring if it keeps occurring when developing #1587.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant