Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hopefully stabilize test_bad_connection.py #6976

Merged
merged 5 commits into from Mar 7, 2024
Merged

Conversation

save-buffer
Copy link
Contributor

Problem

It seems that even though we have a retry on basebackup, it still sometimes fails to fetch it with the failpoint enabled, resulting in a test error.

Summary of changes

If we fail to get the basebackup, disable the failpoint and try again.

@save-buffer
Copy link
Contributor Author

Addresses #6688

Copy link

github-actions bot commented Feb 29, 2024

2490 tests run: 2369 passed, 0 failed, 121 skipped (full report)


Flaky tests (2)

Postgres 15

  • test_empty_branch_remote_storage_upload: debug

Postgres 14

  • test_compute_pageserver_connection_stress: release

Code coverage* (full report)

  • functions: 28.8% (6992 of 24312 functions)
  • lines: 47.3% (43013 of 90878 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
f8f7d2d at 2024-03-07T16:30:43.505Z :recycle:

@jcsp
Copy link
Contributor

jcsp commented Mar 1, 2024

It would be good to look at the connection failure probability vs. the number of retries in compute ctl -- if we're doing e.g. three retries and 50% failure rate, then those numbers probably need adjusting (probably by retrying more times).

@save-buffer
Copy link
Contributor Author

Currently we retry 5 times with an initial timeout of 500ms

>>> sum(500 * 2**i for i in range(5))
15500

Probability of the test failing is then 1/(2^5) = 3%. We can probably reduce the probability of test failure to 0.01% if we make it do 10 retries. Is that good enough? Or should we make it deterministically pass?

@jcsp
Copy link
Contributor

jcsp commented Mar 6, 2024

Probability of the test failing is then 1/(2^5) = 3%. We can probably reduce the probability of test failure to 0.01% if we make it do 10 retries. Is that good enough? Or should we make it deterministically pass?

That works for me.

@save-buffer
Copy link
Contributor Author

I switched it to 10 retries, with it 1.5xing the timeout every time, making the maximum amount of time spent waiting for retries

>>> sum(500*1.5**i for i in range(10))/1000.0
56.6650390625

56 sec

@save-buffer save-buffer enabled auto-merge (squash) March 7, 2024 18:11
@save-buffer save-buffer merged commit 2fc8942 into main Mar 7, 2024
53 checks passed
@save-buffer save-buffer deleted the sasha_fix_test branch March 7, 2024 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants