Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky test //beacon-chain/p2p:go_default_test #10160

Open
prestonvanloon opened this issue Jan 31, 2022 · 5 comments
Open

Flaky test //beacon-chain/p2p:go_default_test #10160

prestonvanloon opened this issue Jan 31, 2022 · 5 comments
Assignees
Labels
Bug Something isn't working CI Continuous integration related items Good First Issue Good for newcomers Help Wanted Extra attention is needed

Comments

@prestonvanloon
Copy link
Member

馃悶 Bug Report

Description

p2p tests are extremely flaky. These tests typically pass around one in ten runs.

I ran this test with --test_strategy=exclusive --runs_per_test=500 overnight and the result is as follows:

//beacon-chain/p2p:go_default_test                                       FAILED in 1218 out of 1355 in 21.5s

Has this worked before in a previous version?

This has been an issue for some time. It passes in CI as this package is rarely changed and a passing test is cached between CI requests.

馃敩 Minimal Reproduction

bazel or go test

bazel test //beacon-chain/p2p:go_default_test

馃敟 Error

--- FAIL: TestVerifyConnectivity (0.00s)
    --- FAIL: TestVerifyConnectivity/Dialing_a_reachable_IP:_142.250.68.46:80 (0.00s)
        assertions.go:164: utils_test.go:33 Unexpected log found: IP address is not accessible
            Searched logs:
            [time="2022-01-31T15:45:05Z" level=warning msg="IP address is not accessible" address="142.250.68.46:80" error="dial tcp 142.250.68.46:80: connect: network is unreachable" prefix=p2p protocol=tcp
            ]
--- FAIL: TestService_BroadcastAttestationWithDiscoveryAttempts (4.09s)                                                                       
    broadcaster_test.go:363: Failed to receive pubsub within 4s                                                                               
    assertions.go:86: broadcaster_test.go:351 Unexpected error: context deadline exceeded  

馃實 Your Environment

Operating System:

  
Linux - Ubuntu
  

What version of Prysm are you running? (Which release)

Latest develop commit

Anything else relevant (validator index / public key)?

@prestonvanloon prestonvanloon added Bug Something isn't working CI Continuous integration related items Good First Issue Good for newcomers Help Wanted Extra attention is needed labels Jan 31, 2022
@patterns
Copy link

patterns commented Apr 4, 2022

I'm interested in helping on this task. I just followed the Contributer guide and got a local environment to finish bazelisk test //beacon-chain/node:go_default_test. So I think I have the baseline ready.

@patterns
Copy link

patterns commented Apr 4, 2022

Reading the test in TestVerifyConnectivity, some initial thoughts are:

  • with google.com as the destination, there may be multiple hops to reach that proxy (or load balancer) and it can implement a back-off strategy (to fend off a flood of requests)
  • the verifyConnectivity function uses a static dial-timeout of one second which may be easily exceeded under high latency
  • at port 80 destination may be redirected to 443 (TLS protocol port)
  • at --runs_per_test=500 and 137 failures, something connects (we prefer 500 or 250 failures to be consistent)

The next steps I can attempt are to gather data to narrow down what is contributing to the failures. For example, choosing a destination that expressly accepts test requests, increasing delay between test run iterations, increasing dial-timeout, or try UDP port 53 (dns).

@nisdas
Copy link
Member

nisdas commented Apr 5, 2022

Hey @patterns , thanks for taking a look at this ! Just assigned you the issue. It seems you already have made good progress on this :) , looking forward to the improvements to these tests.

@patterns
Copy link

patterns commented Apr 7, 2022

Very appreciated, thanks! I think I have the TestVerifyConnectivity changes for a unit test. I borrowed most of the the code from the dial_test.go in the standard lib's net package and realized when looking there that they distinguish between available-external-network (and "local") for the test environment. That made me wonder whether TestVerifyConnectivity is meant to exercise the net.Dial() because it shouldn't be necessary to test that net.Dial() works. It made me think that the purpose of the test is really to verify/prove that a log entry is generated under the right circumstances (unreachable addresses). This seems reasonable because when I looked for where it is called, it is to do quick "health check" on
the Host-Address config param (line#266 in service.go). And I think the param option is the p2p-host-IP:

--p2p-host-ip value The IP address advertised by libp2p. This may be used to advertise an external IP.

I need to confirm whether is correct. At the same time, I'll try to understand the second failing test (Discovery-Attempts one).

utils_test.go.txt

@nisdas
Copy link
Member

nisdas commented Apr 7, 2022

Hey @patterns ,

That is correct, the TestVerifyConnectivity is used to verify that dials to external IPs work as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working CI Continuous integration related items Good First Issue Good for newcomers Help Wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants