Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent test failures involving TestStartosisAddServiceTestSuite #1538

Open
omar711 opened this issue Oct 12, 2023 · 7 comments
Open

Intermittent test failures involving TestStartosisAddServiceTestSuite #1538

omar711 opened this issue Oct 12, 2023 · 7 comments
Assignees
Labels
bug Something isn't working papercut

Comments

@omar711
Copy link
Contributor

omar711 commented Oct 12, 2023

What's your CLI version?

0.84.3

Description & steps to reproduce

I've seen the TestStartosisAddServiceTestSuite test fail twice in CI checks with seemingly unrelated PRs. The failures can be fixed by re-running, where they usually then pass.

Examples: https://app.circleci.com/pipelines/github/kurtosis-tech/kurtosis/7789/workflows/26e1b319-a461-455a-b790-a27fe7bcd6ae/jobs/105839 and https://app.circleci.com/pipelines/github/kurtosis-tech/kurtosis/7775/workflows/b40031d3-47fd-43b9-8265-94fe5d59ea5e/jobs/105604

Interesting notes:

  • the duration is relatively long >90s, suggesting something timing out?
  • one failure involves an HTTP failure
  • one failure seems to be caused by some container being destroyed or becoming unavailable before the test was expecting it

Which suggests there's something awkward in this test that makes it a little brittle.

Desired behavior

Tests + CI run reliably

What is the severity of this bug?

Papercut; this bug is frustrating, but I have a workaround.

What area of the product does this pertain to?

Other: anything not covered by the above

@omar711 omar711 added the bug Something isn't working label Oct 12, 2023
@omar711
Copy link
Contributor Author

omar711 commented Oct 13, 2023

One more here: https://app.circleci.com/pipelines/github/kurtosis-tech/kurtosis/7874/workflows/63979657-73a6-4919-b27b-b6fc9d0814b2/jobs/107163

This time not add service, but TestStartosisReplaceTestSuite:

TestStartosisReplaceTestSuite/TestStartosisReplaceWithModuleInDirectory (17.40s)
        suite.go:87: test panicked: runtime error: invalid memory address or nil pointer dereference

also looks random, and possibly something being torn down prematurely

@mieubrisse
Copy link
Member

        suite.go:87: test panicked: runtime error: invalid memory address or nil pointer dereference

looks like a plain bug: our code being wrong, nothing to do with CI.

I really, really, really think we should prioritize digging out what the cause is and fix these... flaky tests erode the trust in the CI across the entire team, and with merge queues flaky tests are only going exacerbate the problem. Our CI test suite didn't used to be flaky, and in the past ~6-9 months, it's gotten increasingly so.

@omar711
Copy link
Contributor Author

omar711 commented Oct 16, 2023

Cool found the issue in the tests themselves (excuse my go learning adventure):

func (suite *StartosisReplaceTestSuite) TestStartosisReplaceWithModuleInDirectory() {
	ctx := context.Background()
	runResult, _ := suite.RunPackageWithParams(ctx, packageWithReplaceModuleInDirectoryRelPath, packageWithReplaceModuleInDirectoryParams)

	t := suite.T()

	require.Nil(t, runResult.InterpretationError)
	require.Empty(t, runResult.ValidationErrors)
	require.Nil(t, runResult.ExecutionError)
	expectedResult := "Replace with module in directory sample package loaded.\nVerification succeeded. Value is '\"another-dependency-loaded-from-internal-module-in-main-branch\"'.\n"
	require.Regexp(t, expectedResult, string(runResult.RunOutput))
}

The error at suite.RunPackageWithParams... is ignored, which will cause the null pointer issue at the first require. Tested this locally by overriding its return values, e.g.:

	runResult, err := suite.RunPackageWithParams(ctx, packageWithReplaceModuleInDirectoryRelPath, packageWithReplaceModuleInDirectoryParams)
	runResult = nil
	err = fmt.Errorf("oh dear")
...

Then after that adding a require.NoError(t, err) will reveal the actual error that's occurring. I'll try to find other instances of this and apply the same fix into a PR.

That will only show us the real errors, and it only affects 1 of the 3 cases above, from what I can see. There's still something weird in the test infra / CI (long running tests, maybe things timing out, etc)

#1559 contains the fix for this one, plus similar tests

@omar711
Copy link
Contributor Author

omar711 commented Oct 16, 2023

One more random failure (occurred locally with ./internal_testsuites/scripts/test.sh, re-running fixing it):

--- FAIL: TestIdentifiers (1.33s)
    enclave_and_service_identifiers_test.go:56:
        	Error Trace:	/Users/omar/kurtosis/kurtosis/internal_testsuites/golang/testsuite/enclave_and_service_identifiers_test/enclave_and_service_identifiers_test.go:56
        	Error:      	Received unexpected error:
        	            	An error occurred adding the datastore service
        	            	 --- at /Users/omar/kurtosis/kurtosis/internal_testsuites/golang/test_helpers/test_helpers.go:220 (AddDatastoreService) ---
        	            	Caused by: An error has occurred when running Starlark to add service
        	            	 --- at /Users/omar/kurtosis/kurtosis/internal_testsuites/golang/test_helpers/test_helpers.go:162 (AddService) ---
        	            	Caused by: Error requesting Starlark Script run
        	            	 --- at /Users/omar/kurtosis/kurtosis/api/golang/core/lib/enclaves/enclave_context.go:119 (EnclaveContext.RunStarlarkScriptBlocking) ---
        	            	Caused by: Unexpected error happened executing Kurtosis script.
        	            	 --- at /Users/omar/kurtosis/kurtosis/api/golang/core/lib/enclaves/enclave_context.go:104 (EnclaveContext.RunStarlarkScript) ---
        	            	Caused by: rpc error: code = Unavailable desc = connection error: desc = "error reading server preface: http2: frame too large"
        	Test:       	TestIdentifiers
        	Messages:   	An error occurred adding the datastore service to the enclave
FAIL
FAIL	github.com/kurtosis-tech/kurtosis-cli/golang_internal_testsuite/testsuite/enclave_and_service_identifiers_test	1.491s

@omar711 omar711 self-assigned this Oct 16, 2023
@omar711
Copy link
Contributor Author

omar711 commented Oct 16, 2023

In relation to the panic error https://app.circleci.com/pipelines/github/kurtosis-tech/kurtosis/7874/workflows/63979657-73a6-4919-b27b-b6fc9d0814b2/jobs/107163 , I've just noticed you can see in the gateway log:

INFO[2023-10-13T15:43:31Z] Forwarding requests to local port number '39373' to remote port with id 'grpc' 
INFO[2023-10-13T15:43:31Z] Closing connection to pod: https://0.0.0.0:40395/api/v1/namespaces/kt-idle-enclave-3eed8f5bc86342ee9def0471466a9c3a/pods/kurtosis-api/portforward 
WARN[2023-10-13T15:43:31Z] Expected to run api container gateway until stopped, but the server exited prematurely with a non-nil error: 'Expected to run API container gateway server until stopped, but the server exited with a non-nil error
 --- at /root/project/cli/cli/kurtosis_gateway/run/api_container_gateway/run_until_stopped.go:48 (RunApiContainerGatewayUntilStopped) ---
Caused by: An error occurred creating the listener on tcp/39373
 --- at /go/pkg/mod/github.com/kurtosis-tech/minimal-grpc-server/golang@v0.0.0-20230710164206-90b674acb269/server/server.go:111 (MinimalGRPCServer.RunUntilStopped) ---
Caused by: listen tcp :39373: bind: address already in use' 
E1013 15:43:31.228729    7415 portforward.go:419] error closing listener: close tcp4 127.0.0.1:39373: use of closed network connection
ERRO[2023-10-13T15:43:36Z] Expected Gateway to be reachable, instead an error was returned:
Expected to be to call `GetServices` and wait for server to be ready, instead a non-nil error was returned
 --- at /root/project/cli/cli/kurtosis_gateway/server/engine_gateway/engine_gateway_service_server.go:323 (waitForGatewayReady) ---
Caused by: rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:39373: connect: connection refused" 
INFO[2023-10-13T15:43:36Z] Forwarding requests to local port number '35083' to remote port with id 'grpc' 

which happens exactly when the test itself fails:

time="2023-10-13T15:43:36Z" level=info msg="Uploading and executing package 'github.com/kurtosis-tech/sample-startosis-load/replace-subpackage'"
--- FAIL: TestStartosisReplaceTestSuite (86.00s)
    --- FAIL: TestStartosisReplaceTestSuite/TestStartosisReplaceWithModuleInDirectory (17.40s)
        suite.go:87: test panicked: runtime error: invalid memory address or nil pointer dereference

So I'm pretty certain the true cause is something funky in the test infra / timing / stuff closing or timing out / etc

github-merge-queue bot pushed a commit that referenced this issue Oct 16, 2023
## Description:
When an unexpected error happens in
`enclave_context.RunStarlarkPackageBlocking`, it will return `(nil,
error)`. Some tests don't account for this, so they don't detect or
report said error, then panic when trying to reference the `nil` result.

This adds `require.NoError(t, err)` to all the tests I could find that
hit this codepath.

## Is this change user facing?
NO

## References (if applicable):
#1538

---------

Co-authored-by: Omar <omar@omar>
@laurentluce
Copy link
Contributor

@omar711 Are you still investigating/working on that?

@omar711
Copy link
Contributor Author

omar711 commented Oct 18, 2023

Hi @laurentluce , not actively. I dig in when I see something I can debug, then leave it be. e.g. #1559 fixed a bug that might hide underlying failures. I don't think I've got near to the root of it though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working papercut
Projects
None yet
Development

No branches or pull requests

3 participants