Intermittent test failures involving TestStartosisAddServiceTestSuite #1538

omar711 · 2023-10-12T12:28:22Z

What's your CLI version?

0.84.3

Description & steps to reproduce

I've seen the TestStartosisAddServiceTestSuite test fail twice in CI checks with seemingly unrelated PRs. The failures can be fixed by re-running, where they usually then pass.

Examples: https://app.circleci.com/pipelines/github/kurtosis-tech/kurtosis/7789/workflows/26e1b319-a461-455a-b790-a27fe7bcd6ae/jobs/105839 and https://app.circleci.com/pipelines/github/kurtosis-tech/kurtosis/7775/workflows/b40031d3-47fd-43b9-8265-94fe5d59ea5e/jobs/105604

Interesting notes:

the duration is relatively long >90s, suggesting something timing out?
one failure involves an HTTP failure
one failure seems to be caused by some container being destroyed or becoming unavailable before the test was expecting it

Which suggests there's something awkward in this test that makes it a little brittle.

Desired behavior

Tests + CI run reliably

What is the severity of this bug?

Papercut; this bug is frustrating, but I have a workaround.

What area of the product does this pertain to?

Other: anything not covered by the above

The text was updated successfully, but these errors were encountered:

omar711 · 2023-10-13T15:50:06Z

One more here: https://app.circleci.com/pipelines/github/kurtosis-tech/kurtosis/7874/workflows/63979657-73a6-4919-b27b-b6fc9d0814b2/jobs/107163

This time not add service, but TestStartosisReplaceTestSuite:

TestStartosisReplaceTestSuite/TestStartosisReplaceWithModuleInDirectory (17.40s)
        suite.go:87: test panicked: runtime error: invalid memory address or nil pointer dereference

also looks random, and possibly something being torn down prematurely

mieubrisse · 2023-10-13T20:42:30Z

        suite.go:87: test panicked: runtime error: invalid memory address or nil pointer dereference

looks like a plain bug: our code being wrong, nothing to do with CI.

I really, really, really think we should prioritize digging out what the cause is and fix these... flaky tests erode the trust in the CI across the entire team, and with merge queues flaky tests are only going exacerbate the problem. Our CI test suite didn't used to be flaky, and in the past ~6-9 months, it's gotten increasingly so.

omar711 · 2023-10-16T10:09:01Z

Cool found the issue in the tests themselves (excuse my go learning adventure):

func (suite *StartosisReplaceTestSuite) TestStartosisReplaceWithModuleInDirectory() {
	ctx := context.Background()
	runResult, _ := suite.RunPackageWithParams(ctx, packageWithReplaceModuleInDirectoryRelPath, packageWithReplaceModuleInDirectoryParams)

	t := suite.T()

	require.Nil(t, runResult.InterpretationError)
	require.Empty(t, runResult.ValidationErrors)
	require.Nil(t, runResult.ExecutionError)
	expectedResult := "Replace with module in directory sample package loaded.\nVerification succeeded. Value is '\"another-dependency-loaded-from-internal-module-in-main-branch\"'.\n"
	require.Regexp(t, expectedResult, string(runResult.RunOutput))
}

The error at suite.RunPackageWithParams... is ignored, which will cause the null pointer issue at the first require. Tested this locally by overriding its return values, e.g.:

	runResult, err := suite.RunPackageWithParams(ctx, packageWithReplaceModuleInDirectoryRelPath, packageWithReplaceModuleInDirectoryParams)
	runResult = nil
	err = fmt.Errorf("oh dear")
...

Then after that adding a require.NoError(t, err) will reveal the actual error that's occurring. I'll try to find other instances of this and apply the same fix into a PR.

That will only show us the real errors, and it only affects 1 of the 3 cases above, from what I can see. There's still something weird in the test infra / CI (long running tests, maybe things timing out, etc)

#1559 contains the fix for this one, plus similar tests

omar711 · 2023-10-16T10:58:55Z

One more random failure (occurred locally with ./internal_testsuites/scripts/test.sh, re-running fixing it):

--- FAIL: TestIdentifiers (1.33s)
    enclave_and_service_identifiers_test.go:56:
        	Error Trace:	/Users/omar/kurtosis/kurtosis/internal_testsuites/golang/testsuite/enclave_and_service_identifiers_test/enclave_and_service_identifiers_test.go:56
        	Error:      	Received unexpected error:
        	            	An error occurred adding the datastore service
        	            	 --- at /Users/omar/kurtosis/kurtosis/internal_testsuites/golang/test_helpers/test_helpers.go:220 (AddDatastoreService) ---
        	            	Caused by: An error has occurred when running Starlark to add service
        	            	 --- at /Users/omar/kurtosis/kurtosis/internal_testsuites/golang/test_helpers/test_helpers.go:162 (AddService) ---
        	            	Caused by: Error requesting Starlark Script run
        	            	 --- at /Users/omar/kurtosis/kurtosis/api/golang/core/lib/enclaves/enclave_context.go:119 (EnclaveContext.RunStarlarkScriptBlocking) ---
        	            	Caused by: Unexpected error happened executing Kurtosis script.
        	            	 --- at /Users/omar/kurtosis/kurtosis/api/golang/core/lib/enclaves/enclave_context.go:104 (EnclaveContext.RunStarlarkScript) ---
        	            	Caused by: rpc error: code = Unavailable desc = connection error: desc = "error reading server preface: http2: frame too large"
        	Test:       	TestIdentifiers
        	Messages:   	An error occurred adding the datastore service to the enclave
FAIL
FAIL	github.com/kurtosis-tech/kurtosis-cli/golang_internal_testsuite/testsuite/enclave_and_service_identifiers_test	1.491s

omar711 · 2023-10-16T11:09:52Z

In relation to the panic error https://app.circleci.com/pipelines/github/kurtosis-tech/kurtosis/7874/workflows/63979657-73a6-4919-b27b-b6fc9d0814b2/jobs/107163 , I've just noticed you can see in the gateway log:

INFO[2023-10-13T15:43:31Z] Forwarding requests to local port number '39373' to remote port with id 'grpc' 
INFO[2023-10-13T15:43:31Z] Closing connection to pod: https://0.0.0.0:40395/api/v1/namespaces/kt-idle-enclave-3eed8f5bc86342ee9def0471466a9c3a/pods/kurtosis-api/portforward 
WARN[2023-10-13T15:43:31Z] Expected to run api container gateway until stopped, but the server exited prematurely with a non-nil error: 'Expected to run API container gateway server until stopped, but the server exited with a non-nil error
 --- at /root/project/cli/cli/kurtosis_gateway/run/api_container_gateway/run_until_stopped.go:48 (RunApiContainerGatewayUntilStopped) ---
Caused by: An error occurred creating the listener on tcp/39373
 --- at /go/pkg/mod/github.com/kurtosis-tech/minimal-grpc-server/golang@v0.0.0-20230710164206-90b674acb269/server/server.go:111 (MinimalGRPCServer.RunUntilStopped) ---
Caused by: listen tcp :39373: bind: address already in use' 
E1013 15:43:31.228729    7415 portforward.go:419] error closing listener: close tcp4 127.0.0.1:39373: use of closed network connection
ERRO[2023-10-13T15:43:36Z] Expected Gateway to be reachable, instead an error was returned:
Expected to be to call `GetServices` and wait for server to be ready, instead a non-nil error was returned
 --- at /root/project/cli/cli/kurtosis_gateway/server/engine_gateway/engine_gateway_service_server.go:323 (waitForGatewayReady) ---
Caused by: rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:39373: connect: connection refused" 
INFO[2023-10-13T15:43:36Z] Forwarding requests to local port number '35083' to remote port with id 'grpc'

which happens exactly when the test itself fails:

time="2023-10-13T15:43:36Z" level=info msg="Uploading and executing package 'github.com/kurtosis-tech/sample-startosis-load/replace-subpackage'"
--- FAIL: TestStartosisReplaceTestSuite (86.00s)
    --- FAIL: TestStartosisReplaceTestSuite/TestStartosisReplaceWithModuleInDirectory (17.40s)
        suite.go:87: test panicked: runtime error: invalid memory address or nil pointer dereference

So I'm pretty certain the true cause is something funky in the test infra / timing / stuff closing or timing out / etc

## Description: When an unexpected error happens in `enclave_context.RunStarlarkPackageBlocking`, it will return `(nil, error)`. Some tests don't account for this, so they don't detect or report said error, then panic when trying to reference the `nil` result. This adds `require.NoError(t, err)` to all the tests I could find that hit this codepath. ## Is this change user facing? NO ## References (if applicable): #1538 --------- Co-authored-by: Omar <omar@omar>

laurentluce · 2023-10-18T14:48:46Z

@omar711 Are you still investigating/working on that?

omar711 · 2023-10-18T16:57:28Z

Hi @laurentluce , not actively. I dig in when I see something I can debug, then leave it be. e.g. #1559 fixed a bug that might hide underlying failures. I don't think I've got near to the root of it though.

omar711 added the bug Something isn't working label Oct 12, 2023

github-actions bot added the papercut label Oct 12, 2023

omar711 mentioned this issue Oct 16, 2023

fix: propagate unexpected test errors via the test framework #1559

Merged

omar711 self-assigned this Oct 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent test failures involving TestStartosisAddServiceTestSuite #1538

Intermittent test failures involving TestStartosisAddServiceTestSuite #1538

omar711 commented Oct 12, 2023

omar711 commented Oct 13, 2023

mieubrisse commented Oct 13, 2023

omar711 commented Oct 16, 2023 •

edited

omar711 commented Oct 16, 2023

omar711 commented Oct 16, 2023

laurentluce commented Oct 18, 2023

omar711 commented Oct 18, 2023

Intermittent test failures involving TestStartosisAddServiceTestSuite #1538

Intermittent test failures involving TestStartosisAddServiceTestSuite #1538

Comments

omar711 commented Oct 12, 2023

What's your CLI version?

Description & steps to reproduce

Desired behavior

What is the severity of this bug?

What area of the product does this pertain to?

omar711 commented Oct 13, 2023

mieubrisse commented Oct 13, 2023

omar711 commented Oct 16, 2023 • edited

omar711 commented Oct 16, 2023

omar711 commented Oct 16, 2023

laurentluce commented Oct 18, 2023

omar711 commented Oct 18, 2023

omar711 commented Oct 16, 2023 •

edited