Occaisional test failure (timing?) #1153

cam72cam · 2024-01-19T14:13:26Z

Once and a while we see this in test reports. My guess is that there is some timing that could be tweaked or a test that could be restructured.

--- FAIL: TestJSONHook_create (0.01s)
    hook_json_test.go:184: unexpected number of messages. got 6, want 7
    hook_json_test.go:184: unexpected output on line 5:
          map[string]any{
          	"@level": string("info"),
          	"@message": strings.Join({
          		"test_instance.boop: ",
        - 		"Still creating... [20s elapsed",
        + 		"Creation complete after 22s [id=test",
          		"]",
          	}, ""),
          	"@module": string("tofu.ui"),
          	"hook": map[string]any{
          		"action":          string("create"),
        - 		"elapsed_seconds": float64(20),
        + 		"elapsed_seconds": float64(22),
        + 		"id_key":          string("id"),
        + 		"id_value":        string("test"),
          		"resource":        map[string]any{"addr": string("test_instance.boop"), "implied_provider": string("test"), "module": string(""), "resource": string("test_instance.boop"), ...},
          	},
        - 	"type": string("apply_progress"),
        + 	"type": string("apply_complete"),
          }
FAIL
FAIL	github.com/opentofu/opentofu/internal/command/views	3.188s

The text was updated successfully, but these errors were encountered:

bvand · 2024-01-23T19:25:31Z

I'd like to take this one if it's available. Can you link to 1 or 2 related Action failures so I can take a deeper look?

Any suspicion if this only comes into play when running all of the tests? I haven't been able to reproduce locally. I've run all of the tests with go test./... and the specific subset with go test ./internal/command/views -count=1 and go test -race ./internal/command/views -count=1...

cam72cam · 2024-01-23T20:34:39Z

Thanks @bvand

https://github.com/opentofu/opentofu/actions/runs/7586115064/ is the run that had issues, but I re-ran failed tests and it appears to have overwritten the previous logs?

It seems to be intermittent: from @cube2222 "Oh yeah, this test case is known and liked 😄"

I wonder if it has to do with how bogged down the host of the github runner is, as it appears to be timing related?

janosdebugs · 2024-01-24T07:23:13Z

@cam72cam I believe the details of test runs are not available to non-maintainers? @bvand please let us know if you can see that link.

bvand · 2024-01-24T15:54:36Z

@janosdebugs I can access that link but don't see any job failures there (I guess because of the rerun). I do see 40 errors under annotations, but those seem to be unrelated?

bvand · 2024-01-31T19:41:03Z

This seems strange to me, but I may just be missing some implementation detail:

In TestJSONHook_create, the 2s travel into the future doesn't include progress goroutine notification or sleep
Adding those in (PR) causes the unit tests to fail but not the race tests.

Any idea if this is a red herring or an actual issue?

bvand · 2024-02-06T16:31:30Z

I reproduced the failure by running the Unit Tests in matrix - 1/10 of them failed: https://github.com/bvand/opentofu/actions/runs/7802165203/job/21278956021.

Comparing the test ordering and timing between the failed job and a passed job didn't yield anything interesting as far as I can tell (OpenTofu Timing Bug 1153 - Pass_Fail Job Comparison.xlsx). The maximum time diff is the github.com/opentofu/opentofu/internal/tofu test for the failed job took 0.121s longer than the passed job. There was a small difference in ordering between these three tests from the failed job to the passed job, but I don't think that would have done anything since 2/3 of the tests don't have test files...

github.com/opentofu/opentofu/internal/plugin6/convert
github.com/opentofu/opentofu/internal/provider-simple/main
github.com/opentofu/opentofu/internal/provider-simple-v6

Given that these jobs run on separate runners, I wouldn't expect resource constraints to be an issue, but that could be tested by running jobs in a matrix with larger runners a few times and seeing if the test failure stops. That leaves some sort of testing race condition or similar, but I'm not 100% sure how to proceed on debugging further - open to ideas.

bvand · 2024-02-07T14:44:18Z

Ok I think I've discovered the cause - I reduced the time.Sleep to 1 nanosecond and this reproduced the failure in all 10 unit test runs (workflow). So the root cause is likely that sometimes, whether due to resource contention on public runners or otherwise, 1 millisecond isn't enough time to wait for execution.

Some options for fixing:

Increase sleep time. This is still nondeterministic if there's enough contention, but it's possible that 50ms or even 10ms would essentially eliminate these failures. Not preferred since it's also not scalable if used across tests, but could be ok as a quick bandaid. I'll test this with more unit test runs to see if it does eliminate the issue.
Some form of test isolation for tests like this to reduce the probability of resource contention (again not deterministic)
Use a deterministic method to wait for goroutine execution and logging. I'll also look into using WaitGroup or similar for this purpose.

bvand · 2024-02-09T19:26:16Z

I have a proof of concept change to deterministically wait for each applyingHeartbeat to complete in the merge request. I went this route because heartbeatDone is only closed when the for-loop exits and the function returns in the progress.done case, so it's actually closed during postApply and is not useful for checking each individual heartbeat. In isolation I would also think about removing heartbeatDone since it seems to be unused, but I'm new to OpenTofu so I think keeping this change slim makes sense.

I'm happy to either clean up the PR and move forward, go with increased sleep time (I still saw a failure with 10ms so would look to do 50ms), or implement a suggested alternate approach if there are any @cam72cam @janosdebugs

cam72cam · 2024-02-28T14:55:50Z

@bvand apologies for the late reply, we have been a bit swamped with the magnitude of community contributions as of late. Your logic seems sound and I think after a quick tidy, that PR should be good to merge. Thanks again for taking this on!

bvand · 2024-03-01T13:29:34Z

@cam72cam no problem, will do!

cam72cam added bug Something isn't working good first issue Good for newcomers accepted This issue has been accepted for implementation. help wanted Extra attention is needed labels Jan 19, 2024

cam72cam assigned bvand Jan 23, 2024

This was referenced Mar 1, 2024

fix: test flakiness from nondeterministic waiting bvand/opentofu#1

Open

fix: test flakiness from nondeterministic waiting #1337

Merged

cam72cam closed this as completed in #1337 Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Occaisional test failure (timing?) #1153

Occaisional test failure (timing?) #1153

cam72cam commented Jan 19, 2024

bvand commented Jan 23, 2024

cam72cam commented Jan 23, 2024

janosdebugs commented Jan 24, 2024

bvand commented Jan 24, 2024

bvand commented Jan 31, 2024

bvand commented Feb 6, 2024

bvand commented Feb 7, 2024

bvand commented Feb 9, 2024

cam72cam commented Feb 28, 2024

bvand commented Mar 1, 2024

Occaisional test failure (timing?) #1153

Occaisional test failure (timing?) #1153

Comments

cam72cam commented Jan 19, 2024

bvand commented Jan 23, 2024

cam72cam commented Jan 23, 2024

janosdebugs commented Jan 24, 2024

bvand commented Jan 24, 2024

bvand commented Jan 31, 2024

bvand commented Feb 6, 2024

bvand commented Feb 7, 2024

bvand commented Feb 9, 2024

cam72cam commented Feb 28, 2024

bvand commented Mar 1, 2024