Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occaisional test failure (timing?) #1153

Closed
cam72cam opened this issue Jan 19, 2024 · 10 comments · Fixed by #1337 · May be fixed by bvand/opentofu#1
Closed

Occaisional test failure (timing?) #1153

cam72cam opened this issue Jan 19, 2024 · 10 comments · Fixed by #1337 · May be fixed by bvand/opentofu#1
Assignees
Labels
accepted This issue has been accepted for implementation. bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed

Comments

@cam72cam
Copy link
Contributor

Once and a while we see this in test reports. My guess is that there is some timing that could be tweaked or a test that could be restructured.

--- FAIL: TestJSONHook_create (0.01s)
    hook_json_test.go:184: unexpected number of messages. got 6, want 7
    hook_json_test.go:184: unexpected output on line 5:
          map[string]any{
          	"@level": string("info"),
          	"@message": strings.Join({
          		"test_instance.boop: ",
        - 		"Still creating... [20s elapsed",
        + 		"Creation complete after 22s [id=test",
          		"]",
          	}, ""),
          	"@module": string("tofu.ui"),
          	"hook": map[string]any{
          		"action":          string("create"),
        - 		"elapsed_seconds": float64(20),
        + 		"elapsed_seconds": float64(22),
        + 		"id_key":          string("id"),
        + 		"id_value":        string("test"),
          		"resource":        map[string]any{"addr": string("test_instance.boop"), "implied_provider": string("test"), "module": string(""), "resource": string("test_instance.boop"), ...},
          	},
        - 	"type": string("apply_progress"),
        + 	"type": string("apply_complete"),
          }
FAIL
FAIL	github.com/opentofu/opentofu/internal/command/views	3.188s
@cam72cam cam72cam added bug Something isn't working good first issue Good for newcomers accepted This issue has been accepted for implementation. help wanted Extra attention is needed labels Jan 19, 2024
@bvand
Copy link
Contributor

bvand commented Jan 23, 2024

I'd like to take this one if it's available. Can you link to 1 or 2 related Action failures so I can take a deeper look?

Any suspicion if this only comes into play when running all of the tests? I haven't been able to reproduce locally. I've run all of the tests with go test./... and the specific subset with go test ./internal/command/views -count=1 and go test -race ./internal/command/views -count=1...

@cam72cam
Copy link
Contributor Author

Thanks @bvand

https://github.com/opentofu/opentofu/actions/runs/7586115064/ is the run that had issues, but I re-ran failed tests and it appears to have overwritten the previous logs?

It seems to be intermittent: from @cube2222 "Oh yeah, this test case is known and liked 😄"

I wonder if it has to do with how bogged down the host of the github runner is, as it appears to be timing related?

@janosdebugs
Copy link
Contributor

@cam72cam I believe the details of test runs are not available to non-maintainers? @bvand please let us know if you can see that link.

@bvand
Copy link
Contributor

bvand commented Jan 24, 2024

@janosdebugs I can access that link but don't see any job failures there (I guess because of the rerun). I do see 40 errors under annotations, but those seem to be unrelated?

@bvand
Copy link
Contributor

bvand commented Jan 31, 2024

This seems strange to me, but I may just be missing some implementation detail:

  • In TestJSONHook_create, the 2s travel into the future doesn't include progress goroutine notification or sleep
  • Adding those in (PR) causes the unit tests to fail but not the race tests.

Any idea if this is a red herring or an actual issue?

@bvand
Copy link
Contributor

bvand commented Feb 6, 2024

I reproduced the failure by running the Unit Tests in matrix - 1/10 of them failed: https://github.com/bvand/opentofu/actions/runs/7802165203/job/21278956021.

Comparing the test ordering and timing between the failed job and a passed job didn't yield anything interesting as far as I can tell (OpenTofu Timing Bug 1153 - Pass_Fail Job Comparison.xlsx). The maximum time diff is the github.com/opentofu/opentofu/internal/tofu test for the failed job took 0.121s longer than the passed job. There was a small difference in ordering between these three tests from the failed job to the passed job, but I don't think that would have done anything since 2/3 of the tests don't have test files...

  • github.com/opentofu/opentofu/internal/plugin6/convert
  • github.com/opentofu/opentofu/internal/provider-simple/main
  • github.com/opentofu/opentofu/internal/provider-simple-v6

Given that these jobs run on separate runners, I wouldn't expect resource constraints to be an issue, but that could be tested by running jobs in a matrix with larger runners a few times and seeing if the test failure stops. That leaves some sort of testing race condition or similar, but I'm not 100% sure how to proceed on debugging further - open to ideas.

@bvand
Copy link
Contributor

bvand commented Feb 7, 2024

Ok I think I've discovered the cause - I reduced the time.Sleep to 1 nanosecond and this reproduced the failure in all 10 unit test runs (workflow). So the root cause is likely that sometimes, whether due to resource contention on public runners or otherwise, 1 millisecond isn't enough time to wait for execution.

Some options for fixing:

  • Increase sleep time. This is still nondeterministic if there's enough contention, but it's possible that 50ms or even 10ms would essentially eliminate these failures. Not preferred since it's also not scalable if used across tests, but could be ok as a quick bandaid. I'll test this with more unit test runs to see if it does eliminate the issue.
  • Some form of test isolation for tests like this to reduce the probability of resource contention (again not deterministic)
  • Use a deterministic method to wait for goroutine execution and logging. I'll also look into using WaitGroup or similar for this purpose.

@bvand
Copy link
Contributor

bvand commented Feb 9, 2024

I have a proof of concept change to deterministically wait for each applyingHeartbeat to complete in the merge request. I went this route because heartbeatDone is only closed when the for-loop exits and the function returns in the progress.done case, so it's actually closed during postApply and is not useful for checking each individual heartbeat. In isolation I would also think about removing heartbeatDone since it seems to be unused, but I'm new to OpenTofu so I think keeping this change slim makes sense.

I'm happy to either clean up the PR and move forward, go with increased sleep time (I still saw a failure with 10ms so would look to do 50ms), or implement a suggested alternate approach if there are any @cam72cam @janosdebugs

@cam72cam
Copy link
Contributor Author

@bvand apologies for the late reply, we have been a bit swamped with the magnitude of community contributions as of late. Your logic seems sound and I think after a quick tidy, that PR should be good to merge. Thanks again for taking this on!

@bvand
Copy link
Contributor

bvand commented Mar 1, 2024

@cam72cam no problem, will do!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted This issue has been accepted for implementation. bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
3 participants