TestReconcileMachinePhases is flaky #8477

killianmuldoon · 2023-04-05T09:33:19Z

Which jobs are flaking?

TestReconcileMachinePhases/Should_set_Provisioned_when_there_is_a_ProviderID_and_there_is_no_Node

Which tests are flaking?

Part of the unit test run. Flaking on main and 1.4.

Since when has it been flaking?

#8044 moved these tests to be based on envtest instead of the fake client. Since then there has been a number of flakes.

Testgrid link

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-test-release-1-4/1643034564504850432

Reason for failure (if possible)

No response

Anything else we need to know?

No response

Label(s) to be applied

/kind flake
/area testing

The text was updated successfully, but these errors were encountered:

killianmuldoon · 2023-04-05T09:40:34Z

/triage accepted

@chrischdi - did you want to take this one on?

chrischdi · 2023-04-05T09:46:20Z

/assign

I'm digging into it.

I think sometimes it could be the case that if:

not everything got yet created by the test
and the controller already reconciles before that
that it will never reach the target tested state

It's for me locally reproducible via following patch

diff --git a/internal/controllers/machine/machine_controller_phases_test.go b/internal/controllers/machine/machine_controller_phases_test.go
index fbfb18bb9..bbe2c33e4 100644
--- a/internal/controllers/machine/machine_controller_phases_test.go
+++ b/internal/controllers/machine/machine_controller_phases_test.go
@@ -507,7 +507,7 @@ func TestReconcileMachinePhases(t *testing.T) {
                g.Expect(env.Create(ctx, bootstrapConfig)).To(Succeed())
                g.Expect(env.Create(ctx, infraMachine)).To(Succeed())
                g.Expect(env.Create(ctx, machine)).To(Succeed())
-
+               time.Sleep(time.Second)
                modifiedMachine := machine.DeepCopy()
                // Set NodeRef to nil.
                machine.Status.NodeRef = nil

Which then results in reconciliation before the statuses got set.

Different ideas to remediate / fix:

set the paused annotation at the Machine and remove it at the very end after all is set.
debug into the controller why in the "previously reconciled" state, it does not get to the expected result
change the creation / status patch order to create + patch the Machine object as last (after the dependent objects got created + status patched)

chrischdi · 2023-04-05T09:47:47Z

To add:

With this patch, the test does not even finish after 30s (current timeout is 10s), so it may also be a bigger bug.

killianmuldoon · 2023-04-05T09:52:33Z

I don't like the idea of adding pause - it seems to be papering over cracks here 😄. I tried

change the creation / status patch order to create + patch the Machine object as last (after the dependent objects got created + status patched)

And I'm still getting a failure - does it work for you?

chrischdi · 2023-04-05T09:53:12Z

I think so yes, but not yet at 100%.

chrischdi · 2023-04-05T09:55:20Z

(but only with set the paused annotation at the Machine and remove it at the very end after all is set. as change AFAIK)

killianmuldoon · 2023-04-05T09:58:46Z

It looks like the logic for setting LastUpdated is flaky in the controller.

killianmuldoon · 2023-04-05T11:02:14Z

@chrischdi Just took another look - I think the issue here is in the test - there shouldn't be a patch for the machine at all with the lastUpdated field. i.e.

		modifiedMachine.Status.LastUpdated = &lastUpdated
		g.Expect(env.Status().Patch(ctx, modifiedMachine, client.MergeFrom(machine))).To(Succeed())

should be removed IMO.

chrischdi · 2023-04-05T11:10:14Z

Yes maybe, I think there are two cases for this field

nil -> setting it
previous value -> newer value

I don't know which of both (or both) are possible in the runtime scenario 🤔

Edit:

the first one is maybe a case for clusterctl move
the second one may be the case for provisioning to provisioned or similar.

killianmuldoon · 2023-04-05T11:18:40Z

Does clusterctl move set this field? I can't see where that happens

chrischdi · 2023-04-05T11:18:44Z

But to recap, the issue of the test could be the following scenario:

Machine object gets created

cluster-api/internal/controllers/machine/machine_controller_phases_test.go

Line 509 in 94f5468

g.Expect(env.Create(ctx, machine)).To(Succeed())

Meanwhile the object gets reconciled
We patch the .status.lastUpdated field after it was reconciled

cluster-api/internal/controllers/machine/machine_controller_phases_test.go

Lines 515 to 517 in 94f5468

    
           lastUpdated := metav1.NewTime(time.Now().Add(-10 * time.Second)) 
        
           machine.Status.LastUpdated = &lastUpdated 
        
           g.Expect(env.Status().Patch(ctx, modifiedMachine, client.MergeFrom(machine))).To(Succeed())

killianmuldoon · 2023-04-05T11:19:51Z

We patch the .status.lastUpdated field after it was reconciled

And I don't think this patch is necessary or really useful - we just need to check that the field is set and that the lastupdated field in the machine is after the timestamp before the reconcile, no need to have it be updated on the machine IMO.

chrischdi · 2023-04-05T11:22:16Z

I agree for this test case.

But in general for these envtest tests, if we rely on some tests that the status is already in some state:

we could set the Paused annotation at the object, and remove it once the test is set up and we wait for reconciliation

chrischdi · 2023-04-11T08:56:07Z

/reopen

I checked the dashboard again:

There's one more occurence happening at the test (since we merged the fix)

Should_set_`Running`_when_bootstrap_and_infra_is_ready

/ https://github.com/kubernetes-sigs/cluster-api/blob/main/internal/controllers/machine/machine_controller_phases_test.go#L243 , which we should fix too.

k8s-ci-robot · 2023-04-11T08:56:11Z

@chrischdi: Reopened this issue.

In response to this:

/reopen

I checked the dashboard again:

There's one more occurence happening at the test Should_set_Running_when_bootstrap_and_infra_is_ready / https://github.com/kubernetes-sigs/cluster-api/blob/main/internal/controllers/machine/machine_controller_phases_test.go#L243 , which we should fix too.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

chrischdi · 2023-04-26T15:00:47Z

Triage link: 655deea70b0ff4b80d8b

k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. area/testing Issues or PRs related to testing needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 5, 2023

killianmuldoon changed the title ~~TestReconcileMachinePhases/Should_set_Provisioned_when_there_is_a_ProviderID_and_there_is_no_Node~~ TestReconcileMachinePhases is flaky Apr 5, 2023

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 5, 2023

k8s-ci-robot assigned chrischdi Apr 5, 2023

chrischdi mentioned this issue Apr 5, 2023

🐛 machine-controller: fix phase tests race condition in tests on lastUpdated field #8484

Merged

k8s-ci-robot closed this as completed in #8484 Apr 6, 2023

k8s-ci-robot reopened this Apr 11, 2023

chrischdi mentioned this issue Apr 26, 2023

🐛 internal/machine-controller fix flakes in phases test #8576

Merged

k8s-ci-robot closed this as completed in #8576 May 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TestReconcileMachinePhases is flaky #8477

TestReconcileMachinePhases is flaky #8477

killianmuldoon commented Apr 5, 2023

killianmuldoon commented Apr 5, 2023

chrischdi commented Apr 5, 2023 •

edited

chrischdi commented Apr 5, 2023 •

edited

killianmuldoon commented Apr 5, 2023

chrischdi commented Apr 5, 2023

chrischdi commented Apr 5, 2023

killianmuldoon commented Apr 5, 2023

killianmuldoon commented Apr 5, 2023

chrischdi commented Apr 5, 2023 •

edited

killianmuldoon commented Apr 5, 2023

chrischdi commented Apr 5, 2023

killianmuldoon commented Apr 5, 2023

chrischdi commented Apr 5, 2023

chrischdi commented Apr 11, 2023 •

edited

k8s-ci-robot commented Apr 11, 2023

chrischdi commented Apr 26, 2023

TestReconcileMachinePhases is flaky #8477

TestReconcileMachinePhases is flaky #8477

Comments

killianmuldoon commented Apr 5, 2023

Which jobs are flaking?

Which tests are flaking?

Since when has it been flaking?

Testgrid link

Reason for failure (if possible)

Anything else we need to know?

Label(s) to be applied

killianmuldoon commented Apr 5, 2023

chrischdi commented Apr 5, 2023 • edited

chrischdi commented Apr 5, 2023 • edited

killianmuldoon commented Apr 5, 2023

chrischdi commented Apr 5, 2023

chrischdi commented Apr 5, 2023

killianmuldoon commented Apr 5, 2023

killianmuldoon commented Apr 5, 2023

chrischdi commented Apr 5, 2023 • edited

killianmuldoon commented Apr 5, 2023

chrischdi commented Apr 5, 2023

killianmuldoon commented Apr 5, 2023

chrischdi commented Apr 5, 2023

chrischdi commented Apr 11, 2023 • edited

k8s-ci-robot commented Apr 11, 2023

chrischdi commented Apr 26, 2023

chrischdi commented Apr 5, 2023 •

edited

chrischdi commented Apr 5, 2023 •

edited

chrischdi commented Apr 5, 2023 •

edited

chrischdi commented Apr 11, 2023 •

edited