Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e: WaitForJobStopped correction #19749

Merged
merged 2 commits into from Jan 22, 2024
Merged

Conversation

pkazmierczak
Copy link
Contributor

We're getting errors from some tests that assert service de-registration from Consul after calling WaitForJobStopped. Before #19744, WaitForJobStopped waited for allocs to be stopped too, but since we changed the behavior of -purge, there is no wait anymore and Consul-related assertions happen immediately.

Example of a failing test:

=== RUN   TestE2E/ConsulNamespaces/*consul.ConsulNamespacesE2ETest/TestConsulConnectSidecarsMinimalToken
    consul.go:52: 
        	Error Trace:	/home/runner/actions-runner/_work/nomad-e2e/nomad-e2e/nomad/e2e/e2eutil/consul.go:52
        	            				/home/runner/actions-runner/_work/nomad-e2e/nomad-e2e/nomad/testutil/wait.go:68
        	            				/home/runner/actions-runner/_work/nomad-e2e/nomad-e2e/nomad/e2e/e2eutil/consul.go:42
        	            				/home/runner/actions-runner/_work/nomad-e2e/nomad-e2e/nomad/e2e/consul/namespaces.go:246
        	            				/home/runner/actions-runner/_work/nomad-e2e/nomad-e2e/nomad/e2e/consul/namespaces_ent.go:297
        	Error:      	Received unexpected error:
        	            	service count-api: expected empty services but found 1 []*api.ServiceEntry{
        	            	    &api.ServiceEntry{
        	            	        Node: &api.Node{
        	            	            ID:              "d21ce1a2-7718-bc60-9153-4ecf617c338d",
        	            	            Node:            "ip-172-31-94-167",
        	            	            Address:         "172.31.94.167",
        	            	            Datacenter:      "nomad-e2e-shared-hcp-consul",
        	            	            TaggedAddresses: {"lan":"172.31.94.167", "lan_ipv4":"172.31.94.167", "wan":"172.31.94.167", "wan_ipv4":"172.31.94.167"},
        	            	            Meta:            {"consul-network-segment":"", "consul-version":"1.17.0"},
        	            	            CreateIndex:     0x5ad608,
        	            	            ModifyIndex:     0x5ad609,
        	            	            Partition:       "default",
        	            	            PeerName:        "",
        	            	            Locality:        (*api.Locality)(nil),
        	            	        },
        	            	        Service: &api.AgentService{
        	            	            Kind:              "",
        	            	            ID:                "_nomad-task-d4c013af-bc04-2fdf-a930-122c887[363](https://github.com/hashicorp/nomad-e2e/actions/runs/7538512277/job/20519250592#step:22:364)68-group-api-count-api-9001",
        	            	            Service:           "count-api",
        	            	            Tags:              {},
        	            	            Meta:              {"external-source":"nomad"},
        	            	            Port:              9001,
        	            	            Address:           "",
        	            	            SocketPath:        "",
        	            	            TaggedAddresses:   {},
        	            	            Weights:           api.AgentWeights{Passing:1, Warning:1},
        	            	            EnableTagOverride: false,
        	            	            CreateIndex:       0x5ad83b,
        	            	            ModifyIndex:       0x5ad83b,
        	            	            ContentHash:       "",
        	            	            Proxy:             &api.AgentServiceConnectProxyConfig{},
        	            	            Connect:           &api.AgentServiceConnect{},
        	            	            PeerName:          "",
        	            	            Namespace:         "apple",
        	            	            Partition:         "default",
        	            	            Datacenter:        "",
        	            	            Locality:          (*api.Locality)(nil),
        	            	        },
        	            	        Checks: {
        	            	            &api.HealthCheck{
        	            	                Node:        "ip-172-31-94-167",
        	            	                CheckID:     "serfHealth",
        	            	                Name:        "Serf Health Status",
        	            	                Status:      "passing",
        	            	                Notes:       "",
        	            	                Output:      "Agent alive and reachable",
        	            	                ServiceID:   "",
        	            	                ServiceName: "",
        	            	                ServiceTags: {},
        	            	                Type:        "",
        	            	                Namespace:   "default",
        	            	                Partition:   "default",
        	            	                ExposedPort: 0,
        	            	                PeerName:    "",
        	            	                Definition:  api.HealthCheckDefinition{},
        	            	                CreateIndex: 0x5ad608,
        	            	                ModifyIndex: 0x5ad608,
        	            	            },
        	            	            &api.HealthCheck{
        	            	                Node:        "ip-172-31-94-167",
        	            	                CheckID:     "_nomad-check-375f552d70b04bae196f99562ef9aa72264e530b",
        	            	                Name:        "api-health",
        	            	                Status:      "passing",
        	            	                Notes:       "",
        	            	                Output:      "HTTP GET http://172.31.94.167:26051/health: 200 OK Output: Hello, you've hit /health\n",
        	            	                ServiceID:   "_nomad-task-d4c013af-bc04-2fdf-a930-122c88736[368](https://github.com/hashicorp/nomad-e2e/actions/runs/7538512277/job/20519250592#step:22:369)-group-api-count-api-9001",
        	            	                ServiceName: "count-api",
        	            	                ServiceTags: {},
        	            	                Type:        "http",
        	            	                Namespace:   "apple",
        	            	                Partition:   "default",
        	            	                ExposedPort: 0,
        	            	                PeerName:    "",
        	            	                Definition:  api.HealthCheckDefinition{},
        	            	                CreateIndex: 0x5ad83b,
        	            	                ModifyIndex: 0x5ad84b,
        	            	            },
        	            	        },
        	            	    },
        	            	}
        	Test:       	TestE2E/ConsulNamespaces/*consul.ConsulNamespacesE2ETest/TestConsulConnectSidecarsMinimalToken

I welcome better suggestions on how to handle this.

// sleep for 3 seconds to make sure things any related events (like Consul
// service de-registration) has enough time to happen, because there are
// tests that assert such things after stopping jobs.
time.Sleep(3 * time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hard-coded sleep does seem pretty gross, and potential for future flakiness...

would running DeregisterOpts() instead with NoShutdownDelay help here at all? also it returns an EvalID -- any value from watching that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hard-coded sleep does seem pretty gross, and potential for future flakiness...

I concur, although we have plenty of hard-coded sleeps all across our e2eutil package...

would running DeregisterOpts() instead with NoShutdownDelay help here at all?

I don't think so. It is my understanding that NoShutdownDelay will simply result in Deregister call taking less time, but the call blocks until the deregister message is applied via Raft.

also it returns an EvalID -- any value from watching that?

that's an idea maybe, I'll look into it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@gulducat gulducat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming it gets our desired outcome, LGTM!

require.NoError(t, err, "error deregistering job %q", job)

testutil.WaitForResultRetries(retries, func() (bool, error) {
Copy link
Member

@gulducat gulducat Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say to use must.Wait() for the retry logic, except this whole file seems to use this testutil, so being consistent with the surrounding context seems fine.

mainly I'm curious if this does actually do what was originally expected? since the job's being purged, the eval doesn't complete until the allocs are all gone too?

@pkazmierczak pkazmierczak marked this pull request as ready for review January 22, 2024 10:38
@pkazmierczak pkazmierczak merged commit 8a4bd61 into main Jan 22, 2024
18 checks passed
@pkazmierczak pkazmierczak deleted the b-e2e-wait-for-jobs-sleep branch January 22, 2024 10:38
nvanthao pushed a commit to nvanthao/nomad that referenced this pull request Mar 1, 2024
nvanthao pushed a commit to nvanthao/nomad that referenced this pull request Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants