Perform remote service filtering before selecting nodes when scaling in cluster #539

lgfa29 · 2021-11-13T03:49:46Z

This PR makes a few changes in the way nodes are selected for cluster scale in action.

Match instances in target before selecting which nodes to remove

When scaling in a cluster, we must find a set of clients that meet two criteria:

They match the filtering criteria specified in the policy (node_class and/or datacenter).
They belong to the remote service being targeted.

Selecting which nodes to remove can only be reliably executed once both filters are applied since clients in different remote services can have matching filtering criteria (for example, clients with the same node_class but in different AWS ASGs).

Currently, the list of nodes that match the filtering criteria is being reduced prematurely to the number of next count, which increases the odds of nodes that don't belong to the remote service being picked. These nodes will cause the scaling action to fail since they won't exist in the remote service target.

This PR refactors RunPreScaleInTasks to apply both filters before reducing the pool of selected nodes by the desired amount.

This refactoring is done in a new function (RunPreScaleInTasksWithRemoteCheck) to not break external plugins and to maintain our SDK backwards compatible. In a future release, a breaking change to rename these functions should be done.

Don't skip ineligible nodes

Originally, the Autoscaler would fail a scaling action unless the cluster nodes were stable: no node could be draining or set as ineligible. This was done to prevent multiple scaling actions over the same set of clients from interfering with each other, or for the Autoscaler to override manual actions by operators.

But in real life, this turned out to be a very restrictive requirement, specially if the Autoscaler happens to crash or fail during a scaling action, leaving behind ineligible nodes and never being able to perform a scaling action again.

Being an automated system that is supposed to fully manage the lifecycle of clients, it's expected that the Autoscaler will do what ever is necessary to meet policy requirements. If the Autoscaler actions impact operator manual interventions, those policies must be changed or temporarily disabled.

This PR changes the node filtering logic to not skip ineligible nodes. If a node is ineligible it may not receive more workloads, but it's still present and active in the cluster. If a policy evaluation requires that nodes be removed, they should also be considered.

Check for node readiness instead of scheduling eligibility

The original flow only checked for scheduling eligibility, which may be true even if the node is down, so the Autoscaler could pick nodes that were already removed, so non-existing in the remote service, but not pruned, still registered in Nomad. This would cause scaling actions to fail repeatedly and indefinitely until those nodes are removed.

This PR changes the logic to check for readiness instead.

Log scale in filtering process

There are several steps required to select which nodes, out of all registered in Nomad, should be selected for termination. It's hard to follow which nodes are being considered and which ones have been dropped and why.

This PR adds more log lines at the DEBUG level to provide more visibility during this process. Sample output:

2021-11-16T18:56:04.822-0500 [DEBUG] policy_eval.worker.check_handler: calculating new count: check=mem_allocated_percentage id=42a1af42-ec06-85a2-663b-ebf73884784e policy_id=b783aa68-ac68-6aea-a0a0-f18f33b9cbcc queue=cluster source=prometheus strategy=target-value target=gce-mig count=2
2021-11-16T18:56:04.822-0500 [TRACE] internal_plugin.target-value: calculated scaling strategy results: check_name=mem_allocated_percentage current_count=2 new_count=1 metric_value=28.225806451612904 metric_time="2021-11-16 18:56:04 -0500 EST" factor=0.40322580645161293 direction=down
2021-11-16T18:56:04.823-0500 [TRACE] policy_eval.worker: check cpu_allocated_percentage selected: id=42a1af42-ec06-85a2-663b-ebf73884784e policy_id=b783aa68-ac68-6aea-a0a0-f18f33b9cbcc queue=cluster target=gce-mig direction=down count=1
2021-11-16T18:56:04.823-0500 [INFO]  policy_eval.worker: scaling target: id=42a1af42-ec06-85a2-663b-ebf73884784e policy_id=b783aa68-ac68-6aea-a0a0-f18f33b9cbcc queue=cluster target=gce-mig from=2 to=1 reason="scaling down because factor is 0.194879" meta=map[nomad_policy_id:b783aa68-ac68-6aea-a0a0-f18f33b9cbcc]
2021-11-16T18:56:05.311-0500 [DEBUG] internal_plugin.gce-mig: found healthy instance: action=scale_in instance_group=hashistack-nomad-client instance_id=3573983942526237328 instance=https://www.googleapis.com/compute/v1/projects/hashistack-integral-grouse/zones/us-central1-a/instances/hashistack-nomad-client-bwvg
2021-11-16T18:56:05.311-0500 [DEBUG] internal_plugin.gce-mig: found healthy instance: action=scale_in instance_group=hashistack-nomad-client instance_id=2973413554515448488 instance=https://www.googleapis.com/compute/v1/projects/hashistack-integral-grouse/zones/us-central1-a/instances/hashistack-nomad-client-lq2j
2021-11-16T18:56:05.311-0500 [DEBUG] internal_plugin.gce-mig: performing node pool filtering: node_class=hashistack
2021-11-16T18:56:05.361-0500 [DEBUG] internal_plugin.gce-mig: found node: node_id=b11b215a-64f4-e9f4-4b07-0a5a0f7533ff datacenter=dc1 node_class=hashistack status=ready eligibility=eligible draining=false
2021-11-16T18:56:05.361-0500 [DEBUG] internal_plugin.gce-mig: found node: node_id=32acccdc-1762-02ac-5786-8d11d668075a datacenter=dc1 node_class=hashistack status=down eligibility=ineligible draining=false
2021-11-16T18:56:05.361-0500 [DEBUG] internal_plugin.gce-mig: found node: node_id=9c25dc22-795f-f848-22e5-3602f1529182 datacenter=dc1 node_class=hashistack status=down eligibility=ineligible draining=false
2021-11-16T18:56:05.361-0500 [DEBUG] internal_plugin.gce-mig: found node: node_id=aa24d314-c1a1-c525-60e9-e05c8cce8880 datacenter=dc1 node_class=hashistack status=down eligibility=ineligible draining=false
2021-11-16T18:56:05.361-0500 [DEBUG] internal_plugin.gce-mig: found node: node_id=9d8f1719-2b61-64ae-bf26-6ecae654bf16 datacenter=dc1 node_class=hashistack status=ready eligibility=eligible draining=false
2021-11-16T18:56:05.361-0500 [DEBUG] internal_plugin.gce-mig: node passed filter criteria: node_id=b11b215a-64f4-e9f4-4b07-0a5a0f7533ff
2021-11-16T18:56:05.361-0500 [DEBUG] internal_plugin.gce-mig: node passed filter criteria: node_id=9d8f1719-2b61-64ae-bf26-6ecae654bf16
2021-11-16T18:56:05.428-0500 [DEBUG] internal_plugin.gce-mig: identified remote provider ID for node: node_id=b11b215a-64f4-e9f4-4b07-0a5a0f7533ff remote_id=zones/us-central1-a/instances/hashistack-nomad-client-bwvg
2021-11-16T18:56:05.476-0500 [DEBUG] internal_plugin.gce-mig: identified remote provider ID for node: node_id=9d8f1719-2b61-64ae-bf26-6ecae654bf16 remote_id=zones/us-central1-a/instances/hashistack-nomad-client-lq2j
2021-11-16T18:56:05.476-0500 [DEBUG] internal_plugin.gce-mig: node is part of the policy target: node_id=b11b215a-64f4-e9f4-4b07-0a5a0f7533ff remote_id=zones/us-central1-a/instances/hashistack-nomad-client-bwvg
2021-11-16T18:56:05.476-0500 [DEBUG] internal_plugin.gce-mig: node is part of the policy target: node_id=9d8f1719-2b61-64ae-bf26-6ecae654bf16 remote_id=zones/us-central1-a/instances/hashistack-nomad-client-lq2j
2021-11-16T18:56:05.476-0500 [DEBUG] internal_plugin.gce-mig: performing node selection: selector_strategy=least_busy
2021-11-16T18:56:05.601-0500 [DEBUG] internal_plugin.gce-mig: node selected for removal: node_id=b11b215a-64f4-e9f4-4b07-0a5a0f7533ff remote_id=zones/us-central1-a/instances/hashistack-nomad-client-bwvg
2021-11-16T18:56:05.602-0500 [INFO]  internal_plugin.gce-mig: triggering drain on node: node_id=b11b215a-64f4-e9f4-4b07-0a5a0f7533ff deadline=5m0s
2021-11-16T18:56:05.736-0500 [INFO]  internal_plugin.gce-mig: received node drain message: node_id=b11b215a-64f4-e9f4-4b07-0a5a0f7533ff msg="Drain complete for node b11b215a-64f4-e9f4-4b07-0a5a0f7533ff"
2021-11-16T18:56:05.886-0500 [DEBUG] internal_plugin.gce-mig: received node drain message: node_id=b11b215a-64f4-e9f4-4b07-0a5a0f7533ff msg="Alloc "1276d3f6-cd42-a908-dc6e-5723a4481162" status running -> complete"
2021-11-16T18:56:05.886-0500 [INFO]  internal_plugin.gce-mig: received node drain message: node_id=b11b215a-64f4-e9f4-4b07-0a5a0f7533ff msg="All allocations on node "b11b215a-64f4-e9f4-4b07-0a5a0f7533ff" have stopped"
2021-11-16T18:56:05.886-0500 [DEBUG] internal_plugin.gce-mig: node drain complete: node_id=b11b215a-64f4-e9f4-4b07-0a5a0f7533ff
2021-11-16T18:56:05.886-0500 [DEBUG] internal_plugin.gce-mig: pre scale-in tasks now complete
2021-11-16T18:56:05.886-0500 [DEBUG] internal_plugin.gce-mig: deleting GCE MIG instances: action=scale_in instance_group=hashistack-nomad-client instances=["{b11b215a-64f4-e9f4-4b07-0a5a0f7533ff zones/us-central1-a/instances/hashistack-nomad-client-bwvg}"]
2021-11-16T18:56:06.463-0500 [INFO]  internal_plugin.gce-mig: successfully deleted GCE MIG instances: action=scale_in instance_group=hashistack-nomad-client

Maybe they should be at the TRACE level since some clusters may have thousands of nodes? I'm not sure, but they are useful for debugging in general.

Closes #477 #511

…in cluster

…ng actions when scaling in

jrasell

This looks good to me and a nice improvement. The logging is a great addition as well.

jrasell · 2021-11-24T14:19:59Z

sdk/helper/scaleutils/cluster.go

@@ -114,6 +204,12 @@ func (c *ClusterScaleUtils) IdentifyScaleInNodes(cfg map[string]string, num int)
 	// Filter out the Nomad node ID where this autoscaler instance is running.
 	filteredNodes = filterOutNodeID(filteredNodes, c.curNodeID)

+	if c.log.IsDebug() {


Nice; this is pretty much my favourite feature of the logging lib.

lgfa29 mentioned this pull request Nov 13, 2021

bug: scaleutils can select nodes that are down / not part of target AWS ASG #477

Closed

lgfa29 added 4 commits November 16, 2021 20:44

perform remote service filtering before selecting nodes when scaling …

4d0e969

…in cluster

remove unused tests

e54166e

relax requirement to perform a cluster scaling action

f3bf97e

only include healthy and in-service ASG instances when scaling in

5531813

lgfa29 force-pushed the fix-attempt-477 branch from d0e3e3f to 3b7ee57 Compare November 17, 2021 01:45

lgfa29 added 5 commits November 16, 2021 20:51

only include MIG instances that are running and that don't have ongoi…

d98d270

…ng actions when scaling in

exit if all nodes are filter, or warn if there are fewer than expected

334fa9e

log pre-scale task progression

f2d9374

remove unnecessary code

08b8d7f

only include VMSS instances that are running when scaling in

ab8fd7e

lgfa29 force-pushed the fix-attempt-477 branch from 3b7ee57 to ab8fd7e Compare November 17, 2021 01:52

lgfa29 added 3 commits November 16, 2021 20:54

add comment

ac39d14

handle error during node selection

d4b0a7d

add tests for the new scaleutils FilterNodes behaviour

a34b52c

lgfa29 marked this pull request as ready for review November 17, 2021 19:00

lgfa29 requested review from gogococo, jazzyfresh and jrasell as code owners November 17, 2021 19:00

add changelog entry for #539

a5233fb

lgfa29 mentioned this pull request Nov 24, 2021

selected node not found in ASG #511

Closed

jrasell approved these changes Nov 24, 2021

View reviewed changes

lgfa29 merged commit 9d4b205 into main Nov 24, 2021

lgfa29 deleted the fix-attempt-477 branch November 24, 2021 15:43

lgfa29 mentioned this pull request Mar 10, 2022

The autoscaler cannot scale-in for AWS when using the Nomad APM #570

Open

lgfa29 mentioned this pull request Mar 25, 2022

Scaling down "gce-mig" drains node with running batch job #571

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perform remote service filtering before selecting nodes when scaling in cluster #539

Perform remote service filtering before selecting nodes when scaling in cluster #539

lgfa29 commented Nov 13, 2021 •

edited

jrasell left a comment

jrasell Nov 24, 2021

Perform remote service filtering before selecting nodes when scaling in cluster #539

Perform remote service filtering before selecting nodes when scaling in cluster #539

Conversation

lgfa29 commented Nov 13, 2021 • edited

Match instances in target before selecting which nodes to remove

Don't skip ineligible nodes

Check for node readiness instead of scheduling eligibility

Log scale in filtering process

jrasell left a comment

Choose a reason for hiding this comment

jrasell Nov 24, 2021

Choose a reason for hiding this comment

lgfa29 commented Nov 13, 2021 •

edited