Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perform remote service filtering before selecting nodes when scaling in cluster #539

Merged
merged 13 commits into from
Nov 24, 2021

Conversation

lgfa29
Copy link
Contributor

@lgfa29 lgfa29 commented Nov 13, 2021

This PR makes a few changes in the way nodes are selected for cluster scale in action.

Match instances in target before selecting which nodes to remove

When scaling in a cluster, we must find a set of clients that meet two criteria:

  • They match the filtering criteria specified in the policy (node_class and/or datacenter).
  • They belong to the remote service being targeted.

Selecting which nodes to remove can only be reliably executed once both filters are applied since clients in different remote services can have matching filtering criteria (for example, clients with the same node_class but in different AWS ASGs).

Currently, the list of nodes that match the filtering criteria is being reduced prematurely to the number of next count, which increases the odds of nodes that don't belong to the remote service being picked. These nodes will cause the scaling action to fail since they won't exist in the remote service target.

This PR refactors RunPreScaleInTasks to apply both filters before reducing the pool of selected nodes by the desired amount.

This refactoring is done in a new function (RunPreScaleInTasksWithRemoteCheck) to not break external plugins and to maintain our SDK backwards compatible. In a future release, a breaking change to rename these functions should be done.

Don't skip ineligible nodes

Originally, the Autoscaler would fail a scaling action unless the cluster nodes were stable: no node could be draining or set as ineligible. This was done to prevent multiple scaling actions over the same set of clients from interfering with each other, or for the Autoscaler to override manual actions by operators.

But in real life, this turned out to be a very restrictive requirement, specially if the Autoscaler happens to crash or fail during a scaling action, leaving behind ineligible nodes and never being able to perform a scaling action again.

Being an automated system that is supposed to fully manage the lifecycle of clients, it's expected that the Autoscaler will do what ever is necessary to meet policy requirements. If the Autoscaler actions impact operator manual interventions, those policies must be changed or temporarily disabled.

This PR changes the node filtering logic to not skip ineligible nodes. If a node is ineligible it may not receive more workloads, but it's still present and active in the cluster. If a policy evaluation requires that nodes be removed, they should also be considered.

Check for node readiness instead of scheduling eligibility

The original flow only checked for scheduling eligibility, which may be true even if the node is down, so the Autoscaler could pick nodes that were already removed, so non-existing in the remote service, but not pruned, still registered in Nomad. This would cause scaling actions to fail repeatedly and indefinitely until those nodes are removed.

This PR changes the logic to check for readiness instead.

Log scale in filtering process

There are several steps required to select which nodes, out of all registered in Nomad, should be selected for termination. It's hard to follow which nodes are being considered and which ones have been dropped and why.

This PR adds more log lines at the DEBUG level to provide more visibility during this process. Sample output:

2021-11-16T18:56:04.822-0500 [DEBUG] policy_eval.worker.check_handler: calculating new count: check=mem_allocated_percentage id=42a1af42-ec06-85a2-663b-ebf73884784e policy_id=b783aa68-ac68-6aea-a0a0-f18f33b9cbcc queue=cluster source=prometheus strategy=target-value target=gce-mig count=2
2021-11-16T18:56:04.822-0500 [TRACE] internal_plugin.target-value: calculated scaling strategy results: check_name=mem_allocated_percentage current_count=2 new_count=1 metric_value=28.225806451612904 metric_time="2021-11-16 18:56:04 -0500 EST" factor=0.40322580645161293 direction=down
2021-11-16T18:56:04.823-0500 [TRACE] policy_eval.worker: check cpu_allocated_percentage selected: id=42a1af42-ec06-85a2-663b-ebf73884784e policy_id=b783aa68-ac68-6aea-a0a0-f18f33b9cbcc queue=cluster target=gce-mig direction=down count=1
2021-11-16T18:56:04.823-0500 [INFO]  policy_eval.worker: scaling target: id=42a1af42-ec06-85a2-663b-ebf73884784e policy_id=b783aa68-ac68-6aea-a0a0-f18f33b9cbcc queue=cluster target=gce-mig from=2 to=1 reason="scaling down because factor is 0.194879" meta=map[nomad_policy_id:b783aa68-ac68-6aea-a0a0-f18f33b9cbcc]
2021-11-16T18:56:05.311-0500 [DEBUG] internal_plugin.gce-mig: found healthy instance: action=scale_in instance_group=hashistack-nomad-client instance_id=3573983942526237328 instance=https://www.googleapis.com/compute/v1/projects/hashistack-integral-grouse/zones/us-central1-a/instances/hashistack-nomad-client-bwvg
2021-11-16T18:56:05.311-0500 [DEBUG] internal_plugin.gce-mig: found healthy instance: action=scale_in instance_group=hashistack-nomad-client instance_id=2973413554515448488 instance=https://www.googleapis.com/compute/v1/projects/hashistack-integral-grouse/zones/us-central1-a/instances/hashistack-nomad-client-lq2j
2021-11-16T18:56:05.311-0500 [DEBUG] internal_plugin.gce-mig: performing node pool filtering: node_class=hashistack
2021-11-16T18:56:05.361-0500 [DEBUG] internal_plugin.gce-mig: found node: node_id=b11b215a-64f4-e9f4-4b07-0a5a0f7533ff datacenter=dc1 node_class=hashistack status=ready eligibility=eligible draining=false
2021-11-16T18:56:05.361-0500 [DEBUG] internal_plugin.gce-mig: found node: node_id=32acccdc-1762-02ac-5786-8d11d668075a datacenter=dc1 node_class=hashistack status=down eligibility=ineligible draining=false
2021-11-16T18:56:05.361-0500 [DEBUG] internal_plugin.gce-mig: found node: node_id=9c25dc22-795f-f848-22e5-3602f1529182 datacenter=dc1 node_class=hashistack status=down eligibility=ineligible draining=false
2021-11-16T18:56:05.361-0500 [DEBUG] internal_plugin.gce-mig: found node: node_id=aa24d314-c1a1-c525-60e9-e05c8cce8880 datacenter=dc1 node_class=hashistack status=down eligibility=ineligible draining=false
2021-11-16T18:56:05.361-0500 [DEBUG] internal_plugin.gce-mig: found node: node_id=9d8f1719-2b61-64ae-bf26-6ecae654bf16 datacenter=dc1 node_class=hashistack status=ready eligibility=eligible draining=false
2021-11-16T18:56:05.361-0500 [DEBUG] internal_plugin.gce-mig: node passed filter criteria: node_id=b11b215a-64f4-e9f4-4b07-0a5a0f7533ff
2021-11-16T18:56:05.361-0500 [DEBUG] internal_plugin.gce-mig: node passed filter criteria: node_id=9d8f1719-2b61-64ae-bf26-6ecae654bf16
2021-11-16T18:56:05.428-0500 [DEBUG] internal_plugin.gce-mig: identified remote provider ID for node: node_id=b11b215a-64f4-e9f4-4b07-0a5a0f7533ff remote_id=zones/us-central1-a/instances/hashistack-nomad-client-bwvg
2021-11-16T18:56:05.476-0500 [DEBUG] internal_plugin.gce-mig: identified remote provider ID for node: node_id=9d8f1719-2b61-64ae-bf26-6ecae654bf16 remote_id=zones/us-central1-a/instances/hashistack-nomad-client-lq2j
2021-11-16T18:56:05.476-0500 [DEBUG] internal_plugin.gce-mig: node is part of the policy target: node_id=b11b215a-64f4-e9f4-4b07-0a5a0f7533ff remote_id=zones/us-central1-a/instances/hashistack-nomad-client-bwvg
2021-11-16T18:56:05.476-0500 [DEBUG] internal_plugin.gce-mig: node is part of the policy target: node_id=9d8f1719-2b61-64ae-bf26-6ecae654bf16 remote_id=zones/us-central1-a/instances/hashistack-nomad-client-lq2j
2021-11-16T18:56:05.476-0500 [DEBUG] internal_plugin.gce-mig: performing node selection: selector_strategy=least_busy
2021-11-16T18:56:05.601-0500 [DEBUG] internal_plugin.gce-mig: node selected for removal: node_id=b11b215a-64f4-e9f4-4b07-0a5a0f7533ff remote_id=zones/us-central1-a/instances/hashistack-nomad-client-bwvg
2021-11-16T18:56:05.602-0500 [INFO]  internal_plugin.gce-mig: triggering drain on node: node_id=b11b215a-64f4-e9f4-4b07-0a5a0f7533ff deadline=5m0s
2021-11-16T18:56:05.736-0500 [INFO]  internal_plugin.gce-mig: received node drain message: node_id=b11b215a-64f4-e9f4-4b07-0a5a0f7533ff msg="Drain complete for node b11b215a-64f4-e9f4-4b07-0a5a0f7533ff"
2021-11-16T18:56:05.886-0500 [DEBUG] internal_plugin.gce-mig: received node drain message: node_id=b11b215a-64f4-e9f4-4b07-0a5a0f7533ff msg="Alloc "1276d3f6-cd42-a908-dc6e-5723a4481162" status running -> complete"
2021-11-16T18:56:05.886-0500 [INFO]  internal_plugin.gce-mig: received node drain message: node_id=b11b215a-64f4-e9f4-4b07-0a5a0f7533ff msg="All allocations on node "b11b215a-64f4-e9f4-4b07-0a5a0f7533ff" have stopped"
2021-11-16T18:56:05.886-0500 [DEBUG] internal_plugin.gce-mig: node drain complete: node_id=b11b215a-64f4-e9f4-4b07-0a5a0f7533ff
2021-11-16T18:56:05.886-0500 [DEBUG] internal_plugin.gce-mig: pre scale-in tasks now complete
2021-11-16T18:56:05.886-0500 [DEBUG] internal_plugin.gce-mig: deleting GCE MIG instances: action=scale_in instance_group=hashistack-nomad-client instances=["{b11b215a-64f4-e9f4-4b07-0a5a0f7533ff zones/us-central1-a/instances/hashistack-nomad-client-bwvg}"]
2021-11-16T18:56:06.463-0500 [INFO]  internal_plugin.gce-mig: successfully deleted GCE MIG instances: action=scale_in instance_group=hashistack-nomad-client

Maybe they should be at the TRACE level since some clusters may have thousands of nodes? I'm not sure, but they are useful for debugging in general.

Closes #477 #511

@lgfa29 lgfa29 marked this pull request as ready for review November 17, 2021 19:00
Copy link
Member

@jrasell jrasell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me and a nice improvement. The logging is a great addition as well.

@@ -114,6 +204,12 @@ func (c *ClusterScaleUtils) IdentifyScaleInNodes(cfg map[string]string, num int)
// Filter out the Nomad node ID where this autoscaler instance is running.
filteredNodes = filterOutNodeID(filteredNodes, c.curNodeID)

if c.log.IsDebug() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice; this is pretty much my favourite feature of the logging lib.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: scaleutils can select nodes that are down / not part of target AWS ASG
2 participants