Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG 1824215: Allow small tolerance on memory capacity when comparing nodegroups #152

Merged

Conversation

JoelSpeed
Copy link

@JoelSpeed JoelSpeed commented May 7, 2020

This allows a small tolerance in the memory capacity of nodes to allow better matching of similar node groups. There are differences in the memory values that Kubernetes interprets due to variances in the instances that a cloud provider provides.

Also adds tests that match real values from a real set of nodes that would be expected to be the same (the same instance type across multiple availability zones within a given region)

Eg. In testing I saw AWS m5.xlarge nodes with capacities such as 16116152Ki and 15944120Ki not only across availability zones, but within the same availability zone after a few cycles through machines. This is a difference on 168Mi which is much larger than the original tolerance of 128000 Bytes which was preventing BalanceSimilarNodeGroups from balancing across these availability zones.

@JoelSpeed JoelSpeed changed the title UPSTREAM: <carry>: openshift: Use quantities for memory capacity differences BUG 1824215: Use quantities for memory capacity differences May 7, 2020
@openshift-ci-robot openshift-ci-robot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels May 7, 2020
@openshift-ci-robot
Copy link

@JoelSpeed: This pull request references Bugzilla bug 1824215, which is invalid:

  • expected the bug to be in one of the following states: NEW, ASSIGNED, ON_DEV, POST, POST, but it is MODIFIED instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

BUG 1824215: Use quantities for memory capacity differences

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@JoelSpeed
Copy link
Author

/bugzilla refresh

@openshift-ci-robot openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels May 7, 2020
@openshift-ci-robot
Copy link

@JoelSpeed: This pull request references Bugzilla bug 1824215, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.0) matches configured target release for branch (4.5.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks nice to me, thanks Joel!

/approve

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elmiko

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 7, 2020
@enxebre
Copy link
Member

enxebre commented May 11, 2020

Thanks! This is purely autoscaler core, can we get a counter PR upstream?

@enxebre
Copy link
Member

enxebre commented May 11, 2020

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 11, 2020
@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@@ -124,8 +126,10 @@ func IsCloudProviderNodeInfoSimilar(n1, n2 *schedulernodeinfo.NodeInfo, ignoredL
switch kind {
case apiv1.ResourceMemory:
// For memory capacity we allow a small tolerance
memoryDifference := math.Abs(float64(qtyList[0].Value()) - float64(qtyList[1].Value()))
if memoryDifference > MaxMemoryDifferenceInKiloBytes {
difference := absSub(qtyList[0], qtyList[1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could just use math.Abs?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to keep all of the quantities as resource.Quantity's, so this helper allows us to do that and reduce the conversions to/from integers to reduce the likelihood of a mistake being made there


var (
// MaxMemoryDifference describes how much memory capacity can differ but still be considered equal.
MaxMemoryDifference = resource.MustParse("256Mi")
Copy link
Member

@enxebre enxebre May 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How big is the diff we are seeing in real nodes?
Tolerating 256Mi seems too much as to consider nodeGroups equal. The original intention was to tolerate 128Ki kubernetes@e8b3c2a.
I think this should be 256Ki.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the test case I've added below which came from a real world test case. The values that came through the code (via much debug logging) were 16116152Ki and 15944120Ki, which is 168Mi, just over a 1% difference in this case

Please also review the attached BZ which has more details from a customer who report differences in a similar magnitude.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this MaxMemoryDifference also apply to much more smaller instances to the point of making the check loosing its value?
i.e If the possible diff range increase with the instance size, should we may be make our tolerance window a percentage of the given total size?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the discussion to the upstream PR for better visibility https://github.com/kubernetes/autoscaler/pull/3124/files#r422931565

@enxebre
Copy link
Member

enxebre commented May 11, 2020

/lgtm cancel
/hold
to discuss #152 (comment)

@openshift-ci-robot openshift-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. and removed lgtm Indicates that a PR is ready to be merged. labels May 11, 2020
@JoelSpeed
Copy link
Author

Counter PR will be created shortly

@JoelSpeed
Copy link
Author

Upstream kubernetes#3124

…hen comparing nodegroups

This allows developers to better interpet how the calculations are being
done by leaving the values as "Quantities". For example, the max
difference is now a string converted to a quantity which will be easier
to reason about and update if needed in the future.
Also adds tests that match real values from a real set of nodes that
would be expected to be the same
@JoelSpeed JoelSpeed changed the title BUG 1824215: Use quantities for memory capacity differences BUG 1824215: Allow small tolerance on memory capacity when comparing nodegroups Jun 11, 2020
@openshift-ci-robot
Copy link

@JoelSpeed: This pull request references Bugzilla bug 1824215, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

BUG 1824215: Allow small tolerance on memory capacity when comparing nodegroups

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@JoelSpeed
Copy link
Author

/hold cancel

This has been updated to reflect the upstream implementation which should be merging within the next week

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 11, 2020
@openshift-ci-robot
Copy link

@JoelSpeed: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-azure-operator 90751c4 link /test e2e-azure-operator

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@elmiko
Copy link

elmiko commented Jun 11, 2020

thanks Joel!
/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 11, 2020
@openshift-merge-robot openshift-merge-robot merged commit 4abdca5 into openshift:master Jun 11, 2020
@openshift-ci-robot
Copy link

@JoelSpeed: All pull requests linked via external trackers have merged: openshift/kubernetes-autoscaler#152, openshift/kubernetes-autoscaler#144. Bugzilla bug 1824215 has been moved to the MODIFIED state.

In response to this:

BUG 1824215: Allow small tolerance on memory capacity when comparing nodegroups

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@JoelSpeed
Copy link
Author

/cherry-pick release-4.5

@openshift-cherrypick-robot

@JoelSpeed: new pull request created: #157

In response to this:

/cherry-pick release-4.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants