Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(cloudprovider/externalgrpc): GetOptions returns MaxNodeProvisionTime as 0 seconds #5936

Conversation

vadasambar
Copy link
Member

@vadasambar vadasambar commented Jul 6, 2023

Signed-off-by: vadasambar surajrbanakar@gmail.com

What type of PR is this?

/kind bug

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #5935

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fixed: cluster-autoscaler thinks newly scaled up nodegroup using externalgrpc provider has `MaxNodeProvisionTime` set as 0 seconds and expects the new node to be registered in 0-10 seconds instead of the default 15m. Check https://github.com/kubernetes/autoscaler/issues/5935 for more info.  

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 6, 2023
@k8s-ci-robot k8s-ci-robot requested a review from x13n July 6, 2023 19:20
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 6, 2023
@@ -56,8 +56,8 @@ To regenerate the gRPC code:
1. install `protoc` and `protoc-gen-go-grpc`:

```bash
go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.26
go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@v1.1
go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.28
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to latest version at the time of creating this PR.
image
https://grpc.io/docs/languages/go/quickstart/

@@ -16,17 +16,17 @@ limitations under the License.

// Code generated by protoc-gen-go. DO NOT EDIT.
// versions:
// protoc-gen-go v1.26.0
// protoc v3.19.1
// protoc-gen-go v1.28.1
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything in this file is auto-generated using

protoc \
  -I ./cluster-autoscaler \
  -I ./cluster-autoscaler/vendor \
  --go_out=. \
  --go-grpc_out=. \
  ./cluster-autoscaler/cloudprovider/externalgrpc/protos/externalgrpc.proto

https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider/externalgrpc#code-generation

@@ -15,6 +15,10 @@ limitations under the License.
*/

// Code generated by protoc-gen-go-grpc. DO NOT EDIT.
// versions:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything in this file is auto-generated using

protoc \
  -I ./cluster-autoscaler \
  -I ./cluster-autoscaler/vendor \
  --go_out=. \
  --go-grpc_out=. \
  ./cluster-autoscaler/cloudprovider/externalgrpc/protos/externalgrpc.proto

https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider/externalgrpc#code-generation

@@ -786,7 +786,7 @@ type GetAvailableGPUTypesResponse struct {
unknownFields protoimpl.UnknownFields

// GPU types passed in as opaque key-value pairs.
GpuTypes map[string]*anypb.Any `protobuf:"bytes,1,rep,name=gpuTypes,proto3" json:"gpuTypes,omitempty" protobuf_key:"bytes,1,opt,name=key,proto3" protobuf_val:"bytes,2,opt,name=value,proto3"`
GpuTypes map[string]*any1.Any `protobuf:"bytes,1,rep,name=gpuTypes,proto3" json:"gpuTypes,omitempty" protobuf_key:"bytes,1,opt,name=key,proto3" protobuf_val:"bytes,2,opt,name=value,proto3"`
Copy link
Member Author

@vadasambar vadasambar Jul 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that any1 is imported from "github.com/golang/protobuf/ptypes/any". "github.com/golang/protobuf/ptypes/any" internally uses "google.golang.org/protobuf/types/known/anypb"'s anypb

package any
import (
protoreflect "google.golang.org/protobuf/reflect/protoreflect"
protoimpl "google.golang.org/protobuf/runtime/protoimpl"
anypb "google.golang.org/protobuf/types/known/anypb"
reflect "reflect"
)
// Symbols defined in public import of google/protobuf/any.proto.
type Any = anypb.Any

@vadasambar vadasambar force-pushed the fix/add-max-node-provision-time-to-external-grpc-provider branch from ee49e32 to 76092a2 Compare July 7, 2023 06:19
@vadasambar vadasambar marked this pull request as ready for review July 7, 2023 06:21
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 7, 2023
Copy link
Member

@dbonfigli dbonfigli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this.

There is another field that has been recently added to NodeGroupAutoscalingOptions , ZeroOrMaxNodeScaling, but it is constantly worked on and changed name many times recently, since the default is false and it is not causing issues I would not put it in the protobuf until the feature settles down.

Could you also please update the example at

with the new field?

@@ -248,6 +250,8 @@ func TestCloudProvider_GetOptions(t *testing.T) {
assert.Equal(t, 0.7, opts.ScaleDownGpuUtilizationThreshold)
assert.Equal(t, time.Minute, opts.ScaleDownUnneededTime)
assert.Equal(t, time.Hour, opts.ScaleDownUnreadyTime)
assert.Equal(t, time.Hour, opts.ScaleDownUnreadyTime)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicated line

@vadasambar vadasambar force-pushed the fix/add-max-node-provision-time-to-external-grpc-provider branch from 289c10e to fd1a95a Compare July 11, 2023 18:48
@vadasambar
Copy link
Member Author

@dbonfigli I have addressed the review changes. Thanks for reviewing my PR 🙏

I guess I need approval from @BigDarkClown to merge the PR.

@@ -351,6 +351,7 @@ func (w *Wrapper) NodeGroupGetOptions(_ context.Context, req *protos.NodeGroupAu
ScaleDownGpuUtilizationThreshold: pbDefaults.GetScaleDownGpuUtilizationThreshold(),
ScaleDownUnneededTime: pbDefaults.GetScaleDownUnneededTime().Duration,
ScaleDownUnreadyTime: pbDefaults.GetScaleDownUnneededTime().Duration,
MaxNodeProvisionTime: pbDefaults.GetMaxNodeProvisionTime().Duration,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please also add the new field in the return value below?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

…onTime` as 0 seconds

Signed-off-by: vadasambar <surajrbanakar@gmail.com>

chore(cloudprovider/externalgrpc): fix CI failing because of wrong header
- CI complains if I remove or change the copyright header
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

chore(cloudprovider/externalgrpc): fix spacing for license headers
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

chore(cloudprovider/externalgrpc): remove extra space in copyright header
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test(cloudprovider/externalgrpc): add assert for `MaxNodeProvisionTime`
- add extra assert for err test case
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: duplicate line
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: add example with `MaxNodeProvisionTime`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: return `MaxNodeProvisionTime` from `NodeGroupGetOptions` in response
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
@MaciekPytel
Copy link
Contributor

/approve
I'm going to leave lgtm to @dbonfigli as our general policy is for cloudprovider reviews to be done by their OWNERS and not core CA. The change does look good to me for whatever it's worth.

A separate point - I expect that this will keep happening as more options are moved into GetOptions. And I expect that there will be more fields added there over time - there seem to be a lot of interest in making CA configurable on per-nodegroup level and there isn't really any particular reason for us not to support that.
With that in mind - is there some long-term solution here? Should we expect anyone adding new fields to GetOptions to extend grpc? If so - can the steps be documented?

Note: I don't think my point above should be blocking for this PR; just a potential follow-up idea.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dbonfigli, MaciekPytel, vadasambar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 31, 2023
@dbonfigli
Copy link
Member

dbonfigli commented Jul 31, 2023

A separate point - I expect that this will keep happening as more options are moved into GetOptions. And I expect that there will be more fields added there over time - there seem to be a lot of interest in making CA configurable on per-nodegroup level and there isn't really any particular reason for us not to support that. With that in mind - is there some long-term solution here? Should we expect anyone adding new fields to GetOptions to extend grpc? If so - can the steps be documented?

Note: I don't think my point above should be blocking for this PR; just a potential follow-up idea.

I think the problem is similar for in-tree cloud providers that implement GetOption (luckily, only aws, gce, azure, ovh, civo): changing the NodeGroupAutoscalingOptions struct can result in problems if also cloud providers are not updated to handle it. For example, I am pretty sure #5649 is also breaking the ovh cloud provider, since

func (ng *NodeGroup) GetOptions(defaults config.NodeGroupAutoscalingOptions) (*config.NodeGroupAutoscalingOptions, error) {
is not defining or passing back the default for MaxNodeProvisionTime. So, the log term plan for the externalgrpc cloud provider could be the same as for the in-tree ones, either trying to make non breaking changes with new fields that have defaults values that result in working code (like for example with the new ZeroOrMaxNodeScaling), or adding the feature, or reaching out to cloudprovider maintainers putting them on notice for the upcoming breaking change.

For the externalgrcp cloud provider this problem was particular noticeable also because in practice it was mandatory to also implement GetOption, a problem that should be fixed with #5937 .

@vadasambar
Copy link
Member Author

@dbonfigli I think this PR needs your lgtm.

@dbonfigli
Copy link
Member

/lgtm

@k8s-ci-robot
Copy link
Contributor

@dbonfigli: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dbonfigli
Copy link
Member

@dbonfigli I think this PR needs your lgtm.

unfortunately i am not a member of the kubernetes org.

@vadasambar
Copy link
Member Author

@dbonfigli I think this PR needs your lgtm.

unfortunately i am not a member of the kubernetes org.

Thanks for the response. I saw your comment in #5998 (comment)

@MaciekPytel we need an LGTM on this from you because @dbonfigli is not a Kubernetes member and cannot add LGTM label to the PR.

@x13n
Copy link
Member

x13n commented Sep 4, 2023

/lgtm

(based on @dbonfigli's comment above)

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 4, 2023
@k8s-ci-robot k8s-ci-robot merged commit 57f2814 into kubernetes:master Sep 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

When using externalgrpc provider, nodes are created and then immediately deleted next loop iteration
5 participants