Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup AWSEBSCSIDriver setup #11610

Closed
wants to merge 2 commits into from

Conversation

hakman
Copy link
Member

@hakman hakman commented May 27, 2021

What this does:

  • Generate AWSEBSCSIDriver model only when needed
  • Don't pre-pull AWSEBSCSIDriver images in NodeUp

/cc @olemarkus @rifelpet

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels May 27, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please ask for approval from hakman after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@@ -239,6 +239,15 @@ func (tf *TemplateFunctions) AddTo(dest template.FuncMap, secretStore fi.SecretS
dest["EnableSQSTerminationDraining"] = func() bool { return *cluster.Spec.NodeTerminationHandler.EnableSQSTerminationDraining }
}

if cluster.Spec.CloudConfig != nil && fi.BoolValue(cluster.Spec.CloudConfig.ManageStorageClasses) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sort of logic shouldn't be template functions. Template functions aren't scoped to any particular component or logic and makes things harder to maintain. That is precisely why we have option builders.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you mean that fi.BoolValue(cluster.Spec.CloudConfig.ManageStorageClasses) part should be removed, I agree.
If you mean that we should always materialize AWSEBSCSIDriver.Enabled, I am not so sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to materialize it. Or make the templatest more complex.

The problem with the materialization now is that it triggers an unnecessary change to the bootstrap script. I think we can rather restrict what from CloudConfig is added to that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Referencing optional settings in templates other than the one for which they were meant is not nice and makes the templates harder to maintain.
We would trade a simple template function that checks if the feature is enabled, for extra code in multiple places, for an addon that can only be enabled starting k8s 1.20 and enabled by default in 1.22.

If we move the defaulting for 1.22 in model. that would make it better, I guess. This is why I was asking why only for newly created clusters?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think this was answered in office hours. Short version is that I wanted e2e to pass before enabling by default for all clusters. That work is almost done now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, answered and the PR was was updated.

@hakman hakman force-pushed the cleanup-awsebscsidriver-conf branch from 4d7390a to d845db8 Compare May 27, 2021 09:42
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels May 27, 2021
@hakman
Copy link
Member Author

hakman commented May 27, 2021

/retest

@hakman hakman force-pushed the cleanup-awsebscsidriver-conf branch from d845db8 to 2c1688d Compare May 28, 2021 11:07
@hakman hakman force-pushed the cleanup-awsebscsidriver-conf branch from 2c1688d to 6f3c72b Compare May 28, 2021 11:19
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 28, 2021
@hakman
Copy link
Member Author

hakman commented May 31, 2021

@olemarkus Any other thoughts about this?

}

// Pulling CSI driver image
image := "k8s.gcr.io/provider-aws/aws-ebs-csi-driver:" + *csi.Version
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's value in doing the warmpull. We should keep this, but figure out why we're dereferencing nil here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree in this case. This should not have been done at all for anything other than CNI and proxy, which affect startup.
A generic way, with container images assets, would be the way to go forward.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have pending workloads with volumes this will speed up pod startup. So while it does not contribute to node startup, it contributed to pod scheduling time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nil dereferencing works because it assumes cloudup always set this value. That is the case now, but I wouldn't exactly call this very defensive.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a general argument for any pre-pulled image. Realistically, the pull time for these images is very small.
As I said, I don't disagree with the idea, if done correctly. I am not ok with using it in this case and having extra logic in NodeUp for a special case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note quite. There are three images pulled in succession, and one of them sometimes takes up to 20 seconds.
A random example:

FirstSeen              Count    From      Type     Reason      Message
<nil>                  <none>   <none>    Normal   Scheduled   Successfully assigned kube-system/ebs-csi-node-9sbqq to ip-172-20-98-90.eu-central-1.compute.internal
2021-06-03T17:34:17Z   1        kubelet   Normal   Pulling     Pulling image "k8s.gcr.io/provider-aws/aws-ebs-csi-driver:v1.0.0"
2021-06-03T17:34:20Z   1        kubelet   Normal   Pulled      Successfully pulled image "k8s.gcr.io/provider-aws/aws-ebs-csi-driver:v1.0.0" in 3.187918027s
2021-06-03T17:34:21Z   1        kubelet   Normal   Created     Created container ebs-plugin
2021-06-03T17:34:21Z   1        kubelet   Normal   Started     Started container ebs-plugin
2021-06-03T17:34:21Z   1        kubelet   Normal   Pulling     Pulling image "k8s.gcr.io/sig-storage/csi-node-driver-registrar:v2.1.0"
2021-06-03T17:34:34Z   1        kubelet   Normal   Pulled      Successfully pulled image "k8s.gcr.io/sig-storage/csi-node-driver-registrar:v2.1.0" in 12.506681986s
2021-06-03T17:34:34Z   1        kubelet   Normal   Created     Created container node-driver-registrar
2021-06-03T17:34:34Z   1        kubelet   Normal   Started     Started container node-driver-registrar
2021-06-03T17:34:34Z   1        kubelet   Normal   Pulling     Pulling image "k8s.gcr.io/sig-storage/livenessprobe:v2.2.0"
2021-06-03T17:34:35Z   1        kubelet   Normal   Pulled      Successfully pulled image "k8s.gcr.io/sig-storage/livenessprobe:v2.2.0" in 1.094773878s
2021-06-03T17:34:35Z   1        kubelet   Normal   Created     Created container liveness-probe
2021-06-03T17:34:36Z   1        kubelet   Normal   Started     Started container liveness-probe

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we can debate this forever. I doubt any of us will change its mind. Hopefully this will change in the near future to something more maintainable.

@hakman hakman closed this Jun 3, 2021
@hakman hakman deleted the cleanup-awsebscsidriver-conf branch October 22, 2021 07:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/addons area/nodeup blocks-next cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants