Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clients should not fingerprint CSI node-only plugins with GetNodeInfo #10055

Closed
tgross opened this issue Feb 21, 2021 · 4 comments
Closed

clients should not fingerprint CSI node-only plugins with GetNodeInfo #10055

tgross opened this issue Feb 21, 2021 · 4 comments

Comments

@tgross
Copy link
Member

tgross commented Feb 21, 2021

Discovered a CSI compliance bug where the CSI NodeGetInfo specification says:

A Node Plugin MUST implement this RPC call if the plugin has PUBLISH_UNPUBLISH_VOLUME controller capability.

A node-only plugin should not advertise that capability. It looks like at least some node-only plugins do implement this RPC, but it's incorrect to the spec. Regardless, Nomad hits that endpoint anyways and this can cause errors of the form:

2021-02-15T15:47:33.839-0500 [WARN] client.csi-zfs: finished client unary call: grpc.code=Unimplemented duration=357.548µs grpc.service=csi.v1.Node grpc.method=NodeGetInfo

When the Nomad client fingerprints the plugin, it calls NodeGetInfo if p.fingerprintNode is true. This value is set if the plugin is of type PluginCSITypeNode. It looks like that guard for NodeGetInfo should be for p.fingerprintController, not p.fingerprintNode.

@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Feb 21, 2021
@tgross tgross moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Feb 21, 2021
@tgross
Copy link
Member Author

tgross commented Feb 22, 2021

Upon reflection this bug is a little more subtle... the we should be calling NodeGetInfo if the plugin is a node plugin (so p.fingerprintNode == true), but has a controller required. But we don't necessarily know that until an allocation running a controller is started. Ex. if you start a job with the node plugins, and then afterwards start the controller plugins. It'll get picked up on the next fingerprint after the controller registers, but we'll want to surface that ordering in the docs.

@tgross tgross removed this from Needs Roadmapping in Nomad - Community Issues Triage Mar 4, 2021
@tgross
Copy link
Member Author

tgross commented Feb 3, 2022

Another piece of information I've pulled up while looking at the plugin counts is that the docstring comments suggest the intent was to use only GetProbe as the fingerprint after the initial fingerprinting of capabilities (which don't change for a given allocation), but that we're doing the full expensive fingerprinting operation every 30s. We're also throwing away the gRPC client in the plugin_supervisor_hook on each pass, which we don't need to do either.

@tgross
Copy link
Member Author

tgross commented Feb 24, 2022

Now that I've started implementation of topologies (#7669) and have a better understanding of this fingerprint, I'm going to close this issue in lieu of #12056

@tgross tgross closed this as completed Feb 24, 2022
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 11, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant