New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix nil pointer error in nodevolumelimits csi logging #115179
Conversation
@sunnylovestiramisu: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@@ -231,8 +231,12 @@ func (pl *CSILimits) checkAttachableInlineVolume(vol v1.Volume, csiNode *storage | |||
return fmt.Errorf("looking up provisioner name for volume %v: %w", vol, err) | |||
} | |||
if !isCSIMigrationOn(csiNode, inTreeProvisionerName) { | |||
csiNodeName := "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can a unit test with nil CSINode be added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no existing test set up but we can add a new TestCheckAttachableInlineVolume
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the existing test may need some tweaks to be able to pass in a nil CSINode:
https://github.com/kubernetes/kubernetes/blob/818dfc6f4147d36522dd4bf18146efec4832f1c1/pkg/scheduler/framework/plugins/nodevolumelimits/csi_test.go#L483
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified the test and now if it is a nil node the unit test throws the node not found
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I needed to modify the getFakeCSINodeLister to fake returning a nil csi node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this change is not necessary if we return the error?
818dfc6
to
ce3ec84
Compare
5e730ac
to
477a4ac
Compare
/assign @msau42 |
@@ -661,7 +674,10 @@ func getNodeWithPodAndVolumeLimits(limitSource string, pods []*v1.Pod, limit int | |||
default: | |||
// Do nothing. | |||
} | |||
|
|||
nodeInfo.SetNode(node) | |||
if isNilNode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want csiNode
to be nil, not node
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we should modify the getFakeCSINodeLister so that when we call Get the csiNode is nil.
fe28ec2
to
bbd38aa
Compare
@@ -598,8 +610,8 @@ func getFakeCSIStorageClassLister(scName, provisionerName string) fakeframework. | |||
} | |||
} | |||
|
|||
func getFakeCSINodeLister(csiNode *storagev1.CSINode) fakeframework.CSINodeLister { | |||
if csiNode != nil { | |||
func getFakeCSINodeLister(csiNode *storagev1.CSINode, isNilCsiNode bool) fakeframework.CSINodeLister { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the function already handles a nil csiNode
input. Do you need to add this extra field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we cannot set the csiNode to nil before this function, there is a set up step for csiNode.spec.Something. The setup will fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually I think limitSource = "node"
should already cause csiNode to be nil. If you run the existing tests with -v 5, do you encounter the error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tried to modify the test case for limitSource as node in the master branch:
{
newPod: inTreeOneVolPod,
existingPods: []*v1.Pod{inTreeTwoVolPod},
filterName: "csi",
maxVols: 2,
driverNames: []string{csilibplugins.AWSEBSInTreePluginName, ebsCSIDriverName},
migrationEnabled: true,
limitSource: "node",
test: "should count in-tree volumes if migration is enabled",
}
The test just passed without panic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test case should use a pod specification using an inline volume.
Also it looks like the FakeLister doesn't return nil in a "not found" scenario: https://github.com/kubernetes/kubernetes/blob/6ec579904c6120b961ed0a1752cd9f92dfb5124c/pkg/scheduler/framework/fake/listers.go#L264
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inTreeInlineVolPod on master branch testing also did not trigger the painc. Made the change for the fake lister, some other existing tests start to fail:
--- FAIL: TestCSILimits/should_count_in-tree_volumes_if_migration_is_enabled (0.00s)
--- FAIL: TestCSILimits/should_count_unbound_in-tree_volumes_if_migration_is_enabled (0.00s)
--- FAIL: TestCSILimits/should_count_in-tree_inline_volumes_if_migration_is_enabled (0.00s)
--- FAIL: TestCSILimits/should_count_in-tree_and_csi_volumes_if_migration_is_enabled_(when_scheduling_in-tree_volumes) (0.00s)
--- FAIL: TestCSILimits/should_count_in-tree,_inline_and_csi_volumes_if_migration_is_enabled_(when_scheduling_in-tree_volumes) (0.00s)
--- FAIL: TestCSILimits/should_count_in-tree_and_csi_volumes_if_migration_is_enabled_(when_scheduling_csi_volumes) (0.00s)
@@ -97,6 +97,7 @@ func (pl *CSILimits) Filter(ctx context.Context, _ *framework.CycleState, pod *v | |||
if err != nil { | |||
// TODO: return the error once CSINode is created by default (2 releases) | |||
klog.V(5).InfoS("Could not get a CSINode object for the node", "node", klog.KObj(node), "err", err) | |||
return framework.AsStatus(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will cause the scheduler/cluster-autoscaler to retry immediately, as opposed to returning an UnschedulableAndUnresolvable status.
Is retrying the expected behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comment is incorrect as discussed in #107787 (comment).
Autoscaler doesn't simulate CSINode objects today, so we can't return an error otherwise autoscaler will fail forever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wait, sorry, this is my miss, I returned the error here to checkout the logs I added in the test. But after I verified the code path I only removed the logs but forgot to remove this line.
@@ -231,8 +231,12 @@ func (pl *CSILimits) checkAttachableInlineVolume(vol v1.Volume, csiNode *storage | |||
return fmt.Errorf("looking up provisioner name for volume %v: %w", vol, err) | |||
} | |||
if !isCSIMigrationOn(csiNode, inTreeProvisionerName) { | |||
csiNodeName := "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this change is not necessary if we return the error?
bbd38aa
to
8576478
Compare
// ErrReasonNodeNotFound is used for node not found error. | ||
ErrReasonNodeNotFound = "node not found" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leftover?
@@ -231,8 +232,12 @@ func (pl *CSILimits) checkAttachableInlineVolume(vol v1.Volume, csiNode *storage | |||
return fmt.Errorf("looking up provisioner name for volume %v: %w", vol, err) | |||
} | |||
if !isCSIMigrationOn(csiNode, inTreeProvisionerName) { | |||
csiNodeName := "" | |||
if csiNode != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mock is not working as expected after changing to a list, even the test case said that migration is enabled, on line old 291, new 296:
if !isCSIMigrationOn(csiNode, pluginName) {
klog.V(5).InfoS("CSI Migration of plugin is not enabled", "plugin", pluginName)
return "", ""
}
This got returned because csiNode is nil.
622be72
to
250497e
Compare
It turned out the original test setup did not reflect the reality, the name of csi-node and node should be the same. Added extra logs locally( please see here ) to test the csi node == nil:
|
/test pull-kubernetes-unit |
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
driverNames: []string{csilibplugins.AWSEBSInTreePluginName, ebsCSIDriverName}, | ||
migrationEnabled: true, | ||
isNilCsiNode: true, | ||
limitSource: "csinode", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you able to repro the error if you set limitSource
to node
and without needing isNilCsiNode
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need the isNilCsiNode to generate the nil pointer exception? I commented out the csiNodeName and log the csiNode.Name the test returns error:
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x13ba74d]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when limitSource == node
, I believe that should also cause csiNode to be nil: https://github.com/kubernetes/kubernetes/blob/250497ebca681fa8c054cac45bb5d8438b67d7a4/pkg/scheduler/framework/plugins/nodevolumelimits/csi_test.go#L663
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried again the same test case as the last week's code review comment above:
{
newPod: inTreeOneVolPod,
existingPods: []*v1.Pod{inTreeTwoVolPod},
filterName: "csi",
maxVols: 2,
driverNames: []string{csilibplugins.AWSEBSInTreePluginName, ebsCSIDriverName},
migrationEnabled: true,
limitSource: "node",
test: "should count in-tree volumes if migration is enabled",
wantStatus: framework.NewStatus(framework.Unschedulable, ErrReasonMaxVolumeCountExceeded),
}
It did not trigger the csiNode == nil, it did not go into the if !isCSIMigrationOn(csiNode, inTreeProvisionerName)
code path. I do not think the changes from the master branch fixed the test setup? So the change of limitSource still does not change the test result. :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked out your PR, reverted csi.go, and tweaked the test inputs and was able to repro the crash:
--- a/pkg/scheduler/framework/plugins/nodevolumelimits/csi_test.go
+++ b/pkg/scheduler/framework/plugins/nodevolumelimits/csi_test.go
@@ -313,8 +313,8 @@ func TestCSILimits(t *testing.T) {
maxVols: 2,
driverNames: []string{csilibplugins.AWSEBSInTreePluginName, ebsCSIDriverName},
migrationEnabled: true,
- isNilCsiNode: true,
- limitSource: "csinode",
+ isNilCsiNode: false,
+ limitSource: "node",
test: "nil csi node",
},
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean we do not need the isNilCsiNode params and just fix the csi-node-name to node-name should fix the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need the isNilCsiNode to generate the nil pointer exception? I commented out the csiNodeName and log the csiNode.Name the test returns error:
panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x13ba74d]
The above was ran with:
{
newPod: inTreeInlineVolPod,
existingPods: []*v1.Pod{inTreeTwoVolPod},
filterName: "csi",
maxVols: 2,
driverNames: []string{csilibplugins.AWSEBSInTreePluginName, ebsCSIDriverName},
migrationEnabled: true,
isNilCsiNode: true,
limitSource: "csinode",
}
without the change:
csiNodeName := ""
if csiNode != nil {
csiNodeName = csiNode.Name
}
also generated the nil pointer exception, do we need another test case specific to
isNilCsiNode: false,
limitSource: "node",
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
250497e
to
5e2f12e
Compare
Tests failed with |
/lgtm |
LGTM label has been added. Git tree hash: 2a1432bde8ab29d6e83b1a0bde51450313ebfff1
|
Can you add a release note for the fix? We'll want to cherry pick this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM from scheduling
/approve |
1 similar comment
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, msau42, sunnylovestiramisu The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…ick-of-#115179-upstream-release-1.25 Automated cherry pick of #115179: Fix nil pointer error in nodevolumelimits csi logging
…ick-of-#115179-upstream-release-1.26 Automated cherry pick of #115179: Fix nil pointer error in nodevolumelimits csi logging
Fix included in accepted release 4.13.0-0.nightly-2023-11-21-212406 |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Fix nil pointer error in logging: #115178
Which issue(s) this PR fixes:
Fixes #115178
Special notes for your reviewer:
Does this PR introduce a user-facing change?