Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix nil pointer error in nodevolumelimits csi logging #115179

Merged
merged 1 commit into from Jan 26, 2023

Conversation

sunnylovestiramisu
Copy link
Contributor

@sunnylovestiramisu sunnylovestiramisu commented Jan 18, 2023

What type of PR is this?

/kind bug

What this PR does / why we need it:

Fix nil pointer error in logging: #115178

Which issue(s) this PR fixes:

Fixes #115178

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fix 1.25 regression causing a nil pointer error in nodevolumelimits csi logging

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 18, 2023
@k8s-ci-robot
Copy link
Contributor

@sunnylovestiramisu: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 18, 2023
@@ -231,8 +231,12 @@ func (pl *CSILimits) checkAttachableInlineVolume(vol v1.Volume, csiNode *storage
return fmt.Errorf("looking up provisioner name for volume %v: %w", vol, err)
}
if !isCSIMigrationOn(csiNode, inTreeProvisionerName) {
csiNodeName := ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a unit test with nil CSINode be added?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no existing test set up but we can add a new TestCheckAttachableInlineVolume

Copy link
Member

@msau42 msau42 Jan 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified the test and now if it is a nil node the unit test throws the node not found.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I needed to modify the getFakeCSINodeLister to fake returning a nil csi node.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this change is not necessary if we return the error?

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 18, 2023
@sunnylovestiramisu
Copy link
Contributor Author

/assign @msau42

@@ -661,7 +674,10 @@ func getNodeWithPodAndVolumeLimits(limitSource string, pods []*v1.Pod, limit int
default:
// Do nothing.
}

nodeInfo.SetNode(node)
if isNilNode {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want csiNode to be nil, not node

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we should modify the getFakeCSINodeLister so that when we call Get the csiNode is nil.

@@ -598,8 +610,8 @@ func getFakeCSIStorageClassLister(scName, provisionerName string) fakeframework.
}
}

func getFakeCSINodeLister(csiNode *storagev1.CSINode) fakeframework.CSINodeLister {
if csiNode != nil {
func getFakeCSINodeLister(csiNode *storagev1.CSINode, isNilCsiNode bool) fakeframework.CSINodeLister {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the function already handles a nil csiNode input. Do you need to add this extra field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we cannot set the csiNode to nil before this function, there is a set up step for csiNode.spec.Something. The setup will fail?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I think limitSource = "node" should already cause csiNode to be nil. If you run the existing tests with -v 5, do you encounter the error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried to modify the test case for limitSource as node in the master branch:

{
	newPod:           inTreeOneVolPod,
	existingPods:     []*v1.Pod{inTreeTwoVolPod},
	filterName:       "csi",
	maxVols:          2,
	driverNames:      []string{csilibplugins.AWSEBSInTreePluginName, ebsCSIDriverName},
	migrationEnabled: true,
	limitSource:      "node",
	test:             "should count in-tree volumes if migration is enabled",
}

The test just passed without panic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test case should use a pod specification using an inline volume.

Also it looks like the FakeLister doesn't return nil in a "not found" scenario: https://github.com/kubernetes/kubernetes/blob/6ec579904c6120b961ed0a1752cd9f92dfb5124c/pkg/scheduler/framework/fake/listers.go#L264

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inTreeInlineVolPod on master branch testing also did not trigger the painc. Made the change for the fake lister, some other existing tests start to fail:

    --- FAIL: TestCSILimits/should_count_in-tree_volumes_if_migration_is_enabled (0.00s)
    --- FAIL: TestCSILimits/should_count_unbound_in-tree_volumes_if_migration_is_enabled (0.00s)
    --- FAIL: TestCSILimits/should_count_in-tree_inline_volumes_if_migration_is_enabled (0.00s)
    --- FAIL: TestCSILimits/should_count_in-tree_and_csi_volumes_if_migration_is_enabled_(when_scheduling_in-tree_volumes) (0.00s)
    --- FAIL: TestCSILimits/should_count_in-tree,_inline_and_csi_volumes_if_migration_is_enabled_(when_scheduling_in-tree_volumes) (0.00s)
    --- FAIL: TestCSILimits/should_count_in-tree_and_csi_volumes_if_migration_is_enabled_(when_scheduling_csi_volumes) (0.00s)

@@ -97,6 +97,7 @@ func (pl *CSILimits) Filter(ctx context.Context, _ *framework.CycleState, pod *v
if err != nil {
// TODO: return the error once CSINode is created by default (2 releases)
klog.V(5).InfoS("Could not get a CSINode object for the node", "node", klog.KObj(node), "err", err)
return framework.AsStatus(err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will cause the scheduler/cluster-autoscaler to retry immediately, as opposed to returning an UnschedulableAndUnresolvable status.

Is retrying the expected behavior?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment is incorrect as discussed in #107787 (comment).

Autoscaler doesn't simulate CSINode objects today, so we can't return an error otherwise autoscaler will fail forever.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait, sorry, this is my miss, I returned the error here to checkout the logs I added in the test. But after I verified the code path I only removed the logs but forgot to remove this line.

@@ -231,8 +231,12 @@ func (pl *CSILimits) checkAttachableInlineVolume(vol v1.Volume, csiNode *storage
return fmt.Errorf("looking up provisioner name for volume %v: %w", vol, err)
}
if !isCSIMigrationOn(csiNode, inTreeProvisionerName) {
csiNodeName := ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this change is not necessary if we return the error?

Comment on lines 64 to 65
// ErrReasonNodeNotFound is used for node not found error.
ErrReasonNodeNotFound = "node not found"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leftover?

@@ -231,8 +232,12 @@ func (pl *CSILimits) checkAttachableInlineVolume(vol v1.Volume, csiNode *storage
return fmt.Errorf("looking up provisioner name for volume %v: %w", vol, err)
}
if !isCSIMigrationOn(csiNode, inTreeProvisionerName) {
csiNodeName := ""
if csiNode != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mock is not working as expected after changing to a list, even the test case said that migration is enabled, on line old 291, new 296:

if !isCSIMigrationOn(csiNode, pluginName) {
  klog.V(5).InfoS("CSI Migration of plugin is not enabled", "plugin", pluginName)
  return "", ""
}

This got returned because csiNode is nil.

@sunnylovestiramisu
Copy link
Contributor Author

It turned out the original test setup did not reflect the reality, the name of csi-node and node should be the same.

Added extra logs locally( please see here ) to test the csi node == nil:

// go test -v -run TestCSILimits
=== RUN   TestCSILimits/nil_csi_node
===== haha I am a nil csi node  =====

@sunnylovestiramisu
Copy link
Contributor Author

/test pull-kubernetes-unit
/meow

@k8s-ci-robot
Copy link
Contributor

@sunnylovestiramisu: cat image

In response to this:

/test pull-kubernetes-unit
/meow

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

driverNames: []string{csilibplugins.AWSEBSInTreePluginName, ebsCSIDriverName},
migrationEnabled: true,
isNilCsiNode: true,
limitSource: "csinode",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you able to repro the error if you set limitSource to node and without needing isNilCsiNode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the isNilCsiNode to generate the nil pointer exception? I commented out the csiNodeName and log the csiNode.Name the test returns error:

panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x13ba74d]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried again the same test case as the last week's code review comment above:

{
    newPod:           inTreeOneVolPod,
    existingPods:     []*v1.Pod{inTreeTwoVolPod},
    filterName:       "csi",
    maxVols:          2,
    driverNames:      []string{csilibplugins.AWSEBSInTreePluginName, ebsCSIDriverName},
    migrationEnabled: true,
    limitSource:      "node",
    test:             "should count in-tree volumes if migration is enabled",
    wantStatus:       framework.NewStatus(framework.Unschedulable, ErrReasonMaxVolumeCountExceeded),
}

It did not trigger the csiNode == nil, it did not go into the if !isCSIMigrationOn(csiNode, inTreeProvisionerName) code path. I do not think the changes from the master branch fixed the test setup? So the change of limitSource still does not change the test result. :(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked out your PR, reverted csi.go, and tweaked the test inputs and was able to repro the crash:

--- a/pkg/scheduler/framework/plugins/nodevolumelimits/csi_test.go
+++ b/pkg/scheduler/framework/plugins/nodevolumelimits/csi_test.go
@@ -313,8 +313,8 @@ func TestCSILimits(t *testing.T) {
                        maxVols:          2,
                        driverNames:      []string{csilibplugins.AWSEBSInTreePluginName, ebsCSIDriverName},
                        migrationEnabled: true,
-                       isNilCsiNode:     true,
-                       limitSource:      "csinode",
+                       isNilCsiNode:     false,
+                       limitSource:      "node",
                        test:             "nil csi node",
                },

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean we do not need the isNilCsiNode params and just fix the csi-node-name to node-name should fix the test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the isNilCsiNode to generate the nil pointer exception? I commented out the csiNodeName and log the csiNode.Name the test returns error:

panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x13ba74d]

The above was ran with:

{
    newPod:           inTreeInlineVolPod,
    existingPods:     []*v1.Pod{inTreeTwoVolPod},
    filterName:       "csi",
    maxVols:          2,
    driverNames:      []string{csilibplugins.AWSEBSInTreePluginName, ebsCSIDriverName},
    migrationEnabled: true,
    isNilCsiNode:     true,
    limitSource:      "csinode",
}

without the change:

csiNodeName := ""
if csiNode != nil {
	csiNodeName = csiNode.Name
}

also generated the nil pointer exception, do we need another test case specific to

isNilCsiNode:     false,
limitSource:      "node",

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@sunnylovestiramisu
Copy link
Contributor Author

Tests failed with bytes-like object, not str, which looks like a python2 vs python3 issue, a few other PRs also running into the same error:
kubernetes-sigs/gcp-compute-persistent-disk-csi-driver#1116

@msau42
Copy link
Member

msau42 commented Jan 26, 2023

/lgtm
/assign @alculquicondor

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 26, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 2a1432bde8ab29d6e83b1a0bde51450313ebfff1

@msau42
Copy link
Member

msau42 commented Jan 26, 2023

Can you add a release note for the fix? We'll want to cherry pick this.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Jan 26, 2023
Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from scheduling

@msau42
Copy link
Member

msau42 commented Jan 26, 2023

/approve

1 similar comment
@alculquicondor
Copy link
Member

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, msau42, sunnylovestiramisu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 26, 2023
@k8s-ci-robot k8s-ci-robot merged commit 97ab147 into kubernetes:master Jan 26, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.27 milestone Jan 26, 2023
@sunnylovestiramisu sunnylovestiramisu deleted the fixPanic branch January 26, 2023 21:28
k8s-ci-robot added a commit that referenced this pull request Feb 10, 2023
…ick-of-#115179-upstream-release-1.25

Automated cherry pick of #115179: Fix nil pointer error in nodevolumelimits csi logging
k8s-ci-robot added a commit that referenced this pull request Feb 10, 2023
…ick-of-#115179-upstream-release-1.26

Automated cherry pick of #115179: Fix nil pointer error in nodevolumelimits csi logging
@openshift-merge-robot
Copy link

Fix included in accepted release 4.13.0-0.nightly-2023-11-21-212406

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Seg Fault in nodeVolumeLimits of scheduler plugin causing panic
6 participants