-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix kubelet memory leak when device plugin is registered #124719
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: carlory The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/ok-to-test
we need to make sure we have test coverage for the flow modified here
path1 := existing.(DevicePlugin).SocketPath() | ||
path2 := c.(DevicePlugin).SocketPath() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have a way to get the path avoiding the cast?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There‘s no other way except using client{}
structure. The Interface
way is more recommended in the Go
world.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd try adding a SocketPath
method to the Client
interface and evaluate the impacts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or, in general, carrying this information around somehow. I'd really like to explore options to avoid the cast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd try adding a SocketPath method to the Client interface and evaluate the impacts.
I added it in this PR.
|
||
err = c.handler.PluginConnected(c.resource, c) | ||
if err != nil { | ||
klog.ErrorS(err, "Failed to connect to device plugin", "resource", c.resource) | ||
if err := conn.Close(); err != nil { | ||
klog.V(2).ErrorS(err, "Failed to close grcp connection", "resource", c.resource) | ||
} | ||
return err | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder (and I'm not pushing for this solution, but let's give it a fair evaluation) if we can use a defer
to catch the issue instead of moving PluginConnected
above. Something like
func (c *client) Connect() (rerr error) {
// ...
c.mutex.Lock()
c.grpc = conn
c.client = client
c.mutex.Unlock()
defer func() {
if rerr == nil {
return
}
klog.ErrorS(err, "Failed to connect to device plugin", "resource", c.resource)
if err := conn.Close(); err != nil {
klog.V(2).ErrorS(err, "Failed to close grcp connection", "resource", c.resource)
return
}
c.grpc = nil
}()
return c.handler.PluginConnected(c.resource, c)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, we have to lock again to reset c.grpc and c.client in defer
func
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right. On second look the order of operations should not be too critical here. I'll have a deeper look later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line 80: Failed to close grcp connection
"grcp"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
And @ffromani I follow your suggestion.
/triage accepted priority subjected to review like the linked issue |
I will do it later. Unfortunately, there is no existing test case in this package. I need some investigation how to create a test case for this problem. |
516071a
to
2080bfb
Compare
c := s.getClient(name) | ||
if c != nil { | ||
if c.SocketPath() != socketPath { | ||
return fmt.Errorf("the device plugin %s already registered with a different socket path %s", name, c.SocketPath()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return fmt.Errorf("the device plugin %s already registered with a different socket path %s", name, c.SocketPath()) | |
return fmt.Errorf("The device plugin %s already registered with a different socket path %s", name, c.SocketPath()) |
Fantastic work. Stick with one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
go-staticcheck: error strings should not be capitalized (ST1005)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, thx
/hold Needs to handle re-registration case. |
@carlory: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Two types of memory leaks:
(c *client) Connect() error
creates a gRPC connection between device manager and device plugin. but if dail sccuessfully but doc.handler.PluginConnected(c.resource, c)
fails, the caller will not close the connection. This is a memory leak. This PR fixes this by closing the connection ifc.handler.PluginConnected(c.resource, c)
fails.2 different socket file but shares the same
plugin name
which is populated from plugin viaGetInfo
interface. the oldregisterClient
func will rewrite the old client with the new client. the old client will not be closed. This is a memory leak. This PR fixes this by adding a check. In this case, it return an error to the caller.Which issue(s) this PR fixes:
Fixes #124716
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: