-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issue when setting fileysystem capacity in container manager #48567
Conversation
defer timeoutTimer.Stop() | ||
for { | ||
select { | ||
case <-timeoutTimer.C: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to use wait.Until? That would make the code more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to use wait.Until, but this also removed timeout
glog.Warningf("[ContainerManager] Fail to get rootfs information %v", err) | ||
continue | ||
} | ||
for rName, rCap := range cadvisor.StorageScratchCapacityFromFsInfo(rootfs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you want a lock around both scratch and image fs capacity updates? Why not cache values and update them in one go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jingxu97 this is yet to be resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the new code ok? Or you prefer with one lock and put two for loops (cadvisor.StorageScratchCapacityFromFsInfo and cadvisor.StorageOverlayCapacityFromFsInfo) inside the lock?
@@ -551,9 +533,55 @@ func (cm *containerManagerImpl) Start(node *v1.Node, activePods ActivePodsFunc) | |||
}, 5*time.Minute, wait.NeverStop) | |||
} | |||
|
|||
// Local storage filesystem information from `RootFsInfo` and `ImagesFsInfo` is availalbe at a later time | |||
// depending on the time when cadvisor manager updates container stats. Therefore use a go routine to keep | |||
// retrieving the information until it is avaialble within 5 minutes timeout. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the runtimes are down for more than five minutes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same concern here. I think it's better to keep retrying if they are still nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the timeout. But will this cause any thread leakage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would it cause leakage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking multiple go routine might running without stops (each time kubelet restarts, a new go rountine for this will be started)
Cool. The PR should fix #48452. |
@@ -551,9 +533,55 @@ func (cm *containerManagerImpl) Start(node *v1.Node, activePods ActivePodsFunc) | |||
}, 5*time.Minute, wait.NeverStop) | |||
} | |||
|
|||
// Local storage filesystem information from `RootFsInfo` and `ImagesFsInfo` is availalbe at a later time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/availalbe/available
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -219,30 +219,12 @@ func NewContainerManager(mountUtil mount.Interface, cadvisorInterface cadvisor.I | |||
var capacity = v1.ResourceList{} | |||
// It is safe to invoke `MachineInfo` on cAdvisor before logically initializing cAdvisor here because | |||
// machine info is computed and cached once as part of cAdvisor object creation. | |||
// But `RootFsInfo` and `ImagesFsInfo` are not avaialble at this moment so they will be called later during manager starts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/avaialble/available
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
6360f55
to
9fdba6a
Compare
/test pull-kubernetes-unit |
/test pull-kubernetes-e2e-gce-etcd3 |
/test pull-kubernetes-federation-e2e-gce |
/lgtm |
/release-note-none |
9fdba6a
to
27b0660
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: dashpole, jingxu97, vishh No associated issue. Update pull-request body to add a reference to an issue, or get approval with The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
In Container manager, we set up the capacity by retrieving information from cadvisor. However unlike machineinfo, filesystem information is available at a later unknown time. This PR uses a go routine to keep retriving the information until it is avaialble or timeout.
27b0660
to
9606a54
Compare
/test pull-kubernetes-e2e-kops-aws |
/test pull-kubernetes-e2e-kops-aws |
Automatic merge from submit-queue (batch tested with PRs 47232, 48625, 48613, 48567, 39173) |
@jingxu97 Should this cherry-pick to v1.7? Seems this is a serious bug fix. |
Yes, this and #48636 together can fix the local storage allocatable feature |
I still see this message occasionally in localStorageAllocatableEviction test logs: It only seems to happen every minute or so, so not nearly as often as #48703. Still something we should look into. |
@jingxu97 confirmed that the message I found was only during node startup, when capacity information was not yet available. |
Commit found in the "release-1.7" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked. |
In Container manager, we set up the capacity by retrieving information
from cadvisor. However unlike machineinfo, filesystem information is
available at a later unknown time. This PR uses a go routine to keep
retriving the information until it is avaialble or timeout.
This PR fixes issue #48452