Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue when setting fileysystem capacity in container manager #48567

Merged
merged 1 commit into from
Jul 12, 2017

Conversation

jingxu97
Copy link
Contributor

@jingxu97 jingxu97 commented Jul 6, 2017

In Container manager, we set up the capacity by retrieving information
from cadvisor. However unlike machineinfo, filesystem information is
available at a later unknown time. This PR uses a go routine to keep
retriving the information until it is avaialble or timeout.
This PR fixes issue #48452

Fix issue with setting filesystem capacity in container manager.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 6, 2017
@jingxu97 jingxu97 requested review from vishh and dashpole July 6, 2017 22:01
@k8s-github-robot k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. release-note-label-needed labels Jul 6, 2017
defer timeoutTimer.Stop()
for {
select {
case <-timeoutTimer.C:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to use wait.Until? That would make the code more readable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to use wait.Until, but this also removed timeout

glog.Warningf("[ContainerManager] Fail to get rootfs information %v", err)
continue
}
for rName, rCap := range cadvisor.StorageScratchCapacityFromFsInfo(rootfs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you want a lock around both scratch and image fs capacity updates? Why not cache values and update them in one go?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jingxu97 this is yet to be resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the new code ok? Or you prefer with one lock and put two for loops (cadvisor.StorageScratchCapacityFromFsInfo and cadvisor.StorageOverlayCapacityFromFsInfo) inside the lock?

@@ -551,9 +533,55 @@ func (cm *containerManagerImpl) Start(node *v1.Node, activePods ActivePodsFunc)
}, 5*time.Minute, wait.NeverStop)
}

// Local storage filesystem information from `RootFsInfo` and `ImagesFsInfo` is availalbe at a later time
// depending on the time when cadvisor manager updates container stats. Therefore use a go routine to keep
// retrieving the information until it is avaialble within 5 minutes timeout.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the runtimes are down for more than five minutes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same concern here. I think it's better to keep retrying if they are still nil

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the timeout. But will this cause any thread leakage?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would it cause leakage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking multiple go routine might running without stops (each time kubelet restarts, a new go rountine for this will be started)

@feiskyer
Copy link
Member

feiskyer commented Jul 6, 2017

Cool. The PR should fix #48452.

@@ -551,9 +533,55 @@ func (cm *containerManagerImpl) Start(node *v1.Node, activePods ActivePodsFunc)
}, 5*time.Minute, wait.NeverStop)
}

// Local storage filesystem information from `RootFsInfo` and `ImagesFsInfo` is availalbe at a later time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/availalbe/available

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -219,30 +219,12 @@ func NewContainerManager(mountUtil mount.Interface, cadvisorInterface cadvisor.I
var capacity = v1.ResourceList{}
// It is safe to invoke `MachineInfo` on cAdvisor before logically initializing cAdvisor here because
// machine info is computed and cached once as part of cAdvisor object creation.
// But `RootFsInfo` and `ImagesFsInfo` are not avaialble at this moment so they will be called later during manager starts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/avaialble/available

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@jingxu97
Copy link
Contributor Author

jingxu97 commented Jul 7, 2017

/test pull-kubernetes-unit

@jingxu97
Copy link
Contributor Author

jingxu97 commented Jul 7, 2017

/test pull-kubernetes-e2e-gce-etcd3

@jingxu97
Copy link
Contributor Author

jingxu97 commented Jul 7, 2017

/test pull-kubernetes-federation-e2e-gce

@dashpole
Copy link
Contributor

dashpole commented Jul 7, 2017

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 7, 2017
@k8s-github-robot k8s-github-robot added the do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. label Jul 7, 2017
@feiskyer
Copy link
Member

feiskyer commented Jul 8, 2017

/release-note-none

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-label-needed labels Jul 8, 2017
@k8s-github-robot k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 10, 2017
@vishh
Copy link
Contributor

vishh commented Jul 10, 2017

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 10, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: dashpole, jingxu97, vishh

No associated issue. Update pull-request body to add a reference to an issue, or get approval with /approve no-issue

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

In Container manager, we set up the capacity by retrieving information
from cadvisor. However unlike machineinfo, filesystem information is
available at a later unknown time. This PR uses a go routine to keep
retriving the information until it is avaialble or timeout.
@k8s-github-robot k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 10, 2017
@jingxu97 jingxu97 added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. labels Jul 10, 2017
@jingxu97
Copy link
Contributor Author

/test pull-kubernetes-e2e-kops-aws

@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 11, 2017
@fejta
Copy link
Contributor

fejta commented Jul 11, 2017

/test pull-kubernetes-e2e-kops-aws

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 47232, 48625, 48613, 48567, 39173)

@k8s-github-robot k8s-github-robot merged commit dbb4283 into kubernetes:master Jul 12, 2017
@feiskyer
Copy link
Member

@jingxu97 Should this cherry-pick to v1.7? Seems this is a serious bug fix.

@jingxu97 jingxu97 added this to the v1.7 milestone Jul 13, 2017
@jingxu97
Copy link
Contributor Author

Yes, this and #48636 together can fix the local storage allocatable feature

@dashpole
Copy link
Contributor

I still see this message occasionally in localStorageAllocatableEviction test logs:
helpers.go:771] Could not find capacity information for resource storage.kubernetes.io/scratch
helpers.go:782] eviction manager: no observation found for eviction signal allocatableNodeFs.available

It only seems to happen every minute or so, so not nearly as often as #48703. Still something we should look into.

@dashpole
Copy link
Contributor

@jingxu97 confirmed that the message I found was only during node startup, when capacity information was not yet available.

@wojtek-t wojtek-t added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Jul 26, 2017
k8s-github-robot pushed a commit that referenced this pull request Jul 26, 2017
…67-upstream-release-1.7

Automatic merge from submit-queue

Automated cherry pick of #48567

Cherry pick of #48567 on release-1.7.

#48567: Fix issue when setting fileysystem capacity in container
@k8s-cherrypick-bot
Copy link

Commit found in the "release-1.7" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants