Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

glusterfs log files may become very large if volumes mount failed #68050

Closed
houjun41544 opened this issue Aug 30, 2018 · 28 comments
Closed

glusterfs log files may become very large if volumes mount failed #68050

houjun41544 opened this issue Aug 30, 2018 · 28 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@houjun41544
Copy link
Contributor

houjun41544 commented Aug 30, 2018

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

/kind feature

What happened:
when a pod start and need to mount a gluster volume, a log file will be created to record the errors in glusterfs mount. The log path is at /var/lib/kubelet/plugins/kubernetes.io/glusterfs/[volName]/[podName]-glusterfs.log

we encounter a scenario that pod mount glusterfs always failed since the glusterfs server is unreachable for a long time. Finally the log file become quite large,as bellow.
image

when could these log files be cleaned up without manual deletion?

What you expected to happen:

I consider if we should remove the log file after every mount whether mount failed or successfully.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): v1.10
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. kind/bug Categorizes issue or PR as related to a bug. labels Aug 30, 2018
@houjun41544
Copy link
Contributor Author

/sig storage

@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 30, 2018
@houjun41544 houjun41544 changed the title glusterfs log files become very large since mount failed glusterfs log files may become very large if volumes mount failed Sep 4, 2018
@houjun41544
Copy link
Contributor Author

@kubernetes/sig-storage-bugs, can you have a look at this?

@k8s-ci-robot
Copy link
Contributor

@houjun41544: Reiterating the mentions to trigger a notification:
@kubernetes/sig-storage-bugs

In response to this:

@kubernetes/sig-storage-bugs, can you have a look at this?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jsafrane
Copy link
Member

/assign @humblec

@houjun41544
Copy link
Contributor Author

houjun41544 commented Sep 19, 2018

@humblec @jsafrane We tried to solve this problem by clearing the gluster log file after reading the log.#68814

@humblec
Copy link
Contributor

humblec commented Sep 19, 2018

@houjun41544 One of the mechanism we normally suggest here is: applying logrotation in this path. It should help to archive and clear the logs accordingly.

I consider if we should remove the log file after every mount whether mount failed or successfully.

Clearing the log after each mount success/failure may not be good as admins want to track it later

@houjun41544
Copy link
Contributor Author

@humblec If the log file is provided by user, they could archive and clear the logs by themselves. But if it is driver specific ,maybe users do not known the log file existed at all.

@warmchang
Copy link
Contributor

Clearing the log after each mount success/failure may not be good as admins want to track it later

@humblec
The glusterfs log info will be read and stored in the kubelet log, the admins can track it from the kubelet's log file, so the glusterfs log is redundant and can be emptied.

	// Failed mount scenario.
	// Since glusterfs does not return error text
	// it all goes in a log file, we will read the log file
	logErr := readGlusterLog(log, b.pod.Name)
	if logErr != nil {
		return fmt.Errorf("mount failed: %v the following error information was pulled from the glusterfs log to help diagnose this issue: %v", errs, logErr)
	}

@houjun41544
Copy link
Contributor Author

@humblec In addition, it seems that the log files will never be removed even if pods and volumes have been deleted.
Why not place the log file in the pod plugin dir rather than plugin dir?

@humblec
Copy link
Contributor

humblec commented Sep 20, 2018

@humblec
The glusterfs log info will be read and stored in the kubelet log, the admins can track it from the kubelet's log file, so the glusterfs log is redundant and can be emptied.

@warmchang Its just last 2 lines are exposed to kubelet , most of the time these 2 lines can give clue, but at times the sequence of events need to be looked at for debugging the issue so the logfile can help.

@humblec
Copy link
Contributor

humblec commented Sep 20, 2018

@humblec In addition, it seems that the log files will never be removed even if pods and volumes have been deleted.

It seems to me that, clearing/removing log file when pod is deleted would be better.
@jsafrane Any thoughts on this ?

@warmchang
Copy link
Contributor

It seems to me that, clearing/removing log file when pod is deleted would be better.

@humblec The result of this modification is the same. As you described, when the volume mount failed, the admin will check log file to debug the issue. But unfortunately, the log does not existed because it has been cleaned/deleted when the POD dies.

@admoriarty
Copy link

Would love to see this issue resolved, it just caused us to starting losing a node intermittently due to a DiskPressure threshold being hit, because of months old logs of this type, the largest being 7G!

@warmchang
Copy link
Contributor

@admoriarty That's why @houjun41544 submitted this issure, and the PR #68814 try to solve it.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 15, 2019
@houjun41544
Copy link
Contributor Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 15, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 15, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 15, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dmoessne
Copy link

/reopen

@k8s-ci-robot
Copy link
Contributor

@dmoessne: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dmoessne
Copy link

so, I think this is still an issue and do not get why it is abandoned. Are there at least any recommendations how to avoid that ?

@houjun41544
Copy link
Contributor Author

/reopen

@k8s-ci-robot
Copy link
Contributor

@houjun41544: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Jul 15, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@koep
Copy link
Contributor

koep commented Jan 7, 2020

I agree with dmoessne that this issue should be addressed - at least in the form of documentation that recommends steps on how to mitigate the issue (e. g. logrotate). CC @humblec any thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants