Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy logs from dead containers to local files to facilitate immediate GC of dead containers #26923

Closed
vishh opened this issue Jun 7, 2016 · 22 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@vishh
Copy link
Contributor

vishh commented Jun 7, 2016

One of the reasons for keeping dead containers around in kubelet today is to get access to logs from previous instances. Dead container instances can take up disk space via their root filesystem and result in disk pressure on the nodes. To alleviate disk pressure and improve the logging experience in kubernetes, kubelet can retrieve logs from dead containers and GC them right away once kubelet doesn't depend on metadata associated with old containers.
Specifically,

  • Kubelet can retrieve logs from the runtime and store in a per-pod, per-container directory inside of /var/log/ directory. For the docker runtime, this can be a simple move operation. For rkt, we will have to retrieve the logs remotely.
  • The directory structure can be /var/log/<podUID>/<ContainerName>/<InstanceNumber>_stdout.log
  • Update kubectl logs -p to return logs from the files instead of relying on the runtime for previous logs.
  • These logs will be kept around on a best-effort basis and will be deleted whenever there is disk pressure.
  • Kubelet can prefer keeping the first and most recent instances of a container around and aggressively delete other log files.
  • All these logs will be accessible initially via the /logs REST endpoint. In the future, we can consider expanding kubectl logs interface to support an instance number or add support for the first attempt specifically.
@vishh vishh added priority/backlog Higher priority than priority/awaiting-more-evidence. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jun 7, 2016
@vishh vishh added this to the next-candidate milestone Jun 7, 2016
@resouer
Copy link
Contributor

resouer commented Jun 7, 2016

Will we do this for dead containers only, or this apply to all /logs cases?

@vishh
Copy link
Contributor Author

vishh commented Jun 7, 2016

As of now, logs of running containers are controlled by runtimes. To avoid duplicated storage of log files, it's better to avoid managing logs of running instances for now.

@fejta-bot
Copy link

Issues go stale after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 15, 2017
@ankon
Copy link
Contributor

ankon commented Jan 11, 2018

This issue is still very much relevant for us: having access to the logs of a dead container is needed, but most of the container itself is just using disk space that should be reclaimed and better used for something more productive.

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 11, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@hzxuzhonghu
Copy link
Member

/reopen

@hzxuzhonghu
Copy link
Member

We also has a case which depend on get container logs when the job completed successfully.

@krancour
Copy link
Member

This issue is relevant for me. I use fluentd to get logs from pods and send them elsewhere, but two conditions can make me lose valuable logs:

  1. Due to volume, agent falls behind and pod (for whatever reason) is deleted before it catches up. With the log file ripped out from under us, logs go missing.

  2. If a pod's total lifetime is very small, the log files may be deleted before the agent event notices there's a new file to tail.

This seems to me to be a big problem. How many folks are using the likes of fluentd, specifically in an attempt to preserve their logs and don't even realize that in some cases, the kubelet isn't even giving them a fair opportunity to do so?

@krancour
Copy link
Member

/reopen

@k8s-ci-robot
Copy link
Contributor

@krancour: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Jan 17, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@krancour
Copy link
Member

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Feb 16, 2020
@k8s-ci-robot
Copy link
Contributor

@krancour: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@krancour
Copy link
Member

/reopen

@k8s-ci-robot
Copy link
Contributor

@krancour: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Mar 17, 2020
@colltoaction
Copy link

colltoaction commented Apr 3, 2020

This is relevant for me the same way as for @krancour's first case. Would it be fair to say that having a minimum-container-ttl-duration aligned with our fluentd configuration would be enough to preserve logs? I understand that if fluentd is stuck we would still lose logs, but I think we can take that risk. E.g we could set minimum-container-ttl-duration to 10 minutes.

Also, I see that --minimum-container-ttl-duration is deprecated with rationale deprecated once old logs are stored outside of container’s context, but I can't find a new flag covering this use case. Does this mean that the flag is deprecated because a new solution is coming ("store logs outside of container’s context") but for now this is the only way?

Thanks!


EDIT. If I read correctly, it seems that the property is compared with the container creation time, so that wouldn't help save logs if the container was long-lived. I'll keep reading the code to try find something useful.

@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

8 participants