Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capture application termination messages/output #139

Closed
bgrant0607 opened this issue Jun 17, 2014 · 17 comments · Fixed by #2225
Closed

Capture application termination messages/output #139

bgrant0607 opened this issue Jun 17, 2014 · 17 comments · Fixed by #2225
Assignees
Labels
area/app-lifecycle area/docker area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@bgrant0607
Copy link
Member

When applications terminate, they may write out important information about the reason, such as assertion failure messages, uncaught exception messages, stack traces, etc. We should establish an interface for capturing such information in a first-class way for termination reporting, in addition to whatever is logged.

I suggest we pull the deathrattle message from /dev/final-log or something similar.

@vishh
Copy link
Contributor

vishh commented Jun 23, 2014

Is /run a tmpfs? There is an outstanding PR for this in libcontainer.

@bgrant0607
Copy link
Member Author

Good call. Looks like not.

Here's df from a google/nodejs container:
Filesystem 1K-blocks Used Available Use% Mounted on
rootfs 10188088 1639712 8007808 17% /
none 10188088 1639712 8007808 17% /
tmpfs 304556 0 304556 0% /dev
shm 65536 0 65536 0% /dev/shm
/dev/disk/by-uuid/485b0b37-5e5f-4878-85a4-2d8653315786 10188088 1639712 8007808 17% /.dockerinit
/dev/disk/by-uuid/485b0b37-5e5f-4878-85a4-2d8653315786 10188088 1639712 8007808 17% /etc/resolv.conf
/dev/disk/by-uuid/485b0b37-5e5f-4878-85a4-2d8653315786 10188088 1639712 8007808 17% /etc/hostname
/dev/disk/by-uuid/485b0b37-5e5f-4878-85a4-2d8653315786 10188088 1639712 8007808 17% /etc/hosts
/dev/disk/by-uuid/485b0b37-5e5f-4878-85a4-2d8653315786 10188088 1639712 8007808 17% /data
tmpfs 304556 0 304556 0% /proc/kcore

@thockin
Copy link
Member

thockin commented Jun 24, 2014

Is /run LSB compliant?

@bgrant0607
Copy link
Member Author

@bgrant0607
Copy link
Member Author

Re. /run: The point of tmpfs was to avoid pathological disk latency and failure problems. However, we'd need the filesystem to remain live after termination of the main process. We want that for other reasons (e.g., hooks), but it doesn't exist yet.

Solomon expressed some interest in this on #docker-dev:
https://botbot.me/freenode/docker-dev/2014-07-18/?msg=18236306&page=2

@bgrant0607
Copy link
Member Author

One could also view this as simple container output. I could imagine using this for simple data-in/data-out functions, such as config generators.

One question would be whether we should make the path configurable and, if so, should we provide a means to tell the container what that path is? I could imagine allowing the user/client to specify the path and environment variable name.

However, I could also envision standardizing it for containers, potentially even beyond just Docker containers. For instance, could we use /dev/console, similar to VM console output in GCE? Or maybe another file in /dev.

Note also that /dev/stdout is linked to /dev/fd/1, /dev/stderr is linked to /dev/fd/2, and /dev/ptmx is linked to /dev/pts/ptmx.

@bgrant0607 bgrant0607 changed the title Capture application termination messages Capture application termination messages/output Sep 30, 2014
@bgrant0607 bgrant0607 added this to the v0.9 milestone Oct 4, 2014
@bgrant0607
Copy link
Member Author

/dev/console is used by some images/distributions, so it probably needs to be /dev/somethingelse.

@bgrant0607
Copy link
Member Author

Possible file names: /dev/stopmsg, /dev/finalstatus, /dev/deathrattle, ...

/cc @rjnagal

@bgrant0607
Copy link
Member Author

/dev/log is used by syslog.

How about /dev/final-log?

@vishh
Copy link
Contributor

vishh commented Nov 4, 2014

@bgrant0607 I assume we want to have a structured logging format for the death reason. If we were to define and provide a new interface, how can we promote adoption of this interface? Requiring application changes might hinder adoption.
Just capturing the last few log lines from 'docker logs' would be useful for users at this point.

@bgrant0607
Copy link
Member Author

@vishh

No, I don't want a structured logging format. I want the raw output. We can capture other termination information (e.g., time and reason) separately.

With respect to usage: We should ensure that it is easy for a user to add a PreStop hook to populate it.

As for automatic extraction from Docker logs: I'd want to strip the cruft and display the raw output. But how many lines? Fatal log messages are typically 1 line but stacktraces and uncaught language exceptions may be many lines. Rather than building this functionality into Kubernetes, we could provide a script or program that the user can mount into their container and run as a hook, with a configurable number of lines.

In terms of promotion: I'd like to see Docker, libcontainer, and the container community more broadly adopt a mechanism like this. The "container RFC" should have proposed something like this.

@bgrant0607
Copy link
Member Author

A PostStop hook would probably work better than PreStop.

@vishh
Copy link
Contributor

vishh commented Nov 4, 2014

Structured logging might provide ability to make restart decisions which
the infrastructure cannot make on its own - disk full errors vs some
internal application error.

Docker logs: I get that it is difficult to ascertain the number of log
lines that are critical to each individual application. But I feel having
this feature will be very useful to users because it doesn't require any
changes to their container. We can come up with a sane default and provide
an option to store the entire log file if required.

If the applications were to dump their death reason to a location that is
not on tmpfs, we can scrape that today, without having to rely on hooks.

On Tue, Nov 4, 2014 at 3:29 PM, bgrant0607 notifications@github.com wrote:

A PostStop hook would probably work better than PreStop.


Reply to this email directly or view it on GitHub
#139 (comment)
.

@dchen1107
Copy link
Member

@bgrant0607 how about /dev/termination_log? Put some thoughts on the issue this afternoon, and here is the rough design / proposal:

  1. At API,
    i) Introduce TerminationMessagePath field to Container. If the user doesn't specify, it could be /dev/termination_log.
    ii) Introduce string field called Message to ContainerStateTerminated to capture application termination reason.
  2. When kubelet come up, it creates /var/lib/kubelet/termination_logs on the node
  3. When PodSpec request to kubelet, kubelet create an empty file with name: $containerName_$restart_count under /var/lib/kubelet/termination_logs/$podUUID for each new container
  4. When run docker container, kubelet tells docker to bind mount such file created at 3) to container:$TerminationMessagePath
  5. During garbage collection, we should remove the file created for such pod or container

@bgrant0607
Copy link
Member Author

@vishh @dchen1107 and I discussed this in person.

First of all, structure: Different kinds of information need to be communicated:

  1. Concise reason string, similar to Reason in type Status, that can be used for analytics, customized behavior by ecosystem extensions, etc., similar to that described by More comprehensive reporting of termination reasons #137 and Support reason parameter on pod delete #1462, only provided upwards from the container.
  2. Brief arbitrary application-specific string, such as Fatal log message, assertion failure message, stack trace, language exception message, etc. Not structured. We'd log it, return it in Status in the API, display it in the UI, etc. Similar to Message in type Status.
  3. Arbitrary structured payload for use by ecosystem extensions, similar to Details in type Status.
  4. Explicit override of default restart behavior, such as to not restart and kill the pod, kill the pod and reschedule to a different node, delay restart, or restart when normally the container would not be restarted.

We can definitely leave affordances in the API (in ContainerStateTerminated) for returning all of this information.

We're going to punt on at least (4) and probably (3) for now. I feel (2) is most important, but (1) is also widely useful.

Since I believe there is no clear line between infrastructure and user control components, I don't feel we should differentiate Kubelet-originated (e.g., SystemOOM) and other reasons (e.g., WatchDogTimeout) by using separate fields.

We should also generate events for eventualities the system should respond to, such as system OOM.

On the number of lines to pull from logs by default: Docker is at least planning to move to a write-oriented log model rather than a line-oriented model, which may solve this problem.

Regarding the file location:

Using a /dev location would mean that we'd need to manage the bind mounts in the user containers. Using a standard location would require application changes or adapter hooks. Regarding the specific name, termination_log is a bit long, but I agree it's more consistent with the other terminology. I'd use a hyphen rather than an underscore (/dev/termination-log), however.

Using a configurable path would require that we check whether it's in a volume or in the container's writable layer until Docker decouples the mount namespace lifetime from the main process lifetime. We'd copy from the host filesystem for the former and docker cp for the latter.

Setting the configurable path to, say, a glog Fatal log location would provide the data for (2) only. We'd have to synthesize the reason, which could, for instance, be "FatalLog". However, I don't see a good way to accurately characterize the failure reason without running some characterization code. We could provide a default characterization program for common formats, such as glog, Java exceptions, etc.

Finally, we might as well capture the termination-log even for successful termination. The application might emit a brief execution summary of some sort.

Full-blown application output should be handled via a different mechanism.

@dchen1107
Copy link
Member

@bgrant0607 what you commented above is aligned with my initial proposal. The only change I made is taking @vishh's suggestion to allow the user configured a path.

@dchen1107 dchen1107 modified the milestones: v0.5, v0.9 Nov 10, 2014
@dchen1107 dchen1107 added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Feb 4, 2015
vishh pushed a commit to vishh/kubernetes that referenced this issue Apr 6, 2016
resouer pushed a commit to resouer/kubernetes that referenced this issue Sep 15, 2016
xingzhou pushed a commit to xingzhou/kubernetes that referenced this issue Dec 15, 2016
Update links and info for sig-api-machinery
@maicohjf
Copy link

Good call. Looks like not.

Here's df from a google/nodejs container:
Filesystem 1K-blocks Used Available Use% Mounted on
rootfs 10188088 1639712 8007808 17% /
none 10188088 1639712 8007808 17% /
tmpfs 304556 0 304556 0% /dev
shm 65536 0 65536 0% /dev/shm
/dev/disk/by-uuid/485b0b37-5e5f-4878-85a4-2d8653315786 10188088 1639712 8007808 17% /.dockerinit
/dev/disk/by-uuid/485b0b37-5e5f-4878-85a4-2d8653315786 10188088 1639712 8007808 17% /etc/resolv.conf
/dev/disk/by-uuid/485b0b37-5e5f-4878-85a4-2d8653315786 10188088 1639712 8007808 17% /etc/hostname
/dev/disk/by-uuid/485b0b37-5e5f-4878-85a4-2d8653315786 10188088 1639712 8007808 17% /etc/hosts
/dev/disk/by-uuid/485b0b37-5e5f-4878-85a4-2d8653315786 10188088 1639712 8007808 17% /data
tmpfs 304556 0 304556 0% /proc/kcore

Scale the deployment webserver to 6 pods

kubectl scale deployment/webserver --replicas=6

marun added a commit to marun/kubernetes that referenced this issue Jun 4, 2020
pjh pushed a commit to pjh/kubernetes that referenced this issue Jan 31, 2022
* add hello-app-redis

* Update app-deployment.yaml

* Update main.go

* resolve comments
sttts pushed a commit to sttts/kubernetes that referenced this issue Sep 11, 2023
linxiulei pushed a commit to linxiulei/kubernetes that referenced this issue Jan 18, 2024
krunalhinguu pushed a commit to krunalhinguu/kubernetes that referenced this issue Jul 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/app-lifecycle area/docker area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants