Catch s2i build containers that are killed due to OOM #15032

andrewklau · 2017-07-03T23:32:26Z

As the s2i assemble container is launched by direct access to the docker socket, it's currently not possible to catch or log any error message to the user on why their build failed.

Based on what I am seeing, a SIGKILL is being sent to the container, so it's not possible to catch anything like a SIGTERM from within the container to at least display an error message in the logs. Users are often left confused wondering why their build suddenly died.

Instead they seem to currently only get "Assemble failed".

This probably also doesn't provide the best experience as the web console doesn't currently provide a way to configure the build resources (except through yaml).

wanghaoran1988 · 2017-07-04T06:40:52Z

/cc

andrewklau · 2017-07-11T10:53:16Z

If the build container were to be started by kubernetes, would the preStop lifecycle hook be honoured or is the container just killed immediately?

bparees · 2017-07-11T12:26:38Z

If the build container were to be started by kubernetes, would the preStop lifecycle hook be honoured or is the container just killed immediately?

if it were started by k8s and if it were being killed by k8s due to resource constraints, i would expect the prestop hook to be honored (though i'm not certain).

However in this case neither of those things are true. The container isn't managed by k8s, and it's being killed by the host operating system (I believe) which is obviously not going to invoke any hooks even if it were part of a k8s pod.

andrewklau · 2017-08-12T10:13:40Z

I noticed we've now got Oom killed status in 3.6 so I think this helps a lot.

However the build status returns "GenericBuildFailed" but where as the web console seems to be smart enough to return Oom killed. Should build status also be updated to reflect it was killed due to Oom?

bparees · 2017-08-12T15:42:34Z

I noticed we've now got Oom killed status in 3.6 so I think this helps a lot.
However the build status returns "GenericBuildFailed" but where as the web console seems to be smart enough to return Oom killed. Should build status also be updated to reflect it was killed due to Oom?

Where do you see the oom killed status being reported? And are you sure it was the assemble container that was oom killed (not the build pod container) in that case?

bparees · 2017-08-12T15:43:57Z

(the web console has no awareness of the existence of the manually launched assemble container, nor will the pod resource have any information about it, so I'm a bit surprised by what you're saying the web console is detecting/reporting).

andrewklau · 2017-08-12T15:48:52Z

The build pod got the status OOMKilled. I guess this is not the same as the assemble container.

The web console was just reporting the build pod status.

bparees · 2017-08-12T15:56:12Z

The build pod got the status OOMKilled. I guess this is not the same as the assemble container.

ok yeah that makes sense. We could potentially do a better job of reporting the build failure reason when the build pod is oom killed, but that's separate from the main (and imho more likely because the assemble container uses more resources than the build pod container) problem this issue describes, where the assemble container gets oom killed.

andrewklau · 2017-08-15T13:31:39Z

If an assemble container is started for the build process, then what would be the reason the build pod gets OOMKilled?

bparees · 2017-08-15T13:36:36Z

If an assemble container is started for the build process, then what would be the reason the build pod gets OOMKilled?

well it's still a pod like any other, subject to memory pressure the system could decide to oom kill it to free resources. why that particular pod would be chosen, i'm not sure, it doesn't seem like the most likely candidate.

bparees · 2017-10-09T20:46:17Z

i'm not sure if we have a way to tell that the container we launched (e.g. assemble container) got OOM killed. This is probably just worth some quick experimentation/research and if it's not clearly possible for us to know why the container died, close this as won't/can't fix.

Automatic merge from submit-queue (batch tested with PRs 16777, 16811, 16823, 16808, 16833). bump(github.com/openshift/source-to-image): a0e78cce863f296bfb9bf77ac… …5acd152dc059e32 Fixes #15032 @openshift/devex fyi / ptal

wanghaoran1988 added the component/build label Jul 4, 2017

pweil- added area/usability kind/bug Categorizes issue or PR as related to a bug. priority/P2 labels Jul 4, 2017

pweil- assigned bparees Jul 4, 2017

bparees mentioned this issue Jul 12, 2017

Majority of builds failed when image streams pushed on large cluster #15143

Closed

3 tasks

bparees assigned gabemontero and unassigned bparees Oct 9, 2017

openshift-merge-robot closed this as completed in #16833 Oct 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catch s2i build containers that are killed due to OOM #15032

Catch s2i build containers that are killed due to OOM #15032

andrewklau commented Jul 3, 2017 •

edited

Loading

wanghaoran1988 commented Jul 4, 2017

andrewklau commented Jul 11, 2017

bparees commented Jul 11, 2017

andrewklau commented Aug 12, 2017

bparees commented Aug 12, 2017

bparees commented Aug 12, 2017

andrewklau commented Aug 12, 2017

bparees commented Aug 12, 2017

andrewklau commented Aug 15, 2017

bparees commented Aug 15, 2017

bparees commented Oct 9, 2017

Catch s2i build containers that are killed due to OOM #15032

Catch s2i build containers that are killed due to OOM #15032

Comments

andrewklau commented Jul 3, 2017 • edited Loading

wanghaoran1988 commented Jul 4, 2017

andrewklau commented Jul 11, 2017

bparees commented Jul 11, 2017

andrewklau commented Aug 12, 2017

bparees commented Aug 12, 2017

bparees commented Aug 12, 2017

andrewklau commented Aug 12, 2017

bparees commented Aug 12, 2017

andrewklau commented Aug 15, 2017

bparees commented Aug 15, 2017

bparees commented Oct 9, 2017

andrewklau commented Jul 3, 2017 •

edited

Loading