Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catch s2i build containers that are killed due to OOM #15032

Closed
andrewklau opened this issue Jul 3, 2017 · 11 comments
Closed

Catch s2i build containers that are killed due to OOM #15032

andrewklau opened this issue Jul 3, 2017 · 11 comments
Assignees
Labels
area/usability component/build kind/bug Categorizes issue or PR as related to a bug. priority/P2

Comments

@andrewklau
Copy link
Contributor

andrewklau commented Jul 3, 2017

As the s2i assemble container is launched by direct access to the docker socket, it's currently not possible to catch or log any error message to the user on why their build failed.

Based on what I am seeing, a SIGKILL is being sent to the container, so it's not possible to catch anything like a SIGTERM from within the container to at least display an error message in the logs. Users are often left confused wondering why their build suddenly died.

Instead they seem to currently only get "Assemble failed".

This probably also doesn't provide the best experience as the web console doesn't currently provide a way to configure the build resources (except through yaml).

@wanghaoran1988
Copy link
Member

/cc

@pweil- pweil- added area/usability kind/bug Categorizes issue or PR as related to a bug. priority/P2 labels Jul 4, 2017
@andrewklau
Copy link
Contributor Author

If the build container were to be started by kubernetes, would the preStop lifecycle hook be honoured or is the container just killed immediately?

@bparees
Copy link
Contributor

bparees commented Jul 11, 2017

If the build container were to be started by kubernetes, would the preStop lifecycle hook be honoured or is the container just killed immediately?

if it were started by k8s and if it were being killed by k8s due to resource constraints, i would expect the prestop hook to be honored (though i'm not certain).

However in this case neither of those things are true. The container isn't managed by k8s, and it's being killed by the host operating system (I believe) which is obviously not going to invoke any hooks even if it were part of a k8s pod.

@andrewklau
Copy link
Contributor Author

I noticed we've now got Oom killed status in 3.6 so I think this helps a lot.

However the build status returns "GenericBuildFailed" but where as the web console seems to be smart enough to return Oom killed. Should build status also be updated to reflect it was killed due to Oom?

@bparees
Copy link
Contributor

bparees commented Aug 12, 2017

I noticed we've now got Oom killed status in 3.6 so I think this helps a lot.
However the build status returns "GenericBuildFailed" but where as the web console seems to be smart enough to return Oom killed. Should build status also be updated to reflect it was killed due to Oom?

Where do you see the oom killed status being reported? And are you sure it was the assemble container that was oom killed (not the build pod container) in that case?

@bparees
Copy link
Contributor

bparees commented Aug 12, 2017

(the web console has no awareness of the existence of the manually launched assemble container, nor will the pod resource have any information about it, so I'm a bit surprised by what you're saying the web console is detecting/reporting).

@andrewklau
Copy link
Contributor Author

The build pod got the status OOMKilled. I guess this is not the same as the assemble container.

The web console was just reporting the build pod status.

@bparees
Copy link
Contributor

bparees commented Aug 12, 2017

The build pod got the status OOMKilled. I guess this is not the same as the assemble container.

ok yeah that makes sense. We could potentially do a better job of reporting the build failure reason when the build pod is oom killed, but that's separate from the main (and imho more likely because the assemble container uses more resources than the build pod container) problem this issue describes, where the assemble container gets oom killed.

@andrewklau
Copy link
Contributor Author

If an assemble container is started for the build process, then what would be the reason the build pod gets OOMKilled?

@bparees
Copy link
Contributor

bparees commented Aug 15, 2017

If an assemble container is started for the build process, then what would be the reason the build pod gets OOMKilled?

well it's still a pod like any other, subject to memory pressure the system could decide to oom kill it to free resources. why that particular pod would be chosen, i'm not sure, it doesn't seem like the most likely candidate.

@bparees bparees assigned gabemontero and unassigned bparees Oct 9, 2017
@bparees
Copy link
Contributor

bparees commented Oct 9, 2017

i'm not sure if we have a way to tell that the container we launched (e.g. assemble container) got OOM killed. This is probably just worth some quick experimentation/research and if it's not clearly possible for us to know why the container died, close this as won't/can't fix.

openshift-merge-robot added a commit that referenced this issue Oct 13, 2017
Automatic merge from submit-queue (batch tested with PRs 16777, 16811, 16823, 16808, 16833).

bump(github.com/openshift/source-to-image): a0e78cce863f296bfb9bf77ac…

…5acd152dc059e32

Fixes #15032

@openshift/devex fyi / ptal
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/usability component/build kind/bug Categorizes issue or PR as related to a bug. priority/P2
Projects
None yet
Development

No branches or pull requests

5 participants