-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prow jobs frequently exit with "Entrypoint received interrupt: terminated" error when it's run on the Prow service cluster #18930
Comments
@chizhg: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/area prow/peribolos |
@chizhg that looks like the kubelet decided to end it, presumably because of resource starvation. You need to check the pod and/or the kubelet log to get more details. I also recommend enabling |
Thanks for the response @alvaroaleman ! We actually have I also suspected it's due to resource starvation, so I changed the resource requests and limits to be extra large, but I still got the same error:
Also even more weird, I tried to run the same command on my local machine for multiple times (without the |
@chizhg Entrypoint has no knowledge whatsoever about ghproxy or anything else the job does and its log says that it received an interrupt. Another option could be be that a rogue actor deletes the pod (misconfigured sinker? Other broken automation?). You can find that out by:
|
@alvaroaleman Thanks for your insights! I checked the audit log, and it did say the Pod was deleted at some point, but the log did not show the reason. However, I think I fixed the issue, though I don't know what the reason is: Thanks again for your help, and have a nice weekend 😄 |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle rotten |
This issue seem to be re-occuring again |
@peterfeifanchen please provide an audit log of who deleted that pod, otherwise this is not actionable. This issue is unlikely (but not impossible) to be a Prow issue. |
How do I look into stackdirver for who killed |
The pod name is visible in the bottom if you click on "Pod". Please refer to the stackdriver/gke/gcp docs to figure out how audit logging works there, I don't know that. The fields on the audit log are pretty self-explanatory, you will be quickly able to write a query for delete calls of that pod yourself. |
Its much more legible in stackdriver but...not sure any other way than to dump that final part at 9:53am where the SIGTERM came. You can search for "Failed to delete pod" and "SyncLoop (PLEG", "SyncLoop (REMOVE", "SyncLoop (Delete". And in the Prow log, "killing", "Stopping", "Received signal". |
@peterfeifanchen those logs are not helpful, we need the audit log. This is the first google result for "gke audit log": https://cloud.google.com/kubernetes-engine/docs/how-to/audit-logging |
There's two delete operations on the job with a bunch of patch operations in between. Even one of these logs converted to JSON is pretty long, is there a particular section you want to look at? Not sure which of these attributes specify the issuer of the request. |
|
That's not in the log. This is the first 30 lines to see if I am looking at the right thing
|
@peterfeifanchen not sure what GKE does there, but that does not match the types used by kubes audit log: https://github.com/kubernetes/kubernetes/blob/f81aa3f7728c1bc7e3afc6fb1883db6423d896fe/staging/src/k8s.io/apiserver/pkg/apis/audit/v1/types.go#L72 It appears that this using the SA |
Hmm I see. Thanks for the help. Might have to follow up with GKE to figure out why this random kill is happening in the cluster. |
The problem was we had a misconfiguration on another prow control plane. We ended up with 2 prow instances controlling one build cluster and sinker was sending kill signals on a Prow Job definition it didnt have. |
not sure how to close... |
Happy to hear you found the culprit! /close |
@alvaroaleman: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What happened:
We have peribolos running as Prow jobs for Knative, but they fail at times, and the error message is
Entrypoint received interrupt: terminated
. Besides that, there is no error message directly related to peribolos. See https://prow.knative.dev/view/gs/knative-prow/logs/ci-knative-peribolos/1296496637077622784.What you expected to happen:
The Prow jobs should consistently pass. Or if any error happens, it should be able to print out actionable error logs.
How to reproduce it (as minimally and precisely as possible):
There is no consistent way to reproduce it. The Prow jobs sometimes can pass but mostly cannot, and every time it fails, the logs all contain the
Entrypoint received interrupt: terminated
error.Please provide links to example occurrences, if any:
See the example in https://prow.knative.dev/view/gs/knative-prow/logs/ci-knative-peribolos/1296496637077622784.
Anything else we need to know?:
There is no peribolos error logs even if I set
--log-level=trace
./area peribolos
/kind bug
The text was updated successfully, but these errors were encountered: