Prow jobs frequently exit with "Entrypoint received interrupt: terminated" error when it's run on the Prow service cluster #18930

chizhg · 2020-08-20T21:01:20Z

What happened:
We have peribolos running as Prow jobs for Knative, but they fail at times, and the error message is Entrypoint received interrupt: terminated. Besides that, there is no error message directly related to peribolos. See https://prow.knative.dev/view/gs/knative-prow/logs/ci-knative-peribolos/1296496637077622784.

What you expected to happen:
The Prow jobs should consistently pass. Or if any error happens, it should be able to print out actionable error logs.

How to reproduce it (as minimally and precisely as possible):
There is no consistent way to reproduce it. The Prow jobs sometimes can pass but mostly cannot, and every time it fails, the logs all contain the Entrypoint received interrupt: terminated error.

Please provide links to example occurrences, if any:
See the example in https://prow.knative.dev/view/gs/knative-prow/logs/ci-knative-peribolos/1296496637077622784.

Anything else we need to know?:
There is no peribolos error logs even if I set --log-level=trace.

/area peribolos
/kind bug

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2020-08-20T21:01:22Z

@chizhg: The label(s) area/peribolos cannot be applied, because the repository doesn't have them

In response to this:

What happened:
We have peribolos running as Prow jobs for Knative, but they fail at times, and the error message is Entrypoint received interrupt: terminated. Besides that, there is no error message directly related to peribolos. See https://prow.knative.dev/view/gs/knative-prow/logs/ci-knative-peribolos/1296496637077622784.

What you expected to happen:
The Prow jobs should consistently pass. Or if any error happens, it should be able to print out actionable error logs.

How to reproduce it (as minimally and precisely as possible):
There is no consistent way to reproduce it. The Prow jobs sometimes can pass but mostly cannot, and every time when it fails, the logs all contain the Entrypoint received interrupt: terminated error.

Please provide links to example occurrences, if any:
See the example in https://prow.knative.dev/view/gs/knative-prow/logs/ci-knative-peribolos/1296496637077622784.

Anything else we need to know?:
There is no peribolos error logs even if I set --log-level=trace.

/area peribolos
/kind bug

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

chizhg · 2020-08-20T21:01:57Z

/area prow/peribolos

alvaroaleman · 2020-08-22T16:15:59Z

@chizhg that looks like the kubelet decided to end it, presumably because of resource starvation. You need to check the pod and/or the kubelet log to get more details.

I also recommend enabling --kubernetes-blob-storage-workers=1 and --blob-storage-workers=1 on crier, the former will make it upload the pod yaml and the latter the prowjob yaml to gcs. This makes debugging issues like this after the pod is gone from the cluster much easier.

chizhg · 2020-08-22T18:27:19Z

Thanks for the response @alvaroaleman !

We actually have --kubernetes-blob-storage-workers=1 and --blob-storage-workers=1 enabled on crier. Not sure why the one I pasted above does not have the pod logs, but https://prow.knative.dev/prowjob?prowjob=fb7ad95c-e273-11ea-9baa-0242ac110002 is one that has.

I also suspected it's due to resource starvation, so I changed the resource requests and limits to be extra large, but I still got the same error:

limits:
  cpu: "10"
  memory: 64Gi
requests:
   cpu: "6"
   memory: 32Gi

Also even more weird, I tried to run the same command on my local machine for multiple times (without the ghproxy endpoint), none of them failed, so I also suspected it might due to ghproxy but I'm not sure where I can start for debugging.

alvaroaleman · 2020-08-22T18:56:07Z

@chizhg Entrypoint has no knowledge whatsoever about ghproxy or anything else the job does and its log says that it received an interrupt. Another option could be be that a rogue actor deletes the pod (misconfigured sinker? Other broken automation?).

You can find that out by:

Looking at the audit log
kubectl get pod $podname -w -o yaml: If this gets a deletionTimestamp it got deleted. This wont tell you who it was though.

chizhg · 2020-08-22T21:35:28Z

@alvaroaleman Thanks for your insights! I checked the audit log, and it did say the Pod was deleted at some point, but the log did not show the reason.

However, I think I fixed the issue, though I don't know what the reason is:
The Prow job was run on the Prow service cluster when I got the interrupt error, but after I moved it to run on a build cluster, the error was gone (I ran the Job for 5 times and every time it succeeded). So the problem seems to be the Peribolos Prow job pod will be mysteriously deleted when it's running on the Prow service cluster).

Thanks again for your help, and have a nice weekend 😄

fejta-bot · 2020-12-07T21:39:15Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2021-01-06T22:24:13Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

peterfeifanchen · 2021-02-01T19:46:33Z

/remove-lifecycle rotten

peterfeifanchen · 2021-02-01T19:47:11Z

This issue seem to be re-occuring again
https://prow.knative.dev/view/gs/knative-prow/logs/ci-knative-test-infra-branchprotector/1356286298461048832

alvaroaleman · 2021-02-01T19:59:18Z

@peterfeifanchen please provide an audit log of who deleted that pod, otherwise this is not actionable. This issue is unlikely (but not impossible) to be a Prow issue.

peterfeifanchen · 2021-02-01T20:25:34Z

How do I look into stackdirver for who killed ci-knative-test-infra-branchprotector? The Prow Job name is also the pod name in Prow right?

alvaroaleman · 2021-02-01T20:31:18Z

The pod name is visible in the bottom if you click on "Pod". Please refer to the stackdriver/gke/gcp docs to figure out how audit logging works there, I don't know that. The fields on the audit log are pretty self-explanatory, you will be quickly able to write a query for delete calls of that pod yourself.

peterfeifanchen · 2021-02-01T21:09:16Z

Its much more legible in stackdriver but...not sure any other way than to dump that final part at 9:53am where the SIGTERM came. You can search for "Failed to delete pod" and "SyncLoop (PLEG", "SyncLoop (REMOVE", "SyncLoop (Delete". And in the Prow log, "killing", "Stopping", "Received signal".

alvaroaleman · 2021-02-01T21:12:59Z

@peterfeifanchen those logs are not helpful, we need the audit log. This is the first google result for "gke audit log": https://cloud.google.com/kubernetes-engine/docs/how-to/audit-logging

peterfeifanchen · 2021-02-01T21:57:20Z

There's two delete operations on the job with a bunch of patch operations in between. Even one of these logs converted to JSON is pretty long, is there a particular section you want to look at? Not sure which of these attributes specify the issuer of the request.

alvaroaleman · 2021-02-01T21:59:28Z

user.username is what we are interested in

peterfeifanchen · 2021-02-01T22:08:38Z

That's not in the log. This is the first 30 lines to see if I am looking at the right thing

{
    "protoPayload": {
      "@type": "type.googleapis.com/google.cloud.audit.AuditLog",
      "authenticationInfo": {
        "principalEmail": "system:serviceaccount:default:serviceaccount-cluster-admin"
      },
      "authorizationInfo": [
        {
          "granted": true,
          "permission": "io.k8s.core.v1.pods.delete",
          "resource": "core/v1/namespaces/test-pods/pods/b2a5eb6a-64af-11eb-874b-92748067eba1"
        }
      ],
      "methodName": "io.k8s.core.v1.pods.delete",
      "request": {
        "@type": "core.k8s.io/v1.DeleteOptions",
        "apiVersion": "v1",
        "kind": "DeleteOptions"
      },
      "requestMetadata": {
        "callerIp": "35.239.4.40",
        "callerSuppliedUserAgent": "app.binary/v0.0.0 (linux/amd64) kubernetes/$Format"
      },
      "resourceName": "core/v1/namespaces/test-pods/pods/b2a5eb6a-64af-11eb-874b-92748067eba1",
      "response": {
        "@type": "core.k8s.io/v1.Pod",
        "apiVersion": "v1",
        "kind": "Pod",
        "metadata": {
          "annotations": {
            "prow.k8s.io/job": "post-knative-test-infra-branchprotector"
          },

alvaroaleman · 2021-02-01T22:24:35Z

@peterfeifanchen not sure what GKE does there, but that does not match the types used by kubes audit log: https://github.com/kubernetes/kubernetes/blob/f81aa3f7728c1bc7e3afc6fb1883db6423d896fe/staging/src/k8s.io/apiserver/pkg/apis/audit/v1/types.go#L72

It appears that this using the SA system:serviceaccount:default:serviceaccount-cluster-admin and the user agent is app.binary/v0.0.0 (linux/amd64) kubernetes/$Format which looks like a bazel-built binary with no further info?

peterfeifanchen · 2021-02-01T23:25:44Z

Hmm I see. Thanks for the help. Might have to follow up with GKE to figure out why this random kill is happening in the cluster.

peterfeifanchen · 2021-02-03T21:45:41Z

The problem was we had a misconfiguration on another prow control plane. We ended up with 2 prow instances controlling one build cluster and sinker was sending kill signals on a Prow Job definition it didnt have.

peterfeifanchen · 2021-02-03T21:45:54Z

not sure how to close...

alvaroaleman · 2021-02-03T21:47:22Z

Happy to hear you found the culprit!

/close

k8s-ci-robot · 2021-02-03T21:47:30Z

@alvaroaleman: Closing this issue.

In response to this:

Happy to hear you found the culprit!

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

chizhg added the kind/bug Categorizes issue or PR as related to a bug. label Aug 20, 2020

k8s-ci-robot added the area/prow/peribolos Issues or PRs related to prow's peribolos component label Aug 20, 2020

chizhg mentioned this issue Aug 22, 2020

Peribolos Prow jobs fail when --fix-team-repos is set to true knative/test-infra#2289

Closed

chizhg mentioned this issue Aug 22, 2020

Fix Peribolos Prow jobs by using a different GitHub token and running it in the build cluster knative/test-infra#2383

Merged

BenTheElder added the area/prow Issues or PRs related to prow label Aug 28, 2020

chizhg changed the title ~~Peribolos Prow jobs frequently exit with "Entrypoint received interrupt: terminated" error~~ Prow jobs frequently exit with "Entrypoint received interrupt: terminated" error when it's run on the Prow service cluster Sep 8, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 7, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 6, 2021

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 1, 2021

k8s-ci-robot closed this as completed Feb 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prow jobs frequently exit with "Entrypoint received interrupt: terminated" error when it's run on the Prow service cluster #18930

Prow jobs frequently exit with "Entrypoint received interrupt: terminated" error when it's run on the Prow service cluster #18930

chizhg commented Aug 20, 2020 •

edited

Loading

k8s-ci-robot commented Aug 20, 2020

chizhg commented Aug 20, 2020

alvaroaleman commented Aug 22, 2020 •

edited

Loading

chizhg commented Aug 22, 2020 •

edited

Loading

alvaroaleman commented Aug 22, 2020

chizhg commented Aug 22, 2020 •

edited

Loading

fejta-bot commented Dec 7, 2020

fejta-bot commented Jan 6, 2021

peterfeifanchen commented Feb 1, 2021

peterfeifanchen commented Feb 1, 2021

alvaroaleman commented Feb 1, 2021

peterfeifanchen commented Feb 1, 2021

alvaroaleman commented Feb 1, 2021

peterfeifanchen commented Feb 1, 2021

alvaroaleman commented Feb 1, 2021

peterfeifanchen commented Feb 1, 2021

alvaroaleman commented Feb 1, 2021

peterfeifanchen commented Feb 1, 2021

alvaroaleman commented Feb 1, 2021

peterfeifanchen commented Feb 1, 2021

peterfeifanchen commented Feb 3, 2021

peterfeifanchen commented Feb 3, 2021

alvaroaleman commented Feb 3, 2021

k8s-ci-robot commented Feb 3, 2021

Prow jobs frequently exit with "Entrypoint received interrupt: terminated" error when it's run on the Prow service cluster #18930

Prow jobs frequently exit with "Entrypoint received interrupt: terminated" error when it's run on the Prow service cluster #18930

Comments

chizhg commented Aug 20, 2020 • edited Loading

k8s-ci-robot commented Aug 20, 2020

chizhg commented Aug 20, 2020

alvaroaleman commented Aug 22, 2020 • edited Loading

chizhg commented Aug 22, 2020 • edited Loading

alvaroaleman commented Aug 22, 2020

chizhg commented Aug 22, 2020 • edited Loading

fejta-bot commented Dec 7, 2020

fejta-bot commented Jan 6, 2021

peterfeifanchen commented Feb 1, 2021

peterfeifanchen commented Feb 1, 2021

alvaroaleman commented Feb 1, 2021

peterfeifanchen commented Feb 1, 2021

alvaroaleman commented Feb 1, 2021

peterfeifanchen commented Feb 1, 2021

alvaroaleman commented Feb 1, 2021

peterfeifanchen commented Feb 1, 2021

alvaroaleman commented Feb 1, 2021

peterfeifanchen commented Feb 1, 2021

alvaroaleman commented Feb 1, 2021

peterfeifanchen commented Feb 1, 2021

peterfeifanchen commented Feb 3, 2021

peterfeifanchen commented Feb 3, 2021

alvaroaleman commented Feb 3, 2021

k8s-ci-robot commented Feb 3, 2021

chizhg commented Aug 20, 2020 •

edited

Loading

alvaroaleman commented Aug 22, 2020 •

edited

Loading

chizhg commented Aug 22, 2020 •

edited

Loading

chizhg commented Aug 22, 2020 •

edited

Loading