Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prow jobs frequently exit with "Entrypoint received interrupt: terminated" error when it's run on the Prow service cluster #18930

Closed
chizhg opened this issue Aug 20, 2020 · 24 comments
Labels
area/prow/peribolos Issues or PRs related to prow's peribolos component area/prow Issues or PRs related to prow kind/bug Categorizes issue or PR as related to a bug.

Comments

@chizhg
Copy link
Contributor

chizhg commented Aug 20, 2020

What happened:
We have peribolos running as Prow jobs for Knative, but they fail at times, and the error message is Entrypoint received interrupt: terminated. Besides that, there is no error message directly related to peribolos. See https://prow.knative.dev/view/gs/knative-prow/logs/ci-knative-peribolos/1296496637077622784.

What you expected to happen:
The Prow jobs should consistently pass. Or if any error happens, it should be able to print out actionable error logs.

How to reproduce it (as minimally and precisely as possible):
There is no consistent way to reproduce it. The Prow jobs sometimes can pass but mostly cannot, and every time it fails, the logs all contain the Entrypoint received interrupt: terminated error.

Please provide links to example occurrences, if any:
See the example in https://prow.knative.dev/view/gs/knative-prow/logs/ci-knative-peribolos/1296496637077622784.

Anything else we need to know?:
There is no peribolos error logs even if I set --log-level=trace.

/area peribolos
/kind bug

@chizhg chizhg added the kind/bug Categorizes issue or PR as related to a bug. label Aug 20, 2020
@k8s-ci-robot
Copy link
Contributor

@chizhg: The label(s) area/peribolos cannot be applied, because the repository doesn't have them

In response to this:

What happened:
We have peribolos running as Prow jobs for Knative, but they fail at times, and the error message is Entrypoint received interrupt: terminated. Besides that, there is no error message directly related to peribolos. See https://prow.knative.dev/view/gs/knative-prow/logs/ci-knative-peribolos/1296496637077622784.

What you expected to happen:
The Prow jobs should consistently pass. Or if any error happens, it should be able to print out actionable error logs.

How to reproduce it (as minimally and precisely as possible):
There is no consistent way to reproduce it. The Prow jobs sometimes can pass but mostly cannot, and every time when it fails, the logs all contain the Entrypoint received interrupt: terminated error.

Please provide links to example occurrences, if any:
See the example in https://prow.knative.dev/view/gs/knative-prow/logs/ci-knative-peribolos/1296496637077622784.

Anything else we need to know?:
There is no peribolos error logs even if I set --log-level=trace.

/area peribolos
/kind bug

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@chizhg
Copy link
Contributor Author

chizhg commented Aug 20, 2020

/area prow/peribolos

@alvaroaleman
Copy link
Member

alvaroaleman commented Aug 22, 2020

@chizhg that looks like the kubelet decided to end it, presumably because of resource starvation. You need to check the pod and/or the kubelet log to get more details.

I also recommend enabling --kubernetes-blob-storage-workers=1 and --blob-storage-workers=1 on crier, the former will make it upload the pod yaml and the latter the prowjob yaml to gcs. This makes debugging issues like this after the pod is gone from the cluster much easier.

@chizhg
Copy link
Contributor Author

chizhg commented Aug 22, 2020

Thanks for the response @alvaroaleman !

We actually have --kubernetes-blob-storage-workers=1 and --blob-storage-workers=1 enabled on crier. Not sure why the one I pasted above does not have the pod logs, but https://prow.knative.dev/prowjob?prowjob=fb7ad95c-e273-11ea-9baa-0242ac110002 is one that has.

I also suspected it's due to resource starvation, so I changed the resource requests and limits to be extra large, but I still got the same error:

limits:
  cpu: "10"
  memory: 64Gi
requests:
   cpu: "6"
   memory: 32Gi

Also even more weird, I tried to run the same command on my local machine for multiple times (without the ghproxy endpoint), none of them failed, so I also suspected it might due to ghproxy but I'm not sure where I can start for debugging.

@alvaroaleman
Copy link
Member

@chizhg Entrypoint has no knowledge whatsoever about ghproxy or anything else the job does and its log says that it received an interrupt. Another option could be be that a rogue actor deletes the pod (misconfigured sinker? Other broken automation?).

You can find that out by:

  • Looking at the audit log
  • kubectl get pod $podname -w -o yaml: If this gets a deletionTimestamp it got deleted. This wont tell you who it was though.

@chizhg
Copy link
Contributor Author

chizhg commented Aug 22, 2020

@alvaroaleman Thanks for your insights! I checked the audit log, and it did say the Pod was deleted at some point, but the log did not show the reason.

However, I think I fixed the issue, though I don't know what the reason is:
The Prow job was run on the Prow service cluster when I got the interrupt error, but after I moved it to run on a build cluster, the error was gone (I ran the Job for 5 times and every time it succeeded). So the problem seems to be the Peribolos Prow job pod will be mysteriously deleted when it's running on the Prow service cluster).

Thanks again for your help, and have a nice weekend 😄

@BenTheElder BenTheElder added the area/prow Issues or PRs related to prow label Aug 28, 2020
@chizhg chizhg changed the title Peribolos Prow jobs frequently exit with "Entrypoint received interrupt: terminated" error Prow jobs frequently exit with "Entrypoint received interrupt: terminated" error when it's run on the Prow service cluster Sep 8, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 7, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 6, 2021
@peterfeifanchen
Copy link
Contributor

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 1, 2021
@peterfeifanchen
Copy link
Contributor

@alvaroaleman
Copy link
Member

@peterfeifanchen please provide an audit log of who deleted that pod, otherwise this is not actionable. This issue is unlikely (but not impossible) to be a Prow issue.

@peterfeifanchen
Copy link
Contributor

How do I look into stackdirver for who killed ci-knative-test-infra-branchprotector? The Prow Job name is also the pod name in Prow right?

@alvaroaleman
Copy link
Member

The pod name is visible in the bottom if you click on "Pod". Please refer to the stackdriver/gke/gcp docs to figure out how audit logging works there, I don't know that. The fields on the audit log are pretty self-explanatory, you will be quickly able to write a query for delete calls of that pod yourself.

@peterfeifanchen
Copy link
Contributor

Its much more legible in stackdriver but...not sure any other way than to dump that final part at 9:53am where the SIGTERM came. You can search for "Failed to delete pod" and "SyncLoop (PLEG", "SyncLoop (REMOVE", "SyncLoop (Delete". And in the Prow log, "killing", "Stopping", "Received signal".

@alvaroaleman
Copy link
Member

@peterfeifanchen those logs are not helpful, we need the audit log. This is the first google result for "gke audit log": https://cloud.google.com/kubernetes-engine/docs/how-to/audit-logging

@peterfeifanchen
Copy link
Contributor

There's two delete operations on the job with a bunch of patch operations in between. Even one of these logs converted to JSON is pretty long, is there a particular section you want to look at? Not sure which of these attributes specify the issuer of the request.

@alvaroaleman
Copy link
Member

user.username is what we are interested in

@peterfeifanchen
Copy link
Contributor

That's not in the log. This is the first 30 lines to see if I am looking at the right thing

{
    "protoPayload": {
      "@type": "type.googleapis.com/google.cloud.audit.AuditLog",
      "authenticationInfo": {
        "principalEmail": "system:serviceaccount:default:serviceaccount-cluster-admin"
      },
      "authorizationInfo": [
        {
          "granted": true,
          "permission": "io.k8s.core.v1.pods.delete",
          "resource": "core/v1/namespaces/test-pods/pods/b2a5eb6a-64af-11eb-874b-92748067eba1"
        }
      ],
      "methodName": "io.k8s.core.v1.pods.delete",
      "request": {
        "@type": "core.k8s.io/v1.DeleteOptions",
        "apiVersion": "v1",
        "kind": "DeleteOptions"
      },
      "requestMetadata": {
        "callerIp": "35.239.4.40",
        "callerSuppliedUserAgent": "app.binary/v0.0.0 (linux/amd64) kubernetes/$Format"
      },
      "resourceName": "core/v1/namespaces/test-pods/pods/b2a5eb6a-64af-11eb-874b-92748067eba1",
      "response": {
        "@type": "core.k8s.io/v1.Pod",
        "apiVersion": "v1",
        "kind": "Pod",
        "metadata": {
          "annotations": {
            "prow.k8s.io/job": "post-knative-test-infra-branchprotector"
          },

@alvaroaleman
Copy link
Member

@peterfeifanchen not sure what GKE does there, but that does not match the types used by kubes audit log: https://github.com/kubernetes/kubernetes/blob/f81aa3f7728c1bc7e3afc6fb1883db6423d896fe/staging/src/k8s.io/apiserver/pkg/apis/audit/v1/types.go#L72

It appears that this using the SA system:serviceaccount:default:serviceaccount-cluster-admin and the user agent is app.binary/v0.0.0 (linux/amd64) kubernetes/$Format which looks like a bazel-built binary with no further info?

@peterfeifanchen
Copy link
Contributor

Hmm I see. Thanks for the help. Might have to follow up with GKE to figure out why this random kill is happening in the cluster.

@peterfeifanchen
Copy link
Contributor

The problem was we had a misconfiguration on another prow control plane. We ended up with 2 prow instances controlling one build cluster and sinker was sending kill signals on a Prow Job definition it didnt have.

@peterfeifanchen
Copy link
Contributor

not sure how to close...

@alvaroaleman
Copy link
Member

Happy to hear you found the culprit!

/close

@k8s-ci-robot
Copy link
Contributor

@alvaroaleman: Closing this issue.

In response to this:

Happy to hear you found the culprit!

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/prow/peribolos Issues or PRs related to prow's peribolos component area/prow Issues or PRs related to prow kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

6 participants