Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod is deleted after job is failed, with restartPolicy: Never #83999

Closed
mofirouz opened this issue Oct 16, 2019 · 3 comments
Closed

Pod is deleted after job is failed, with restartPolicy: Never #83999

mofirouz opened this issue Oct 16, 2019 · 3 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@mofirouz
Copy link

mofirouz commented Oct 16, 2019

What happened:
A job is created with a single InitContainer and a single main container. The pod restart policy is set to "Never". If the job fails, the pod is randomly deleted. I should mention that the pod is deleted sometimes, not all the time.

Most importantly, we did not observe this issue in Kubernetes 1.12.9-gke.15, but we are observing it now in 1.14.6-gke.1 - we do not have a Kubernetes 1.13 cluster.

What you expected to happen:
The pod to remain indefinitely as long as the Job object remains on the system or is explicitly deleted.

How to reproduce it (as minimally and precisely as possible):

apiVersion: batch/v1
kind: Job
metadata:
  name: test-job
  namespace: test
spec:
  activeDeadlineSeconds: 300
  backoffLimit: 0
  completions: 1
  parallelism: 1
  template:
    spec:
      terminationGracePeriodSeconds: 30
      restartPolicy: Never
      automountServiceAccountToken: false
      containers:
      - image: perl
        name: pi
        command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
      initContainers:
      - image: alpine/git:latest
        name: git
        command:
        - /bin/sh
        - -ec
        - git clone git@github.com:test/badrepo
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - kill
            - sys_chroot
            - mknod
            - net_raw
            - chown
            - dac_override
            - fowner
            - fsetid
            - setgid
            - setuid
            - setpcap
            - net_bind_service
            - audit_write
            - setfcap
          readOnlyRootFilesystem: false
          runAsNonRoot: true
          runAsUser: 1001

Anything else we need to know?:

I have a sinking suspension that this may be related to the following issue (#79398) / PR (#79451) - hopefully I'm not completed off-base here.

Environment: GKE

  • Kubernetes version (use kubectl version):
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.6-gke.1", GitCommit:"61c30f98599ad5309185df308962054d9670bafa", GitTreeState:"clean", BuildDate:"2019-08-28T11:06:42Z", GoVersion:"go1.12.9b4", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: GKE
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Network plugin and version (if this is a network-related bug):
  • Others:
@mofirouz mofirouz added the kind/bug Categorizes issue or PR as related to a bug. label Oct 16, 2019
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Oct 16, 2019
@mofirouz
Copy link
Author

@kubernetes/sig-node-bugs

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 16, 2019
@k8s-ci-robot
Copy link
Contributor

@mofirouz: Reiterating the mentions to trigger a notification:
@kubernetes/sig-node-bugs

In response to this:

@kubernetes/sig-node-bugs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mofirouz
Copy link
Author

mofirouz commented Dec 1, 2019

I've figured this out - it's to do with auto-resizing of node pools in GKE - after ~15min the underlying node that was hosting the pod goes away and Kubernetes removes all the components that were once connected to that node.

@mofirouz mofirouz closed this as completed Dec 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

2 participants