Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding EKS role annotation to Master ServiceAccount causes Jenkins pod restart loop #361

Open
Nuru opened this issue May 4, 2020 · 14 comments
Labels
bug Something isn't working not-stale
Milestone

Comments

@Nuru
Copy link

Nuru commented May 4, 2020

Expected Behavior

On Amazon EKS, adding eks.amazonaws.com/role-arn annotation to Jenkins spec.serviceAccount.annotations would allow ServiceAccount to assume AWS IAM role.

Actual Behavior

Simply adding the annotation to an existing Jenkins resource fails because of #362. Creating a Jenkins resource from scratch containing the annotation fails with a restart loop:

Log output from operator, trimmed for brevity:

:11:49.808Z Creating a new Jenkins Master Pod jenkins/jenkins-jenkins	{"cr": "jenkins"}
:11:49.983Z Jenkins master pod restarted by operator:	{"cr": "jenkins"}
:11:49.983Z Jenkins pod volumes have changed, actual '[{aws-iam-token {nil nil nil nil nil nil nil
:11:49.983Z Env has changed to '[{Name:COPY_REFERENCE_FILE_LOG Value:/var/lib/jenkins/copy_referen
:11:49.984Z Volume mounts have changed to '[{Name:jenkins-home ReadOnly:false MountPath:/var/lib/j
:11:49.984Z Env has changed to '[{Name:BACKUP_COUNT Value:9 ValueFrom:nil} {Name:BACKUP_DIR Value:
:11:49.984Z Volume mounts have changed to '[{Name:jenkins-home ReadOnly:false MountPath:/jenkins-h
:12:01.027Z Creating a new Jenkins Master Pod jenkins/jenkins-jenkins	{"cr": "jenkins"}
:12:01.296Z Jenkins master pod restarted by operator:	{"cr": "jenkins"}
:12:01.296Z Jenkins pod volumes have changed, actual '[{aws-iam-token {nil nil nil nil nil nil nil
:12:01.296Z Env has changed to '[{Name:COPY_REFERENCE_FILE_LOG Value:/var/lib/jenkins/copy_referen
:12:01.296Z Volume mounts have changed to '[{Name:jenkins-home ReadOnly:false MountPath:/var/lib/j
:12:01.296Z Env has changed to '[{Name:BACKUP_COUNT Value:9 ValueFrom:nil} {Name:BACKUP_DIR Value:
:12:01.297Z Volume mounts have changed to '[{Name:jenkins-home ReadOnly:false MountPath:/jenkins-h
:12:10.997Z Creating a new Jenkins Master Pod jenkins/jenkins-jenkins	{"cr": "jenkins"}

Steps to Reproduce the Problem

  1. Deploy Jenkins 2.222.3 with kubernetes-operator v0.4.0 using helm chart 0.2.0 onto AWS EKS cluster running Kuberentes 1.15.11, including the annotation in the Jenkins resource:
spec:
  serviceAccount:
    annotations:
      eks.amazonaws.com/role-arn: some-role
  1. View the logs on the operator pod.
  2. Wait forever Jenkins to come up
@Nuru
Copy link
Author

Nuru commented May 4, 2020

Our operator is completely non-functional now. I have tried completely deleting it via helm and re-installing it, and it still exhibits the same behavior. The kubelet log shows that approximately 0.2 seconds after starting to mount volumes in the Jenkins pod, it starts to unmount them, with no intervening reason being given. Kubernetes generates a warning event with the message partially saying "timeout expired waiting for volumes to attach or mount" but that appears to be spurious, unless somehow the operator is specifying 0 second timeout via some mechanism outside of the Pod spec.

I can copy the Pod spec the operator creates, change the name, delete the ownerReferences section, and deploy the Pod successfully, so it has nothing to do with the Pod spec, the mounts, the networking, the file system, PVCs, etc.

It has to be the operator killing the pods because it thinks they are incorrectly configured. See in the logs that 0.2 seconds after "Creating a new Jenkins Master Pod" there is "Jenkins master pod restarted by operator".

EDIT:

Fixed now, and issue updated. Figuring this out was made much harder because of #362, which prevented changes to the Jenkins resource from fixing the problem. To fix the problem, you have to delete the Jenkins resource and re-create it without the offending annotation.

@Nuru Nuru changed the title After deleting operator pod, operator keeps restarting Jenkins Adding EKS role annotation to Master ServiceAccount causes Jenkins pod restart loop May 4, 2020
@tomaszsek
Copy link

Hi @Nuru,

I think adding eks.amazonaws.com/role-arn change the pod envs and volumes in a pod by the EKS. The operator doesn't know about this change and keeps restart Jenkins forever. Could you send the output of the from command kubectl get pod jenkins-jenkins -o yaml when the operator restarts pod?

Cheers

@Nuru
Copy link
Author

Nuru commented May 8, 2020

@tomaszsek Yes, that is probably right, EKS adds the aws-iam-token, but the operator should say something more useful in the logs, or give up because of the sync failure. I started on this mess because the operator gave up trying to modify the ServiceAccount after 10 failures, so I expected to see something like that in the logs here. I spent all day trying to figure out why the volume mounts were failing. Turns out it was because the operator was killing the pod before the mounts finished.

As you asked, here is the pod spec from the operator. I have redacted a few numbers that should not make a difference to you.

Pod.yaml (click to see)
kind: Pod
apiVersion: v1
metadata:
  name: jenkins-jenkins
  namespace: jenkins
  selfLink: /api/v1/namespaces/jenkins/pods/jenkins-jenkins
  uid: a6534b92-c435-470f-bcd0-c6cc3916fa51
  resourceVersion: '14798790'
  creationTimestamp: '2020-05-08T06:07:25Z'
  deletionTimestamp: '2020-05-08T06:07:55Z'
  deletionGracePeriodSeconds: 30
  labels:
    app: jenkins-operator
    jenkins-cr: jenkins
  annotations:
    kubernetes.io/psp: eks.privileged
  ownerReferences:
    - apiVersion: jenkins.io/v1alpha2
      kind: Jenkins
      name: jenkins
      uid: 7a8f1ee9-10d7-41f1-a3b7-49d7a0dc6b07
      controller: true
      blockOwnerDeletion: true
spec:
  volumes:
    - name: aws-iam-token
      projected:
        sources:
          - serviceAccountToken:
              audience: sts.amazonaws.com
              expirationSeconds: 86400
              path: token
        defaultMode: 420
    - name: jenkins-home
      emptyDir: {}
    - name: scripts
      configMap:
        name: jenkins-operator-scripts-jenkins
        defaultMode: 511
    - name: init-configuration
      configMap:
        name: jenkins-operator-init-configuration-jenkins
        defaultMode: 420
    - name: operator-credentials
      secret:
        secretName: jenkins-operator-credentials-jenkins
        defaultMode: 420
    - name: custom-css
      configMap:
        name: custom-css
        defaultMode: 420
    - name: backup
      persistentVolumeClaim:
        claimName: jenkins-backup
    - name: jenkins-operator-jenkins-token-8hsq4
      secret:
        secretName: jenkins-operator-jenkins-token-8hsq4
        defaultMode: 420
  containers:
    - name: jenkins-master
      image: 'jenkinsci/blueocean:1.23.1'
      command:
        - bash
        - '-c'
        - >-
          /var/jenkins/scripts/init.sh && exec /sbin/tini -s --
          /usr/local/bin/jenkins.sh
      ports:
        - name: http
          containerPort: 8080
          protocol: TCP
        - name: slavelistener
          containerPort: 50000
          protocol: TCP
      env:
        - name: COPY_REFERENCE_FILE_LOG
          value: /var/lib/jenkins/copy_reference_file.log
        - name: JAVA_OPTS
          value: >-
            -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap
            -XX:MaxRAMFraction=1 -Djenkins.install.runSetupWizard=false
            -Djava.awt.headless=true
        - name: JENKINS_HOME
          value: /var/lib/jenkins
        - name: AWS_ROLE_ARN
          value: >-
            arn:aws:iam::ACCOUNT:role/eks-jenkins-operator-jenkins@jenkins
        - name: AWS_WEB_IDENTITY_TOKEN_FILE
          value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
      resources:
        limits:
          cpu: 1500m
          memory: 3Gi
        requests:
          cpu: 500m
          memory: 500Mi
      volumeMounts:
        - name: jenkins-home
          mountPath: /var/lib/jenkins
        - name: scripts
          readOnly: true
          mountPath: /var/jenkins/scripts
        - name: init-configuration
          readOnly: true
          mountPath: /var/jenkins/init-configuration
        - name: operator-credentials
          readOnly: true
          mountPath: /var/jenkins/operator-credentials
        - name: custom-css
          mountPath: /var/jenkins/jenkins/userContent/custom-css
        - name: jenkins-operator-jenkins-token-8hsq4
          readOnly: true
          mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        - name: aws-iam-token
          readOnly: true
          mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
      livenessProbe:
        httpGet:
          path: /login
          port: http
          scheme: HTTP
        initialDelaySeconds: 80
        timeoutSeconds: 5
        periodSeconds: 10
        successThreshold: 1
        failureThreshold: 12
      readinessProbe:
        httpGet:
          path: /login
          port: http
          scheme: HTTP
        initialDelaySeconds: 30
        timeoutSeconds: 1
        periodSeconds: 10
        successThreshold: 1
        failureThreshold: 3
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      imagePullPolicy: IfNotPresent
    - name: backup
      image: 'virtuslab/jenkins-operator-backup-pvc:v0.0.8'
      env:
        - name: BACKUP_COUNT
          value: '9'
        - name: BACKUP_DIR
          value: /backup
        - name: JENKINS_HOME
          value: /jenkins-home
        - name: AWS_ROLE_ARN
          value: >-
            arn:aws:iam::ACCOUNT:role/eks-jenkins-operator-jenkins@jenkins
        - name: AWS_WEB_IDENTITY_TOKEN_FILE
          value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
      resources:
        limits:
          cpu: 100m
          memory: 100Mi
        requests:
          cpu: 50m
          memory: 50Mi
      volumeMounts:
        - name: jenkins-home
          mountPath: /jenkins-home
        - name: backup
          mountPath: /backup
        - name: jenkins-operator-jenkins-token-8hsq4
          readOnly: true
          mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        - name: aws-iam-token
          readOnly: true
          mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      imagePullPolicy: IfNotPresent
  restartPolicy: Never
  terminationGracePeriodSeconds: 30
  dnsPolicy: ClusterFirst
  serviceAccountName: jenkins-operator-jenkins
  serviceAccount: jenkins-operator-jenkins
  nodeName: REDACTED.us-east-2.compute.internal
  securityContext: {}
  schedulerName: default-scheduler
  tolerations:
    - key: node.kubernetes.io/not-ready
      operator: Exists
      effect: NoExecute
      tolerationSeconds: 300
    - key: node.kubernetes.io/unreachable
      operator: Exists
      effect: NoExecute
      tolerationSeconds: 300
  priority: 0
  enableServiceLinks: true
status:
  phase: Pending
  conditions:
    - type: Initialized
      status: 'True'
      lastProbeTime: null
      lastTransitionTime: '2020-05-08T06:07:25Z'
    - type: Ready
      status: 'False'
      lastProbeTime: null
      lastTransitionTime: '2020-05-08T06:07:25Z'
      reason: ContainersNotReady
      message: 'containers with unready status: [jenkins-master backup]'
    - type: ContainersReady
      status: 'False'
      lastProbeTime: null
      lastTransitionTime: '2020-05-08T06:07:25Z'
      reason: ContainersNotReady
      message: 'containers with unready status: [jenkins-master backup]'
    - type: PodScheduled
      status: 'True'
      lastProbeTime: null
      lastTransitionTime: '2020-05-08T06:07:25Z'
  hostIP: REDACTED
  startTime: '2020-05-08T06:07:25Z'
  containerStatuses:
    - name: backup
      state:
        terminated:
          exitCode: 0
          startedAt: null
          finishedAt: null
      lastState: {}
      ready: false
      restartCount: 0
      image: 'virtuslab/jenkins-operator-backup-pvc:v0.0.8'
      imageID: ''
    - name: jenkins-master
      state:
        terminated:
          exitCode: 0
          startedAt: null
          finishedAt: null
      lastState: {}
      ready: false
      restartCount: 0
      image: 'jenkinsci/blueocean:1.23.1'
      imageID: ''
  qosClass: Burstable

@tomaszsek
Copy link

That I was thinking. The new volume has been added by the EKS:

spec:
  volumes:
    - name: aws-iam-token
      projected:
        sources:
          - serviceAccountToken:
              audience: sts.amazonaws.com
              expirationSeconds: 86400
              path: token
        defaultMode: 420

and containers envs:

        - name: AWS_ROLE_ARN
          value: >-
            arn:aws:iam::ACCOUNT:role/eks-jenkins-operator-jenkins@jenkins
        - name: AWS_WEB_IDENTITY_TOKEN_FILE
          value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token

We will make a fix for that. We are working on moving to deployment instead of pod #195. After #195 is complete this kind of errors should never appear.

@Nuru
Copy link
Author

Nuru commented May 10, 2020

Thank you for getting this fixed. I understand why the operator rejects the changes EKS made, and I expect you are right that switching to a Deployment will provide a clean fix.

Do you have an ETA for fixing this? It is a blocker for us.

@tomaszsek tomaszsek added the bug Something isn't working label May 11, 2020
@tomaszsek tomaszsek added this to Needs triage in Triage & Progress via automation May 11, 2020
@tomaszsek tomaszsek moved this from Needs triage to In progress in Triage & Progress May 11, 2020
@johncblandii
Copy link

Any thoughts on timing here @tomaszsek? This is becoming a blocker.

@saranchand
Copy link

Is there any update on the fix for this issue. ?

@stale
Copy link

stale bot commented Jul 21, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this issue is still affecting you, just comment with any updates and we'll keep it open. Thank you for your contributions.

@stale stale bot added the stale label Jul 21, 2021
@Sig00rd Sig00rd added this to the New API milestone Jul 27, 2021
@stale stale bot removed the stale label Jul 27, 2021
@tarvip
Copy link

tarvip commented Nov 8, 2021

It is possible to workaround this issue, volumes and env variables that were added by EKS must be added to CR. It is not very elegant, but it works.

@maslakov
Copy link

is there any progress on that? this is a kind of blocker for those, who wants to develop, e.g. a custom backup to AWS, using proper IAM based solution.

@Harguer
Copy link

Harguer commented May 4, 2022

Hi!
Is this issue solved? I'm having problems while configuring this to have s3 bucket backups. It just not work, if i try to add manually the eks role in the annotation section "spec.serviceAccount.annotations", it start crashing and i need to remove the whole operator cdr stuff and deploy it fresh again.
https://aws.amazon.com/blogs/opensource/introducing-fine-grained-iam-roles-service-accounts/

Thanks in advance!

@Harguer
Copy link

Harguer commented Jun 10, 2022

Hello! I was wondering if there is some update here, I'm still having issues to integrate this with jenkins-operator https://eksctl.io/usage/iamserviceaccounts/

@mkyc
Copy link

mkyc commented Oct 13, 2022

Workaround is to add jenkins.io/use-deployment=true annotation to Jenkins CR.

apiVersion: jenkins.io/v1alpha2
kind: Jenkins
metadata:
  name: master
  annotations:
    "jenkins.io/use-deployment": 'true'

this will create Deployment between Jenkins CR and Pod.

I found it here

@brokenpip3 brokenpip3 modified the milestones: New API, 0.10 Mar 7, 2023
@github-actions github-actions bot added the stale label May 8, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 19, 2023
@brokenpip3 brokenpip3 reopened this May 19, 2023
@stale stale bot removed the stale label May 19, 2023
@bogdansuta
Copy link

Workaround is to add jenkins.io/use-deployment=true annotation to Jenkins CR.

I tried this, but Configuration as Code does not work any more and the seed job worker cannot connect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working not-stale
Projects
No open projects
Triage & Progress
  
In progress
Development

No branches or pull requests