Skip to content
This repository has been archived by the owner on Jul 23, 2020. It is now read-only.

Jenkins is not unidled for 1a cluster user #4378

Closed
ljelinkova opened this issue Oct 2, 2018 · 23 comments
Closed

Jenkins is not unidled for 1a cluster user #4378

ljelinkova opened this issue Oct 2, 2018 · 23 comments

Comments

@ljelinkova
Copy link
Collaborator

ljelinkova commented Oct 2, 2018

Issue Overview

For our both user accounts on 1a cluster, Jenkins is stuck in idled state and even the reset of environment does not help.

Expected Behaviour

The Jenkins is unidled when pipeline starts.

Current Behaviour

Jenkins is not unidled when pipeline starts. Reset of environment does not help.

Additional Information

This happens only for our 1a accounts. I do not have any personal account on 1a cluster so I cannot verify if this affects all accounts.

@ljelinkova
Copy link
Collaborator Author

It seems that Jenkins tries to start but does not pass the readiness probe on time.

jenkins-log

@jfchevrette
Copy link
Contributor

There is no "jenkins-home" PVC under that namespace, which explains why the jenkins pod refuses to start.

@chmouel
Copy link

chmouel commented Oct 2, 2018

downgradding to SEV2 since that can only occur for a subset of users in sev1

@ppitonak
Copy link
Collaborator

ppitonak commented Oct 2, 2018

Who is responsible for creating this PVC?

@sthaha
Copy link
Collaborator

sthaha commented Oct 2, 2018

@ppitonak it must be created as a part of init-tenant so I would say fabric8-tenant-service - https://github.com/fabric8-services/fabric8-tenant.

I am curious to know if we have any logs about tenant initialisation for this namespace.

@rhopp
Copy link
Collaborator

rhopp commented Oct 2, 2018

There is no "jenkins-home" PVC under that namespace, which explains why the jenkins pod refuses to start.

@jfchevrette I don't think it's solely because of that...
For example with my account rhopp-osiotest2 (id 436a9797-e818-4697-b5d2-648c2068645e), when I do env reset, the pvc is created, but jenkins deployment fails - pod is created, PVC mounted, I can also observe pod logs (see, that jenkins IS starting), but I think, that the start takes so long, that the deployment fails.

LOGS:
╭─rhopp@rhopp ~  
╰─$ oc project rhopp-osiotest2-jenkins
Now using project "rhopp-osiotest2-jenkins" on server "https://api.starter-us-east-1a.openshift.com:443".

╭─rhopp@rhopp ~  
╰─$ oc get pvc
NAME           STATUS    VOLUME                                       CAPACITY   ACCESS MODES   STORAGECLASS     AGE
jenkins-home   Bound     afbf9112-844c-4d45-b0f6-f413cd009486-02-2e   1Gi        RWO            gluster-subvol   34s

╭─rhopp@rhopp ~  
╰─$ oc get pods
NAME               READY     STATUS    RESTARTS   AGE
jenkins-1-2p5c2    0/1       Running   0          3m
jenkins-1-deploy   1/1       Running   0          3m

╭─rhopp@rhopp ~  
╰─$ oc logs jenkins-1-2p5c2  # THIS WAS DONE 5 MINUTES AFTER THE POD WAS CREATED
OPENSHIFT_JENKINS_JVM_ARCH is set to i686 so using 32 bit Java
Oct 02, 2018 02:36:22 PM INFO: Using temp directory /var/lib/jenkins/tmp.05OoUaw0dC for generating configurations
Oct 02, 2018 02:36:54 PM INFO: Generating kubernetes-plugin configuration (/opt/openshift/configuration/config.xml.tpl)
Oct 02, 2018 02:36:55 PM INFO: Generating kubernetes-plugin credentials (credentials.xml) ...
Oct 02, 2018 02:36:55 PM INFO: Copying Jenkins configuration to /var/lib/jenkins
Oct 02, 2018 02:36:56 PM INFO: Removing template files (*.tpl) from /var/lib/jenkins
Oct 02, 2018 02:36:57 PM INFO: Removed temp directory /var/lib/jenkins/tmp.05OoUaw0dC
plugins directory not found copy plugins
Copying 141 Jenkins plugins to /var/lib/jenkins ...
Creating initial Jenkins 'admin' user ...
sed: can't read /var/lib/jenkins/users/admin/config.xml: No such file or directory
Running from: /usr/lib/jenkins/jenkins.war
webroot: EnvVars.masterEnvVars.get("JENKINS_HOME")
Oct 02, 2018 2:37:41 PM Main deleteWinstoneTempContents
WARNING: Failed to delete the temporary Winstone file /tmp/winstone/jenkins.war
Oct 02, 2018 2:37:41 PM org.eclipse.jetty.util.log.Log initialized
INFO: Logging initialized @2370ms to org.eclipse.jetty.util.log.JavaUtilLog
Oct 02, 2018 2:37:42 PM winstone.Logger logInternal
INFO: Beginning extraction from war file

╭─rhopp@rhopp ~  
╰─$ oc get ev              
LAST SEEN   FIRST SEEN   COUNT     NAME                                KIND                    SUBOBJECT                     TYPE      REASON                  SOURCE                                   MESSAGE
6m          6m           1         jenkins-1-2p5c2.1559d1607e21b5e6    Pod                                                   Normal    Scheduled               default-scheduler                        Successfully assigned jenkins-1-2p5c2 to ip-172-21-61-24.ec2.internal
6m          6m           1         jenkins-1-2p5c2.1559d160969548cd    Pod                                                   Normal    SuccessfulMountVolume   kubelet, ip-172-21-61-24.ec2.internal    MountVolume.SetUp succeeded for volume "jenkins-token-2jxbj" 
6m          6m           1         jenkins-1-2p5c2.1559d1609d64e2e9    Pod                                                   Normal    SuccessfulMountVolume   kubelet, ip-172-21-61-24.ec2.internal    MountVolume.SetUp succeeded for volume "jenkins-config" 
6m          6m           1         jenkins-1-2p5c2.1559d160a7b1e117    Pod                                                   Normal    SuccessfulMountVolume   kubelet, ip-172-21-61-24.ec2.internal    MountVolume.SetUp succeeded for volume "afbf9112-844c-4d45-b0f6-f413cd009486-02-2e" 
6m          6m           1         jenkins-1-2p5c2.1559d16165388af2    Pod                     spec.containers{jenkins}      Normal    Pulling                 kubelet, ip-172-21-61-24.ec2.internal    pulling image "fabric8/jenkins-openshift:v7e8de2d"
6m          6m           1         jenkins-1-2p5c2.1559d164057d26ef    Pod                     spec.containers{jenkins}      Normal    Pulled                  kubelet, ip-172-21-61-24.ec2.internal    Successfully pulled image "fabric8/jenkins-openshift:v7e8de2d"
6m          6m           1         jenkins-1-2p5c2.1559d1640c8472a2    Pod                     spec.containers{jenkins}      Normal    Created                 kubelet, ip-172-21-61-24.ec2.internal    Created container
6m          6m           1         jenkins-1-2p5c2.1559d16437e02c92    Pod                     spec.containers{jenkins}      Normal    Started                 kubelet, ip-172-21-61-24.ec2.internal    Started container
45s         5m           19        jenkins-1-2p5c2.1559d16f4ad7a423    Pod                     spec.containers{jenkins}      Warning   Unhealthy               kubelet, ip-172-21-61-24.ec2.internal    Readiness probe failed: Get http://10.131.16.23:8080/login: dial tcp 10.131.16.23:8080: getsockopt: connection refused
6m          6m           1         jenkins-1-deploy.1559d15fc7227efe   Pod                                                   Normal    Scheduled               default-scheduler                        Successfully assigned jenkins-1-deploy to ip-172-21-61-157.ec2.internal
6m          6m           1         jenkins-1-deploy.1559d15fd5b87169   Pod                                                   Normal    SuccessfulMountVolume   kubelet, ip-172-21-61-157.ec2.internal   MountVolume.SetUp succeeded for volume "deployer-token-nrqkl" 
6m          6m           1         jenkins-1-deploy.1559d16049c0944f   Pod                     spec.containers{deployment}   Normal    Pulling                 kubelet, ip-172-21-61-157.ec2.internal   pulling image "registry.reg-aws.openshift.com:443/openshift3/ose-deployer:v3.10.9"
6m          6m           1         jenkins-1-deploy.1559d1605bcd18da   Pod                     spec.containers{deployment}   Normal    Pulled                  kubelet, ip-172-21-61-157.ec2.internal   Successfully pulled image "registry.reg-aws.openshift.com:443/openshift3/ose-deployer:v3.10.9"
6m          6m           1         jenkins-1-deploy.1559d1605f4dc589   Pod                     spec.containers{deployment}   Normal    Created                 kubelet, ip-172-21-61-157.ec2.internal   Created container
6m          6m           1         jenkins-1-deploy.1559d1606dc85787   Pod                     spec.containers{deployment}   Normal    Started                 kubelet, ip-172-21-61-157.ec2.internal   Started container
6m          6m           1         jenkins-1.1559d1607d7b0b5b          ReplicationController                                 Normal    SuccessfulCreate        replication-controller                   Created pod: jenkins-1-2p5c2
6m          6m           1         jenkins.1559d15fc426765b            DeploymentConfig                                      Normal    DeploymentCreated       deploymentconfig-controller              Created new replication controller "jenkins-1" for version 1

╭─rhopp@rhopp ~  
╰─$ oc logs jenkins-1-2p5c2  # THIS WAS DONE 8 MINUTES AFTER THE POD WAS CREATED
OPENSHIFT_JENKINS_JVM_ARCH is set to i686 so using 32 bit Java
Oct 02, 2018 02:36:22 PM INFO: Using temp directory /var/lib/jenkins/tmp.05OoUaw0dC for generating configurations
Oct 02, 2018 02:36:54 PM INFO: Generating kubernetes-plugin configuration (/opt/openshift/configuration/config.xml.tpl)
Oct 02, 2018 02:36:55 PM INFO: Generating kubernetes-plugin credentials (credentials.xml) ...
Oct 02, 2018 02:36:55 PM INFO: Copying Jenkins configuration to /var/lib/jenkins
Oct 02, 2018 02:36:56 PM INFO: Removing template files (*.tpl) from /var/lib/jenkins
Oct 02, 2018 02:36:57 PM INFO: Removed temp directory /var/lib/jenkins/tmp.05OoUaw0dC
plugins directory not found copy plugins
Copying 141 Jenkins plugins to /var/lib/jenkins ...
Creating initial Jenkins 'admin' user ...
sed: can't read /var/lib/jenkins/users/admin/config.xml: No such file or directory
Running from: /usr/lib/jenkins/jenkins.war
webroot: EnvVars.masterEnvVars.get("JENKINS_HOME")
Oct 02, 2018 2:37:41 PM Main deleteWinstoneTempContents
WARNING: Failed to delete the temporary Winstone file /tmp/winstone/jenkins.war
Oct 02, 2018 2:37:41 PM org.eclipse.jetty.util.log.Log initialized
INFO: Logging initialized @2370ms to org.eclipse.jetty.util.log.JavaUtilLog
Oct 02, 2018 2:37:42 PM winstone.Logger logInternal
INFO: Beginning extraction from war file
Oct 02, 2018 2:43:02 PM org.eclipse.jetty.server.handler.ContextHandler setContextPath
WARNING: Empty contextPath
Oct 02, 2018 2:43:03 PM org.eclipse.jetty.server.Server doStart
INFO: jetty-9.4.z-SNAPSHOT
Oct 02, 2018 2:43:20 PM org.eclipse.jetty.webapp.StandardDescriptorProcessor visitServlet
INFO: NO JSP Support for /, did not find org.eclipse.jetty.jsp.JettyJspServlet
Oct 02, 2018 2:43:21 PM org.eclipse.jetty.server.session.DefaultSessionIdManager doStart
INFO: DefaultSessionIdManager workerName=node0
Oct 02, 2018 2:43:21 PM org.eclipse.jetty.server.session.DefaultSessionIdManager doStart
INFO: No SessionScavenger set, using defaults
Oct 02, 2018 2:43:21 PM org.eclipse.jetty.server.session.HouseKeeper startScavenging
INFO: Scavenging every 600000ms
Jenkins home directory: /var/lib/jenkins found at: EnvVars.masterEnvVars.get("JENKINS_HOME")
Oct 02, 2018 2:43:46 PM org.eclipse.jetty.server.handler.ContextHandler doStart
INFO: Started w.@1e61854{/,file:///var/lib/jenkins/war/,AVAILABLE}{/var/lib/jenkins/war}
Oct 02, 2018 2:43:46 PM org.eclipse.jetty.server.AbstractConnector doStart
INFO: Started ServerConnector@1f417a7{HTTP/1.1,[http/1.1]}{0.0.0.0:8080}
Oct 02, 2018 2:43:46 PM org.eclipse.jetty.server.Server doStart
INFO: Started @367504ms
Oct 02, 2018 2:43:46 PM winstone.Logger logInternal
INFO: Winstone Servlet Engine v4.0 running: controlPort=disabled
Oct 02, 2018 2:44:14 PM jenkins.InitReactorRunner$1 onAttained
INFO: Started initialization

Nothing more is added to the log, until the deployment times out, because readinnes probe didn't succeed.

@chmouel
Copy link

chmouel commented Oct 2, 2018

Thanks for the investigation @rhopp,

@jfchevrette where can we look to see the gluster IO sisue ? this is IO or networking who would shows this,

@sthaha
Copy link
Collaborator

sthaha commented Oct 2, 2018

sed: can't read /var/lib/jenkins/users/admin/config.xml: No such file or directory
Running from: /usr/lib/jenkins/jenkins.war
webroot: EnvVars.masterEnvVars.get("JENKINS_HOME")
Oct 02, 2018 2:37:41 PM Main deleteWinstoneTempContents
WARNING: Failed to delete the temporary Winstone file /tmp/winstone/jenkins.war
Oct 02, 2018 2:37:41 PM org.eclipse.jetty.util.log.Log initialized
INFO: Logging initialized @2370ms to org.eclipse.jetty.util.log.JavaUtilLog
Oct 02, 2018 2:37:42 PM winstone.Logger logInternal
INFO: Beginning extraction from war file
Oct 02, 2018 2:43:02 PM org.eclipse.jetty.server.handler.ContextHandler setContextPath
WARNING: Empty contextPath
Oct 02, 2018 2:43:03 PM org.eclipse.jetty.server.Server doStart
INFO: jetty-9.4.z-SNAPSHOT

corresponds to https://github.com/fabric8-jenkins/jenkins-openshift-base/blob/5ff927d8786e9e43ed629236e932fe034e93958d/2/contrib/s2i/run#L308
and java -jar /usr/lib/jenkins/jenkins.war (redacted) seems to take a long time extracting the war file which to me seems like I/O issue (perhaps glusterfs)

@rhopp
Copy link
Collaborator

rhopp commented Oct 2, 2018

Gluster monitoring in zabbix:
cluster 1a: https://zabbix.devshift.net:9443/zabbix/screens.php?elementid=55
cluster 2 (for comparison): https://zabbix.devshift.net:9443/zabbix/screens.php?elementid=65

@ppitonak
Copy link
Collaborator

ppitonak commented Oct 3, 2018

@chmouel why did you decrease severity? See the definition of SEV-1 from https://access.redhat.com/support/policy/severity

Severity 1 (urgent)

A problem that severely impacts your use of the software in a production environment (such as loss of production data or in which your production systems are not functioning). The situation halts your business operations and no procedural workaround exists.

For at least three users the production system is not functioning and no workaround exists

@chmouel
Copy link

chmouel commented Oct 3, 2018

@sthaha wasn't there some optimisations done by the che team for the slow gluster io ? maybe we can see if they can fit ?

@l0rd @ibuziuk do you have some advice maybe of what you have found ?

@chmouel
Copy link

chmouel commented Oct 3, 2018

@ppitonak I assume it's because it's only affects 1a and not everyone, I don't mind really having it in SEV-1 if you want it as is :)

@l0rd
Copy link
Collaborator

l0rd commented Oct 3, 2018

@chmouel if you refer to the problem with volumes with more than 20k files this is the PR that @benoitf has created. The fix should be available in 3.11. I don't know when it will be available on OSO honestly.

@ppitonak
Copy link
Collaborator

ppitonak commented Oct 3, 2018

It seems that there are multiple issues here... should I log persistentvolumeclaim "jenkins-home" not found error as a new issue?

@chmouel
Copy link

chmouel commented Oct 3, 2018

@ppitonak good idea,

@chmouel
Copy link

chmouel commented Oct 3, 2018

@l0rd i guess that's would improve the volume mounting slowness but unfortunately not the slow IO writes on gluster,

@rhopp
Copy link
Collaborator

rhopp commented Oct 3, 2018

@l0rd @chmouel This issue is unfortunately completely different from the che slow mount problem. In this issue, the mount is pretty quick, because the PVC is empty after env reset (unlike the che mount problem, where the PVC is populated with lot of files)

@sthaha
Copy link
Collaborator

sthaha commented Oct 4, 2018

@jmelis I am assigning this to you (need a point of contact) since does not seems like an idler or a jenkins issue but some I/O bound problem .

@hrishin
Copy link

hrishin commented Oct 4, 2018

May be related #4121

@ppitonak
Copy link
Collaborator

ppitonak commented Oct 4, 2018

I see the following event in OpenShift console for account osio-ci-e2e-001-preview on prod-preview (together with PVC jenkins-home not found)

Error creating: pods "jenkins-1-g7lbq" is forbidden: exceeded quota: compute-resources, requested: limits.cpu=2,limits.memory=1Gi, used: limits.cpu=2,limits.memory=1Gi, limited: limits.cpu=2,limits.memory=1Gi

BTW, is this resource quota shared with Che workspaces?

@jfchevrette
Copy link
Contributor

The original issue was about jenkins not coming up in the osio-ci-e2e-001-jenkins namespace. The jenkins pod sometimes do come up after about 10 minutes, and sometimes will time out. That appears to be due to a known storage issue which is being worked on. We expect the fix to be deployed within the next few weeks.

Ref. https://bugzilla.redhat.com/show_bug.cgi?id=1631834
Upstream PR: kubernetes/kubernetes#68680

The reason why the jenkins-home was not present is still unknown to me. One theory is that the account may have been reset while it was still being mounted and as such caused the initialization to fail to create a new PVC because the previous one was still stuck.

If there are other issues observed in other namespaces in other areas such as quotas, please open a separate issue so we can troubleshoot and track efficiently.

@xyntrix
Copy link

xyntrix commented Oct 8, 2018

This issue is a bit all over the place. Please provide:

  • A clear summary of the symptoms (rather than interpretation of the root cause)
  • A list of accounts that are experiencing those symptoms
  • What actions were performed, to experience those symptoms.

Based on what we have observed so far, there are certainly pvc mount failures in the event log for a test account or two -- but these are related to a race condition between performing a tenant reset and attempting to spin up new resources. Tenant reset and adding objects are most certainly not synchronous steps. Having the above data will help clarify what the actual issues are here. It is not clear that this should be prioritized as a SEV1 given the current lack of information -- rather, this looks more like a SEV3 as an entire business unit is not halted.

@ppitonak
Copy link
Collaborator

Unidling is no longer the only issue in 1a cluster. Closing in favor of #4215 and #4121

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests