Skip to content
This repository has been archived by the owner on Jul 23, 2020. It is now read-only.

[Intermittent] Workspace fails to start #4626

Closed
ljelinkova opened this issue Dec 10, 2018 · 25 comments
Closed

[Intermittent] Workspace fails to start #4626

ljelinkova opened this issue Dec 10, 2018 · 25 comments

Comments

@ljelinkova
Copy link
Collaborator

The e2e tests have failed multiple times on che tests - the workspace was not started.
For example:
https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1b-released/1203/console

04-01-aftereach

Or:
https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1774/console

04-01-aftereach

This is reported already RH-Che repository: redhat-developer/rh-che#1126

@ibuziuk
Copy link
Collaborator

ibuziuk commented Dec 11, 2018

@ljelinkova @ppitonak are you still able to reproduce this problem on prod ? we have changed a few che-server configs [1] that should fix the problem. I currently can not reproduce it (Note on prod-preview we have not changed anything)

[1] https://gitlab.cee.redhat.com/dtsd/housekeeping/issues/2476

@ljelinkova
Copy link
Collaborator Author

I haven't seen this issue on prod since this morning, however, we need to let more e2e tests to run to make any conclusion.

BTW, we need to fix the prod-prev to allow PR checks for SAAS to pass.

@ibuziuk
Copy link
Collaborator

ibuziuk commented Dec 11, 2018

ok, lemme apply the same config on preview

@ljelinkova
Copy link
Collaborator Author

@ibuziuk
Copy link
Collaborator

ibuziuk commented Dec 11, 2018

@ljelinkova according to the screenshot it fails to start the bayesian lsp. Could you provide events from *-che namespace ?

@ljelinkova
Copy link
Collaborator Author

@ibuziuk
Copy link
Collaborator

ibuziuk commented Dec 11, 2018

@rhopp @ljelinkova FYI new config is applied on prod-preview:

image

@rhopp
Copy link
Collaborator

rhopp commented Dec 12, 2018

@ibuziuk I'm afraid new config didn't help... The issues is still happening (for example here, from today's morning run). And I was able to reproduce it pretty often yesterday too.

@ibuziuk
Copy link
Collaborator

ibuziuk commented Dec 12, 2018

@rhopp could you please provide info about the failure rate ? preferably based on periodic CI jobs. I want to understand how often does it currently happen on prod

@rhopp
Copy link
Collaborator

rhopp commented Dec 12, 2018

@ibuziuk I've checked latest 18 jobs runs (from today morning and yesterday evening).
Out of these 18 job runs, 8 failed on this error.
I've put that into this spreadsheet with links to screenshots. If you need more info (like logs that the tests are saving), just delete last portion of the url.
https://docs.google.com/spreadsheets/d/1SsC06QJSp4fptCzEsFrVvOHKbQKqaDROeH1oE97lUiQ/edit?usp=sharing

@ibuziuk
Copy link
Collaborator

ibuziuk commented Dec 12, 2018

@rhopp ~50% rate is pretty high. Most, if not all, are currently happening during bayesian installation.
Also, I'm wondering about error message in e2e test e.g. [1]:

  - Failed: Number of projects in che

What does it mean ?

[1] https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1796/console

@ppitonak
Copy link
Collaborator

ppitonak commented Dec 12, 2018

It simply means that workspace did not start, for whatever reason. When you scroll to the end of Jenkins log, you will find a link to artifacts.ci.centos.org (http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1796/ for the job that you linked).

There, have a look on 04-01-che-failed.png which is probably the best clue of what happened. oc-che-logs.txt could also be helpful

@ppitonak
Copy link
Collaborator

BTW, I've just noticed that account are not reset on prod because of UI change. Probably not related to this error because I observed it multiple times in prod-preview.

@ibuziuk
Copy link
Collaborator

ibuziuk commented Dec 12, 2018

It simply means that workspace did not start, for whatever reason

@ppitonak ok, but wouldn't it be better to put Workspace failed to start instead of non-obvious Number of projects in che ?

@ibuziuk
Copy link
Collaborator

ibuziuk commented Dec 12, 2018

@rhopp does this issue reproducible equally likely on prod-preview ? Am I correct that those jobs [1] are running against prod ?

[1] https://docs.google.com/spreadsheets/d/1SsC06QJSp4fptCzEsFrVvOHKbQKqaDROeH1oE97lUiQ/edit?usp=sharing

@ibuziuk ibuziuk changed the title Che workspace fails to start Che workspace startup fails in ~50% of cases Dec 12, 2018
@ppitonak
Copy link
Collaborator

It simply means that workspace did not start, for whatever reason

@ppitonak ok, but wouldn't it be better to put Workspace failed to start instead of non-obvious Number of projects in che ?

Fair point...I changed that and all new builds will have a more meaningful message.

@ibuziuk
Copy link
Collaborator

ibuziuk commented Dec 12, 2018

@rhopp I do not see anything obvious currently:

sh-4.2$ cat bootstrapper.log
2018/12/12 11:13:54 Bootstrapper configuration
2018/12/12 11:13:54 Push endpoint: wss://che.openshift.io/api/websocket
2018/12/12 11:13:54 Push logs endpoint: wss://che.openshift.io/api/websocket
2018/12/12 11:13:54 Auth enabled: true
2018/12/12 11:13:54 Runtime ID:
2018/12/12 11:13:54 Workspace: workspacejpufc7fpmxt5thns
2018/12/12 11:13:54 Environment: default
2018/12/12 11:13:54 OwnerId: 9956255a-776f-4c8a-950e-878dc54f871a
2018/12/12 11:13:54 Machine name: dev-machine
2018/12/12 11:13:54 Installer timeout: 180seconds
2018/12/12 11:13:54 Check servers period: 3seconds
2018/12/12 11:13:54 Push logs endpoint reconnect period: 10seconds
2018/12/12 11:13:54 Planning to install
2018/12/12 11:13:54 - com.redhat.oc-login:1.0.0 - Login with oc to user project on OSO
2018/12/12 11:13:54 - com.redhat.bayesian.lsp:1.0.0 - Insights about your application dependencies
2018/12/12 11:13:54 - org.eclipse.che.ws-agent:1.0.3 - Workspace API support
2018/12/12 11:13:54 - org.eclipse.che.terminal:1.0.1 - Embedded web terminal
2018/12/12 11:13:54 - org.eclipse.che.exec:1.0.1 - Agent for command execution
2018/12/12 11:13:54 Starting installations
2018/12/12 11:13:54 Installing 'com.redhat.oc-login'
2018/12/12 11:13:55 Installation of 'com.redhat.oc-login' successfully finished
2018/12/12 11:13:55 Installing 'com.redhat.bayesian.lsp'
2018/12/12 11:13:56 Installation of 'com.redhat.bayesian.lsp' successfully finished
2018/12/12 11:13:56 Installing 'org.eclipse.che.ws-agent'
2018/12/12 11:13:59 Installation of 'org.eclipse.che.ws-agent' successfully finished
2018/12/12 11:13:59 Installing 'org.eclipse.che.terminal'
2018/12/12 11:14:02 Installation of 'org.eclipse.che.terminal' successfully finished
2018/12/12 11:14:02 Installing 'org.eclipse.che.exec'
2018/12/12 11:14:05 Installation of 'org.eclipse.che.exec' successfully finished
2018/12/12 11:14:05 All installations successfully finished

sh-4.2$ cat catalina.log
2018-12-12 11:13:59,005[main] [INFO ] [o.a.c.s.VersionLoggerListener 89] - Server version: Apache Tomcat/8.5.23
2018-12-12 11:13:59,024[main] [INFO ] [o.a.c.s.VersionLoggerListener 91] - Server built: Sep 28 2017 10:30:11 UTC
2018-12-12 11:13:59,027[main] [INFO ] [o.a.c.s.VersionLoggerListener 93] - Server number: 8.5.23.0
2018-12-12 11:13:59,029[main] [INFO ] [o.a.c.s.VersionLoggerListener 95] - OS Name: Linux
2018-12-12 11:13:59,030[main] [INFO ] [o.a.c.s.VersionLoggerListener 97] - OS Version: 3.10.0-862.11.6.el7.x86_64
2018-12-12 11:13:59,031[main] [INFO ] [o.a.c.s.VersionLoggerListener 99] - Architecture: amd64
2018-12-12 11:13:59,032[main] [INFO ] [o.a.c.s.VersionLoggerListener 101] - Java Home: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.191.b12-0.el7_5.x86_64/jre
2018-12-12 11:13:59,033[main] [INFO ] [o.a.c.s.VersionLoggerListener 103] - JVM Version: 1.8.0_191-b12
2018-12-12 11:13:59,043[main] [INFO ] [o.a.c.s.VersionLoggerListener 105] - JVM Vendor: Oracle Corporation
2018-12-12 11:13:59,044[main] [INFO ] [o.a.c.s.VersionLoggerListener 107] - CATALINA_BASE: /home/user/che/ws-agent
2018-12-12 11:13:59,044[main] [INFO ] [o.a.c.s.VersionLoggerListener 109] - CATALINA_HOME: /home/user/che/ws-agent
2018-12-12 11:13:59,045[main] [INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -Djava.util.logging.config.file=/home/user/che/ws-agent/conf/logging.properties
2018-12-12 11:13:59,045[main] [INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
2018-12-12 11:13:59,046[main] [INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -XX:MaxRAM=600m
2018-12-12 11:13:59,046[main] [INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -XX:MaxRAMFraction=1
2018-12-12 11:13:59,046[main] [INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -XX:+UseParallelGC
2018-12-12 11:13:59,050[main] [INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -XX:MinHeapFreeRatio=10
2018-12-12 11:13:59,051[main] [INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -XX:MaxHeapFreeRatio=20
2018-12-12 11:13:59,051[main] [INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -XX:GCTimeRatio=4
2018-12-12 11:13:59,052[main] [INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -XX:AdaptiveSizePolicyWeight=90
2018-12-12 11:13:59,054[main] [INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -Dsun.zip.disableMemoryMapping=true
2018-12-12 11:13:59,054[main] [INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -Xms50m
2018-12-12 11:13:59,055[main] [INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -Dfile.encoding=UTF8
2018-12-12 11:13:59,056[main] [INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -Djava.security.egd=file:/dev/./urandom
2018-12-12 11:13:59,058[main] [INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -Dche.logs.dir=/workspace_logs/ws-agent
2018-12-12 11:13:59,060[main] [INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -Dche.logs.level=INFO
2018-12-12 11:13:59,062[main] [INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -Djuli-logback.configurationFile=file:/home/user/che/ws-agent/conf/tomcat-logger.xml
2018-12-12 11:13:59,063[main] [INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -Djdk.tls.ephemeralDHKeySize=2048
2018-12-12 11:13:59,063[main] [INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -Djava.protocol.handler.pkgs=org.apache.catalina.webresources
2018-12-12 11:13:59,069[main] [INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -Dcom.sun.management.jmxremote
2018-12-12 11:13:59,070[main] [INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -Dcom.sun.management.jmxremote.ssl=false
2018-12-12 11:13:59,070[main] [INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -Dcom.sun.management.jmxremote.authenticate=false
2018-12-12 11:13:59,072[main] [INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -Dche.local.conf.dir=/home/user/che/ws-agent/conf/
2018-12-12 11:13:59,074[main] [INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -Dcatalina.base=/home/user/che/ws-agent
2018-12-12 11:13:59,074[main] [INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -Dcatalina.home=/home/user/che/ws-agent
2018-12-12 11:13:59,076[main] [INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -Djava.io.tmpdir=/home/user/che/ws-agent/temp
2018-12-12 11:13:59,156[main] [INFO ] [o.a.c.http11.Http11NioProtocol 573] - Initializing ProtocolHandler ["http-nio-4401"]
2018-12-12 11:13:59,185[main] [INFO ] [o.a.t.util.net.NioSelectorPool 67] - Using a shared selector for servlet write/read
2018-12-12 11:13:59,195[main] [INFO ] [o.a.catalina.startup.Catalina 632] - Initialization processed in 504 ms
2018-12-12 11:13:59,224[main] [INFO ] [c.m.JmxRemoteLifecycleListener 336] - The JMX Remote Listener has configured the registry on port [32002] and the server on port [32102] for the [Platform] server
2018-12-12 11:13:59,226[main] [INFO ] [o.a.c.core.StandardService 416] - Starting service [Catalina]
2018-12-12 11:13:59,227[main] [INFO ] [o.a.c.core.StandardEngine 259] - Starting Servlet Engine: Apache Tomcat/8.5.23
2018-12-12 11:13:59,444[ost-startStop-1] [INFO ] [o.a.c.startup.HostConfig 959] - Deploying web application archive [/home/user/che/ws-agent/webapps/ROOT.war]
2018-12-12 11:14:06,726[ost-startStop-1] [INFO ] [o.a.c.startup.HostConfig 1023] - Deployment of web application archive [/home/user/che/ws-agent/webapps/ROOT.war] has finished in [7,281] ms
2018-12-12 11:14:06,737[main] [INFO ] [o.a.c.http11.Http11NioProtocol 632] - Starting ProtocolHandler ["http-nio-4401"]
2018-12-12 11:14:06,758[main] [INFO ] [o.a.catalina.startup.Catalina 683] - Server startup in 756

@rhopp do you have the same logs when workspace is failed to start for you (no obvious errors) ?

2018/12/12 11:13:55 Installing 'com.redhat.bayesian.lsp'
2018/12/12 11:13:56 Installation of 'com.redhat.bayesian.lsp' successfully finished

All installers are processed just fine without obvious errors, but workspace is still in a starting phase untill timeout occurs

@rhopp
Copy link
Collaborator

rhopp commented Dec 12, 2018

@ibuziuk That's exactly what I'm seeing and that's what perplexes me for last few days (and what I was trying to say on daily standups for last few days :-D)

@ibuziuk
Copy link
Collaborator

ibuziuk commented Dec 12, 2018

@rhopp @ppitonak @ljelinkova hot-fix [1] has been rollout to prod. Would be really useful to get info about startup failure similar to what was provided today - https://docs.google.com/spreadsheets/d/1SsC06QJSp4fptCzEsFrVvOHKbQKqaDROeH1oE97lUiQ/edit?usp=sharing

[1] redhat-developer/rh-che#1140

P.S. after update I was not able to reproduce so far

@rhopp
Copy link
Collaborator

rhopp commented Dec 13, 2018

@ibuziuk So far, so good. I've checked all the runs that were executed today (since midnight). Out of 10 runs, none of them failed on this bug.

@ibuziuk
Copy link
Collaborator

ibuziuk commented Dec 13, 2018

@rhopp @ljelinkova @ppitonak if the situation is stabilized, please remove P1 / SEV1 / e2e labels.

@ibuziuk ibuziuk changed the title Che workspace startup fails in ~50% of cases Intermittent - Workspace fails to start Dec 13, 2018
@ibuziuk ibuziuk changed the title Intermittent - Workspace fails to start [Intermittent] Workspace fails to start Dec 13, 2018
@ppitonak
Copy link
Collaborator

Is the redhat-developer/rh-che#1140 the final fix or are there any follow-up issues? I would prefer closing this issue and track the rest in more specific issues.

@ibuziuk
Copy link
Collaborator

ibuziuk commented Dec 13, 2018

The issue is very likely related to the fact, that currently each log entry from workspace installers is pushed to Che master via json rpc, which creates significant load and results in the fact, that che master can not create yet another native thread. This leads to the failure during workspace startup when logs from installers can not be pushed to master anymore. There is already an issue opened in the upstream for moving endpoint for workspace output processing from Che master to a separate container, but since it is an architectural issue the solution is not trivial and can not be applied in the short term.

Here is the list of issue which could cause the problem:

@ppitonak
Copy link
Collaborator

Thanks for the list of issues. Closing the issue now and will reopen when it appears again.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants