[Intermittent] Workspace fails to start #4626

ljelinkova · 2018-12-10T12:50:04Z

The e2e tests have failed multiple times on che tests - the workspace was not started.
For example:
https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1b-released/1203/console

Or:
https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1774/console

This is reported already RH-Che repository: redhat-developer/rh-che#1126

ibuziuk · 2018-12-11T09:55:05Z

@ljelinkova @ppitonak are you still able to reproduce this problem on prod ? we have changed a few che-server configs [1] that should fix the problem. I currently can not reproduce it (Note on prod-preview we have not changed anything)

[1] https://gitlab.cee.redhat.com/dtsd/housekeeping/issues/2476

ljelinkova · 2018-12-11T10:40:49Z

I haven't seen this issue on prod since this morning, however, we need to let more e2e tests to run to make any conclusion.

BTW, we need to fix the prod-prev to allow PR checks for SAAS to pass.

ibuziuk · 2018-12-11T11:35:17Z

ok, lemme apply the same config on preview

ljelinkova · 2018-12-11T11:39:05Z

I've just seen this failed che test in production
http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-openshift.io-smoketest-us-east-2a-released/1210/04-01-che-failed.png

ibuziuk · 2018-12-11T12:06:52Z

@ljelinkova according to the screenshot it fails to start the bayesian lsp. Could you provide events from *-che namespace ?

ljelinkova · 2018-12-11T12:09:36Z

These are the logs:

http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-openshift.io-smoketest-us-east-2a-released/1210/oc-che-logs.txt

@rhopp is also investigating this issue

ibuziuk · 2018-12-11T12:58:44Z

@rhopp @ljelinkova FYI new config is applied on prod-preview:

rhopp · 2018-12-12T07:33:39Z

@ibuziuk I'm afraid new config didn't help... The issues is still happening (for example here, from today's morning run). And I was able to reproduce it pretty often yesterday too.

ibuziuk · 2018-12-12T09:07:29Z

@rhopp could you please provide info about the failure rate ? preferably based on periodic CI jobs. I want to understand how often does it currently happen on prod

rhopp · 2018-12-12T09:26:04Z

@ibuziuk I've checked latest 18 jobs runs (from today morning and yesterday evening).
Out of these 18 job runs, 8 failed on this error.
I've put that into this spreadsheet with links to screenshots. If you need more info (like logs that the tests are saving), just delete last portion of the url.
https://docs.google.com/spreadsheets/d/1SsC06QJSp4fptCzEsFrVvOHKbQKqaDROeH1oE97lUiQ/edit?usp=sharing

ibuziuk · 2018-12-12T09:44:42Z

@rhopp ~50% rate is pretty high. Most, if not all, are currently happening during bayesian installation.
Also, I'm wondering about error message in e2e test e.g. [1]:

  - Failed: Number of projects in che

What does it mean ?

[1] https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1796/console

ppitonak · 2018-12-12T09:45:12Z

It simply means that workspace did not start, for whatever reason. When you scroll to the end of Jenkins log, you will find a link to artifacts.ci.centos.org (http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1796/ for the job that you linked).

There, have a look on 04-01-che-failed.png which is probably the best clue of what happened. oc-che-logs.txt could also be helpful

ppitonak · 2018-12-12T09:50:46Z

BTW, I've just noticed that account are not reset on prod because of UI change. Probably not related to this error because I observed it multiple times in prod-preview.

ibuziuk · 2018-12-12T09:51:28Z

It simply means that workspace did not start, for whatever reason

@ppitonak ok, but wouldn't it be better to put Workspace failed to start instead of non-obvious Number of projects in che ?

ibuziuk · 2018-12-12T10:19:29Z

@rhopp does this issue reproducible equally likely on prod-preview ? Am I correct that those jobs [1] are running against prod ?

[1] https://docs.google.com/spreadsheets/d/1SsC06QJSp4fptCzEsFrVvOHKbQKqaDROeH1oE97lUiQ/edit?usp=sharing

rhopp · 2018-12-12T10:44:09Z

@ibuziuk Today runs agains prod-preview (2 jobs, running every 2 hours). 20 runs, 4 failures on this error
http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-prod-preview.openshift.io-smoketest-pr-us-east-2a-beta/4412/04-01-che-failed.png
http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-prod-preview.openshift.io-smoketest-pr-us-east-2a-released/4408/04-01-che-failed.png
http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-prod-preview.openshift.io-smoketest-pr-us-east-2a-released/4410/04-01-che-failed.png
http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-prod-preview.openshift.io-smoketest-pr-us-east-2a-released/4401/04-01-che-failed.png

It fails most often on bayesian (if not every time) maybe because when this installer is starting, it is writing most stuff to the standard output ?

ppitonak · 2018-12-12T11:18:35Z

It simply means that workspace did not start, for whatever reason

@ppitonak ok, but wouldn't it be better to put Workspace failed to start instead of non-obvious Number of projects in che ?

Fair point...I changed that and all new builds will have a more meaningful message.

ibuziuk · 2018-12-12T11:21:32Z

@rhopp I do not see anything obvious currently:

sh-4.2$ cat bootstrapper.log
2018/12/12 11:13:54 Bootstrapper configuration
2018/12/12 11:13:54 Push endpoint: wss://che.openshift.io/api/websocket
2018/12/12 11:13:54 Push logs endpoint: wss://che.openshift.io/api/websocket
2018/12/12 11:13:54 Auth enabled: true
2018/12/12 11:13:54 Runtime ID:
2018/12/12 11:13:54 Workspace: workspacejpufc7fpmxt5thns
2018/12/12 11:13:54 Environment: default
2018/12/12 11:13:54 OwnerId: 9956255a-776f-4c8a-950e-878dc54f871a
2018/12/12 11:13:54 Machine name: dev-machine
2018/12/12 11:13:54 Installer timeout: 180seconds
2018/12/12 11:13:54 Check servers period: 3seconds
2018/12/12 11:13:54 Push logs endpoint reconnect period: 10seconds
2018/12/12 11:13:54 Planning to install
2018/12/12 11:13:54 - com.redhat.oc-login:1.0.0 - Login with oc to user project on OSO
2018/12/12 11:13:54 - com.redhat.bayesian.lsp:1.0.0 - Insights about your application dependencies
2018/12/12 11:13:54 - org.eclipse.che.ws-agent:1.0.3 - Workspace API support
2018/12/12 11:13:54 - org.eclipse.che.terminal:1.0.1 - Embedded web terminal
2018/12/12 11:13:54 - org.eclipse.che.exec:1.0.1 - Agent for command execution
2018/12/12 11:13:54 Starting installations
2018/12/12 11:13:54 Installing 'com.redhat.oc-login'
2018/12/12 11:13:55 Installation of 'com.redhat.oc-login' successfully finished
2018/12/12 11:13:55 Installing 'com.redhat.bayesian.lsp'
2018/12/12 11:13:56 Installation of 'com.redhat.bayesian.lsp' successfully finished
2018/12/12 11:13:56 Installing 'org.eclipse.che.ws-agent'
2018/12/12 11:13:59 Installation of 'org.eclipse.che.ws-agent' successfully finished
2018/12/12 11:13:59 Installing 'org.eclipse.che.terminal'
2018/12/12 11:14:02 Installation of 'org.eclipse.che.terminal' successfully finished
2018/12/12 11:14:02 Installing 'org.eclipse.che.exec'
2018/12/12 11:14:05 Installation of 'org.eclipse.che.exec' successfully finished
2018/12/12 11:14:05 All installations successfully finished

sh-4.2$ cat catalina.log
2018-12-12 11:13:59,005[main] 2018-12-12 11:13:59,024[main] 2018-12-12 11:13:59,027[main] 2018-12-12 11:13:59,029[main] 2018-12-12 11:13:59,030[main] 2018-12-12 11:13:59,031[main] 2018-12-12 11:13:59,032[main] 2018-12-12 11:13:59,033[main] 2018-12-12 11:13:59,043[main] 2018-12-12 11:13:59,044[main] 2018-12-12 11:13:59,044[main] 2018-12-12 11:13:59,045[main] 2018-12-12 11:13:59,045[main] 2018-12-12 11:13:59,046[main] 2018-12-12 11:13:59,046[main] 2018-12-12 11:13:59,046[main] 2018-12-12 11:13:59,050[main] 2018-12-12 11:13:59,051[main] 2018-12-12 11:13:59,051[main] 2018-12-12 11:13:59,052[main] 2018-12-12 11:13:59,054[main] 2018-12-12 11:13:59,054[main] 2018-12-12 11:13:59,055[main] 2018-12-12 11:13:59,056[main] 2018-12-12 11:13:59,058[main] 2018-12-12 11:13:59,060[main] 2018-12-12 11:13:59,062[main] 2018-12-12 11:13:59,063[main] 2018-12-12 11:13:59,063[main] 2018-12-12 11:13:59,069[main] 2018-12-12 11:13:59,070[main] 2018-12-12 11:13:59,070[main] 2018-12-12 11:13:59,072[main] 2018-12-12 11:13:59,074[main] 2018-12-12 11:13:59,074[main] 2018-12-12 11:13:59,076[main] 2018-12-12 11:13:59,156[main] 2018-12-12 11:13:59,185[main] 2018-12-12 11:13:59,195[main] 2018-12-12 11:13:59,224[main] 2018-12-12 11:13:59,226[main] 2018-12-12 11:13:59,227[main] 2018-12-12 11:13:59,444[ost-startStop-1] 2018-12-12 11:14:06,726[ost-startStop-1] 2018-12-12 11:14:06,737[main] 2018-12-12 11:14:06,758[main] [INFO ] [o.a.c.s.VersionLoggerListener 89] - Server version: Apache Tomcat/8.5.23
[INFO ] [o.a.c.s.VersionLoggerListener 91] - Server built: Sep 28 2017 10:30:11 UTC
[INFO ] [o.a.c.s.VersionLoggerListener 93] - Server number: 8.5.23.0
[INFO ] [o.a.c.s.VersionLoggerListener 95] - OS Name: Linux
[INFO ] [o.a.c.s.VersionLoggerListener 97] - OS Version: 3.10.0-862.11.6.el7.x86_64
[INFO ] [o.a.c.s.VersionLoggerListener 99] - Architecture: amd64
[INFO ] [o.a.c.s.VersionLoggerListener 101] - Java Home: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.191.b12-0.el7_5.x86_64/jre
[INFO ] [o.a.c.s.VersionLoggerListener 103] - JVM Version: 1.8.0_191-b12
[INFO ] [o.a.c.s.VersionLoggerListener 105] - JVM Vendor: Oracle Corporation
[INFO ] [o.a.c.s.VersionLoggerListener 107] - CATALINA_BASE: /home/user/che/ws-agent
[INFO ] [o.a.c.s.VersionLoggerListener 109] - CATALINA_HOME: /home/user/che/ws-agent
[INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -Djava.util.logging.config.file=/home/user/che/ws-agent/conf/logging.properties
[INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
[INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -XX:MaxRAM=600m
[INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -XX:MaxRAMFraction=1
[INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -XX:+UseParallelGC
[INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -XX:MinHeapFreeRatio=10
[INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -XX:MaxHeapFreeRatio=20
[INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -XX:GCTimeRatio=4
[INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -XX:AdaptiveSizePolicyWeight=90
[INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -Dsun.zip.disableMemoryMapping=true
[INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -Xms50m
[INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -Dfile.encoding=UTF8
[INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -Djava.security.egd=file:/dev/./urandom
[INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -Dche.logs.dir=/workspace_logs/ws-agent
[INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -Dche.logs.level=INFO
[INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -Djuli-logback.configurationFile=file:/home/user/che/ws-agent/conf/tomcat-logger.xml
[INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -Djdk.tls.ephemeralDHKeySize=2048
[INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -Djava.protocol.handler.pkgs=org.apache.catalina.webresources
[INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -Dcom.sun.management.jmxremote
[INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -Dcom.sun.management.jmxremote.ssl=false
[INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -Dcom.sun.management.jmxremote.authenticate=false
[INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -Dche.local.conf.dir=/home/user/che/ws-agent/conf/
[INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -Dcatalina.base=/home/user/che/ws-agent
[INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -Dcatalina.home=/home/user/che/ws-agent
[INFO ] [o.a.c.s.VersionLoggerListener 115] - Command line argument: -Djava.io.tmpdir=/home/user/che/ws-agent/temp
[INFO ] [o.a.c.http11.Http11NioProtocol 573] - Initializing ProtocolHandler ["http-nio-4401"]
[INFO ] [o.a.t.util.net.NioSelectorPool 67] - Using a shared selector for servlet write/read
[INFO ] [o.a.catalina.startup.Catalina 632] - Initialization processed in 504 ms
[INFO ] [c.m.JmxRemoteLifecycleListener 336] - The JMX Remote Listener has configured the registry on port [32002] and the server on port [32102] for the [Platform] server
[INFO ] [o.a.c.core.StandardService 416] - Starting service [Catalina]
[INFO ] [o.a.c.core.StandardEngine 259] - Starting Servlet Engine: Apache Tomcat/8.5.23
[INFO ] [o.a.c.startup.HostConfig 959] - Deploying web application archive [/home/user/che/ws-agent/webapps/ROOT.war]
[INFO ] [o.a.c.startup.HostConfig 1023] - Deployment of web application archive [/home/user/che/ws-agent/webapps/ROOT.war] has finished in [7,281] ms
[INFO ] [o.a.c.http11.Http11NioProtocol 632] - Starting ProtocolHandler ["http-nio-4401"]
[INFO ] [o.a.catalina.startup.Catalina 683] - Server startup in 756

@rhopp do you have the same logs when workspace is failed to start for you (no obvious errors) ?

2018/12/12 11:13:55 Installing 'com.redhat.bayesian.lsp'
2018/12/12 11:13:56 Installation of 'com.redhat.bayesian.lsp' successfully finished

All installers are processed just fine without obvious errors, but workspace is still in a starting phase untill timeout occurs

rhopp · 2018-12-12T11:31:09Z

@ibuziuk That's exactly what I'm seeing and that's what perplexes me for last few days (and what I was trying to say on daily standups for last few days :-D)

ibuziuk · 2018-12-12T19:18:05Z

@rhopp @ppitonak @ljelinkova hot-fix [1] has been rollout to prod. Would be really useful to get info about startup failure similar to what was provided today - https://docs.google.com/spreadsheets/d/1SsC06QJSp4fptCzEsFrVvOHKbQKqaDROeH1oE97lUiQ/edit?usp=sharing

[1] redhat-developer/rh-che#1140

P.S. after update I was not able to reproduce so far

rhopp · 2018-12-13T07:56:04Z

@ibuziuk So far, so good. I've checked all the runs that were executed today (since midnight). Out of 10 runs, none of them failed on this bug.

ibuziuk · 2018-12-13T13:55:53Z

@rhopp @ljelinkova @ppitonak if the situation is stabilized, please remove P1 / SEV1 / e2e labels.

ppitonak · 2018-12-13T14:40:07Z

Is the redhat-developer/rh-che#1140 the final fix or are there any follow-up issues? I would prefer closing this issue and track the rest in more specific issues.

ibuziuk · 2018-12-13T14:50:43Z

The issue is very likely related to the fact, that currently each log entry from workspace installers is pushed to Che master via json rpc, which creates significant load and results in the fact, that che master can not create yet another native thread. This leads to the failure during workspace startup when logs from installers can not be pushed to master anymore. There is already an issue opened in the upstream for moving endpoint for workspace output processing from Che master to a separate container, but since it is an architectural issue the solution is not trivial and can not be applied in the short term.

Here is the list of issue which could cause the problem:

Move output endpoint from Che master container to a separate one Move output endpoint from Che master container to a separate one eclipse-che/che#11459
Che Server rejects JSON RPC messages which may lead to losing events Che Server rejects JSON RPC messages which may lead to losing events eclipse-che/che#11961
Potential memory leak in ProbeScheduler Potential memory leak in ProbeScheduler eclipse-che/che#12187

ppitonak · 2018-12-13T14:54:38Z

Thanks for the list of issues. Closing the issue now and will reopen when it appears again.

ljelinkova added area/che team/che SEV1-urgent priority/P1 Critical labels Dec 10, 2018

ibuziuk self-assigned this Dec 10, 2018

ibuziuk added the team/che/osio label Dec 10, 2018

ljelinkova mentioned this issue Dec 10, 2018

Che workspace fails to start due to 'Unauthorized!' error #4622

Closed

ibuziuk mentioned this issue Dec 11, 2018

Intermittent - Workspaces fail to start on prod redhat-developer/rh-che#1126

Closed

1 task

ppitonak mentioned this issue Dec 11, 2018

Updating rh-che to the latest version on prod openshiftio/saas-openshiftio#1160

Closed

ibuziuk added the status/pending-qa-verification label Dec 11, 2018

ppitonak added the area/e2e-tests label Dec 11, 2018

ibuziuk changed the title ~~Che workspace fails to start~~ Che workspace startup fails in ~50% of cases Dec 12, 2018

ibuziuk added env/prod-preview env/prod labels Dec 12, 2018

ibuziuk removed the status/pending-qa-verification label Dec 12, 2018

ibuziuk mentioned this issue Dec 12, 2018

Enabling monitoring for rhche on dsaas / dsaas-stg and exposing metrics to zabbix redhat-developer/osd-monitor-poc#43

Closed

ibuziuk mentioned this issue Dec 13, 2018

Slow mount of PVC for che workspaces - can prevent workspaces from starting. #4079

Closed

ibuziuk changed the title ~~Che workspace startup fails in ~50% of cases~~ Intermittent - Workspace fails to start Dec 13, 2018

ibuziuk changed the title ~~Intermittent - Workspace fails to start~~ [Intermittent] Workspace fails to start Dec 13, 2018

ppitonak closed this as completed Dec 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Intermittent] Workspace fails to start #4626

[Intermittent] Workspace fails to start #4626

ljelinkova commented Dec 10, 2018

ibuziuk commented Dec 11, 2018

ljelinkova commented Dec 11, 2018

ibuziuk commented Dec 11, 2018

ljelinkova commented Dec 11, 2018

ibuziuk commented Dec 11, 2018

ljelinkova commented Dec 11, 2018

ibuziuk commented Dec 11, 2018

rhopp commented Dec 12, 2018

ibuziuk commented Dec 12, 2018

rhopp commented Dec 12, 2018

ibuziuk commented Dec 12, 2018

ppitonak commented Dec 12, 2018 •

edited

Loading

ppitonak commented Dec 12, 2018

ibuziuk commented Dec 12, 2018

ibuziuk commented Dec 12, 2018

rhopp commented Dec 12, 2018

ppitonak commented Dec 12, 2018

ibuziuk commented Dec 12, 2018 •

edited

Loading

rhopp commented Dec 12, 2018

ibuziuk commented Dec 12, 2018 •

edited

Loading

rhopp commented Dec 13, 2018

ibuziuk commented Dec 13, 2018

ppitonak commented Dec 13, 2018

ibuziuk commented Dec 13, 2018

ppitonak commented Dec 13, 2018

[Intermittent] Workspace fails to start #4626

[Intermittent] Workspace fails to start #4626

Comments

ljelinkova commented Dec 10, 2018

ibuziuk commented Dec 11, 2018

ljelinkova commented Dec 11, 2018

ibuziuk commented Dec 11, 2018

ljelinkova commented Dec 11, 2018

ibuziuk commented Dec 11, 2018

ljelinkova commented Dec 11, 2018

ibuziuk commented Dec 11, 2018

rhopp commented Dec 12, 2018

ibuziuk commented Dec 12, 2018

rhopp commented Dec 12, 2018

ibuziuk commented Dec 12, 2018

ppitonak commented Dec 12, 2018 • edited Loading

ppitonak commented Dec 12, 2018

ibuziuk commented Dec 12, 2018

ibuziuk commented Dec 12, 2018

rhopp commented Dec 12, 2018

ppitonak commented Dec 12, 2018

ibuziuk commented Dec 12, 2018 • edited Loading

rhopp commented Dec 12, 2018

ibuziuk commented Dec 12, 2018 • edited Loading

rhopp commented Dec 13, 2018

ibuziuk commented Dec 13, 2018

ppitonak commented Dec 13, 2018

ibuziuk commented Dec 13, 2018

ppitonak commented Dec 13, 2018

ppitonak commented Dec 12, 2018 •

edited

Loading

ibuziuk commented Dec 12, 2018 •

edited

Loading

ibuziuk commented Dec 12, 2018 •

edited

Loading