-
Notifications
You must be signed in to change notification settings - Fork 66
[Intermittent] Workspace fails to start #4626
Comments
@ljelinkova @ppitonak are you still able to reproduce this problem on prod ? we have changed a few che-server configs [1] that should fix the problem. I currently can not reproduce it (Note on prod-preview we have not changed anything) [1] https://gitlab.cee.redhat.com/dtsd/housekeeping/issues/2476 |
I haven't seen this issue on prod since this morning, however, we need to let more e2e tests to run to make any conclusion. BTW, we need to fix the prod-prev to allow PR checks for SAAS to pass. |
ok, lemme apply the same config on preview |
I've just seen this failed che test in production |
@ljelinkova according to the screenshot it fails to start the bayesian lsp. Could you provide events from *-che namespace ? |
These are the logs: @rhopp is also investigating this issue |
@rhopp @ljelinkova FYI new config is applied on |
@ibuziuk I'm afraid new config didn't help... The issues is still happening (for example here, from today's morning run). And I was able to reproduce it pretty often yesterday too. |
@rhopp could you please provide info about the failure rate ? preferably based on periodic CI jobs. I want to understand how often does it currently happen on prod |
@ibuziuk I've checked latest 18 jobs runs (from today morning and yesterday evening). |
@rhopp ~50% rate is pretty high. Most, if not all, are currently happening during bayesian installation.
What does it mean ? [1] https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1796/console |
It simply means that workspace did not start, for whatever reason. When you scroll to the end of Jenkins log, you will find a link to artifacts.ci.centos.org (http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1796/ for the job that you linked). There, have a look on |
BTW, I've just noticed that account are not reset on prod because of UI change. Probably not related to this error because I observed it multiple times in prod-preview. |
@ppitonak ok, but wouldn't it be better to put |
@rhopp does this issue reproducible equally likely on prod-preview ? Am I correct that those jobs [1] are running against prod ? [1] https://docs.google.com/spreadsheets/d/1SsC06QJSp4fptCzEsFrVvOHKbQKqaDROeH1oE97lUiQ/edit?usp=sharing |
@ibuziuk Today runs agains prod-preview (2 jobs, running every 2 hours). 20 runs, 4 failures on this error It fails most often on bayesian (if not every time) maybe because when this installer is starting, it is writing most stuff to the standard output ? |
Fair point...I changed that and all new builds will have a more meaningful message. |
@rhopp I do not see anything obvious currently:
@rhopp do you have the same logs when workspace is failed to start for you (no obvious errors) ?
All installers are processed just fine without obvious errors, but workspace is still in a starting phase untill timeout occurs |
@ibuziuk That's exactly what I'm seeing and that's what perplexes me for last few days (and what I was trying to say on daily standups for last few days :-D) |
@rhopp @ppitonak @ljelinkova hot-fix [1] has been rollout to prod. Would be really useful to get info about startup failure similar to what was provided today - https://docs.google.com/spreadsheets/d/1SsC06QJSp4fptCzEsFrVvOHKbQKqaDROeH1oE97lUiQ/edit?usp=sharing [1] redhat-developer/rh-che#1140 P.S. after update I was not able to reproduce so far |
@ibuziuk So far, so good. I've checked all the runs that were executed today (since midnight). Out of 10 runs, none of them failed on this bug. |
@rhopp @ljelinkova @ppitonak if the situation is stabilized, please remove P1 / SEV1 / e2e labels. |
Is the redhat-developer/rh-che#1140 the final fix or are there any follow-up issues? I would prefer closing this issue and track the rest in more specific issues. |
The issue is very likely related to the fact, that currently each log entry from workspace installers is pushed to Che master via json rpc, which creates significant load and results in the fact, that che master can not create yet another native thread. This leads to the failure during workspace startup when logs from installers can not be pushed to master anymore. There is already an issue opened in the upstream for moving endpoint for workspace output processing from Che master to a separate container, but since it is an architectural issue the solution is not trivial and can not be applied in the short term. Here is the list of issue which could cause the problem:
|
Thanks for the list of issues. Closing the issue now and will reopen when it appears again. |
The e2e tests have failed multiple times on che tests - the workspace was not started.
For example:
https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1b-released/1203/console
Or:
https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1774/console
This is reported already RH-Che repository: redhat-developer/rh-che#1126
The text was updated successfully, but these errors were encountered: