Between midnight and and ~03:00 UTC, users provisioned on cluster starter-us-east-2 canot create Che workspaces #2544

ldimaggi · 2018-03-12T13:43:41Z

Updated March 22, 2018 - The pattern is consistent - starts at midnight UTC and continues for 3+ hours - only affects the starter-us-east-2 cluster.

The problem was first noted as affecting the creation of new Che workspaces starting after 19:00 (Boston EST time) here - #2154 (comment)

Since then - the same pattern has been seen in running build pipelines - E2E tests that create/run build pipelines are failing at 19:00 Boston time.

Question to be investigated - Are backups or some other system maintenance actions being performed on the starter-us-east-2 clusters at midnight UTC?

The starter-us-east-2a cluster does not seem to be affected by the issue.

joshuawilson · 2018-03-12T14:55:46Z

tasks don't have severity, is this a bug?

ldimaggi · 2018-03-12T15:00:09Z

Not yet - still investigating - removed bug label.

ldimaggi · 2018-03-13T02:04:29Z

Seeing a series of these errors:

Mar 13, 2018 1:23:53 AM io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$2 
onFailure
WARNING: Exec Failure
java.net.SocketException: Connection reset

Mar 13, 2018 1:24:00 AM io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager 
nextReconnectInterval
INFO: Current reconnect backoff is 8000 milliseconds (T3)
Mar 13, 2018 1:24:00 AM okhttp3.internal.platform.Platform log
INFO: ALPN callback dropped: HTTP/2 is disabled. Is alpn-boot on the boot class path?
Mar 13, 2018 1:24:00 AM io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$2
onFailure
WARNING: Exec Failure: HTTP 403, Status: 403 - User "system:serviceaccount:username-
jenkins:jenkins" cannot watch secrets in the namespace "username": User 
"system:serviceaccount:username:jenkins" cannot watch secrets in project "username"
java.net.ProtocolException: Expected HTTP 101 response but was '403 Forbidden'

ldimaggi · 2018-03-14T01:39:21Z

The pattern is consistent:

Build log:

EXITCODE   0[ERROR] F8: Failed to execute the build [Unable to build the image using the OpenShift build service]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 11:09 min
[INFO] Finished at: 2018-03-14T01:24:19+00:00
[INFO] Final Memory: 37M/59M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal io.fabric8:fabric8-maven-plugin:3.5.38:build (fmp) on project testmar141520989676914: Failed to execute the build: Unable to build the image using the OpenShift build service: An error has occurred. timeout: Socket closed -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal io.fabric8:fabric8-maven-plugin:3.5.38:build (fmp) on project testmar141520989676914: Failed to execute the build
	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)
	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
	at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
	at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)

Pod log:

Caused by: io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred.
	at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:62)
	at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:53)
	at io.fabric8.openshift.client.dsl.internal.BuildConfigOperationsImpl.fromInputStream(BuildConfigOperationsImpl.java:276)
	at io.fabric8.openshift.client.dsl.internal.BuildConfigOperationsImpl.fromFile(BuildConfigOperationsImpl.java:231)
	at io.fabric8.openshift.client.dsl.internal.BuildConfigOperationsImpl.fromFile(BuildConfigOperationsImpl.java:68)
	at io.fabric8.maven.core.service.openshift.OpenshiftBuildService.startBuild(OpenshiftBuildService.java:361)
	at io.fabric8.maven.core.service.openshift.OpenshiftBuildService.build(OpenshiftBuildService.java:111)
	... 27 more
Caused by: java.net.SocketTimeoutException: timeout

ldimaggi · 2018-03-23T02:13:57Z

I just saw the creation of a Che workspace fail 5 times out of 5 - this error resulted - #2154

mishaone · 2018-03-27T00:09:52Z

Tried this at the appropriate timezone. Created a workspace before UTC midnight and then another one two minutes after. Got this error: Could not start workspace newprojectname-tvfkw. Reason: Start of environment 'default' failed. Error: Failed to get the ID of the container running in the OpenShift pod

ldimaggi · 2018-03-27T00:17:09Z

Checking again at midnight UTC - this time - found this in the Che log:

An OpenShift Pod not found

{"@timestamp":"2018-03-27T00:02:17.197+00:00","@Version":1,"message":"Workspace 'ldimaggi@redhat.com/0dsiu' with id 'workspacemypusb6e60pgdxfv' created by user 'ldimaggi@redhat.com'","logger_name":"org.eclipse.che.api.workspace.server.WorkspaceManager","thread_name":"http-nio-8080-exec-1","level":"INFO","level_value":20000,"req_id":"821c3366-ddac-443b-b8d8-3c7ae8727a66","identity_id":"20ddc23a-bb62-4834-9130-9af2f54e85b1"}
{"@timestamp":"2018-03-27T00:02:18.957+00:00","@Version":1,"message":"Workspace 'ldimaggi@redhat.com/mpd32' with id 'workspace2ojaubk4hptm4jnc' is being stopped by user 'ldimaggi@redhat.com'","logger_name":"org.eclipse.che.api.workspace.server.WorkspaceManager","thread_name":"WorkspaceSharedPool-5","level":"INFO","level_value":20000}
{"@timestamp":"2018-03-27T00:02:18.959+00:00","@Version":1,"message":"Retrieving user Che tenant data","logger_name":"com.redhat.che.multitenant.Fabric8WorkspaceEnvironmentProvider","thread_name":"WorkspaceSharedPool-5","level":"INFO","level_value":20000}
{"@timestamp":"2018-03-27T00:02:19.050+00:00","@Version":1,"message":"cheTenantData = {ldimaggi-che,https://f8osoproxy-test-dsaas-production.09b5.dsaas.openshiftapps.com,8a09.starter-us-east-2.openshiftapps.com}","logger_name":"com.redhat.che.multitenant.Fabric8WorkspaceEnvironmentProvider","thread_name":"WorkspaceSharedPool-5","level":"INFO","level_value":20000}
{"@timestamp":"2018-03-27T00:02:19.050+00:00","@Version":1,"message":"OSO proxy URL - https://f8osoproxy-test-dsaas-production.09b5.dsaas.openshiftapps.com","logger_name":"com.redhat.che.multitenant.Fabric8WorkspaceEnvironmentProvider","thread_name":"WorkspaceSharedPool-5","level":"INFO","level_value":20000}
{"@timestamp":"2018-03-27T00:02:19.167+00:00","@Version":1,"message":"An OpenShift Pod with label cheContainerIdentifier=---------------------------------------------------------- could not be found","logger_name":"org.eclipse.che.plugin.openshift.client.OpenShiftConnector","thread_name":"WorkspaceSharedPool-5","level":"ERROR","level_value":40000}
{"@timestamp":"2018-03-27T00:02:19.167+00:00","@Version":1,"message":"An OpenShift Pod with label cheContainerIdentifier=---------------------------------------------------------- could not be found","logger_name":"org.eclipse.che.plugin.docker.machine.DockerInstance","thread_name":"WorkspaceSharedPool-5","level":"ERROR","level_value":40000,"stack_trace":"java.io.IOException: An OpenShift Pod with label cheContainerIdentifier=---------------------------------------------------------- could not be found\n\tat org.eclipse.che.plugin.openshift.client.OpenShiftConnector.getChePodByContainerId(OpenShiftConnector.java:1593)\n\tat org.eclipse.che.plugin.openshift.client.OpenShiftConnector.getDeploymentName(OpenShiftConnector.java:2242)\n\tat org.eclipse.che.plugin.openshift.client.OpenShiftConnector.removeContainer(OpenShiftConnector.java:831)\n\tat org.eclipse.che.plugin.docker.machine.DockerInstance.destroy(DockerInstance.java:282)\n\tat org.eclipse.che.api.environment.server.CheEnvironmentEngine.destroyMachine(CheEnvironmentEngine.java:1172)\n\tat org.eclipse.che.api.environment.server.CheEnvironmentEngine.destroyEnvironment(CheEnvironmentEngine.java:1147)\n\tat org.eclipse.che.api.environment.server.CheEnvironmentEngine.stop(CheEnvironmentEngine.java:322)\n\tat org.eclipse.che.api.workspace.server.WorkspaceRuntimes.stopEnvironmentAndPublishEvents(WorkspaceRuntimes.java:788)\n\tat org.eclipse.che.api.workspace.server.WorkspaceRuntimes.stop(WorkspaceRuntimes.java:356)\n\tat org.eclipse.che.api.workspace.server.WorkspaceManager.lambda$stopAsync$3(WorkspaceManager.java:732)\n\tat org.eclipse.che.commons.lang.concurrent.CopyThreadLocalRunnable.run(CopyThreadLocalRunnable.java:29)\n\tat java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1626)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n"}
{"@timestamp":"2018-03-27T00:02:19.167+00:00","@Version":1,"message":"Could not destroy machine 'machine1q8uiabhaaccf1vd' of workspace 'workspace2ojaubk4hptm4jnc'","logger_name":"org.eclipse.che.api.environment.server.CheEnvironmentEngine","thread_name":"WorkspaceSharedPool-5","level":"ERROR","level_value":40000,"stack_trace":"org.eclipse.che.api.machine.server.exception.MachineException: An OpenShift Pod with label cheContainerIdentifier=---------------------------------------------------------- could not be found\n\tat org.eclipse.che.plugin.docker.machine.DockerInstance.destroy(DockerInstance.java:286)\n\tat org.eclipse.che.api.environment.server.CheEnvironmentEngine.destroyMachine(CheEnvironmentEngine.java:1172)\n\tat org.eclipse.che.api.environment.server.CheEnvironmentEngine.destroyEnvironment(CheEnvironmentEngine.java:1147)\n\tat org.eclipse.che.api.environment.server.CheEnvironmentEngine.stop(CheEnvironmentEngine.java:322)\n\tat org.eclipse.che.api.workspace.server.WorkspaceRuntimes.stopEnvironmentAndPublishEvents(WorkspaceRuntimes.java:788)\n\tat org.eclipse.che.api.workspace.server.WorkspaceRuntimes.stop(WorkspaceRuntimes.java:356)\n\tat org.eclipse.che.api.workspace.server.WorkspaceManager.lambda$stopAsync$3(WorkspaceManager.java:732)\n\tat org.eclipse.che.commons.lang.concurrent.CopyThreadLocalRunnable.run(CopyThreadLocalRunnable.java:29)\n\tat java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1626)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n"}
{"@timestamp":"2018-03-27T00:02:19.168+00:00","@Version":1,"message":"Workspace 'ldimaggi@redhat.com/mpd32' with id 'workspace2ojaubk4hptm4jnc' stopped by user 'ldimaggi@redhat.com'","logger_name":"org.eclipse.che.api.workspace.server.WorkspaceManager","thread_name":"WorkspaceSharedPool-5","level":"INFO","level_value":20000}

rhopp · 2018-03-27T05:41:37Z

@ldimaggi What about events in OSO (starter-us-east-2)?

ldimaggi · 2018-03-27T13:40:46Z

No events were listed/displayed.

l0rd · 2018-03-27T17:17:09Z

This really looks like a duplicate of #2154. Our plan to mitigate/investigate is to:

Have a small script that try to mount PVs on cluster us-east-2 every 10 minutes or so Create simple test mounting PVC on OSO redhat-developer/che-functional-tests#200. Something that acquit Che and is easy to share/reproduce
Improve Che to provide a better error message when a PV mount error occurs Provide a better error message when PV failed to be attached redhat-developer/rh-che#557

jfchevrette · 2018-03-28T16:03:56Z

Opened a ticket with the SRE team. We have a theory on what may be causing this. Will report back here once I have more info.

ldimaggi · 2018-04-24T18:12:38Z

This is the same problem as is defined in #2154 - the situation has improved - but the problem is still present.

ldimaggi added SEV2-high type/task team/service-delivery labels Mar 12, 2018

ldimaggi self-assigned this Mar 12, 2018

ldimaggi removed the SEV2-high label Mar 12, 2018

ldimaggi added type/bug SEV2-high and removed type/task labels Mar 23, 2018

ldimaggi changed the title ~~On cluster starter-us-east-2 - consistent pattern - creating a Che workspace and running a build pipeline start failing at midnight UTC time~~ On cluster starter-us-east-2 - consistent pattern - creating a Che workspace starts failing at midnight UTC time Mar 23, 2018

ldimaggi changed the title ~~On cluster starter-us-east-2 - consistent pattern - creating a Che workspace starts failing at midnight UTC time~~ On cluster starter-us-east-2 - consistent pattern - creating a Che workspace starts failing at midnight UTC time for 2 hours Mar 23, 2018

ldimaggi changed the title ~~On cluster starter-us-east-2 - consistent pattern - creating a Che workspace starts failing at midnight UTC time for 2 hours~~ On cluster starter-us-east-2 - consistent pattern - creating a Che workspace starts failing at midnight UTC time for 2+ hours Mar 23, 2018

ldimaggi changed the title ~~On cluster starter-us-east-2 - consistent pattern - creating a Che workspace starts failing at midnight UTC time for 2+ hours~~ On cluster starter-us-east-2 - consistent pattern - creating a Che workspace starts failing at midnight UTC time for 3+ hours Mar 23, 2018

ldimaggi changed the title ~~On cluster starter-us-east-2 - consistent pattern - creating a Che workspace starts failing at midnight UTC time for 3+ hours~~ Between midnight and and ~03:00 UTC, users provisioned on cluster starter-us-east-2 canot create Che workspaces Mar 28, 2018

qodfathr added the priority/P1 Critical label Apr 23, 2018

ldimaggi closed this as completed Apr 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Between midnight and and ~03:00 UTC, users provisioned on cluster starter-us-east-2 canot create Che workspaces #2544

Between midnight and and ~03:00 UTC, users provisioned on cluster starter-us-east-2 canot create Che workspaces #2544

ldimaggi commented Mar 12, 2018 •

edited

Loading

joshuawilson commented Mar 12, 2018

ldimaggi commented Mar 12, 2018

ldimaggi commented Mar 13, 2018 •

edited

Loading

ldimaggi commented Mar 14, 2018 •

edited

Loading

ldimaggi commented Mar 23, 2018

mishaone commented Mar 27, 2018

ldimaggi commented Mar 27, 2018

rhopp commented Mar 27, 2018

ldimaggi commented Mar 27, 2018

l0rd commented Mar 27, 2018

jfchevrette commented Mar 28, 2018

ldimaggi commented Apr 24, 2018

Between midnight and and ~03:00 UTC, users provisioned on cluster starter-us-east-2 canot create Che workspaces #2544

Between midnight and and ~03:00 UTC, users provisioned on cluster starter-us-east-2 canot create Che workspaces #2544

Comments

ldimaggi commented Mar 12, 2018 • edited Loading

joshuawilson commented Mar 12, 2018

ldimaggi commented Mar 12, 2018

ldimaggi commented Mar 13, 2018 • edited Loading

ldimaggi commented Mar 14, 2018 • edited Loading

ldimaggi commented Mar 23, 2018

mishaone commented Mar 27, 2018

ldimaggi commented Mar 27, 2018

rhopp commented Mar 27, 2018

ldimaggi commented Mar 27, 2018

l0rd commented Mar 27, 2018

jfchevrette commented Mar 28, 2018

ldimaggi commented Apr 24, 2018

ldimaggi commented Mar 12, 2018 •

edited

Loading

ldimaggi commented Mar 13, 2018 •

edited

Loading

ldimaggi commented Mar 14, 2018 •

edited

Loading