Skip to content
This repository has been archived by the owner on Jul 23, 2020. It is now read-only.

Between midnight and and ~03:00 UTC, users provisioned on cluster starter-us-east-2 canot create Che workspaces #2544

Closed
ldimaggi opened this issue Mar 12, 2018 · 12 comments

Comments

@ldimaggi
Copy link
Collaborator

ldimaggi commented Mar 12, 2018

Updated March 22, 2018 - The pattern is consistent - starts at midnight UTC and continues for 3+ hours - only affects the starter-us-east-2 cluster.


The problem was first noted as affecting the creation of new Che workspaces starting after 19:00 (Boston EST time) here - #2154 (comment)

Since then - the same pattern has been seen in running build pipelines - E2E tests that create/run build pipelines are failing at 19:00 Boston time.

Question to be investigated - Are backups or some other system maintenance actions being performed on the starter-us-east-2 clusters at midnight UTC?

The starter-us-east-2a cluster does not seem to be affected by the issue.

@joshuawilson
Copy link
Member

tasks don't have severity, is this a bug?

@ldimaggi
Copy link
Collaborator Author

Not yet - still investigating - removed bug label.

@ldimaggi
Copy link
Collaborator Author

ldimaggi commented Mar 13, 2018

Seeing a series of these errors:

Mar 13, 2018 1:23:53 AM io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$2 
onFailure
WARNING: Exec Failure
java.net.SocketException: Connection reset

Mar 13, 2018 1:24:00 AM io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager 
nextReconnectInterval
INFO: Current reconnect backoff is 8000 milliseconds (T3)
Mar 13, 2018 1:24:00 AM okhttp3.internal.platform.Platform log
INFO: ALPN callback dropped: HTTP/2 is disabled. Is alpn-boot on the boot class path?
Mar 13, 2018 1:24:00 AM io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$2
onFailure
WARNING: Exec Failure: HTTP 403, Status: 403 - User "system:serviceaccount:username-
jenkins:jenkins" cannot watch secrets in the namespace "username": User 
"system:serviceaccount:username:jenkins" cannot watch secrets in project "username"
java.net.ProtocolException: Expected HTTP 101 response but was '403 Forbidden'

@ldimaggi
Copy link
Collaborator Author

ldimaggi commented Mar 14, 2018

The pattern is consistent:

Build log:

EXITCODE   0[ERROR] F8: Failed to execute the build [Unable to build the image using the OpenShift build service]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 11:09 min
[INFO] Finished at: 2018-03-14T01:24:19+00:00
[INFO] Final Memory: 37M/59M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal io.fabric8:fabric8-maven-plugin:3.5.38:build (fmp) on project testmar141520989676914: Failed to execute the build: Unable to build the image using the OpenShift build service: An error has occurred. timeout: Socket closed -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal io.fabric8:fabric8-maven-plugin:3.5.38:build (fmp) on project testmar141520989676914: Failed to execute the build
	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)
	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
	at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
	at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
Pod log:

Caused by: io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred.
	at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:62)
	at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:53)
	at io.fabric8.openshift.client.dsl.internal.BuildConfigOperationsImpl.fromInputStream(BuildConfigOperationsImpl.java:276)
	at io.fabric8.openshift.client.dsl.internal.BuildConfigOperationsImpl.fromFile(BuildConfigOperationsImpl.java:231)
	at io.fabric8.openshift.client.dsl.internal.BuildConfigOperationsImpl.fromFile(BuildConfigOperationsImpl.java:68)
	at io.fabric8.maven.core.service.openshift.OpenshiftBuildService.startBuild(OpenshiftBuildService.java:361)
	at io.fabric8.maven.core.service.openshift.OpenshiftBuildService.build(OpenshiftBuildService.java:111)
	... 27 more
Caused by: java.net.SocketTimeoutException: timeout

@ldimaggi ldimaggi changed the title On cluster starter-us-east-2 - creating a Che workspace and running a build pipeline start failing at midnight UTC time On cluster starter-us-east-2 - consistent pattern - creating a Che workspace and running a build pipeline start failing at midnight UTC time Mar 23, 2018
@ldimaggi ldimaggi changed the title On cluster starter-us-east-2 - consistent pattern - creating a Che workspace and running a build pipeline start failing at midnight UTC time On cluster starter-us-east-2 - consistent pattern - creating a Che workspace starts failing at midnight UTC time Mar 23, 2018
@ldimaggi
Copy link
Collaborator Author

I just saw the creation of a Che workspace fail 5 times out of 5 - this error resulted - #2154

@ldimaggi ldimaggi changed the title On cluster starter-us-east-2 - consistent pattern - creating a Che workspace starts failing at midnight UTC time On cluster starter-us-east-2 - consistent pattern - creating a Che workspace starts failing at midnight UTC time for 2 hours Mar 23, 2018
@ldimaggi ldimaggi changed the title On cluster starter-us-east-2 - consistent pattern - creating a Che workspace starts failing at midnight UTC time for 2 hours On cluster starter-us-east-2 - consistent pattern - creating a Che workspace starts failing at midnight UTC time for 2+ hours Mar 23, 2018
@ldimaggi ldimaggi changed the title On cluster starter-us-east-2 - consistent pattern - creating a Che workspace starts failing at midnight UTC time for 2+ hours On cluster starter-us-east-2 - consistent pattern - creating a Che workspace starts failing at midnight UTC time for 3+ hours Mar 23, 2018
@mishaone
Copy link
Collaborator

Tried this at the appropriate timezone. Created a workspace before UTC midnight and then another one two minutes after. Got this error: Could not start workspace newprojectname-tvfkw. Reason: Start of environment 'default' failed. Error: Failed to get the ID of the container running in the OpenShift pod

workspace_fail

@ldimaggi
Copy link
Collaborator Author

Checking again at midnight UTC - this time - found this in the Che log:

An OpenShift Pod not found

{"@timestamp":"2018-03-27T00:02:17.197+00:00","@Version":1,"message":"Workspace 'ldimaggi@redhat.com/0dsiu' with id 'workspacemypusb6e60pgdxfv' created by user 'ldimaggi@redhat.com'","logger_name":"org.eclipse.che.api.workspace.server.WorkspaceManager","thread_name":"http-nio-8080-exec-1","level":"INFO","level_value":20000,"req_id":"821c3366-ddac-443b-b8d8-3c7ae8727a66","identity_id":"20ddc23a-bb62-4834-9130-9af2f54e85b1"}
{"@timestamp":"2018-03-27T00:02:18.957+00:00","@Version":1,"message":"Workspace 'ldimaggi@redhat.com/mpd32' with id 'workspace2ojaubk4hptm4jnc' is being stopped by user 'ldimaggi@redhat.com'","logger_name":"org.eclipse.che.api.workspace.server.WorkspaceManager","thread_name":"WorkspaceSharedPool-5","level":"INFO","level_value":20000}
{"@timestamp":"2018-03-27T00:02:18.959+00:00","@Version":1,"message":"Retrieving user Che tenant data","logger_name":"com.redhat.che.multitenant.Fabric8WorkspaceEnvironmentProvider","thread_name":"WorkspaceSharedPool-5","level":"INFO","level_value":20000}
{"@timestamp":"2018-03-27T00:02:19.050+00:00","@Version":1,"message":"cheTenantData = {ldimaggi-che,https://f8osoproxy-test-dsaas-production.09b5.dsaas.openshiftapps.com,8a09.starter-us-east-2.openshiftapps.com}","logger_name":"com.redhat.che.multitenant.Fabric8WorkspaceEnvironmentProvider","thread_name":"WorkspaceSharedPool-5","level":"INFO","level_value":20000}
{"@timestamp":"2018-03-27T00:02:19.050+00:00","@Version":1,"message":"OSO proxy URL - https://f8osoproxy-test-dsaas-production.09b5.dsaas.openshiftapps.com","logger_name":"com.redhat.che.multitenant.Fabric8WorkspaceEnvironmentProvider","thread_name":"WorkspaceSharedPool-5","level":"INFO","level_value":20000}
{"@timestamp":"2018-03-27T00:02:19.167+00:00","@Version":1,"message":"An OpenShift Pod with label cheContainerIdentifier=---------------------------------------------------------- could not be found","logger_name":"org.eclipse.che.plugin.openshift.client.OpenShiftConnector","thread_name":"WorkspaceSharedPool-5","level":"ERROR","level_value":40000}
{"@timestamp":"2018-03-27T00:02:19.167+00:00","@Version":1,"message":"An OpenShift Pod with label cheContainerIdentifier=---------------------------------------------------------- could not be found","logger_name":"org.eclipse.che.plugin.docker.machine.DockerInstance","thread_name":"WorkspaceSharedPool-5","level":"ERROR","level_value":40000,"stack_trace":"java.io.IOException: An OpenShift Pod with label cheContainerIdentifier=---------------------------------------------------------- could not be found\n\tat org.eclipse.che.plugin.openshift.client.OpenShiftConnector.getChePodByContainerId(OpenShiftConnector.java:1593)\n\tat org.eclipse.che.plugin.openshift.client.OpenShiftConnector.getDeploymentName(OpenShiftConnector.java:2242)\n\tat org.eclipse.che.plugin.openshift.client.OpenShiftConnector.removeContainer(OpenShiftConnector.java:831)\n\tat org.eclipse.che.plugin.docker.machine.DockerInstance.destroy(DockerInstance.java:282)\n\tat org.eclipse.che.api.environment.server.CheEnvironmentEngine.destroyMachine(CheEnvironmentEngine.java:1172)\n\tat org.eclipse.che.api.environment.server.CheEnvironmentEngine.destroyEnvironment(CheEnvironmentEngine.java:1147)\n\tat org.eclipse.che.api.environment.server.CheEnvironmentEngine.stop(CheEnvironmentEngine.java:322)\n\tat org.eclipse.che.api.workspace.server.WorkspaceRuntimes.stopEnvironmentAndPublishEvents(WorkspaceRuntimes.java:788)\n\tat org.eclipse.che.api.workspace.server.WorkspaceRuntimes.stop(WorkspaceRuntimes.java:356)\n\tat org.eclipse.che.api.workspace.server.WorkspaceManager.lambda$stopAsync$3(WorkspaceManager.java:732)\n\tat org.eclipse.che.commons.lang.concurrent.CopyThreadLocalRunnable.run(CopyThreadLocalRunnable.java:29)\n\tat java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1626)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n"}
{"@timestamp":"2018-03-27T00:02:19.167+00:00","@Version":1,"message":"Could not destroy machine 'machine1q8uiabhaaccf1vd' of workspace 'workspace2ojaubk4hptm4jnc'","logger_name":"org.eclipse.che.api.environment.server.CheEnvironmentEngine","thread_name":"WorkspaceSharedPool-5","level":"ERROR","level_value":40000,"stack_trace":"org.eclipse.che.api.machine.server.exception.MachineException: An OpenShift Pod with label cheContainerIdentifier=---------------------------------------------------------- could not be found\n\tat org.eclipse.che.plugin.docker.machine.DockerInstance.destroy(DockerInstance.java:286)\n\tat org.eclipse.che.api.environment.server.CheEnvironmentEngine.destroyMachine(CheEnvironmentEngine.java:1172)\n\tat org.eclipse.che.api.environment.server.CheEnvironmentEngine.destroyEnvironment(CheEnvironmentEngine.java:1147)\n\tat org.eclipse.che.api.environment.server.CheEnvironmentEngine.stop(CheEnvironmentEngine.java:322)\n\tat org.eclipse.che.api.workspace.server.WorkspaceRuntimes.stopEnvironmentAndPublishEvents(WorkspaceRuntimes.java:788)\n\tat org.eclipse.che.api.workspace.server.WorkspaceRuntimes.stop(WorkspaceRuntimes.java:356)\n\tat org.eclipse.che.api.workspace.server.WorkspaceManager.lambda$stopAsync$3(WorkspaceManager.java:732)\n\tat org.eclipse.che.commons.lang.concurrent.CopyThreadLocalRunnable.run(CopyThreadLocalRunnable.java:29)\n\tat java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1626)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n"}
{"@timestamp":"2018-03-27T00:02:19.168+00:00","@Version":1,"message":"Workspace 'ldimaggi@redhat.com/mpd32' with id 'workspace2ojaubk4hptm4jnc' stopped by user 'ldimaggi@redhat.com'","logger_name":"org.eclipse.che.api.workspace.server.WorkspaceManager","thread_name":"WorkspaceSharedPool-5","level":"INFO","level_value":20000}

@rhopp
Copy link
Collaborator

rhopp commented Mar 27, 2018

@ldimaggi What about events in OSO (starter-us-east-2)?

@ldimaggi
Copy link
Collaborator Author

No events were listed/displayed.

@l0rd
Copy link
Collaborator

l0rd commented Mar 27, 2018

This really looks like a duplicate of #2154. Our plan to mitigate/investigate is to:

@ldimaggi ldimaggi changed the title On cluster starter-us-east-2 - consistent pattern - creating a Che workspace starts failing at midnight UTC time for 3+ hours Between midnight and and ~03:00 UTC, users provisioned on cluster starter-us-east-2 canot create Che workspaces Mar 28, 2018
@jfchevrette
Copy link
Contributor

Opened a ticket with the SRE team. We have a theory on what may be causing this. Will report back here once I have more info.

@qodfathr qodfathr added the priority/P1 Critical label Apr 23, 2018
@ldimaggi
Copy link
Collaborator Author

This is the same problem as is defined in #2154 - the situation has improved - but the problem is still present.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants