You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Have observed a scenario where all of the C* nodes in the cluster restart while Stargate is up and running. In this example, going from cassandra.datacenters[0].size=3 down to cassandra.datacenters[0].size=1. When this happens a full rolling restart of the Cassandra pods can be observed. After the restart is completed the Stargate node is then unable to connect to the cluster and serve any requests with errors like:
Exceptions like this can be seen in the stargate logs:
ERROR [dw-140 - GET /v2/schemas/namespaces] 2021-02-23 02:45:05,211 AuthnTableBasedService.java:276 - Failed to validate token
java.util.concurrent.ExecutionException: org.apache.cassandra.stargate.exceptions.UnavailableException: Cannot achieve consistency level ONE
at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
at io.stargate.auth.table.AuthnTableBasedService.validateToken(AuthnTableBasedService.java:251)
at io.stargate.auth.AuthenticationService.validateToken(AuthenticationService.java:33)
at io.stargate.web.resources.Db.getDataStoreForToken(Db.java:72)
at io.stargate.web.docsapi.resources.NamespacesResource.lambda$getAllNamespaces$1(NamespacesResource.java:95)
at io.stargate.web.resources.RequestHandler.handle(RequestHandler.java:36)
at io.stargate.web.docsapi.resources.NamespacesResource.getAllNamespaces(NamespacesResource.java:92)
at sun.reflect.GeneratedMethodAccessor50.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52)
at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:124)
at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:167)
at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:176)
at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:79)
at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:469)
at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:391)
at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:80)
at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:253)
at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248)
at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244)
at org.glassfish.jersey.internal.Errors.process(Errors.java:292)
at org.glassfish.jersey.internal.Errors.process(Errors.java:274)
at org.glassfish.jersey.internal.Errors.process(Errors.java:244)
at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265)
at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:232)
at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:680)
at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:394)
at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:346)
at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:366)
at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:319)
at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205)
at io.dropwizard.jetty.NonblockingServletHolder.handle(NonblockingServletHolder.java:50)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1631)
at io.dropwizard.servlets.ThreadNameFilter.doFilter(ThreadNameFilter.java:35)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1618)
at io.dropwizard.jersey.filter.AllowedMethodsFilter.handle(AllowedMethodsFilter.java:47)
at io.dropwizard.jersey.filter.AllowedMethodsFilter.doFilter(AllowedMethodsFilter.java:41)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1618)
at org.eclipse.jetty.servlets.CrossOriginFilter.handle(CrossOriginFilter.java:319)
at org.eclipse.jetty.servlets.CrossOriginFilter.doFilter(CrossOriginFilter.java:273)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1618)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:549)
at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1369)
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:489)
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1284)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at com.codahale.metrics.jetty9.InstrumentedHandler.handle(InstrumentedHandler.java:249)
at io.dropwizard.jetty.ContextRoutingHandler.handle(ContextRoutingHandler.java:37)
at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:767)
at org.eclipse.jetty.server.handler.RequestLogHandler.handle(RequestLogHandler.java:54)
at org.eclipse.jetty.server.handler.StatisticsHandler.handle(StatisticsHandler.java:173)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at org.eclipse.jetty.server.Server.handle(Server.java:501)
at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:383)
at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:556)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:375)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:272)
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:806)
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.cassandra.stargate.exceptions.UnavailableException: Cannot achieve consistency level ONE
at io.stargate.db.cassandra.impl.Conversion.toExternal(Conversion.java:276)
at io.stargate.db.cassandra.impl.Conversion.convertInternalException(Conversion.java:472)
at io.stargate.db.cassandra.impl.CassandraPersistence$CassandraConnection.lambda$executeRequestOnExecutor$0(CassandraPersistence.java:443)
at io.stargate.db.cassandra.impl.CassandraPersistence.lambda$runOnExecutor$1(CassandraPersistence.java:268)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:165)
at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:113)
... 1 common frames omitted
Suppressed: org.apache.cassandra.exceptions.UnavailableException: Cannot achieve consistency level ONE
Expected behavior
The system should be tolerant of restarts of the C* pods.
Environment (please complete the following information):
Helm charts version info
% helm ls -A
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
k8ssandra default 10 2021-02-22 21:37:16.239715 -0500 EST deployed k8ssandra-0.55.0 3.11.10
traefik default 1 2021-02-22 20:15:39.848483 -0500 EST deployed traefik-9.14.3 2.4.5
% k get nodes
NAME STATUS ROLES AGE VERSION
k8ssandra-control-plane Ready master 132m v1.17.17
k8ssandra-worker Ready <none> 132m v1.17.17
k8ssandra-worker2 Ready <none> 132m v1.17.17
k8ssandra-worker3 Ready <none> 132m v1.17.17
Additional context
Forcing a restart of Stargate via something like helm upgrade k8ssandra k8ssandra/k8ssandra --reuse-values --set stargate.enabled=false followed by helm upgrade k8ssandra k8ssandra/k8ssandra --reuse-values --set stargate.enabled=trueafter the C* restart is complete will correct the problem.
┆Issue is synchronized with this Jira Bug by Unito
┆friendlyId: K8SSAND-118
┆priority: Medium
The text was updated successfully, but these errors were encountered:
It's worth mentioning so that it's clear, this seems to be a problem when ALL C* nodes is in play go down. Either because there was only a single C* node, or because they were forced down together at the same time for other reasons (failure, medusa restore w/ full shutdown, etc). I saw no issues during rolling restarts while at least one C* node was up and running.
sync-by-unitobot
changed the title
When a restart of the C* nodes in a cluster occurs, Stargate is then unable to function
K8SSAND-118 ⁃ When a restart of the C* nodes in a cluster occurs, Stargate is then unable to function
Apr 1, 2022
Bug Report
Describe the bug
Have observed a scenario where all of the C* nodes in the cluster restart while Stargate is up and running. In this example, going from
cassandra.datacenters[0].size=3
down tocassandra.datacenters[0].size=1
. When this happens a full rolling restart of the Cassandra pods can be observed. After the restart is completed the Stargate node is then unable to connect to the cluster and serve any requests with errors like:To Reproduce
Steps to reproduce the behavior:
cassandra.datacenters[0].size=3
cassandra.datacenters[0].size=1
Here you can see that the
dc1-default
pod has restarted more recently than thedc1-stargate
Exceptions like this can be seen in the stargate logs:
Expected behavior
The system should be tolerant of restarts of the C* pods.
Environment (please complete the following information):
Kind
Additional context
Forcing a restart of Stargate via something like
helm upgrade k8ssandra k8ssandra/k8ssandra --reuse-values --set stargate.enabled=false
followed byhelm upgrade k8ssandra k8ssandra/k8ssandra --reuse-values --set stargate.enabled=true
after the C* restart is complete will correct the problem.┆Issue is synchronized with this Jira Bug by Unito
┆friendlyId: K8SSAND-118
┆priority: Medium
The text was updated successfully, but these errors were encountered: