Stargate container memory calculated incorrectly #249

jsanda · 2021-01-26T03:16:23Z

Bug Report

Describe the bug
While testing #163 I observed that my Stargate container was repeatedly getting OOM killed. The problem is due to the container memory being set to the same value as the JVM heap size. The container memory is calculated as follows:

{{ mul 1.5 (.Values.stargate.heapMB | default 1024) }}Mi

The problem is that the math functions operate on int64 values (see here in the Helm docs), so the 1.5 gets cast to an int64 which winds up as 1.

Another problem is that the stargate.heapMB property is not documented in values.yaml

To Reproduce
Steps to reproduce the behavior:
Deploy k8ssandra with the following values:

stargate:
  enabled: true
  heapMB: 2048

Expected behavior
The total container memory must be greater than the JVM heap size. We also need to document the heapMB property in values.yaml.

Environment (please complete the following information):

Helm charts version info
0.31.0

The text was updated successfully, but these errors were encountered:

jakerobb · 2021-01-27T03:26:11Z

I'm not sure how to explain this, but I cannot reproduce this on my machine. The container is up and running fine on my machine and has not been OOM killed. I configured stargate.heapMB=2048, and this excerpt of kubectl describe pod shows that the container has a heap min and max of 2048M and a resource limit of 2Gi.

Containers:
  cluster1-k8ssandra-dc1-stargate:
    Container ID:   containerd://41d79305478af96b38a8717e64681e16fc627606a3a5b592b22bc19df045932e
    Image:          stargateio/stargate-3_11:v1.0.0
    Image ID:       docker.io/stargateio/stargate-3_11@sha256:810036d9e0018e151c8ec152f72e8d2bcf55a2980cdc5f1125f035b212aa6b12
    Ports:          8080/TCP, 8081/TCP, 8082/TCP, 8084/TCP, 8085/TCP, 8090/TCP, 9042/TCP, 8609/TCP, 7000/TCP, 7001/TCP
    Host Ports:     0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP
    State:          Running
      Started:      Tue, 26 Jan 2021 22:12:10 -0500
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  2Gi
    Requests:
      cpu:      200m
      memory:   2Gi
    Liveness:   http-get http://:health/checker/liveness delay=60s timeout=3s period=10s #success=1 #failure=3
    Readiness:  http-get http://:health/checker/readiness delay=60s timeout=3s period=10s #success=1 #failure=3
    Environment:
      JAVA_OPTS:        -XX:+CrashOnOutOfMemoryError -Xms2048m -Xmx2048m
      CLUSTER_NAME:     cluster1
      CLUSTER_VERSION:  3.11
      SEED:             cluster1-seed-service.cluster1.svc.cluster.local
      DATACENTER_NAME:  dc1
      RACK_NAME:        default
      ENABLE_AUTH:      true
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-x7n97 (ro)

Am I missing something? When are you guys seeing the OOM killer do its thing? I assumed it would happen shortly after JVM startup, but this pod is ten minutes old, responsive to various HTTP requests, and running without issue.

I can still tweak the settings, of course, but I wanted to be able to reproduce the problem before proceeding with changes.

jsanda · 2021-01-27T03:32:42Z

When are you guys seeing the OOM killer do its thing?

Shortly after the pod is created. The container status reports an exit code of 137.

I hit this repeatedly while testing #236. The errors are what led me to investigate and discover that the container memory is being calculated incorrectly.

I was testing with a 3 node C* cluster, but I doubt that matters.

jdonenine · 2021-01-27T03:34:27Z

I have seen it sporadically @jakerobb most frequently when running a 3 Cassandra node 2 Stargate node cluster. I was running today with a single stargate node and it was working.

Here's a log message that I observed yesterday when @jsanda and I were talking about it. I'm not sure how much this will help or not, but here ya go:

Running java -server -XX:+CrashOnOutOfMemoryError -Xms1024M -Xmx1024M -Dstargate.libdir=./stargate-lib -jar ./stargate-lib/stargate-starter-1.0.0.jar --cluster-name k8ssandra --cluster-version 3.11 --cluster-seed k8ssandra-seed-service.default.svc.cluster.local --listen 10.244.1.9 --dc dc1 --rack default --enable-auth

That's unfortunately all I had in the slack convo. I will try to reproduce it again and get some more complete logs for you.

jsanda · 2021-01-27T03:38:25Z

I will try to reproduce again and report back more details. With that said the memory calculation

{{ mul 1.5 (.Values.stargate.heapMB | default 1024) }}Mi

is definitely a bug since Helm's math functions only work with int64 values.

jakerobb · 2021-01-27T13:48:43Z

Yep, not denying the fact that it's a bug. I am just not going to be confident in any sort of fix until I know how to reproduce the OOM.

I'm guessing you were both using kind?

jdonenine · 2021-01-27T14:04:35Z

Kind for me, yup.

jsanda · 2021-01-27T17:30:26Z

I also hit this with Kind. I have tried several times on GKE and have not hit it. That is interesting. I can keep increasing the memory, but I will run into a different problem. The pod won't get scheduled to run because none of my k8s worker nodes have enough memory to satisfy the resource requests. Maybe this behavior is more specific to Kind than I originally thought.

…; adjusting memory defaults and also liveness/readiness probes for improved startup reliability

jakerobb · 2021-02-06T03:18:02Z

I have unlinked #283 because my changes have not (as far as I am aware) affected John's ability to reproduce the OOM killer symptom. I think this should remain open until we can figure that out.

…; adjusting memory defaults and also liveness/readiness probes for improved startup reliability

* #249: resolving mathematical error in the Stargate pod's memory specs; adjusting memory defaults and also liveness/readiness probes for improved startup reliability * #249: corrected docs * updated test suite to use unique+descriptive values for namespace, clusterName, and releaseName * Split ingressroutes template into separate templates so that each can be tested; relocated each to an appropriate directory. * Factored out a shareable renderTemplate function for template tests to reduce boilerplate (and improve behavior in failure cases). * Restructured utils package and added several utilities. * Added/improved unit tests for Stargate and Ingress. * Updated all existing tests to use new renderTemplate function. * Stargate ingress is now on by default when stargate and ingress are themselves enabled. Stargate's default host for ingress is now * instead of localhost. * changing file names, and import aliases to all-lowercase per code review feedback * Updating to be compatible with latest rebased changes * Updating to be compatible with latest rebased changes; adding a distinct test for custom releasename * Added docs for all utility functions. Removed some debug logging. Refactored one function slightly to be more reusable. * Refactored FindEnvVarByName to take the container instead of the array of EnvVars. * Re-updating to use the new factored-out utility methods. * Updating Stargate ingress to support wildcard host (by not specifying it) * Restoring fix for issue #299 from PR #320, which was inadvertently merged over by #311 * clarifying docs, fixing some typos * Moving Stargate deployment defaults from template to values, refactoring to simplify, revising tests for compatibility with latest rebase. * updated chart version * updating Stargate dashboard to correct the stat title for request rate * updating test suite to use random-suffixed values for namespace and release name; updating Stargate tests to not be fragile under such randomness

jsanda added bug Something isn't working complexity:low needs-triage component:stargate labels Jan 26, 2021

jsanda added this to the 1.0.0 milestone Jan 26, 2021

jsanda added this to To do in K8ssandra via automation Jan 26, 2021

jdonenine assigned jakerobb Jan 26, 2021

jakerobb moved this from To do to In progress in K8ssandra Jan 26, 2021

jsanda mentioned this issue Jan 27, 2021

What conventions should we adopt for setting default values for resources? #258

Closed

This was referenced Jan 29, 2021

Prometheus monitoring for Stargate #164

Closed

Provide out-of-the-box grafana dashboards for visualizing Stargate metrics #165

Closed

#165: adding initial Grafana dashboard for Stargate #282

Merged

jakerobb pushed a commit that referenced this issue Jan 30, 2021

#249: resolving mathematical error in the Stargate pod's memory specs…

d2b5a40

…; adjusting memory defaults and also liveness/readiness probes for improved startup reliability

jakerobb mentioned this issue Jan 30, 2021

#249: stargate bugfix and tuning #283

Closed

5 tasks

jakerobb pushed a commit that referenced this issue Jan 30, 2021

#249: corrected docs

551b8d6

jakerobb pushed a commit that referenced this issue Feb 2, 2021

#249: resolving mathematical error in the Stargate pod's memory specs…

cf0a5de

…; adjusting memory defaults and also liveness/readiness probes for improved startup reliability

jakerobb pushed a commit that referenced this issue Feb 2, 2021

#249: corrected docs

ca14eb7

jdonenine linked a pull request Feb 4, 2021 that will close this issue

#249: stargate bugfix and tuning #283

Closed

5 tasks

jakerobb pushed a commit that referenced this issue Feb 5, 2021

#249: resolving mathematical error in the Stargate pod's memory specs…

c8d17b5

…; adjusting memory defaults and also liveness/readiness probes for improved startup reliability

jakerobb pushed a commit that referenced this issue Feb 5, 2021

#249: corrected docs

ecb9d92

jakerobb removed a link to a pull request Feb 6, 2021

#249: stargate bugfix and tuning #283

Closed

5 tasks

jakerobb pushed a commit that referenced this issue Feb 7, 2021

#249: resolving mathematical error in the Stargate pod's memory specs…

17d05bf

…; adjusting memory defaults and also liveness/readiness probes for improved startup reliability

jakerobb pushed a commit that referenced this issue Feb 7, 2021

#249: corrected docs

9ecbb79

jakerobb mentioned this issue Feb 7, 2021

#249: stargate tuning and tests #333

Merged

5 tasks

jdonenine linked a pull request Feb 8, 2021 that will close this issue

#249: stargate tuning and tests #333

Merged

5 tasks

jsanda closed this as completed in #333 Feb 9, 2021

K8ssandra automation moved this from In progress to Done Feb 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stargate container memory calculated incorrectly #249

Stargate container memory calculated incorrectly #249

jsanda commented Jan 26, 2021

jakerobb commented Jan 27, 2021

jsanda commented Jan 27, 2021

jdonenine commented Jan 27, 2021

jsanda commented Jan 27, 2021

jakerobb commented Jan 27, 2021

jdonenine commented Jan 27, 2021

jsanda commented Jan 27, 2021

jakerobb commented Feb 6, 2021

Stargate container memory calculated incorrectly #249

Stargate container memory calculated incorrectly #249

Comments

jsanda commented Jan 26, 2021

Bug Report

jakerobb commented Jan 27, 2021

jsanda commented Jan 27, 2021

jdonenine commented Jan 27, 2021

jsanda commented Jan 27, 2021

jakerobb commented Jan 27, 2021

jdonenine commented Jan 27, 2021

jsanda commented Jan 27, 2021

jakerobb commented Feb 6, 2021