Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stargate container memory calculated incorrectly #249

Closed
jsanda opened this issue Jan 26, 2021 · 8 comments · Fixed by #333
Closed

Stargate container memory calculated incorrectly #249

jsanda opened this issue Jan 26, 2021 · 8 comments · Fixed by #333
Assignees
Projects
Milestone

Comments

@jsanda
Copy link
Contributor

jsanda commented Jan 26, 2021

Bug Report

Describe the bug
While testing #163 I observed that my Stargate container was repeatedly getting OOM killed. The problem is due to the container memory being set to the same value as the JVM heap size. The container memory is calculated as follows:

{{ mul 1.5 (.Values.stargate.heapMB | default 1024) }}Mi

The problem is that the math functions operate on int64 values (see here in the Helm docs), so the 1.5 gets cast to an int64 which winds up as 1.

Another problem is that the stargate.heapMB property is not documented in values.yaml

To Reproduce
Steps to reproduce the behavior:
Deploy k8ssandra with the following values:

stargate:
  enabled: true
  heapMB: 2048

Expected behavior
The total container memory must be greater than the JVM heap size. We also need to document the heapMB property in values.yaml.

Environment (please complete the following information):

  • Helm charts version info
    0.31.0
@jsanda jsanda added this to the 1.0.0 milestone Jan 26, 2021
@jsanda jsanda added this to To do in K8ssandra via automation Jan 26, 2021
@jakerobb jakerobb moved this from To do to In progress in K8ssandra Jan 26, 2021
@jakerobb
Copy link
Contributor

I'm not sure how to explain this, but I cannot reproduce this on my machine. The container is up and running fine on my machine and has not been OOM killed. I configured stargate.heapMB=2048, and this excerpt of kubectl describe pod shows that the container has a heap min and max of 2048M and a resource limit of 2Gi.

Containers:
  cluster1-k8ssandra-dc1-stargate:
    Container ID:   containerd://41d79305478af96b38a8717e64681e16fc627606a3a5b592b22bc19df045932e
    Image:          stargateio/stargate-3_11:v1.0.0
    Image ID:       docker.io/stargateio/stargate-3_11@sha256:810036d9e0018e151c8ec152f72e8d2bcf55a2980cdc5f1125f035b212aa6b12
    Ports:          8080/TCP, 8081/TCP, 8082/TCP, 8084/TCP, 8085/TCP, 8090/TCP, 9042/TCP, 8609/TCP, 7000/TCP, 7001/TCP
    Host Ports:     0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP
    State:          Running
      Started:      Tue, 26 Jan 2021 22:12:10 -0500
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  2Gi
    Requests:
      cpu:      200m
      memory:   2Gi
    Liveness:   http-get http://:health/checker/liveness delay=60s timeout=3s period=10s #success=1 #failure=3
    Readiness:  http-get http://:health/checker/readiness delay=60s timeout=3s period=10s #success=1 #failure=3
    Environment:
      JAVA_OPTS:        -XX:+CrashOnOutOfMemoryError -Xms2048m -Xmx2048m
      CLUSTER_NAME:     cluster1
      CLUSTER_VERSION:  3.11
      SEED:             cluster1-seed-service.cluster1.svc.cluster.local
      DATACENTER_NAME:  dc1
      RACK_NAME:        default
      ENABLE_AUTH:      true
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-x7n97 (ro)

Am I missing something? When are you guys seeing the OOM killer do its thing? I assumed it would happen shortly after JVM startup, but this pod is ten minutes old, responsive to various HTTP requests, and running without issue.

I can still tweak the settings, of course, but I wanted to be able to reproduce the problem before proceeding with changes.

@jsanda
Copy link
Contributor Author

jsanda commented Jan 27, 2021

When are you guys seeing the OOM killer do its thing?

Shortly after the pod is created. The container status reports an exit code of 137.

I hit this repeatedly while testing #236. The errors are what led me to investigate and discover that the container memory is being calculated incorrectly.

I was testing with a 3 node C* cluster, but I doubt that matters.

@jdonenine
Copy link
Contributor

I have seen it sporadically @jakerobb most frequently when running a 3 Cassandra node 2 Stargate node cluster. I was running today with a single stargate node and it was working.

Here's a log message that I observed yesterday when @jsanda and I were talking about it. I'm not sure how much this will help or not, but here ya go:

Running java -server -XX:+CrashOnOutOfMemoryError -Xms1024M -Xmx1024M -Dstargate.libdir=./stargate-lib -jar ./stargate-lib/stargate-starter-1.0.0.jar --cluster-name k8ssandra --cluster-version 3.11 --cluster-seed k8ssandra-seed-service.default.svc.cluster.local --listen 10.244.1.9 --dc dc1 --rack default --enable-auth

That's unfortunately all I had in the slack convo. I will try to reproduce it again and get some more complete logs for you.

@jsanda
Copy link
Contributor Author

jsanda commented Jan 27, 2021

I will try to reproduce again and report back more details. With that said the memory calculation

{{ mul 1.5 (.Values.stargate.heapMB | default 1024) }}Mi

is definitely a bug since Helm's math functions only work with int64 values.

@jakerobb
Copy link
Contributor

Yep, not denying the fact that it's a bug. I am just not going to be confident in any sort of fix until I know how to reproduce the OOM.

I'm guessing you were both using kind?

@jdonenine
Copy link
Contributor

Kind for me, yup.

@jsanda
Copy link
Contributor Author

jsanda commented Jan 27, 2021

I also hit this with Kind. I have tried several times on GKE and have not hit it. That is interesting. I can keep increasing the memory, but I will run into a different problem. The pod won't get scheduled to run because none of my k8s worker nodes have enough memory to satisfy the resource requests. Maybe this behavior is more specific to Kind than I originally thought.

jakerobb pushed a commit that referenced this issue Jan 30, 2021
…; adjusting memory defaults and also liveness/readiness probes for improved startup reliability
jakerobb pushed a commit that referenced this issue Jan 30, 2021
jakerobb pushed a commit that referenced this issue Feb 2, 2021
…; adjusting memory defaults and also liveness/readiness probes for improved startup reliability
jakerobb pushed a commit that referenced this issue Feb 2, 2021
@jdonenine jdonenine linked a pull request Feb 4, 2021 that will close this issue
5 tasks
jakerobb pushed a commit that referenced this issue Feb 5, 2021
…; adjusting memory defaults and also liveness/readiness probes for improved startup reliability
jakerobb pushed a commit that referenced this issue Feb 5, 2021
@jakerobb jakerobb removed a link to a pull request Feb 6, 2021
5 tasks
@jakerobb
Copy link
Contributor

jakerobb commented Feb 6, 2021

I have unlinked #283 because my changes have not (as far as I am aware) affected John's ability to reproduce the OOM killer symptom. I think this should remain open until we can figure that out.

jakerobb pushed a commit that referenced this issue Feb 7, 2021
…; adjusting memory defaults and also liveness/readiness probes for improved startup reliability
jakerobb pushed a commit that referenced this issue Feb 7, 2021
@jdonenine jdonenine linked a pull request Feb 8, 2021 that will close this issue
5 tasks
K8ssandra automation moved this from In progress to Done Feb 9, 2021
jsanda pushed a commit that referenced this issue Feb 9, 2021
* #249: resolving mathematical error in the Stargate pod's memory specs; adjusting memory defaults and also liveness/readiness probes for improved startup reliability

* #249: corrected docs

* updated test suite to use unique+descriptive values for namespace, clusterName, and releaseName

* Split ingressroutes template into separate templates so that each can be tested; relocated each to an appropriate directory.

* Factored out a shareable renderTemplate function for template tests to reduce boilerplate (and improve behavior in failure cases).

* Restructured utils package and added several utilities.

* Added/improved unit tests for Stargate and Ingress.

* Updated all existing tests to use new renderTemplate function.

* Stargate ingress is now on by default when stargate and ingress are themselves enabled.
Stargate's default host for ingress is now * instead of localhost.

* changing file names, and import aliases to all-lowercase per code review feedback

* Updating to be compatible with latest rebased changes

* Updating to be compatible with latest rebased changes; adding a distinct test for custom releasename

* Added docs for all utility functions. Removed some debug logging. Refactored one function slightly to be more reusable.

* Refactored FindEnvVarByName to take the container instead of the array of EnvVars.

* Re-updating to use the new factored-out utility methods.

* Updating Stargate ingress to support wildcard host (by not specifying it)

* Restoring fix for issue #299 from PR #320, which was inadvertently merged over by #311

* clarifying docs, fixing some typos

* Moving Stargate deployment defaults from template to values, refactoring to simplify, revising tests for compatibility with latest rebase.

* updated chart version

* updating Stargate dashboard to correct the stat title for request rate

* updating test suite to use random-suffixed values for namespace and release name; updating Stargate tests to not be fragile under such randomness
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
K8ssandra
  
Done
Development

Successfully merging a pull request may close this issue.

3 participants