-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stargate container memory calculated incorrectly #249
Comments
I'm not sure how to explain this, but I cannot reproduce this on my machine. The container is up and running fine on my machine and has not been OOM killed. I configured stargate.heapMB=2048, and this excerpt of
Am I missing something? When are you guys seeing the OOM killer do its thing? I assumed it would happen shortly after JVM startup, but this pod is ten minutes old, responsive to various HTTP requests, and running without issue. I can still tweak the settings, of course, but I wanted to be able to reproduce the problem before proceeding with changes. |
Shortly after the pod is created. The container status reports an exit code of 137. I hit this repeatedly while testing #236. The errors are what led me to investigate and discover that the container memory is being calculated incorrectly. I was testing with a 3 node C* cluster, but I doubt that matters. |
I have seen it sporadically @jakerobb most frequently when running a 3 Cassandra node 2 Stargate node cluster. I was running today with a single stargate node and it was working. Here's a log message that I observed yesterday when @jsanda and I were talking about it. I'm not sure how much this will help or not, but here ya go: Running java -server -XX:+CrashOnOutOfMemoryError -Xms1024M -Xmx1024M -Dstargate.libdir=./stargate-lib -jar ./stargate-lib/stargate-starter-1.0.0.jar --cluster-name k8ssandra --cluster-version 3.11 --cluster-seed k8ssandra-seed-service.default.svc.cluster.local --listen 10.244.1.9 --dc dc1 --rack default --enable-auth That's unfortunately all I had in the slack convo. I will try to reproduce it again and get some more complete logs for you. |
I will try to reproduce again and report back more details. With that said the memory calculation
is definitely a bug since Helm's math functions only work with int64 values. |
Yep, not denying the fact that it's a bug. I am just not going to be confident in any sort of fix until I know how to reproduce the OOM. I'm guessing you were both using kind? |
Kind for me, yup. |
I also hit this with Kind. I have tried several times on GKE and have not hit it. That is interesting. I can keep increasing the memory, but I will run into a different problem. The pod won't get scheduled to run because none of my k8s worker nodes have enough memory to satisfy the resource requests. Maybe this behavior is more specific to Kind than I originally thought. |
…; adjusting memory defaults and also liveness/readiness probes for improved startup reliability
…; adjusting memory defaults and also liveness/readiness probes for improved startup reliability
…; adjusting memory defaults and also liveness/readiness probes for improved startup reliability
I have unlinked #283 because my changes have not (as far as I am aware) affected John's ability to reproduce the OOM killer symptom. I think this should remain open until we can figure that out. |
…; adjusting memory defaults and also liveness/readiness probes for improved startup reliability
* #249: resolving mathematical error in the Stargate pod's memory specs; adjusting memory defaults and also liveness/readiness probes for improved startup reliability * #249: corrected docs * updated test suite to use unique+descriptive values for namespace, clusterName, and releaseName * Split ingressroutes template into separate templates so that each can be tested; relocated each to an appropriate directory. * Factored out a shareable renderTemplate function for template tests to reduce boilerplate (and improve behavior in failure cases). * Restructured utils package and added several utilities. * Added/improved unit tests for Stargate and Ingress. * Updated all existing tests to use new renderTemplate function. * Stargate ingress is now on by default when stargate and ingress are themselves enabled. Stargate's default host for ingress is now * instead of localhost. * changing file names, and import aliases to all-lowercase per code review feedback * Updating to be compatible with latest rebased changes * Updating to be compatible with latest rebased changes; adding a distinct test for custom releasename * Added docs for all utility functions. Removed some debug logging. Refactored one function slightly to be more reusable. * Refactored FindEnvVarByName to take the container instead of the array of EnvVars. * Re-updating to use the new factored-out utility methods. * Updating Stargate ingress to support wildcard host (by not specifying it) * Restoring fix for issue #299 from PR #320, which was inadvertently merged over by #311 * clarifying docs, fixing some typos * Moving Stargate deployment defaults from template to values, refactoring to simplify, revising tests for compatibility with latest rebase. * updated chart version * updating Stargate dashboard to correct the stat title for request rate * updating test suite to use random-suffixed values for namespace and release name; updating Stargate tests to not be fragile under such randomness
Bug Report
Describe the bug
While testing #163 I observed that my Stargate container was repeatedly getting OOM killed. The problem is due to the container memory being set to the same value as the JVM heap size. The container memory is calculated as follows:
The problem is that the math functions operate on int64 values (see here in the Helm docs), so the 1.5 gets cast to an int64 which winds up as 1.
Another problem is that the
stargate.heapMB
property is not documented invalues.yaml
To Reproduce
Steps to reproduce the behavior:
Deploy k8ssandra with the following values:
Expected behavior
The total container memory must be greater than the JVM heap size. We also need to document the
heapMB
property invalues.yaml
.Environment (please complete the following information):
0.31.0
The text was updated successfully, but these errors were encountered: