Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

manager pod has multiple restarts with oomkilled reason #1416

Closed
Tracked by #28
vbedida79 opened this issue May 12, 2023 · 18 comments · Fixed by #1429
Closed
Tracked by #28

manager pod has multiple restarts with oomkilled reason #1416

vbedida79 opened this issue May 12, 2023 · 18 comments · Fixed by #1429

Comments

@vbedida79
Copy link

Summary

The operator's controller-manager pod has multiple restarts with oomkilled termination reason.

Details

The controller manager pod for version 0.26.1 has multiple restarts with oomkilled reason on OCP 4.12. Currently, we have increased the pod's memory limit from 50MB to 100MB. This ceases multiple restarts of the pod.
After memory increase, we observed a slow increase of memory usage over the course of 3 days from- initially from 70 MB to currently at 108 MB. Could there be an internal memory leak possibility?
The pod logs do not show any errors:

I0512 17:44:29.821411 1 reconciler.go:233] "intel-device-plugins-manager: " controller="sgxdeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="SgxDevicePlugin" SgxDevicePlugin="sgxdeviceplugin-sample" namespace="" name="sgxdeviceplugin-sample" reconcileID=f73e9186-de1a-4b4c-a7c1-50e73a749a63 ="(MISSING)" I0512 17:44:29.821583 1 reconciler.go:233] "intel-device-plugins-manager: " controller="gpudeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="GpuDevicePlugin" GpuDevicePlugin="gpudeviceplugin-sample" namespace="" name="gpudeviceplugin-sample" reconcileID=0c1fac7a-7709-4dcd-b123-54f8a453ff4f ="(MISSING)" I0512 17:44:29.821603 1 reconciler.go:233] "intel-device-plugins-manager: " controller="qatdeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="QatDevicePlugin" QatDevicePlugin="qatdeviceplugin-sample" namespace="" name="qatdeviceplugin-sample" reconcileID=84c9d221-e5f4-4348-b7da-a2acae338b90 ="(MISSING)" I0512 17:44:29.828747 1 reconciler.go:233] "intel-device-plugins-manager: " controller="qatdeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="QatDevicePlugin" QatDevicePlugin="qatdeviceplugin-sample" namespace="" name="qatdeviceplugin-sample" reconcileID=c89e9d61-74b2-40f6-aa6c-3bda3437ce9e ="(MISSING)"

Possible solutions

  1. Is the memory increase change efficient ? The root cause could be an internal memory leak in the application.
  2. A temporary solution could be to increase the number of replicas for the pod to avoid overlapping of the pod restarts. Might not address the memory leak though.
@mythi
Copy link
Contributor

mythi commented May 15, 2023

@vbedida79 is this regression from v0.26.0? Do you have some test case that you are running during those days?

@tkatila
Copy link
Contributor

tkatila commented May 15, 2023

How is the operator deployed? I had an operator online for 11 hours with a script adding and removing a gpu CR and the memory foot print fluctuated between 45-47MB, but it didn't seem to increase constantly.

@vbedida79
Copy link
Author

@vbedida79 is this regression from v0.26.0? Do you have some test case that you are running during those days?

We have observed restarts with 0.26.0, but wonder if it could be due to tls handshake errors- as we didnt increase memory limit for that pod yet. After increasing memory, no restarts but could see the steady increase.
No specific test cases apart from workloads for sgx(sgx sdk demo) and gpu (clinfo) jobs with 0.26.1.

@vbedida79
Copy link
Author

vbedida79 commented May 15, 2023

How is the operator deployed? I had an operator online for 11 hours with a script adding and removing a gpu CR and the memory foot print fluctuated between 45-47MB, but it didn't seem to increase constantly.

Deployed operator with operator-sdk on ocp 4.12. Changed the memory limit to 100mb for the pod 4 days ago. It began with using around 50 MB. Over time, it has increased and the current usage is around 105MB.
Would deleting CR's cause steady fluctuations? Any other root cause?

@mythi
Copy link
Contributor

mythi commented May 15, 2023

No specific test cases apart from workloads for sgx(sgx sdk demo) and gpu (clinfo) jobs with 0.26.1.

do you deploy/undeploy them in a loop?

@vbedida79
Copy link
Author

No loop. Since increasing memory limit, deployed these jobs once.

@vbedida79
Copy link
Author

vbedida79 commented May 17, 2023

No workloads are running currently. The pod's memory from the time of operator deployed has increased from 50 MB to 100 and its in the consistent average range of ~107MB after that. Is this normal and expected?

@vbedida79
Copy link
Author

Update: Checked, manager container memory spiked from 50 to 80 when the pod was created. After which it has constant avg of 80-90 MB.
Solution was to increase limit from 50 to 100 to avoid restarts. Is the usage same on your environment?

@mythi
Copy link
Contributor

mythi commented May 19, 2023

Solution was to increase limit from 50 to 100 to avoid restarts. Is the usage same on your environment?

yes pretty much and we are going to update the limit as well. thanks for checking!

@vbedida79
Copy link
Author

vbedida79 commented May 19, 2023

Got it, thanks. Will this change be included in a specific release or immediate? Currently we use 0.26.1 for OCP 4.12.

@mythi
Copy link
Contributor

mythi commented May 22, 2023

Only in main which will be in 0.27.

@tkatila can you create the PR. 100M request 120M limit?

@tkatila
Copy link
Contributor

tkatila commented May 22, 2023

@tkatila can you create the PR. 100M request 120M limit?

Sure

tkatila referenced this issue in tkatila/intel-device-plugins-for-kubernetes May 22, 2023
Fixes: #1416

Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
@vbedida79
Copy link
Author

vbedida79 commented May 22, 2023

Thanks @mythi @tkatila.
We use 0.26.1 release to publish the certified operator on OpenShift 4.12. We observe frequent restarts with it.
Can the memory increase solution be included in 0.26.1 release or a branch using that?

@mythi
Copy link
Contributor

mythi commented May 22, 2023

is it not possible to modify the bundle before you release?

@vbedida79
Copy link
Author

@uMartinXu any thoughts?

@uMartinXu
Copy link

I think if there are no other new problems found in 0.26.1, we can add this small change to our bundle and release it.
@chaitanya1731 what do you think of this? :-)

@uMartinXu
Copy link

@mythi @tkatila Have you gotten the chance to test how much memory the operator actually consumes on Vanilla K8S? I want to check whether it will consume more memory on OCP than Vanilla K8S. Thanks!

@tkatila
Copy link
Contributor

tkatila commented May 25, 2023

@uMartinXu I did run two basic scenarios with the 0.26.1 operator:

  • Applied and deleted device plugin CRs in a loop
    • Memory consumption fluctuated between 38-48MB over ~10 hour duration. No OOM kills.
  • Applied a set of CRs and let the operator just idle
    • Memory consumption started from 43MB and decreased down to 38MB. This was with (I think) three days of idling. No OOM kills.

The operator has actually now been online for 10 days without restarts.

I applied the operator with the yaml deployment files, you mentioned to install it via operator sdk. Maybe that's the difference? I'm not familiar with the sdk.

tkatila referenced this issue in tkatila/intel-device-plugins-for-kubernetes Aug 8, 2023
Fixes: #1416

Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants