Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Foreground deletion of OpentelemetryColletor object causes objects to be created and deleted in endless loop #2364

Closed
huzhekun opened this issue Nov 17, 2023 · 4 comments · Fixed by #2383
Assignees
Labels
area:collector Issues for deploying collector bug Something isn't working

Comments

@huzhekun
Copy link

Component(s)

No response

What happened?

Description

When deleting an OpentelemetryColletor object with the command kubectl delete opentelemetrycollector cluster -n observability-metrics --cascade=foreground the object does not delete and is instead stuck in a cycle trying to recreate dependent objects and dependent objects being deleted

Steps to Reproduce

Delete an OpentelemetryColletor object with --cascade=foreground

Expected Result

Object's dependent resources delete cleanly then the collector is deleted

Actual Result

Collector is not deleted, the underlying resources such as deployments or daemonsets and others are stuck in a cycle of being deleted and created multiple times per second

Example of watching the underlying deployment of the collector (after running the command to delete the collector object)

$ kubectl get deployment -n observability-metrics -w
NAME                READY   UP-TO-DATE   AVAILABLE   AGE
cluster-collector   2/2     2            2           33s
cluster-collector   2/2     2            2           33s
cluster-collector   1/2     1            1           34s
cluster-collector   0/2     0            0           34s
cluster-collector   0/2     0            0           35s
cluster-collector   0/2     0            0           36s
cluster-collector   0/2     0            0           0s
cluster-collector   0/2     0            0           0s
cluster-collector   0/2     0            0           0s
cluster-collector   0/2     2            0           0s
cluster-collector   0/2     2            0           1s
cluster-collector   0/2     2            0           1s
cluster-collector   0/2     1            0           2s
cluster-collector   0/2     0            0           2s
cluster-collector   0/2     0            0           5s
cluster-collector   0/2     0            0           5s
cluster-collector   0/2     0            0           0s
cluster-collector   0/2     0            0           0s
cluster-collector   0/2     0            0           0s
cluster-collector   0/2     2            0           0s
cluster-collector   0/2     2            0           1s
cluster-collector   0/2     2            0           1s
cluster-collector   1/2     1            1           2s
cluster-collector   0/2     0            0           2s
cluster-collector   0/2     0            0           6s
cluster-collector   0/2     0            0           6s
cluster-collector   0/2     0            0           0s
cluster-collector   0/2     0            0           0s
cluster-collector   0/2     0            0           0s
cluster-collector   0/2     2            0           0s
...

Kubernetes Version

1.25

Operator version

0.83.0

Collector version

0.83.0

Environment information

Environment

OS: (e.g., "Amazon Linux 2")

Log output

No response

Additional context

No response

@huzhekun huzhekun added bug Something isn't working needs triage labels Nov 17, 2023
@jaronoff97 jaronoff97 self-assigned this Nov 17, 2023
@jaronoff97 jaronoff97 added area:collector Issues for deploying collector and removed needs triage labels Nov 17, 2023
@jaronoff97
Copy link
Contributor

jaronoff97 commented Nov 17, 2023

hey, this is something I haven't tested but can look in to. We should probably write a test for this as well. My bet is the reconciler isn't checking for a deletion timestamp as a blocker to reconciliation

@jaronoff97
Copy link
Contributor

okay was easily able to repro this. I think the problem has to do with deletion timestamp and finalizers. The fix should be simple – check for a deletion timestamp on the CRD we get. I'm not positive what to do about finalizers, i don't think we need to do anything special for it, but going to check with @pavolloffay on that one.

@jaronoff97
Copy link
Contributor

Yep, checking for the deletion timestamp was enough. I also found a fix for a pervasive operator issue that I'm going to solve like cockroach db here by using the retry.

@jaronoff97
Copy link
Contributor

Should be all set in the next release. I wrote a unit test to catch this and also tested manually on a kind cluster. Please let me know if you see any further issues after upgrading. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:collector Issues for deploying collector bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants