Object stuck in loop with inconsistent status updates and handler failure #1116

ozlerhakan · 2024-05-15T12:13:55Z

Long story short

First of all, thank you for the effort you and the contributors have put into this project, providing such a solid framework. We've recently implemented Kopf in prod to manage data for our search clusters.

I wanted to bring up an unusual situation we've encountered. I hope that we can find a way to mitigate it. Our operator uses one timer as the main handler, periodically checking our workflow, along with an update handler for the status.storage=[] field to be stored on S3. Recently, we found ourselves in a situation where our CRD object was stuck in a loop, constantly checking a folder for 7 days in prod. The general workflow involves storing our successful updates details as an object into status.storage=[] with patch.setdefault("status", {})["storage"] and waiting for the operator to store those details to S3 with the help of the status.storage update handler.

Regarding this issue, here are some observations:

The update handler did not trigger for 7 days even though the status.storage was updated with the latest item.
Although we used patch.status within the handler, it seems that the status.storage which we get within the timer as a status parameter did not include the latest item. Instead, it showed the item updated 7 days ago with the same workflow, and this process actually repeated itself for 7 days until we realized the problem.
We are able to suspend the plan's workflow using when= in handlers, but from time to time neither kubectl edit nor kubectl apply -f triggered any message like Handler 'Operator.resume_operator_handler/spec' succeeded. Although, after checking the spec, the changes seemed to be added, the plan was not suspended either. I have noticed that kopf.zalando.org/last-handled-configuration was also pointing to an old configuration.

We also have the list_cluster_custom_object method to list the CRD objects to see if there are any plans that overlap specific values. In this scenario, when we delete a CRD object and create the same plan with a different name, this list still includes the old CRD object, resulting in a validation error even though it's not present in the cluster. Do you think this is related to caching in the operator or the way the Kubernetes API is being used within the operator?

The workaround is not desirable from our side, but restarting the operator, deleting the plan and recreating it with a new metadata.name fixed.

Sadly, I have no any useful error or warning messages during this extended period. Please let me know if you need further details about our process.

Thanks!

Kopf version

1.37.1

Kubernetes version

v1.27.12

Python version

3.12

Code

kopf.timer(
    "app.coperator.ai/v1",
    "coperator",
    when=suspend_operation,
)(co.timer_workflow_operator)

kopf.on.update(
    "app.operator.ai/v1",
    "operator",
    field="status.storage",
    when=suspend_operation,
)(storage_manager.update_storage_state)

crs = custom.list_cluster_custom_object(
    group=spec.group,
    version=spec.version,
    plural=spec.plural_kind,
).get("items")

Logs

No response

Additional information

No response

The text was updated successfully, but these errors were encountered:

ozlerhakan · 2024-05-26T11:45:18Z

Follow-up on this: I've removed the status.storage handler because kopf.zalando.org/last-handled-configuration was constantly adding the full list of items to this field, causing it to become bloated gradually. I've moved the workflow from this handler to our Timer module. I believe, for our use case this seems to be a decent solution ATM.

We also have the list_cluster_custom_object method to list the CRD objects to see if there are any plans that overlap specific values. In this scenario, when we delete a CRD object and create the same plan with a different name, this list still includes the old CRD object, resulting in a validation error even though it's not present in the cluster. Do you think this is related to caching in the operator or the way the Kubernetes API is being used within the operator?

Regarding this case, I've added the watch=False parameter to this method. After testing this behavior in our integration tests, no items are returned from this method after deleting a CRD object and recreating it with different metadata.

I'm closing this issue. If there are any new hiccups, I'll bring them up here. Cheers!

ozlerhakan added the bug Something isn't working label May 15, 2024

ozlerhakan closed this as completed May 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Object stuck in loop with inconsistent status updates and handler failure #1116

Object stuck in loop with inconsistent status updates and handler failure #1116

ozlerhakan commented May 15, 2024 •

edited

Loading

ozlerhakan commented May 26, 2024

Object stuck in loop with inconsistent status updates and handler failure #1116

Object stuck in loop with inconsistent status updates and handler failure #1116

Comments

ozlerhakan commented May 15, 2024 • edited Loading

Long story short

Kopf version

Kubernetes version

Python version

Code

Logs

Additional information

ozlerhakan commented May 26, 2024

ozlerhakan commented May 15, 2024 •

edited

Loading