You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all, thank you for the effort you and the contributors have put into this project, providing such a solid framework. We've recently implemented Kopf in prod to manage data for our search clusters.
I wanted to bring up an unusual situation we've encountered. I hope that we can find a way to mitigate it. Our operator uses one timer as the main handler, periodically checking our workflow, along with an update handler for the status.storage=[] field to be stored on S3. Recently, we found ourselves in a situation where our CRD object was stuck in a loop, constantly checking a folder for 7 days in prod. The general workflow involves storing our successful updates details as an object into status.storage=[] with patch.setdefault("status", {})["storage"] and waiting for the operator to store those details to S3 with the help of the status.storage update handler.
Regarding this issue, here are some observations:
The update handler did not trigger for 7 days even though the status.storage was updated with the latest item.
Although we used patch.status within the handler, it seems that the status.storage which we get within the timer as a status parameter did not include the latest item. Instead, it showed the item updated 7 days ago with the same workflow, and this process actually repeated itself for 7 days until we realized the problem.
We are able to suspend the plan's workflow using when= in handlers, but from time to time neither kubectl edit nor kubectl apply -f triggered any message like Handler 'Operator.resume_operator_handler/spec' succeeded. Although, after checking the spec, the changes seemed to be added, the plan was not suspended either. I have noticed that kopf.zalando.org/last-handled-configuration was also pointing to an old configuration.
We also have the list_cluster_custom_object method to list the CRD objects to see if there are any plans that overlap specific values. In this scenario, when we delete a CRD object and create the same plan with a different name, this list still includes the old CRD object, resulting in a validation error even though it's not present in the cluster. Do you think this is related to caching in the operator or the way the Kubernetes API is being used within the operator?
The workaround is not desirable from our side, but restarting the operator, deleting the plan and recreating it with a new metadata.name fixed.
Sadly, I have no any useful error or warning messages during this extended period. Please let me know if you need further details about our process.
Follow-up on this: I've removed the status.storage handler because kopf.zalando.org/last-handled-configuration was constantly adding the full list of items to this field, causing it to become bloated gradually. I've moved the workflow from this handler to our Timer module. I believe, for our use case this seems to be a decent solution ATM.
We also have the list_cluster_custom_object method to list the CRD objects to see if there are any plans that overlap specific values. In this scenario, when we delete a CRD object and create the same plan with a different name, this list still includes the old CRD object, resulting in a validation error even though it's not present in the cluster. Do you think this is related to caching in the operator or the way the Kubernetes API is being used within the operator?
Regarding this case, I've added the watch=False parameter to this method. After testing this behavior in our integration tests, no items are returned from this method after deleting a CRD object and recreating it with different metadata.
I'm closing this issue. If there are any new hiccups, I'll bring them up here. Cheers!
Long story short
Hi @nolar,
First of all, thank you for the effort you and the contributors have put into this project, providing such a solid framework. We've recently implemented Kopf in prod to manage data for our search clusters.
I wanted to bring up an unusual situation we've encountered. I hope that we can find a way to mitigate it. Our operator uses one timer as the main handler, periodically checking our workflow, along with an update handler for the
status.storage=[]
field to be stored on S3. Recently, we found ourselves in a situation where our CRD object was stuck in a loop, constantly checking a folder for 7 days in prod. The general workflow involves storing our successful updates details as an object intostatus.storage=[]
withpatch.setdefault("status", {})["storage"]
and waiting for the operator to store those details to S3 with the help of thestatus.storage
update handler.Regarding this issue, here are some observations:
status.storage
was updated with the latest item.patch.status
within the handler, it seems that thestatus.storage
which we get within the timer as a status parameter did not include the latest item. Instead, it showed the item updated 7 days ago with the same workflow, and this process actually repeated itself for 7 days until we realized the problem.when=
in handlers, but from time to time neitherkubectl edit
norkubectl apply -f
triggered any message likeHandler 'Operator.resume_operator_handler/spec' succeeded
. Although, after checking the spec, the changes seemed to be added, the plan was not suspended either. I have noticed thatkopf.zalando.org/last-handled-configuration
was also pointing to an old configuration.We also have the
list_cluster_custom_object
method to list the CRD objects to see if there are any plans that overlap specific values. In this scenario, when we delete a CRD object and create the same plan with a different name, this list still includes the old CRD object, resulting in a validation error even though it's not present in the cluster. Do you think this is related to caching in the operator or the way the Kubernetes API is being used within the operator?The workaround is not desirable from our side, but restarting the operator, deleting the plan and recreating it with a new
metadata.name
fixed.Sadly, I have no any useful error or warning messages during this extended period. Please let me know if you need further details about our process.
Thanks!
Kopf version
1.37.1
Kubernetes version
v1.27.12
Python version
3.12
Code
Logs
No response
Additional information
No response
The text was updated successfully, but these errors were encountered: