-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Research and determine the problems with flink operator #757
Comments
Link to subfeature: https://lightbend.productboard.com/feature-board/planning/features/5738467 |
The problem
What we've found
Possible solution
|
The first approach is building a vanilla Flink App in a Cloudflow App and test the state. This is done and now
Should be enough to create a state in the app that checkpointing should keep. So when running the app and posting
four times we get
But when I undeploy and redeploy and execute same POST I would expect to get
Would you agree with that conclusion? @RayRoestenburg @andreaTP . That right now no state can be kept if undeploy is performed. Whether the retainPolicy is |
If the conclusion above is accepted we can now follow two options as @ray suggested. |
@franciscolopezsancho Thanks for the details. Can you try to do a deploy, make a minor change, rebuild, followed by a deploy with your test app? (instead of undeploy/deploy) |
yep @RayRoestenburg. Tested. Works fine. Did it a couple of times an in both kept the state. |
Ok nice, as expected, but I was getting slightly worried :-) thanks @franciscolopezsancho ! |
Regards that last question @RayRoestenburg I just can think that the cluster is not great plus the App has trouble to get into a snapshot/savepoint in some cases because is throwing exceptions. By the cluster not being great I mean not deleting the previous PVCs (maybe because the garbage collector doesn't work). Also I've seen some pods in Pending state for long time as waiting for resources to be created. In this cases some deployment don't work the first time but it does the second. |
Flink-operator problems and possible solutionsDeploy over deployThe problem When modifying an existing flinkapp CRs it doesn't translate into changing the state in the cluster. Desire state does change in Kubernetes but not current. What's happening under the hood is: Some new pods get created because of the deployment of the modifyed flinkapp CR. After five minutes there is no reconciliation between the new pods and the old ones and the Flink Operator deletes the new one. The logs in here are quite expressive stating this fact. These 5 mins can be configured but it doesn't seem very reliable. Most likely the state of the flinkapp is blocking the reconciliation more references Failure recoveryThe problem What's happening under the hood is:
info found through lyft/flinkk8soperator#221 possible solution Undeploy the whole Cloufflow application and redeploy, after version 2.0.12 (not included) Flinkapp CR fails to savepointingThe problem possible solution Undeploy the whole Cloufflow application and redeploy, after version 2.0.12 (not included) When undeploy/deploy doesn't workThe problem What's happening under the hood is: possible solution
more info JM single point of failureThe problem What's happening under the hood is: possible solution flink-operator was created around version 0.8 of Flink and nowadays Flink has support for native kubernetes. While still experimental it makes sense we start to think about moving in this direction. |
Can this be closed? @franciscolopezsancho @jtownley |
I think so |
Goal: Determine what is causing flink to Flake
Determine and verify the reported causes of flink
Determine if this can be reported?
Determine if this is a bug in the operator
Determine if this could have a work around
The text was updated successfully, but these errors were encountered: