Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flink cluster got into a bad state #13

Closed
yuchaoran2011 opened this issue Jun 3, 2019 · 3 comments
Closed

Flink cluster got into a bad state #13

yuchaoran2011 opened this issue Jun 3, 2019 · 3 comments

Comments

@yuchaoran2011
Copy link
Contributor

I'm not able to delete the FlinkApplication CR object:

$ $ oc delete FlinkApplication wordcount-operator-example
flinkapplication.flink.k8s.io "wordcount-operator-example" deleted
$ oc get FlinkApplication
NAME                         AGE
wordcount-operator-example   28m

As shown above, the delete command did nothing.

The pods associated with the FlinkApp are not cleaned up either:
Screen Shot 2019-06-03 at 16 28 02

@YuvalItzchakov
Copy link
Contributor

YuvalItzchakov commented Jun 4, 2019

I ran into the same issue. The operator is unable to delete the underlying deployments if they're in a bad state. A workaround is to run:

kubectl patch flinkapplications/{APPLICATION_NAME} -p '{"metadata":{"finalizers":[]}}' --type=merge

Which causes all finalizers to reset and kicks off termination.

@anandswaminathan
Copy link
Contributor

@YuvalItzchakov @yuchaoran2011 Thanks for reporting. This happens because the example we have in the quick start guide completes. If you have a streaming application, that does not move to get "FINISHED", this would not happen. We will get a fix for this.

@mwylde The resource is stuck in Deletion as we cannot savepoint as there is no flink job running in the cluster. Hence the finalizer is not getting cleared. This happens in the following case:

  1. If underlying Flink Job finishes.
  2. If the underlying Flink job was cancelled from outside the Operator.
  3. If after issuing CancelWithSavepoint, we failed to store the triggerId in the custom resource due to failures. (3 is extremely rare but is possible)

What do you think about clearing the finalizer if jobFinished is true here: https://github.com/lyft/flinkk8soperator/blob/master/pkg/controller/flinkapplication/flink_state_machine.go#L529

@yuchaoran2011
Copy link
Contributor Author

@anandswaminathan Thanks for the reply. Actually the example app is a streaming app (because it's adapted from this example) but it still completes.

I think the reason that it completes is because the input data is from a bounded data source. In other words, all batch apps and streaming apps with a bounded data source will complete. Only streaming apps with an unbounded data source continues indefinitely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants