Flink cluster got into a bad state #13

yuchaoran2011 · 2019-06-03T20:29:19Z

I'm not able to delete the FlinkApplication CR object:

$ $ oc delete FlinkApplication wordcount-operator-example
flinkapplication.flink.k8s.io "wordcount-operator-example" deleted
$ oc get FlinkApplication
NAME                         AGE
wordcount-operator-example   28m

As shown above, the delete command did nothing.

The pods associated with the FlinkApp are not cleaned up either:

The text was updated successfully, but these errors were encountered:

YuvalItzchakov · 2019-06-04T08:46:31Z

I ran into the same issue. The operator is unable to delete the underlying deployments if they're in a bad state. A workaround is to run:

kubectl patch flinkapplications/{APPLICATION_NAME} -p '{"metadata":{"finalizers":[]}}' --type=merge

Which causes all finalizers to reset and kicks off termination.

anandswaminathan · 2019-06-04T23:17:11Z

@YuvalItzchakov @yuchaoran2011 Thanks for reporting. This happens because the example we have in the quick start guide completes. If you have a streaming application, that does not move to get "FINISHED", this would not happen. We will get a fix for this.

@mwylde The resource is stuck in Deletion as we cannot savepoint as there is no flink job running in the cluster. Hence the finalizer is not getting cleared. This happens in the following case:

If underlying Flink Job finishes.
If the underlying Flink job was cancelled from outside the Operator.
If after issuing CancelWithSavepoint, we failed to store the triggerId in the custom resource due to failures. (3 is extremely rare but is possible)

What do you think about clearing the finalizer if jobFinished is true here: https://github.com/lyft/flinkk8soperator/blob/master/pkg/controller/flinkapplication/flink_state_machine.go#L529

yuchaoran2011 · 2019-06-04T23:54:44Z

@anandswaminathan Thanks for the reply. Actually the example app is a streaming app (because it's adapted from this example) but it still completes.

I think the reason that it completes is because the input data is from a bounded data source. In other words, all batch apps and streaming apps with a bounded data source will complete. Only streaming apps with an unbounded data source continues indefinitely.

mwylde mentioned this issue Jun 18, 2019

[STRMCMP-544] Allow FlinkApplications to be deleted when job submission fails #30

Merged

yuchaoran2011 closed this as completed Jun 19, 2019

andrewgdavis mentioned this issue Oct 10, 2019

bad state after delete issued #119

Open

franciscolopezsancho mentioned this issue Nov 10, 2020

Research and determine the problems with flink operator lightbend/cloudflow#757

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flink cluster got into a bad state #13

Flink cluster got into a bad state #13

yuchaoran2011 commented Jun 3, 2019

YuvalItzchakov commented Jun 4, 2019 •

edited

Loading

anandswaminathan commented Jun 4, 2019

yuchaoran2011 commented Jun 4, 2019

Flink cluster got into a bad state #13

Flink cluster got into a bad state #13

Comments

yuchaoran2011 commented Jun 3, 2019

YuvalItzchakov commented Jun 4, 2019 • edited Loading

anandswaminathan commented Jun 4, 2019

yuchaoran2011 commented Jun 4, 2019

YuvalItzchakov commented Jun 4, 2019 •

edited

Loading