Research and determine the problems with flink operator #757

jtownley · 2020-09-30T15:46:41Z

Goal: Determine what is causing flink to Flake

Determine and verify the reported causes of flink

Determine if this can be reported?

Determine if this is a bug in the operator

Determine if this could have a work around

jtownley · 2020-09-30T15:46:42Z

Link to subfeature: https://lightbend.productboard.com/feature-board/planning/features/5738467

franciscolopezsancho · 2020-10-16T08:51:24Z

The problem
I would start by defining what we mean by flaking in this case. This is:

When creating new CRs of a flinkapp this doesn't translate into changing the state in the cluster. Desire state does change in Kubernetes but not current.

What we've found
What's happening under the hood is:

Some new pods get created because of the deployment of a new flinkapp CR. After five minutes there is no reconciliation between the new pod and the old one and the Flink Operator deletes the new one. The logs in here are quite expressive stating this fact.
The old pod gets in an invalid state that prevents it to be terminated. Anything different from 'Running' or 'Deploy Failed', as per here, makes reconciliation impossible.

Possible solution

Undeploy the whole Cloufflow application and redeploy, but this has a possible caveat. When reinstalling the same app the state could have been removed. State that is kept through checkpointing and snapshots (savepoints in Flink terms).
Undeploy the specific Flink application and redeploy, but again when reinstalling the same app the state could have been removed.

franciscolopezsancho · 2020-10-16T08:55:08Z

The first approach is building a vanilla Flink App in a Cloudflow App and test the state.

This is done and now

proved that state will not be kept if the underlying storageClass has no explicitly set reclaimPolicy. This will set this property to 'Delete' by default. see here
proved that even if the state is Retain (--set storageClass.retainPolicy=Retain) when install nfs, Flink Application doesn't keep track of the past state. Now I'm wondering if my knowledge of Flink is enough as I'm working under the assumption that

override def createLogic() = new FlinkStreamletLogic {

    override def buildExecutionGraph = {
      readStream(in).keyBy("deviceId").sum(1).print()
    }

  }

Should be enough to create a state in the app that checkpointing should keep. So when running the app and posting

curl -i -X POST localhost:3000 -H "Content-Type: application/json" --data '{"deviceId":"c75cb448-df0e-4692-8e06-0321b7703992","count":4}'

four times we get

1> {"deviceId": "c75cb448-df0e-4692-8e06-0321b7703992", "count": 4}
1> {"deviceId": "c75cb448-df0e-4692-8e06-0321b7703992", "count": 8}
1> {"deviceId": "c75cb448-df0e-4692-8e06-0321b7703992", "count": 12}
1> {"deviceId": "c75cb448-df0e-4692-8e06-0321b7703992", "count": 16}

But when I undeploy and redeploy and execute same POST I would expect to get "count": 20 but I get

1> {"deviceId": "c75cb448-df0e-4692-8e06-0321b7703992", "count": 4}

Would you agree with that conclusion? @RayRoestenburg @andreaTP . That right now no state can be kept if undeploy is performed. Whether the retainPolicy is Delete or Retain.

franciscolopezsancho · 2020-10-21T13:46:24Z

If the conclusion above is accepted we can now follow two options as @ray suggested.

Recommend to the user to use another persistence, as AzureBlob, and not PVC as checkpointing for Flink. See here. Which is not fully tested in azure but will be soon.
Stop providing the creation of PVC out of the box in CF and provide a way through configuration to reuse an existing PVC. See here

RayRoestenburg · 2020-10-21T13:47:12Z

@franciscolopezsancho Thanks for the details. Can you try to do a deploy, make a minor change, rebuild, followed by a deploy with your test app? (instead of undeploy/deploy)

franciscolopezsancho · 2020-10-21T14:14:10Z

yep @RayRoestenburg. Tested. Works fine. Did it a couple of times an in both kept the state.
1> {"deviceId": "c75cb448-df0e-4692-8e06-0321b7703992", "count": 4}
redeploy with changes
1> {"deviceId": "c75cb448-df0e-4692-8e06-0321b7703992", "count": 8}
redeploy with changes
1> {"deviceId": "c75cb448-df0e-4692-8e06-0321b7703992", "count": 12}

RayRoestenburg · 2020-10-21T14:21:16Z

Ok nice, as expected, but I was getting slightly worried :-) thanks @franciscolopezsancho !
So one of the other questions is why for certain apps, the deploy/deploy causes such big problems?

franciscolopezsancho · 2020-10-21T17:24:24Z

Regards that last question @RayRoestenburg I just can think that the cluster is not great plus the App has trouble to get into a snapshot/savepoint in some cases because is throwing exceptions. By the cluster not being great I mean not deleting the previous PVCs (maybe because the garbage collector doesn't work). Also I've seen some pods in Pending state for long time as waiting for resources to be created. In this cases some deployment don't work the first time but it does the second.
Seems to me like a combinantion of a misbehaving cluster in plus a faulty application.

franciscolopezsancho · 2020-11-10T19:55:47Z

Flink-operator problems and possible solutions

Deploy over deploy

The problem

When modifying an existing flinkapp CRs it doesn't translate into changing the state in the cluster. Desire state does change in Kubernetes but not current.

What's happening under the hood is:

Some new pods get created because of the deployment of the modifyed flinkapp CR. After five minutes there is no reconciliation between the new pods and the old ones and the Flink Operator deletes the new one. The logs in here are quite expressive stating this fact.
The old pod gets in an invalid state that prevents it to be terminated. Anything different from 'Running' or 'Deploy Failed', as per here, makes reconciliation impossible.
possible solution

These 5 mins can be configured but it doesn't seem very reliable. Most likely the state of the flinkapp is blocking the reconciliation
Undeploy the whole Cloufflow application and redeploy, after version 2.0.12 (not included)

more references
lyft/flinkk8soperator#154 (comment)
#561

Failure recovery

The problem
When the Job Manager fails and gets instantiated again the communication among JM and task managers gets broken

What's happening under the hood is:
The Job Manager can't find the already running tasks

Exception occurred in REST handler: Job ed952687752d2a5b2c60d843d7e5605f8 not found

info found through lyft/flinkk8soperator#221

possible solution

Undeploy the whole Cloufflow application and redeploy, after version 2.0.12 (not included)

Flinkapp CR fails to savepointing

The problem
In some situations, without any cause we can trace, Savepointing can't get through and the "CLUSTER HEALTH" gets into an unstable state. i.e. PHASE -> Failed ; CLUSTER_HEALTH -> Red
This blocks any further deployment.

possible solution

Undeploy the whole Cloufflow application and redeploy, after version 2.0.12 (not included)

When undeploy/deploy doesn't work

The problem
In some cases 'kubectl cloudflow undeploy app' wouldn't delete the CR

What's happening under the hood is:
This happens when the CR contains some finalizers that aren't getting deleted. Finalizers are asynchronous pre-delete hooks which must be deleted prior to the deletion of the CR that contains them.

possible solution

kubectl -n [ns] patch flinkapplications/[app-name] -p '{"metadata":{"finalizers":[]}}' --type=merge

more info
lyft/flinkk8soperator#13

JM single point of failure

The problem
JobManager is in charge of scheduling and manages the lifecycle of resources. This makes it a Single Point of failure.

What's happening under the hood is:
HA of Flink Job Manager selection is done by k8s deployment of 1 replica. If later on the kubelet crashed, k8s will assume the JM has died and will deploy another one, and here's the problem, the new JM will don't know about the previous TMs

possible solution
investigating alternatives to flink-operator by @blublinsky https://github.com/lightbend/FlinkClusterManager/

flink-operator was created around version 0.8 of Flink and nowadays Flink has support for native kubernetes. While still experimental it makes sense we start to think about moving in this direction.

RayRoestenburg · 2020-11-18T13:31:13Z

Can this be closed? @franciscolopezsancho @jtownley

blublinsky · 2020-11-18T15:06:02Z

I think so

RayRoestenburg self-assigned this Sep 30, 2020

RayRoestenburg assigned franciscolopezsancho and unassigned RayRoestenburg and franciscolopezsancho Oct 15, 2020

RayRoestenburg closed this as completed Dec 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research and determine the problems with flink operator #757

Research and determine the problems with flink operator #757

jtownley commented Sep 30, 2020 •

edited

Loading

jtownley commented Sep 30, 2020

franciscolopezsancho commented Oct 16, 2020

franciscolopezsancho commented Oct 16, 2020 •

edited

Loading

franciscolopezsancho commented Oct 21, 2020

RayRoestenburg commented Oct 21, 2020 •

edited

Loading

franciscolopezsancho commented Oct 21, 2020

RayRoestenburg commented Oct 21, 2020 •

edited

Loading

franciscolopezsancho commented Oct 21, 2020

franciscolopezsancho commented Nov 10, 2020 •

edited

Loading

RayRoestenburg commented Nov 18, 2020

blublinsky commented Nov 18, 2020 •

edited

Loading

Research and determine the problems with flink operator #757

Research and determine the problems with flink operator #757

Comments

jtownley commented Sep 30, 2020 • edited Loading

jtownley commented Sep 30, 2020

franciscolopezsancho commented Oct 16, 2020

franciscolopezsancho commented Oct 16, 2020 • edited Loading

franciscolopezsancho commented Oct 21, 2020

RayRoestenburg commented Oct 21, 2020 • edited Loading

franciscolopezsancho commented Oct 21, 2020

RayRoestenburg commented Oct 21, 2020 • edited Loading

franciscolopezsancho commented Oct 21, 2020

franciscolopezsancho commented Nov 10, 2020 • edited Loading

Flink-operator problems and possible solutions

Deploy over deploy

Failure recovery

Flinkapp CR fails to savepointing

When undeploy/deploy doesn't work

JM single point of failure

RayRoestenburg commented Nov 18, 2020

blublinsky commented Nov 18, 2020 • edited Loading

jtownley commented Sep 30, 2020 •

edited

Loading

franciscolopezsancho commented Oct 16, 2020 •

edited

Loading

RayRoestenburg commented Oct 21, 2020 •

edited

Loading

RayRoestenburg commented Oct 21, 2020 •

edited

Loading

franciscolopezsancho commented Nov 10, 2020 •

edited

Loading

blublinsky commented Nov 18, 2020 •

edited

Loading