Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research and determine the problems with flink operator #757

Closed
jtownley opened this issue Sep 30, 2020 · 11 comments
Closed

Research and determine the problems with flink operator #757

jtownley opened this issue Sep 30, 2020 · 11 comments
Assignees

Comments

@jtownley
Copy link

jtownley commented Sep 30, 2020

Goal: Determine what is causing flink to Flake

Determine and verify the reported causes of flink

Determine if this can be reported?

Determine if this is a bug in the operator

Determine if this could have a work around

@jtownley
Copy link
Author

@franciscolopezsancho
Copy link
Collaborator

The problem
I would start by defining what we mean by flaking in this case. This is:

  • When creating new CRs of a flinkapp this doesn't translate into changing the state in the cluster. Desire state does change in Kubernetes but not current.

What we've found
What's happening under the hood is:

  • Some new pods get created because of the deployment of a new flinkapp CR. After five minutes there is no reconciliation between the new pod and the old one and the Flink Operator deletes the new one. The logs in here are quite expressive stating this fact.
  • The old pod gets in an invalid state that prevents it to be terminated. Anything different from 'Running' or 'Deploy Failed', as per here, makes reconciliation impossible.

Possible solution

  1. Undeploy the whole Cloufflow application and redeploy, but this has a possible caveat. When reinstalling the same app the state could have been removed. State that is kept through checkpointing and snapshots (savepoints in Flink terms).
  2. Undeploy the specific Flink application and redeploy, but again when reinstalling the same app the state could have been removed.

@franciscolopezsancho
Copy link
Collaborator

franciscolopezsancho commented Oct 16, 2020

The first approach is building a vanilla Flink App in a Cloudflow App and test the state.

This is done and now

  1. proved that state will not be kept if the underlying storageClass has no explicitly set reclaimPolicy. This will set this property to 'Delete' by default. see here
  2. proved that even if the state is Retain (--set storageClass.retainPolicy=Retain) when install nfs, Flink Application doesn't keep track of the past state. Now I'm wondering if my knowledge of Flink is enough as I'm working under the assumption that
override def createLogic() = new FlinkStreamletLogic {

    override def buildExecutionGraph = {
      readStream(in).keyBy("deviceId").sum(1).print()
    }

  }

Should be enough to create a state in the app that checkpointing should keep. So when running the app and posting

curl -i -X POST localhost:3000 -H "Content-Type: application/json" --data '{"deviceId":"c75cb448-df0e-4692-8e06-0321b7703992","count":4}'

four times we get

1> {"deviceId": "c75cb448-df0e-4692-8e06-0321b7703992", "count": 4}
1> {"deviceId": "c75cb448-df0e-4692-8e06-0321b7703992", "count": 8}
1> {"deviceId": "c75cb448-df0e-4692-8e06-0321b7703992", "count": 12}
1> {"deviceId": "c75cb448-df0e-4692-8e06-0321b7703992", "count": 16}

But when I undeploy and redeploy and execute same POST I would expect to get "count": 20 but I get

1> {"deviceId": "c75cb448-df0e-4692-8e06-0321b7703992", "count": 4}

Would you agree with that conclusion? @RayRoestenburg @andreaTP . That right now no state can be kept if undeploy is performed. Whether the retainPolicy is Delete or Retain.

@franciscolopezsancho
Copy link
Collaborator

If the conclusion above is accepted we can now follow two options as @ray suggested.

  • Recommend to the user to use another persistence, as AzureBlob, and not PVC as checkpointing for Flink. See here. Which is not fully tested in azure but will be soon.
  • Stop providing the creation of PVC out of the box in CF and provide a way through configuration to reuse an existing PVC. See here

@RayRoestenburg
Copy link
Contributor

RayRoestenburg commented Oct 21, 2020

@franciscolopezsancho Thanks for the details. Can you try to do a deploy, make a minor change, rebuild, followed by a deploy with your test app? (instead of undeploy/deploy)

@franciscolopezsancho
Copy link
Collaborator

yep @RayRoestenburg. Tested. Works fine. Did it a couple of times an in both kept the state.
1> {"deviceId": "c75cb448-df0e-4692-8e06-0321b7703992", "count": 4}
redeploy with changes
1> {"deviceId": "c75cb448-df0e-4692-8e06-0321b7703992", "count": 8}
redeploy with changes
1> {"deviceId": "c75cb448-df0e-4692-8e06-0321b7703992", "count": 12}

@RayRoestenburg
Copy link
Contributor

RayRoestenburg commented Oct 21, 2020

Ok nice, as expected, but I was getting slightly worried :-) thanks @franciscolopezsancho !
So one of the other questions is why for certain apps, the deploy/deploy causes such big problems?

@franciscolopezsancho
Copy link
Collaborator

Regards that last question @RayRoestenburg I just can think that the cluster is not great plus the App has trouble to get into a snapshot/savepoint in some cases because is throwing exceptions. By the cluster not being great I mean not deleting the previous PVCs (maybe because the garbage collector doesn't work). Also I've seen some pods in Pending state for long time as waiting for resources to be created. In this cases some deployment don't work the first time but it does the second.
Seems to me like a combinantion of a misbehaving cluster in plus a faulty application.

@franciscolopezsancho
Copy link
Collaborator

franciscolopezsancho commented Nov 10, 2020

Flink-operator problems and possible solutions

Deploy over deploy

The problem

When modifying an existing flinkapp CRs it doesn't translate into changing the state in the cluster. Desire state does change in Kubernetes but not current.

What's happening under the hood is:

Some new pods get created because of the deployment of the modifyed flinkapp CR. After five minutes there is no reconciliation between the new pods and the old ones and the Flink Operator deletes the new one. The logs in here are quite expressive stating this fact.
The old pod gets in an invalid state that prevents it to be terminated. Anything different from 'Running' or 'Deploy Failed', as per here, makes reconciliation impossible.
possible solution

These 5 mins can be configured but it doesn't seem very reliable. Most likely the state of the flinkapp is blocking the reconciliation
Undeploy the whole Cloufflow application and redeploy, after version 2.0.12 (not included)

more references
lyft/flinkk8soperator#154 (comment)
#561

Failure recovery

The problem
When the Job Manager fails and gets instantiated again the communication among JM and task managers gets broken

What's happening under the hood is:
The Job Manager can't find the already running tasks

Exception occurred in REST handler: Job ed952687752d2a5b2c60d843d7e5605f8 not found 

info found through lyft/flinkk8soperator#221

possible solution

Undeploy the whole Cloufflow application and redeploy, after version 2.0.12 (not included)

Flinkapp CR fails to savepointing

The problem
In some situations, without any cause we can trace, Savepointing can't get through and the "CLUSTER HEALTH" gets into an unstable state. i.e. PHASE -> Failed ; CLUSTER_HEALTH -> Red
This blocks any further deployment.

possible solution

Undeploy the whole Cloufflow application and redeploy, after version 2.0.12 (not included)

When undeploy/deploy doesn't work

The problem
In some cases 'kubectl cloudflow undeploy app' wouldn't delete the CR

What's happening under the hood is:
This happens when the CR contains some finalizers that aren't getting deleted. Finalizers are asynchronous pre-delete hooks which must be deleted prior to the deletion of the CR that contains them.

possible solution

kubectl -n [ns] patch flinkapplications/[app-name] -p '{"metadata":{"finalizers":[]}}' --type=merge

more info
lyft/flinkk8soperator#13

JM single point of failure

The problem
JobManager is in charge of scheduling and manages the lifecycle of resources. This makes it a Single Point of failure.

What's happening under the hood is:
HA of Flink Job Manager selection is done by k8s deployment of 1 replica. If later on the kubelet crashed, k8s will assume the JM has died and will deploy another one, and here's the problem, the new JM will don't know about the previous TMs

possible solution
investigating alternatives to flink-operator by @blublinsky https://github.com/lightbend/FlinkClusterManager/

flink-operator was created around version 0.8 of Flink and nowadays Flink has support for native kubernetes. While still experimental it makes sense we start to think about moving in this direction.

@RayRoestenburg
Copy link
Contributor

Can this be closed? @franciscolopezsancho @jtownley

@blublinsky
Copy link

blublinsky commented Nov 18, 2020

I think so

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants