Stateful Failover Proposal #5116

Dyex719 · 2024-07-01T17:58:42Z

What type of PR is this?
Proposal for stateful failover

What this PR does / why we need it:
Explained in the doc

Which issue(s) this PR fixes:
Fixes #5006, #4969

Special notes for your reviewer:
N/A

Does this PR introduce a user-facing change?:
NONE

karmada-bot · 2024-07-01T17:58:52Z

Welcome @Dyex719! It looks like this is your first PR to karmada-io/karmada 🎉

codecov-commenter · 2024-07-01T18:10:25Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 40.91%. Comparing base (2271a41) to head (fd35fb4).
Report is 680 commits behind head on master.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@             Coverage Diff             @@
##           master    #5116       +/-   ##
===========================================
+ Coverage   28.21%   40.91%   +12.69%     
===========================================
  Files         632      650       +18     
  Lines       43556    55182    +11626     
===========================================
+ Hits        12291    22575    +10284     
- Misses      30368    31170      +802     
- Partials      897     1437      +540

Flag	Coverage Δ
unittests	`40.91% <ø> (+12.69%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Aditya Addepalli <dyex719@gmail.com>

RainbowMango

/assign
Thanks @Dyex719 Just quickly went through the propoasl, I believe this is meaningful enhancement and very inspiring.

Dyex719 · 2024-07-03T14:10:46Z

Thanks @RainbowMango! One assumption that we wanted to discuss about (I wrote this in the proposal too):

All the replicas of the stateful application are not migrated together, it is not clear when the state needs to be restored. In this proposal we focus on the use case where all the replicas of a stateful application are migrated together.

Do you think this is a valid assumption? This narrows our scope a little, but defines the problem more clearly.

RainbowMango · 2024-07-04T01:42:38Z

Do you think this is a valid assumption? This narrows our scope a little, but defines the problem more clearly.

I 100% agree.

Partially replica migration is not technically failover, but more like elastic scaling of replicas.
The conditions under which failover triggers are fully configurable, if people tolerate the failure of part of replicas, then migration should not be triggered, if they don't, the application should be rebuilt, by leveraging failover.

RainbowMango · 2024-07-09T12:17:20Z

docs/proposals/stateful-failover/stateful-failover-proposal.md

+
+We propose to add two fields to a propogation policy to enable stateful failover.
+1. persistedFields.maxHistory: This sets the max limit on the amount of stateful failover history that is persisted before the older entries are overwritten. If this is set to 5, the resourcebinding will store a maximum of 5 failover entries in FailoverHistory before it overwrites the older history.
+2. persistedFields.fields: This is a list of the fields that are required by the stateful application to be persisted during failover to resume processing from a particular case. This takes in a list of field names as well as how to access them from the spec. 


Is there any restriction for the field type here? Can we or should we support types other than String?

persistedFields.fields contains the name of the field we want to persist and the path to obtain the value of that label. Since both the name and value is tied to a label which supports only strings, I think having a string is sufficient.

persistedFields.fields: - LabelName: jobID PersistedStatusItem: obj.status.jobStatus.jobID

It works I believe. But what I'm thinking here is maybe we can make room for extensibility. The info we can take can be more generic, like a string, integer, and so on. I'm still thinking about it. will get back to you once I have an idea.

Tried to draft the API, please take a look:

// ApplicationFailoverBehavior indicates application failover behaviors. type ApplicationFailoverBehavior struct { // other fields. // StatePreservation defines the policy for preserving and restoring state data // during failover events for stateful applications. // // When an application fails over from one cluster to another, this policy enables // the extraction of critical data from the original resource configuration. // Upon successful migration, the extracted data is then re-injected into the new // resource, ensuring that the application can resume operation with its previous // state intact. // This is particularly useful for stateful applications where maintaining data // consistency across failover events is crucial. // If not specified, means no state data will be preserved. // +optional StatePreservation *StatePreservation `json:"statePreservation,omitempty"` } // StatePreservation defines the policy for preserving state during failover events. type StatePreservation struct { // Rules contains a list of StatePreservationRule configurations. // Each rule specifies a JSONPath expression targeting specific pieces of // state data to be preserved during failover events. An AliasName is associated // with each rule, serving as a label key when the preserved data is passed // to the new cluster. // +required Rules []StatePreservationRule `json:"rules"` // Note: We probably need more policies to control how to feed the new cluster with the // preserved state data in the future. Such as: // - Is it always acceptable to feed data as the label? Is there a need for annotation? // - If each label name should be started with a prefix, like `karmada.io/failover-preserving-<fieldname>: <state>` // Sure, we can default with the label, this structure just makes room for future extensions. // // For instance, we can introduce a policy if someone wants to control how the preserving state // feed to new clusters. This is probably not included this time. // RestorePolicy determines when and how the preserved state should be restored. // RestorePolicy RestorePolicy `json:"restorePolicy"` } // StatePreservationRule defines a single rule for state preservation. // It includes a JSONPath expression and an alias name that will be used // as a label key when passing state information to the new cluster. type StatePreservationRule struct { // AliasName is the name that will be used as a label key when the preserved // data is passed to the new cluster. This facilitates the injection of the // preserved state back into the application resources during recovery. // +required AliasName string `json:"aliasName"` // JSONPath is the JSONPath query used to identify the state data // to be preserved from the original resource configuration. // Example: ".spec.template.spec.containers[?(@.name=='my-container')].image" // This example selects the image of a container named 'my-container'. // +required JSONPath string `json:"jsonPath"` }

@Dyex719 @mszacillo What do you think of this API?

I haven't put history thing here, but seems it should not in StatePreservation or persistedFields as referenced by #5116.

Good point, let me think about it.

It makes 100% sense to move the StatePreservation to FailoverBehavior, so that both Cluster failover and application failover can share the same configurations.

But, given the cluster failover configuration is still not provided, also not sure when it will come, I'm afraid it might slow down the process if we consider cluster failover in this proposal. So, I tend to focus on application failover right now.

In addition, if cluster failover also needs this configuration later, we can deprecate the one in the application in a compatible way, it's not a concern for now.

By the way, now I'm going to look at the second part(as I mentioned on #5116 (comment)), which is how to grab and store the state data in ResourceBining.

In our implementation we use the taint condition to specify that a cluster failover is taking place.

Are there plans to include a separate cluster failover implementation? We would prefer to have cluster failover be in scope of this proposal as it is important for flink applications to recover from these failures as well.

Are there plans to include a separate cluster failover implementation? We would prefer to have cluster failover be in scope of this proposal as it is important for flink applications to recover from these failures as well.

I don't mind having the cluster failover in this proposal. You might need to refresh the proposal.
I'm sure this gonna be a large proposal.

RainbowMango · 2024-07-09T12:23:18Z

docs/proposals/stateful-failover/stateful-failover-proposal.md

+
+    // PersistedItem is a pointer to the status item that should be persisted to the rescheduled 
+    // replica during a failover. This should be input in the form: obj.status.<path-to-item> 
+	PersistedStatusItem string `json:"persistedStatusItem,omitempty"`


Given this persistedStatusItem will be applied as a label value, which brings a limitation that the length can not exceed 63 characters. Does Flink's checkpoint path exceed this?

In our use-case, we only want to persist the jobID which is a 32 hexadecimal so this limitation is not a problem for us. Both labels and annotations have a 63 character limit for the value in the key, value pair so I don't think there is any way around this.

I should make a change here:

// persistedStatusItem is ~~a pointer to the status item~~ the value of the status item that should be persisted to the rescheduled replica during a failover. This should be input in the form: ~~obj.status.path-to-item~~ value at the location obj.status.path-to-item

In the case of flink, the flink process will create a jobID at obj.status.jobStatus.jobID which would look like fd72014d4c864993a2e5a9287b4a9c5d so the persistedStatusItem would store this value in the PersistedStatusItem field.

Flink Deployment spec would have:

status: jobStatus: jobID: fd72014d4c864993a2e5a9287b4a9c5d

PersistedDuringFailover would store jobID as the labelName and fd72014d4c864993a2e5a9287b4a9c5d as the PersistedStatusItem (value)

Since we are storing the metadata related to the state here, I think the limitation of the value being a string and less than 63 characters is okay. The actual state of the flink job will be stored in some durable storage and may be very big. In other stateful applications as well I think we can make this assumption that metadata related to the state should be persisted with the resourceBindingStatus and the actual data can be stored elsewhere.

I will include this in the proposal. Let me know what you think

Yeah, I think we can go with the label as the default data provision.
By the way, I talked to Keyverno's maintainer, I was told that Keyverno can perform filters against annotation. We can support that if there is a need.

Both labels and annotations have a 63 character limit for the value in the key, value pair so I don't think there is any way around this.

By the way, the value length of the annotation can support up to 255 chars. :)

Looks like as you said annotations values can support 253 characters whereas label values can only be 64 characters long. Annotation keys can only be 64 characters whereas label keys can be 253 characters.

karmada-bot · 2024-07-10T17:34:35Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from rainbowmango. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

RainbowMango · 2024-07-20T10:53:14Z

Hi @Dyex719, @mszacillo, I've been thinking this feature recently and came up with some ideas, this feature consists of 3 parts:

The first part is how to declare which state data(fields) should be preserved during failover, and I post my idea at #5116 (comment). Please take a look.

The second part is how to store the preserved state data, one approach is to store them in the history items(this is our first idea), the challenge things is that it's hard to maintain the history item, especially figuring out what is the destination cluster.

I think it is worth thinking about storing them in the GracefulEvictionTask, during the failover process, just before creating the eviction task, it makes sense to grab some state(fields) or snapshot of scheduled cluster list. With this grabbed data, the controller would get to know which cluster is the destination by comparing the snapshot and the newly scheduled cluster.

[edit]
I don't mean I don't like the first approach, just raise another idea. Maybe we also can add a snapshot of the scheduled cluster to the history item. The most challenging thing for this part is distinguishing which field should be managed by which component(scheduler, or controller), and they should be decoupled.

The third part is how to feed(or inject) the preserved state data to the destination cluster. This is going to be not that complex as long as the controller can figure out the destination cluster and only feed the state data when creating the application.

Dyex719 · 2024-07-20T19:10:49Z

Hi @RainbowMango,

To address your comment:

The second part is how to store the preserved state data, one approach is to store them in the history items(this is our first idea), the challenge things is that it's hard to maintain the history item, especially figuring out what is the destination cluster.

One thing we were thinking about it is to only store the cluster from which the workload failed over from. This would help keep the logic simple without involving multiple components. The current cluster the workload is scheduled on can always be inferred from the resourcebinding.

With this we would achieve:

spec:
  clusters:
  - name: member3
    replicas: 2
...
failoverHistory:
    - failoverTime: "2024-07-18T19:03:06Z"
      originCluster: member1
      reason: "applicationFailover"
    - failoverTime: "2024-07-18T19:08:45Z"
      originCluster: member2
      reason: "clusterFailover"

The entire trace of the workload could still be inferred from the current state + failoverHistory, the workload was originally on member1, then it migrated to member2, then it moved to member3 which can be inferred from spec.clusters in the resource binding where it is currently scheduled.

Work in progress implementation is available here: master...Dyex719:karmada:stateful-failover-flag.

Let me know what you think about this! I will get back to your idea at #5116 (comment) in a bit. Thanks!

Dyex719 · 2024-07-20T19:27:42Z

The third part is how to feed(or inject) the preserved state data to the destination cluster. This is going to be not that complex as long as the controller can figure out the destination cluster and only feed the state data when creating the application.

Our idea was to preserve only the metadata of the state rather than the actual state which may be large and be of different formats in the resourcebinding.

As you know, we use Kyverno for fetching the actual state using the persisted metadata in the resourcebinding. My understanding is that Kyverno is well supported by Karmada so we could use Kyverno to do this injection. It also allows a lot of customization that the user can perform to suit their needs.

Is your idea to have another component (maybe the scheduler) do this injection? My only concern is that customization may become difficult.

RainbowMango · 2024-09-27T11:14:46Z

@mszacillo @Dyex719
I tried to design the API and asked @XiShanYongYe-Chang for a demo, the demo shows it is feasible. I hope we can demonstrate the demo at the next meeting. And I also hope you can help to take a look it.

API Design: master...RainbowMango:karmada:api_draft_application_failover
Demo: master...XiShanYongYe-Chang:karmada:api_draft_application_failover

RainbowMango · 2024-10-15T08:04:18Z

@XiShanYongYe-Chang I see @mszacillo and @Dyex719 added an agenda to this week's community meeting, I wonder if you can show a demo during the meeting, or share the test report here.

XiShanYongYe-Chang · 2024-10-15T12:30:28Z

@XiShanYongYe-Chang I see @mszacillo and @Dyex719 added an agenda to this week's community meeting, I wonder if you can show a demo during the meeting, or share the test report here.

Okay, let me show you the demo during the meeting tonight.

XiShanYongYe-Chang · 2024-10-15T13:08:24Z

Let me describe the demo in words.

Create each resource according to the following configuration file:

unfold me to see the yaml

# deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - image: nginx
        name: nginx
===================================
# pp.yaml 
# propagate the target Deployment to member1, member2 member clusters, only to one cluster by setting spreadConstraints
apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
  name: nginx-propagation
spec:
  resourceSelectors:
    - apiVersion: apps/v1
      kind: Deployment
      name: nginx
  placement:
    clusterAffinity:
      clusterNames:
        - member1
        - member2
    spreadConstraints:
      - spreadByField: cluster
        maxGroups: 1
        minGroups: 1
  propagateDeps: true
  failover:
    application:
      decisionConditions:
        tolerationSeconds: 60
      purgeMode: Graciously
      statePreservation:
        rules:
          - aliasLabelName: cztest-replicas
            jsonPath: ".updatedReplicas"
===================================
# op.yaml 
# Modify the Deployment image distributed to the member1 cluster via OverridePolicy to simulate an application failure on member1
apiVersion: policy.karmada.io/v1alpha1
kind: OverridePolicy
metadata:
  name: nginx-op
spec:
  resourceSelectors:
    - apiVersion: apps/v1
      kind: Deployment
      name: nginx
  overrideRules:
    - targetCluster:
        clusterNames:
          - member1
      overriders:
        imageOverrider:
          - component: "Registry"
            operator: replace
            value: "fake"

Execute the following commands to deploy the above resources to the karmada control plane:

 kubectl --context karmada-apiserver apply -f op.yaml
 kubectl --context karmada-apiserver apply -f pp.yaml
 kubectl --context karmada-apiserver apply -f deployment.yaml

The deployment will be propagated to the member1 cluster first, but due to a image error, after waiting for 1 minute, it will be propagated to the member2 cluster after application failover.

Then execute the following command to watch the deployment resource on the member2 cluster:

 kubectl --kubeconfig /root/.kube/members.config --context member2 get deployments.apps -oyaml -w

You will observe that the target cztest-replicas: "2" appears in the label of the deployment nginx resource on the member2 cluster.

mszacillo · 2024-10-15T13:42:13Z

Hi Hongcai and Chang, Thanks for the heads up! I'm a little surprised that there has been a different branch created for the change - as this is something we had proposed and put up for review (for reference we currently use the failover feature in our own DEV setup), but thank you for putting so much thought into this feature. Perhaps we can discuss the implementation process more during the meeting? Cheers.

…

On Tue, Oct 15, 2024 at 9:08 AM Chang ***@***.***> wrote: Let me describe the demo in words. Create each resource according to the following configuration file: unfold me to see the yaml # deploy.yamlapiVersion: apps/v1kind: Deploymentmetadata: name: nginx labels: app: nginxspec: replicas: 2 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: containers: - image: nginx name: nginx===================================# pp.yaml # propagate the target Deployment to member1, member2 member clusters, only to one cluster by setting spreadConstraintsapiVersion: policy.karmada.io/v1alpha1kind: PropagationPolicymetadata: name: nginx-propagationspec: resourceSelectors: - apiVersion: apps/v1 kind: Deployment name: nginx placement: clusterAffinity: clusterNames: - member1 - member2 spreadConstraints: - spreadByField: cluster maxGroups: 1 minGroups: 1 propagateDeps: true failover: application: decisionConditions: tolerationSeconds: 60 purgeMode: Graciously statePreservation: rules: - aliasLabelName: cztest-replicas jsonPath: ".updatedReplicas"===================================# op.yaml # Modify the Deployment image distributed to the member1 cluster via OverridePolicy to simulate an application failure on member1apiVersion: policy.karmada.io/v1alpha1kind: OverridePolicymetadata: name: nginx-opspec: resourceSelectors: - apiVersion: apps/v1 kind: Deployment name: nginx overrideRules: - targetCluster: clusterNames: - member1 overriders: imageOverrider: - component: "Registry" operator: replace value: "fake" Execute the following commands to deploy the above resources to the karmada control plane: kubectl --context karmada-apiserver apply -f op.yaml kubectl --context karmada-apiserver apply -f pp.yaml kubectl --context karmada-apiserver apply -f deployment.yaml The deployment will be propagated to the member1 cluster first, but due to a image error, after waiting for 1 minute, it will be propagated to the member2 cluster after application failover. Then execute the following command to watch the deployment resource on the member2 cluster: kubectl --kubeconfig /root/.kube/members.config --context member2 get deployments.apps -oyaml -w You will observe that the target cztest-replicas: "2" appears in the label of the deployment nginx resource on the member2 cluster. image.png (view on web) <https://github.com/user-attachments/assets/acafcbe7-cfdb-4a6d-b677-7b266004dbdc> — Reply to this email directly, view it on GitHub <#5116 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACTHQ2PDV52RS6MWFDTZ7DDZ3UHV5AVCNFSM6AAAAABKGBCDCGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJTHA3TAOJUGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

RainbowMango · 2024-10-17T06:55:18Z

I'm a little surprised that there has been a
different branch created for the change - as this is something we had
proposed and put up for review (for reference we currently use the failover
feature in our own DEV setup), but thank you for putting so much thought into this feature.

Thanks for letting me know!
We discussed this proposal a lot. I want to help move forward with the feature, so I tried to refine the API design based on your idea. The new branch was just used to explain my thoughts. I'm looking forward to hearing your thoughts, and if we reach a consensus, they should be put in this proposal.

Signed-off-by: Aditya Addepalli <dyex719@gmail.com>

karmada-bot added the do-not-merge/contains-merge-commits Indicates a PR which contains merge commits. label Jul 1, 2024

karmada-bot requested review from Poor12 and Tingtal July 1, 2024 17:58

karmada-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 1, 2024

Add stateful failover proposal

520f3f7

Signed-off-by: Aditya Addepalli <dyex719@gmail.com>

Dyex719 force-pushed the state-proposal branch from 59001a2 to 520f3f7 Compare July 1, 2024 19:50

karmada-bot removed the do-not-merge/contains-merge-commits Indicates a PR which contains merge commits. label Jul 1, 2024

RainbowMango reviewed Jul 3, 2024

View reviewed changes

karmada-bot assigned RainbowMango Jul 3, 2024

RainbowMango reviewed Jul 9, 2024

View reviewed changes

RainbowMango mentioned this pull request Jul 17, 2024

FederatedResourceQuota should be failover friendly #5179

Open

Dyex719 mentioned this pull request Jul 25, 2024

Add failover history information #5251

Open

mszacillo force-pushed the state-proposal branch from e4768bc to f91381f Compare October 22, 2024 17:17

Add limitations related to persisted information

fd35fb4

Signed-off-by: Aditya Addepalli <dyex719@gmail.com>

mszacillo force-pushed the state-proposal branch from f91381f to fd35fb4 Compare October 23, 2024 09:03

RainbowMango mentioned this pull request Nov 5, 2024

[Feature] Stateful Application Failover Support #5788

Open

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stateful Failover Proposal #5116

Stateful Failover Proposal #5116

Dyex719 commented Jul 1, 2024

karmada-bot commented Jul 1, 2024

codecov-commenter commented Jul 1, 2024 •

edited

Loading

RainbowMango left a comment

Dyex719 commented Jul 3, 2024

RainbowMango commented Jul 4, 2024

RainbowMango Jul 9, 2024

Dyex719 Jul 9, 2024 •

edited

Loading

RainbowMango Jul 20, 2024

RainbowMango Jul 20, 2024

RainbowMango Sep 10, 2024

RainbowMango Sep 11, 2024

RainbowMango Sep 11, 2024

RainbowMango Sep 11, 2024

Dyex719 Sep 15, 2024

RainbowMango Sep 18, 2024

RainbowMango Jul 9, 2024

Dyex719 Jul 9, 2024

Dyex719 Jul 9, 2024 •

edited

Loading

Dyex719 Jul 10, 2024

RainbowMango Sep 10, 2024

Dyex719 Sep 10, 2024

karmada-bot commented Jul 10, 2024

RainbowMango commented Jul 20, 2024 •

edited

Loading

Dyex719 commented Jul 20, 2024 •

edited

Loading

Dyex719 commented Jul 20, 2024

RainbowMango commented Sep 27, 2024

RainbowMango commented Oct 15, 2024

XiShanYongYe-Chang commented Oct 15, 2024

XiShanYongYe-Chang commented Oct 15, 2024

mszacillo commented Oct 15, 2024 via email •

edited

Loading

RainbowMango commented Oct 17, 2024

Stateful Failover Proposal #5116

Are you sure you want to change the base?

Stateful Failover Proposal #5116

Conversation

Dyex719 commented Jul 1, 2024

karmada-bot commented Jul 1, 2024

codecov-commenter commented Jul 1, 2024 • edited Loading

Codecov Report

RainbowMango left a comment

Choose a reason for hiding this comment

Dyex719 commented Jul 3, 2024

RainbowMango commented Jul 4, 2024

Choose a reason for hiding this comment

Dyex719 Jul 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dyex719 Jul 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karmada-bot commented Jul 10, 2024

RainbowMango commented Jul 20, 2024 • edited Loading

Dyex719 commented Jul 20, 2024 • edited Loading

Dyex719 commented Jul 20, 2024

RainbowMango commented Sep 27, 2024

RainbowMango commented Oct 15, 2024

XiShanYongYe-Chang commented Oct 15, 2024

XiShanYongYe-Chang commented Oct 15, 2024

mszacillo commented Oct 15, 2024 via email • edited Loading

RainbowMango commented Oct 17, 2024

codecov-commenter commented Jul 1, 2024 •

edited

Loading

Dyex719 Jul 9, 2024 •

edited

Loading

Dyex719 Jul 9, 2024 •

edited

Loading

RainbowMango commented Jul 20, 2024 •

edited

Loading

Dyex719 commented Jul 20, 2024 •

edited

Loading

mszacillo commented Oct 15, 2024 via email •

edited

Loading