Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stateful Failover Proposal #5116

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Dyex719
Copy link

@Dyex719 Dyex719 commented Jul 1, 2024

What type of PR is this?
Proposal for stateful failover

What this PR does / why we need it:
Explained in the doc

Which issue(s) this PR fixes:
Fixes #5006, #4969

Special notes for your reviewer:
N/A

Does this PR introduce a user-facing change?:
NONE

@karmada-bot karmada-bot added the do-not-merge/contains-merge-commits Indicates a PR which contains merge commits. label Jul 1, 2024
@karmada-bot
Copy link
Collaborator

Welcome @Dyex719! It looks like this is your first PR to karmada-io/karmada 🎉

@karmada-bot karmada-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 1, 2024
@codecov-commenter
Copy link

codecov-commenter commented Jul 1, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 40.91%. Comparing base (2271a41) to head (fd35fb4).
Report is 680 commits behind head on master.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@             Coverage Diff             @@
##           master    #5116       +/-   ##
===========================================
+ Coverage   28.21%   40.91%   +12.69%     
===========================================
  Files         632      650       +18     
  Lines       43556    55182    +11626     
===========================================
+ Hits        12291    22575    +10284     
- Misses      30368    31170      +802     
- Partials      897     1437      +540     
Flag Coverage Δ
unittests 40.91% <ø> (+12.69%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Aditya Addepalli <dyex719@gmail.com>
@karmada-bot karmada-bot removed the do-not-merge/contains-merge-commits Indicates a PR which contains merge commits. label Jul 1, 2024
Copy link
Member

@RainbowMango RainbowMango left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/assign
Thanks @Dyex719 Just quickly went through the propoasl, I believe this is meaningful enhancement and very inspiring.

@Dyex719
Copy link
Author

Dyex719 commented Jul 3, 2024

Thanks @RainbowMango! One assumption that we wanted to discuss about (I wrote this in the proposal too):

All the replicas of the stateful application are not migrated together, it is not clear when the state needs to be restored. In this proposal we focus on the use case where all the replicas of a stateful application are migrated together.

Do you think this is a valid assumption? This narrows our scope a little, but defines the problem more clearly.

@RainbowMango
Copy link
Member

Do you think this is a valid assumption? This narrows our scope a little, but defines the problem more clearly.

I 100% agree.

  1. Partially replica migration is not technically failover, but more like elastic scaling of replicas.
  2. The conditions under which failover triggers are fully configurable, if people tolerate the failure of part of replicas, then migration should not be triggered, if they don't, the application should be rebuilt, by leveraging failover.


We propose to add two fields to a propogation policy to enable stateful failover.
1. persistedFields.maxHistory: This sets the max limit on the amount of stateful failover history that is persisted before the older entries are overwritten. If this is set to 5, the resourcebinding will store a maximum of 5 failover entries in FailoverHistory before it overwrites the older history.
2. persistedFields.fields: This is a list of the fields that are required by the stateful application to be persisted during failover to resume processing from a particular case. This takes in a list of field names as well as how to access them from the spec.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any restriction for the field type here? Can we or should we support types other than String?

Copy link
Author

@Dyex719 Dyex719 Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

persistedFields.fields contains the name of the field we want to persist and the path to obtain the value of that label. Since both the name and value is tied to a label which supports only strings, I think having a string is sufficient.

persistedFields.fields:
      - LabelName: jobID
        PersistedStatusItem: obj.status.jobStatus.jobID

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works I believe. But what I'm thinking here is maybe we can make room for extensibility. The info we can take can be more generic, like a string, integer, and so on. I'm still thinking about it. will get back to you once I have an idea.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to draft the API, please take a look:

// ApplicationFailoverBehavior indicates application failover behaviors.
type ApplicationFailoverBehavior struct {
	// other fields.

	// StatePreservation defines the policy for preserving and restoring state data
	// during failover events for stateful applications.
	//
	// When an application fails over from one cluster to another, this policy enables
	// the extraction of critical data from the original resource configuration.
	// Upon successful migration, the extracted data is then re-injected into the new
	// resource, ensuring that the application can resume operation with its previous
	// state intact.
	// This is particularly useful for stateful applications where maintaining data
	// consistency across failover events is crucial.
	// If not specified, means no state data will be preserved.
	// +optional
	StatePreservation *StatePreservation `json:"statePreservation,omitempty"`
}

// StatePreservation defines the policy for preserving state during failover events.
type StatePreservation struct {
	// Rules contains a list of StatePreservationRule configurations.
	// Each rule specifies a JSONPath expression targeting specific pieces of
	// state data to be preserved during failover events. An AliasName is associated
	// with each rule, serving as a label key when the preserved data is passed
	// to the new cluster.
	// +required
	Rules []StatePreservationRule `json:"rules"`

	// Note: We probably need more policies to control how to feed the new cluster with the
	// preserved state data in the future. Such as:
	// - Is it always acceptable to feed data as the label? Is there a need for annotation?
	// - If each label name should be started with a prefix, like `karmada.io/failover-preserving-<fieldname>: <state>`
	// Sure, we can default with the label, this structure just makes room for future extensions.
	//
	// For instance, we can introduce a policy if someone wants to control how the preserving state
	// feed to new clusters. This is probably not included this time.
	// RestorePolicy determines when and how the preserved state should be restored.
	// RestorePolicy RestorePolicy `json:"restorePolicy"`
}

// StatePreservationRule defines a single rule for state preservation.
// It includes a JSONPath expression and an alias name that will be used
// as a label key when passing state information to the new cluster.
type StatePreservationRule struct {
	// AliasName is the name that will be used as a label key when the preserved
	// data is passed to the new cluster. This facilitates the injection of the
	// preserved state back into the application resources during recovery.
	// +required
	AliasName string `json:"aliasName"`

	// JSONPath is the JSONPath query used to identify the state data
	// to be preserved from the original resource configuration.
	// Example: ".spec.template.spec.containers[?(@.name=='my-container')].image"
	// This example selects the image of a container named 'my-container'.
	// +required
	JSONPath string `json:"jsonPath"`
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dyex719 @mszacillo What do you think of this API?

I haven't put history thing here, but seems it should not in StatePreservation or persistedFields as referenced by #5116.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, let me think about it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes 100% sense to move the StatePreservation to FailoverBehavior, so that both Cluster failover and application failover can share the same configurations.

But, given the cluster failover configuration is still not provided, also not sure when it will come, I'm afraid it might slow down the process if we consider cluster failover in this proposal. So, I tend to focus on application failover right now.

In addition, if cluster failover also needs this configuration later, we can deprecate the one in the application in a compatible way, it's not a concern for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, now I'm going to look at the second part(as I mentioned on #5116 (comment)), which is how to grab and store the state data in ResourceBining.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our implementation we use the taint condition to specify that a cluster failover is taking place.

Are there plans to include a separate cluster failover implementation? We would prefer to have cluster failover be in scope of this proposal as it is important for flink applications to recover from these failures as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there plans to include a separate cluster failover implementation? We would prefer to have cluster failover be in scope of this proposal as it is important for flink applications to recover from these failures as well.

I don't mind having the cluster failover in this proposal. You might need to refresh the proposal.
I'm sure this gonna be a large proposal.


// PersistedItem is a pointer to the status item that should be persisted to the rescheduled
// replica during a failover. This should be input in the form: obj.status.<path-to-item>
PersistedStatusItem string `json:"persistedStatusItem,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this persistedStatusItem will be applied as a label value, which brings a limitation that the length can not exceed 63 characters. Does Flink's checkpoint path exceed this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our use-case, we only want to persist the jobID which is a 32 hexadecimal so this limitation is not a problem for us. Both labels and annotations have a 63 character limit for the value in the key, value pair so I don't think there is any way around this.

Copy link
Author

@Dyex719 Dyex719 Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should make a change here:

// persistedStatusItem is a pointer to the status item the value of the status item that should be persisted to the rescheduled replica during a failover. This should be input in the form: obj.status.path-to-item value at the location obj.status.path-to-item

In the case of flink, the flink process will create a jobID at obj.status.jobStatus.jobID which would look like fd72014d4c864993a2e5a9287b4a9c5d so the persistedStatusItem would store this value in the PersistedStatusItem field.

Flink Deployment spec would have:

status:
  jobStatus:
    jobID: fd72014d4c864993a2e5a9287b4a9c5d

PersistedDuringFailover would store jobID as the labelName and fd72014d4c864993a2e5a9287b4a9c5d as the PersistedStatusItem (value)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are storing the metadata related to the state here, I think the limitation of the value being a string and less than 63 characters is okay. The actual state of the flink job will be stored in some durable storage and may be very big. In other stateful applications as well I think we can make this assumption that metadata related to the state should be persisted with the resourceBindingStatus and the actual data can be stored elsewhere.

I will include this in the proposal. Let me know what you think

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we can go with the label as the default data provision.
By the way, I talked to Keyverno's maintainer, I was told that Keyverno can perform filters against annotation. We can support that if there is a need.

Both labels and annotations have a 63 character limit for the value in the key, value pair so I don't think there is any way around this.

By the way, the value length of the annotation can support up to 255 chars. :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like as you said annotations values can support 253 characters whereas label values can only be 64 characters long. Annotation keys can only be 64 characters whereas label keys can be 253 characters.

@karmada-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from rainbowmango. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@RainbowMango
Copy link
Member

RainbowMango commented Jul 20, 2024

Hi @Dyex719, @mszacillo, I've been thinking this feature recently and came up with some ideas, this feature consists of 3 parts:

The first part is how to declare which state data(fields) should be preserved during failover, and I post my idea at #5116 (comment). Please take a look.

The second part is how to store the preserved state data, one approach is to store them in the history items(this is our first idea), the challenge things is that it's hard to maintain the history item, especially figuring out what is the destination cluster.

I think it is worth thinking about storing them in the GracefulEvictionTask, during the failover process, just before creating the eviction task, it makes sense to grab some state(fields) or snapshot of scheduled cluster list. With this grabbed data, the controller would get to know which cluster is the destination by comparing the snapshot and the newly scheduled cluster.

[edit]
I don't mean I don't like the first approach, just raise another idea. Maybe we also can add a snapshot of the scheduled cluster to the history item. The most challenging thing for this part is distinguishing which field should be managed by which component(scheduler, or controller), and they should be decoupled.

The third part is how to feed(or inject) the preserved state data to the destination cluster. This is going to be not that complex as long as the controller can figure out the destination cluster and only feed the state data when creating the application.

@Dyex719
Copy link
Author

Dyex719 commented Jul 20, 2024

Hi @RainbowMango,

To address your comment:

The second part is how to store the preserved state data, one approach is to store them in the history items(this is our first idea), the challenge things is that it's hard to maintain the history item, especially figuring out what is the destination cluster.

One thing we were thinking about it is to only store the cluster from which the workload failed over from. This would help keep the logic simple without involving multiple components. The current cluster the workload is scheduled on can always be inferred from the resourcebinding.

With this we would achieve:

spec:
  clusters:
  - name: member3
    replicas: 2
...
failoverHistory:
    - failoverTime: "2024-07-18T19:03:06Z"
      originCluster: member1
      reason: "applicationFailover"
    - failoverTime: "2024-07-18T19:08:45Z"
      originCluster: member2
      reason: "clusterFailover"

The entire trace of the workload could still be inferred from the current state + failoverHistory, the workload was originally on member1, then it migrated to member2, then it moved to member3 which can be inferred from spec.clusters in the resource binding where it is currently scheduled.

Work in progress implementation is available here: master...Dyex719:karmada:stateful-failover-flag.

Let me know what you think about this! I will get back to your idea at #5116 (comment) in a bit. Thanks!

@Dyex719
Copy link
Author

Dyex719 commented Jul 20, 2024

The third part is how to feed(or inject) the preserved state data to the destination cluster. This is going to be not that complex as long as the controller can figure out the destination cluster and only feed the state data when creating the application.

Our idea was to preserve only the metadata of the state rather than the actual state which may be large and be of different formats in the resourcebinding.

As you know, we use Kyverno for fetching the actual state using the persisted metadata in the resourcebinding. My understanding is that Kyverno is well supported by Karmada so we could use Kyverno to do this injection. It also allows a lot of customization that the user can perform to suit their needs.

Is your idea to have another component (maybe the scheduler) do this injection? My only concern is that customization may become difficult.

@RainbowMango
Copy link
Member

@mszacillo @Dyex719
I tried to design the API and asked @XiShanYongYe-Chang for a demo, the demo shows it is feasible. I hope we can demonstrate the demo at the next meeting. And I also hope you can help to take a look it.

API Design: master...RainbowMango:karmada:api_draft_application_failover
Demo: master...XiShanYongYe-Chang:karmada:api_draft_application_failover

@RainbowMango
Copy link
Member

@XiShanYongYe-Chang I see @mszacillo and @Dyex719 added an agenda to this week's community meeting, I wonder if you can show a demo during the meeting, or share the test report here.

@XiShanYongYe-Chang
Copy link
Member

@XiShanYongYe-Chang I see @mszacillo and @Dyex719 added an agenda to this week's community meeting, I wonder if you can show a demo during the meeting, or share the test report here.

Okay, let me show you the demo during the meeting tonight.

@XiShanYongYe-Chang
Copy link
Member

Let me describe the demo in words.

Create each resource according to the following configuration file:

unfold me to see the yaml
# deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - image: nginx
        name: nginx
===================================
# pp.yaml 
# propagate the target Deployment to member1, member2 member clusters, only to one cluster by setting spreadConstraints
apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
  name: nginx-propagation
spec:
  resourceSelectors:
    - apiVersion: apps/v1
      kind: Deployment
      name: nginx
  placement:
    clusterAffinity:
      clusterNames:
        - member1
        - member2
    spreadConstraints:
      - spreadByField: cluster
        maxGroups: 1
        minGroups: 1
  propagateDeps: true
  failover:
    application:
      decisionConditions:
        tolerationSeconds: 60
      purgeMode: Graciously
      statePreservation:
        rules:
          - aliasLabelName: cztest-replicas
            jsonPath: ".updatedReplicas"
===================================
# op.yaml 
# Modify the Deployment image distributed to the member1 cluster via OverridePolicy to simulate an application failure on member1
apiVersion: policy.karmada.io/v1alpha1
kind: OverridePolicy
metadata:
  name: nginx-op
spec:
  resourceSelectors:
    - apiVersion: apps/v1
      kind: Deployment
      name: nginx
  overrideRules:
    - targetCluster:
        clusterNames:
          - member1
      overriders:
        imageOverrider:
          - component: "Registry"
            operator: replace
            value: "fake"

Execute the following commands to deploy the above resources to the karmada control plane:

 kubectl --context karmada-apiserver apply -f op.yaml
 kubectl --context karmada-apiserver apply -f pp.yaml
 kubectl --context karmada-apiserver apply -f deployment.yaml

The deployment will be propagated to the member1 cluster first, but due to a image error, after waiting for 1 minute, it will be propagated to the member2 cluster after application failover.

Then execute the following command to watch the deployment resource on the member2 cluster:

 kubectl --kubeconfig /root/.kube/members.config --context member2 get deployments.apps -oyaml -w

You will observe that the target cztest-replicas: "2" appears in the label of the deployment nginx resource on the member2 cluster.

image

@mszacillo
Copy link
Contributor

mszacillo commented Oct 15, 2024 via email

@RainbowMango
Copy link
Member

I'm a little surprised that there has been a
different branch created for the change - as this is something we had
proposed and put up for review (for reference we currently use the failover
feature in our own DEV setup), but thank you for putting so much thought into this feature.

Thanks for letting me know!
We discussed this proposal a lot. I want to help move forward with the feature, so I tried to refine the API design based on your idea. The new branch was just used to explain my thoughts. I'm looking forward to hearing your thoughts, and if we reach a consensus, they should be put in this proposal.

Signed-off-by: Aditya Addepalli <dyex719@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Framework for restoring state after failover for stateful applications
6 participants