Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: add workflow reconciling backoff time and failed limit times #2881

Merged
merged 16 commits into from
Dec 15, 2021

Conversation

FogDong
Copy link
Member

@FogDong FogDong commented Dec 3, 2021

Description of your changes

This PR:

  1. Add workflow reconciling backoff time.
  2. Add workflow step failed limit times, defaults to 10. If the step is failed 10 times and there are no other running steps, the workflow will suspend automatically.

The workflow is now running like this:

$ kubectl get app -w
NAME    COMPONENT      TYPE   PHASE   HEALTHY   STATUS   AGE
mysql   mysql-secret   raw    rendering                               0s
mysql   mysql-secret   raw    runningWorkflow                         0s
mysql   mysql-secret   raw    runningWorkflow      true               1s
mysql   mysql-secret   raw    runningWorkflow      true               2s
mysql   mysql-secret   raw    runningWorkflow      true               3s
mysql   mysql-secret   raw    runningWorkflow      true               4s
mysql   mysql-secret   raw    runningWorkflow      true               5s
mysql   mysql-secret   raw    runningWorkflow      true               6s
mysql   mysql-secret   raw    runningWorkflow      true               9s
mysql   mysql-secret   raw    runningWorkflow      true               15s
mysql   mysql-secret   raw    runningWorkflow      true               27s
mysql   mysql-secret   raw    workflowSuspending   true               52s

I have:

  • Read and followed KubeVela's contribution process.
  • Related Docs updated properly. In a new feature or configuration option, an update to the documentation is necessary.
  • Run make reviewable to ensure this PR is ready for review.
  • Added backport release-x.y labels to auto-backport this PR if necessary.

How has this code been tested

Special notes for your reviewer

/cc @leejanee

@codecov
Copy link

codecov bot commented Dec 6, 2021

Codecov Report

Merging #2881 (fcfc1de) into master (f47ae0e) will increase coverage by 12.78%.
The diff coverage is 86.73%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #2881       +/-   ##
===========================================
+ Coverage   47.21%   60.00%   +12.78%     
===========================================
  Files         173      239       +66     
  Lines       17669    24430     +6761     
===========================================
+ Hits         8343    14659     +6316     
+ Misses       8353     8015      -338     
- Partials      973     1756      +783     
Flag Coverage Δ
apiserver-unittests 27.09% <10.63%> (?)
core-unittests 47.33% <69.89%> (+0.11%) ⬆️
e2e-multicluster-test 25.13% <71.50%> (?)
e2e-rollout-tests 28.90% <64.76%> (?)
e2etests 37.69% <78.75%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
apis/core.oam.dev/common/types.go 79.31% <ø> (+17.24%) ⬆️
pkg/velaql/context.go 50.00% <33.33%> (-2.18%) ⬇️
pkg/workflow/workflow.go 84.30% <82.07%> (+11.68%) ⬆️
pkg/workflow/tasks/custom/task.go 76.34% <82.35%> (+3.20%) ⬆️
pkg/workflow/context/context.go 71.08% <92.59%> (+3.96%) ⬆️
...dev/v1alpha2/application/application_controller.go 79.35% <100.00%> (+28.98%) ⬆️
pkg/resourcekeeper/options.go 100.00% <0.00%> (ø)
apis/core.oam.dev/v1beta1/resourcetracker_types.go 85.55% <0.00%> (ø)
pkg/resourcekeeper/statekeep.go
pkg/apiserver/rest/usecase/workflow.go 65.72% <0.00%> (ø)
... and 160 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f47ae0e...fcfc1de. Read the comment docs.

@FogDong FogDong changed the title WIP: add workflow fail retry Feat: add workflow reconciling backoff time and failed limit times Dec 7, 2021
@FogDong FogDong marked this pull request as ready for review December 7, 2021 11:59
@wonderflow
Copy link
Collaborator

默认应该是 30或者50次失败,然后失败的重试时间1·~10 次应该在 10s内,之后可以拉长,最长可以控制到1分钟。

失败重试最后变成暂停,破坏了暂停这个状态语义。 我的建议是长时间重试不成功后应该设为终止状态。

@FogDong
Copy link
Member Author

FogDong commented Dec 8, 2021

默认应该是 30或者50次失败,然后失败的重试时间1·~10 次应该在 10s内,之后可以拉长,最长可以控制到1分钟。

你说的这种可能是 controller runtime 的 backoff 机制,系数 0.005*2^(n-1),若结果小于 1 则以 1 为最小单位

Times 2^(n-1) 0.005*2^(n-1) Requeue(s)
1 1 0.005 1
2 2 0.01 1
3 4 0.02 1
4 8 0.04 1
5 16 0.08 1
6 32 0.16 1
7 64 0.32 1
8 128 0.64 1
9 256 1.28 1.28
10 512 2.56 2.56
11 1024 5.12 5.12
12 2048 10.24 10.24

但在我们的场景里,workflow step 失败了 10 次或是 50 次的结果应该是差不多的,如果觉得 5 次重试太少,可以更改为 10 次。
同时,可以把我们的系数改成 0.05。公式为 0.05*2^(n-1),若结果小于 1 则以 1 为最小单位,若大于 60 则以 60 为最大单位

Times 2^(n-1) 0.05*2^(n-1) Requeue(s)
1 1 0.05 1
2 2 0.1 1
3 4 0.2 1
4 8 0.4 1
5 16 0.8 1
6 32 1.6 1.6
7 64 3.2 3.2
8 128 6.4 6.4
9 256 12.8 12.8
10 512 25.6 25.6
11 1024 51.2 51.2
12 2048 102.4 60

WDYT /cc @wonderflow @leejanee

@leejanee
Copy link
Member

leejanee commented Dec 8, 2021

但在我们的场景里,workflow step 失败了 10 次或是 50 次的结果应该是差不多的,如果觉得 5 次重试太少,可以更改为 10 次。 同时,可以把我们的系数改成 0.05。公式为 0.05*2^(n-1),若结果小于 1 则以 1 为最小单位,若大于 60 则以 60 为最大单位

这样可以的,不过系数可以放大一点,错误发生时没必要追求快速重试的

pkg/workflow/context/context.go Outdated Show resolved Hide resolved
pkg/workflow/workflow.go Outdated Show resolved Hide resolved
pkg/workflow/workflow.go Outdated Show resolved Hide resolved
@wonderflow
Copy link
Collaborator

This table works fine for me, please make the retry time at least to be 20 times, the last 8 times retry can be 1 min .

workflow step 失败了 10 次或是 50 次的结果应该是差不多的,如果觉得 5 次重试太少,可以更改为 10 次。 同时,可以把我们的系数改成 0.05。公式为 0.05*2^(n-1),若结果小于 1 则以 1 为最小单位,若大于 60 则以 60 为最大单位

Times 2^(n-1) 0.05*2^(n-1) Requeue(s)
1 1 0.05 1
2 2 0.1 1
3 4 0.2 1
4 8 0.4 1
5 16 0.8 1
6 32 1.6 1.6
7 64 3.2 3.2
8 128 6.4 6.4
9 256 12.8 12.8
10 512 25.6 25.6
11 1024 51.2 51.2
12 2048 102.4 60
WDYT /cc @wonderflow @leejanee

pkg/workflow/workflow.go Outdated Show resolved Hide resolved
Signed-off-by: FogDong <dongtianxin.tx@alibaba-inc.com>
Signed-off-by: FogDong <dongtianxin.tx@alibaba-inc.com>
Signed-off-by: FogDong <dongtianxin.tx@alibaba-inc.com>
Signed-off-by: FogDong <dongtianxin.tx@alibaba-inc.com>
Signed-off-by: FogDong <dongtianxin.tx@alibaba-inc.com>
Signed-off-by: FogDong <dongtianxin.tx@alibaba-inc.com>
Signed-off-by: FogDong <dongtianxin.tx@alibaba-inc.com>
Signed-off-by: FogDong <dongtianxin.tx@alibaba-inc.com>
Copy link
Member

@leejanee leejanee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

另外,reconcile backoff不单单服务于流水线,甚至涉及到gc机制。这里一定要协调好,如果发生冲突可能需要以最短的时间为准

pkg/workflow/context/context.go Outdated Show resolved Hide resolved
Signed-off-by: FogDong <dongtianxin.tx@alibaba-inc.com>
Signed-off-by: FogDong <dongtianxin.tx@alibaba-inc.com>
Signed-off-by: FogDong <dongtianxin.tx@alibaba-inc.com>
Signed-off-by: FogDong <dongtianxin.tx@alibaba-inc.com>
Signed-off-by: FogDong <dongtianxin.tx@alibaba-inc.com>
Copy link
Collaborator

@wonderflow wonderflow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after nits

Copy link
Member

@leejanee leejanee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Signed-off-by: FogDong <dongtianxin.tx@alibaba-inc.com>
@Somefive Somefive merged commit 655c261 into kubevela:master Dec 15, 2021
@FogDong FogDong deleted the feat-wffailover branch June 30, 2022 03:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants