Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhance: add graceful stop timeout to avoid node stop hang under extreme cases #30320

Merged
merged 2 commits into from
Jan 27, 2024

Conversation

chyezh
Copy link
Contributor

@chyezh chyezh commented Jan 26, 2024

  1. add coordinator and proxy graceful stop timeout to 5s.
  2. add other work node graceful stop timeout to 900s, and we should potentially change this to 600s when graceful stop is smooth
  3. change the order of datacoord component while stop.
  4. LivenessCheck do not perform graceful shutdown now.

issue: #30310
pr: #30317
also see: #30306

…eme cases

Signed-off-by: chyezh <chyezh@outlook.com>
@sre-ci-robot sre-ci-robot added the size/L Denotes a PR that changes 100-499 lines. label Jan 26, 2024
@mergify mergify bot added dco-passed DCO check passed. kind/enhancement Issues or changes related to enhancement labels Jan 26, 2024
Copy link
Contributor

mergify bot commented Jan 26, 2024

@chyezh E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

Copy link
Contributor

mergify bot commented Jan 26, 2024

@chyezh ut workflow job failed, comment rerun ut can trigger the job again.

@chyezh chyezh force-pushed the fixup_kill_node_with_timeout_2_3 branch from 5cdc0b4 to ad01a62 Compare January 26, 2024 09:44
Copy link
Contributor

mergify bot commented Jan 26, 2024

@chyezh E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@chyezh
Copy link
Contributor Author

chyezh commented Jan 26, 2024

/run-cpu-e2e

Copy link
Contributor

mergify bot commented Jan 26, 2024

@chyezh E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@chyezh
Copy link
Contributor Author

chyezh commented Jan 26, 2024

/run-cpu-e2e

@xiaofan-luan
Copy link
Contributor

/approve
/lgtm

Copy link
Contributor

mergify bot commented Jan 26, 2024

@chyezh ut workflow job failed, comment rerun ut can trigger the job again.

@chyezh chyezh force-pushed the fixup_kill_node_with_timeout_2_3 branch from ad01a62 to fdb90ff Compare January 26, 2024 11:07
@sre-ci-robot sre-ci-robot removed the lgtm label Jan 26, 2024
@sre-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chyezh, xiaofan-luan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

codecov bot commented Jan 26, 2024

Codecov Report

Attention: 10 lines in your changes are missing coverage. Please review.

Comparison is base (26df754) 82.06% compared to head (b6cf703) 82.10%.
Report is 1 commits behind head on 2.3.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##              2.3   #30320      +/-   ##
==========================================
+ Coverage   82.06%   82.10%   +0.03%     
==========================================
  Files         840      840              
  Lines      121139   121170      +31     
==========================================
+ Hits        99413    99483      +70     
+ Misses      18487    18446      -41     
- Partials     3239     3241       +2     
Files Coverage Δ
internal/distributed/datacoord/service.go 87.75% <100.00%> (+0.15%) ⬆️
internal/distributed/datanode/service.go 82.19% <100.00%> (+0.33%) ⬆️
internal/distributed/indexnode/service.go 71.51% <100.00%> (+0.70%) ⬆️
internal/distributed/proxy/service.go 80.44% <100.00%> (+0.07%) ⬆️
internal/distributed/querycoord/service.go 73.87% <100.00%> (+0.23%) ⬆️
internal/distributed/querynode/service.go 80.58% <100.00%> (+0.38%) ⬆️
internal/distributed/rootcoord/service.go 79.43% <100.00%> (+0.29%) ⬆️
pkg/util/paramtable/component_param.go 98.37% <100.00%> (+0.03%) ⬆️
internal/datacoord/server.go 71.02% <91.66%> (+1.01%) ⬆️
internal/datanode/data_node.go 79.91% <0.00%> (+2.18%) ⬆️
... and 5 more

... and 14 files with indirect coverage changes

Copy link
Contributor

mergify bot commented Jan 26, 2024

@chyezh ut workflow job failed, comment rerun ut can trigger the job again.

@chyezh chyezh force-pushed the fixup_kill_node_with_timeout_2_3 branch from fdb90ff to 13e9239 Compare January 26, 2024 12:23
@jaime0815
Copy link
Contributor

/lgtm

Copy link
Contributor

mergify bot commented Jan 26, 2024

@chyezh ut workflow job failed, comment rerun ut can trigger the job again.

@chyezh chyezh force-pushed the fixup_kill_node_with_timeout_2_3 branch from 13e9239 to 0060e4a Compare January 26, 2024 14:17
@sre-ci-robot sre-ci-robot removed the lgtm label Jan 26, 2024
Copy link
Contributor

mergify bot commented Jan 26, 2024

@chyezh E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@chyezh
Copy link
Contributor Author

chyezh commented Jan 26, 2024

/run-cpu-e2e

3 similar comments
@chyezh
Copy link
Contributor Author

chyezh commented Jan 26, 2024

/run-cpu-e2e

@chyezh
Copy link
Contributor Author

chyezh commented Jan 26, 2024

/run-cpu-e2e

@chyezh
Copy link
Contributor Author

chyezh commented Jan 26, 2024

/run-cpu-e2e

Signed-off-by: chyezh <chyezh@outlook.com>
@chyezh chyezh force-pushed the fixup_kill_node_with_timeout_2_3 branch from 0060e4a to b6cf703 Compare January 26, 2024 15:45
@chyezh
Copy link
Contributor Author

chyezh commented Jan 26, 2024

rerun ut

@jaime0815
Copy link
Contributor

/lgtm

Copy link
Contributor

mergify bot commented Jan 26, 2024

@chyezh E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@chyezh
Copy link
Contributor Author

chyezh commented Jan 27, 2024

/run-cpu-e2e

@yanliang567
Copy link
Contributor

e2e passed, but pipeline hang at log uploading. manual pass

@yanliang567 yanliang567 added ci-passed manual-pass manually set pass before ci-passed labeled labels Jan 27, 2024
@sre-ci-robot sre-ci-robot merged commit 77e1237 into milvus-io:2.3 Jan 27, 2024
13 of 14 checks passed
@chyezh chyezh deleted the fixup_kill_node_with_timeout_2_3 branch March 4, 2024 02:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved ci-passed dco-passed DCO check passed. kind/enhancement Issues or changes related to enhancement lgtm manual-pass manually set pass before ci-passed labeled size/L Denotes a PR that changes 100-499 lines.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants