Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: TestHAKeeperCanBootstrapAndRepairShards failed #8438

Closed
1 task done
w-zr opened this issue Mar 14, 2023 · 10 comments
Closed
1 task done

[Bug]: TestHAKeeperCanBootstrapAndRepairShards failed #8438

w-zr opened this issue Mar 14, 2023 · 10 comments
Assignees
Labels
bug/ut kind/bug Something isn't working resolved/v1.1.1 severity/s0 Extreme impact: Cause the application to break down and seriously affect the use
Milestone

Comments

@w-zr
Copy link
Contributor

w-zr commented Mar 14, 2023

Is there an existing issue for the same bug?

  • I have checked the existing issues.

Environment

- Version or commit-id (e.g. v0.1.0 or 8b23a93):
- Hardware parameters:
- OS type:
- Others:

Actual Behavior

e4805a81b78c89ccdd5b423ac86dc2f

Expected Behavior

UT should pass.

Steps to Reproduce

No response

Additional information

No response

@w-zr w-zr added kind/bug Something isn't working needs-triage labels Mar 14, 2023
@volgariver6
Copy link
Contributor

刚创建时的日志不全,跑ut时又发生了一次,上传日志
fail.log

@gouhongshen
Copy link
Contributor

gouhongshen commented Jun 16, 2023

the error seems caused by timeout(2s) when invoke state, err := store1.getCheckerState().

and the log detail shows that some communication problems exist between the raft nodes (and the gossip nodes).

after 1000+ go test run -v TestHAKeeperCanBootstrapAndRepairShards and a dozen times make ut without error shows, I tried traffic control by executing shell command:

tc qdisc add dev lo root netem delay 50ms 10ms
tc qdisc add dev lo root netem loss 10% 

or add some random sleep in dragonbaot/internal/raft/raft.go::handleHeartbeatMessage().

And I got a similar timeout error, but not totally the same:

the same:

1. leader lost quorum
2. HAKeeper cannot be bootstrapping before the timeout
3. same term num elapsed

deference:

1. no gossip nodes have been marked as failed

so I am not sure it must be the network issue.

@gouhongshen
Copy link
Contributor

gouhongshen commented Jul 7, 2023

the memberlist: errors may relate to these issues:

  1. How to detect and react to TCP only failures hashicorp/memberlist#264
  2. Document TCP-only operation hashicorp/memberlist#226

all mentioned k8s environment

@gouhongshen
Copy link
Contributor

It's been a long time since it happened again, may close.

@gouhongshen gouhongshen assigned w-zr and unassigned gouhongshen Aug 24, 2023
@w-zr w-zr closed this as completed Aug 24, 2023
@YANGGMM YANGGMM reopened this Nov 29, 2023
@YANGGMM
Copy link
Contributor

YANGGMM commented Nov 29, 2023

@YANGGMM YANGGMM assigned gouhongshen and unassigned w-zr Nov 29, 2023
@gouhongshen
Copy link
Contributor

@sukki37 sukki37 added bug/ut severity/s0 Extreme impact: Cause the application to break down and seriously affect the use and removed needs-triage labels Jan 2, 2024
@sukki37 sukki37 added this to the 1.2.0 milestone Jan 2, 2024
@gouhongshen
Copy link
Contributor

not working on it

1 similar comment
@gouhongshen
Copy link
Contributor

not working on it

@gouhongshen
Copy link
Contributor

augment the hakeeperDefaultTimeout config may work.

@heni02
Copy link
Contributor

heni02 commented Jan 23, 2024

@heni02 heni02 closed this as completed Jan 23, 2024
@matrix-meow matrix-meow reopened this Jan 23, 2024
@heni02 heni02 assigned w-zr and unassigned heni02 Jan 23, 2024
@w-zr w-zr closed this as completed Jan 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug/ut kind/bug Something isn't working resolved/v1.1.1 severity/s0 Extreme impact: Cause the application to break down and seriously affect the use
Projects
None yet
Development

No branches or pull requests

7 participants