-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible race condition while updating khstate #858
Comments
@adriananeci sorry for the late response! The race condition that you see in your logs is expected, therefore the retry. If your check is stuck in that loop, that's an issue. But it seems like the uuid for that check ( I'm pretty sure the error message You mentioned that you can't repro this, which makes it a little bit more difficult for us, but I do think its a big concern if the check isn't getting rescheduled / is stuck... The next time this happens, it'd be great to see more kuberhealthy logs to learn more about where it's failing to reschedule. Usually if you do a grep of the khcheck name -- ie |
Hey @joshulyne ,
We hit the same issue again today. Another workaround was to rolling restart
Looking at the kuberhealthy logs, if my assumption is correct, at a moment in time, 2 kuberhealthy instances were simultaneously tried to adjust the khstate CRD:
After this race condition, the only logs related to this check were:
|
@adriananeci thanks for the quick response! So among the kuberhealthy instances, only one of them can modify a khstate resource at a time, this is generally the master kuberhealthy -- a constant master calculation is done in the background to ensure that its the only instance that can modify resources. If two kuberhealthy instances were modifying a khstate resource at the same time, you would usually see in the logs where one of them stopped being master or Looking at the logs it seems like the khcheck I see that it tries to here at the end of the first run (check uuid:
And if it had been successful, you should see a log line like: However at the same time, a new run for the same khcheck is starting (check uuid:
Setting the new check UUID is the first thing that happens during a new check run and this requires an update of the khstate and I THINK that's where the race condition is happening... When the previous khcheck run and new run are trying to update the same khstate resource. What are you setting the |
So I tried to replicate the scenario -- creating a bad khcheck that fails to report back to kuberhealthy within its timeout and setting the runInterval low so that check runs immediately after. This is what I see in the logs:
The first khcheck run (uuid: |
Hey @joshulyne Thanks for your reply. Check run interval and timeout is configured as:
Looking deeper into the logs I can see that for the same check(pod Based on the below logs, the master pod( One second later, the same pod successfully connected to the slave kuberhealthy( Later on, the master succeeded to update the khState( Another interesting thing is that both kuberhealthy replicas updated the khState for the same uuid between
During this timeframe, the only master was
Also, no |
Hey @adriananeci thanks for sharing more details about your check and the logs with us. It is interesting to see that the To give you an update, we did find a bug in the way we were updated khstate resources, specifically in the error log line you were seeing here:
The khcheck easily could have gotten stuck here, because we weren't re-fetching the khstate resource when we retried (every 5s) to update the khstate resource -- we were trying to modify the khstate resource that hadn't had the latest modifications. @integrii plans to create a PR for a fix for this bug. We'll let you know when that gets merged in -- thanks for reporting and digging into this with us! |
This also may be related to #880. Kuberhealthy was designed originally to be "okay" with two masters running at the same time in order to avoid having some external source of truth or complex election communication between pods. We could also slow down the master election process to ensure that there is never an overlap of masters. For now though, I want to keep trying to go with the strategy of 'its okay to run multiple masters for a period of time' to avoid all that extra complication in the codebase. That said, I think this issue involves a go routine getting infinitely stuck trying to update the object in etcd. Because of that stuck process, the prior master never fully demotes its self and essentially goes brain dead. trying an object update that isn't allowed. The fix in PR #872 fixes that infinite loop. If we can also track down #880, then we should be back in a place where multiple masters running at the same time isn't an issue. @joshulyne should be giving you an image that you can run to test the fix in #872 here shortly. It would be really helpful if you could switch your Kuberhealthy deployment image out to that version and see if it fixes your issue. |
@adriananeci latest image with fix! |
Deployed it on a test cluster and it seems to have fixed the race condition. Is there any ETA for |
Working on the v2.4.1 release for today! |
@adriananeci v2.4.1 is out! thanks for testing this out and working with us on it! |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment on the issue or this will be closed in 15 days. |
This issue was closed because it has been stalled for 15 days with no activity. Please reopen and comment on the issue if you believe it should stay open. |
Describe the bug
Possible race condition while updating khstate.
Most probably a check failed few days ago due to
Reporting took too long to complete and timed out.
and was never scheduled anymore since then.From kuberhealthy logs:
Steps To Reproduce
Expected behavior
If a check failed, I would like kuberhealthy to be able to schedule a new check and update the khstate accordingly.
Versions
Additional context
khstate
status:The workaround was to remove the khstate and let kuberhealthy recreate it.
The text was updated successfully, but these errors were encountered: