Skip to content

Conversation

@fedepaol
Copy link
Member

Making the interaction with FRR more robust introducing retries after failures in both the call of the reloader (if the pid is not there yet) and if we find that the interaction between the reloader and the container fails.

Fixes #1462

fedepaol added 2 commits June 29, 2022 16:47
If the reload fails, most likely due to the reloader container not ready
yet, we wait and try to reload the configuration again.

Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>
If we detect a failure in the configuration processing, we ask the
debouncer to retry using the latest configuration.

Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>
Comment on lines +355 to +357
if newCfg.useOld && config == nil {
continue // just ignore the event
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I understand this :/ why sending something to the chan in the first place only to do nothing with it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should never happen, because it means that we received a "reload old config" event but the old config is nil (meaning, we already tried a reload that failed).
Nonethless, I'd prefer not having the speaker crash but ignore the event in such corner case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I misread it as newCfg.config == nil 😅

}
updated = <-result
if updated.Hostname != "1" {
t.Fatal("Config was not updated")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be "Config was updated"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, because what writes to result is the action to update the config. So in this case, if we don't get the value, it means that the retry mechanism did not trigger and that the update function was not called

err := body(config)
if err != nil {
timeOut = time.After(failureRetryInterval)
timerSet = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean that after a fail the next config update won't happen until the failureretryinterval finishes? shouldn't we let new updates override this timeout?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct, and I thought that it makes sense to go slow after a failure happened (think about the case of the slow container startup, after a failure we want to give it time to recover). 5 seconds is still a bearable timeout for applying configuration changes.

@fedepaol
Copy link
Member Author

Tested also manually by putting a sleep in the reloader script first, and in the frr entry point after, and the speaker recovered.

Copy link
Member

@oribon oribon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@fedepaol fedepaol merged commit 655f7a5 into metallb:main Jun 30, 2022
fedepaol added a commit to fedepaol/metallb that referenced this pull request Jul 4, 2022
Adding a note about metallb#1463

Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>
fedepaol added a commit that referenced this pull request Jul 4, 2022
Adding a note about #1463

Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>
fedepaol added a commit to fedepaol/metallb that referenced this pull request Jul 8, 2022
Adding a note about metallb#1463

Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>
novad03 pushed a commit to novad03/k8s-meta that referenced this pull request Nov 25, 2023
Adding a note about metallb/metallb#1463

Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FRR interaction race condition

2 participants