New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(feg): Diameter closed connection recovery #12192
fix(feg): Diameter closed connection recovery #12192
Conversation
fe426b7
to
c6c1491
Compare
feg/gateway/diameter/connection.go
Outdated
} | ||
if c != nil && conn != nil && c.destroyConnection(conn) { | ||
// if connection was closed not by connection management functions, recover it | ||
c.getDiamConnection() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible this fails again and then the connection is just lost?
Meaning if for some reason there is a network outage and the server is not reacheable getDiamConnection
will fail and the connection will never be created. We may end up in the same place.
Maybe we need some kind of infinite loop to keep retrying every n seconds?
if c != nil && conn != nil && c.destroyConnection(conn) {
for {
// if connection was closed not by connection management functions, recover it
_, _, err := c.getDiamConnection()
if err != nil {
return
}
glog.Error("Failed to recover closed connection. Retrying")
time.sleep(1time.Second)
}
```
Or maybe this loop could be at `getDiamConnection`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, if there is a prolonged network outage, we'll just fall back to active recovery - when the client tries to send out a request, getDiamConnection is called again and will attempt to reconnect. Putting infinite loop here could be a bit heavy, maybe a limited number of retries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note newConnection
suffers of this issue. It never tries to reconnect if it fails once.
Maybe we could add a loop in that go routine and just call newConnection
in this line instead of c.getDiamConnection()
http://github.com/magma/magma/blob/master/feg/gateway/diameter/connection.go#L48-L48
c6c1491
to
d41cf6a
Compare
break | ||
} | ||
time.Sleep(retryWaitTime) | ||
retryWaitTime *= 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now i am just concerned about HA functionality :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HE disable will set connection to disabled...
d41cf6a
to
9ee32af
Compare
9ee32af
to
c52413b
Compare
Signed-off-by: Evgeniy Makeev <evgeniym@fb.com>
c52413b
to
fd269b7
Compare
Signed-off-by: Evgeniy Makeev <evgeniym@fb.com>
Signed-off-by: Evgeniy Makeev <evgeniym@fb.com>
Signed-off-by: Evgeniy Makeev <evgeniym@fb.com>
Signed-off-by: Evgeniy Makeev <evgeniym@fb.com>
Signed-off-by: Evgeniy Makeev <evgeniym@fb.com>
Signed-off-by: Evgeniy Makeev evgeniym@fb.com
Summary
Add connection recovery logic for externally closed diameter connection.
Current connection manager only recovers errored connections. There several scenarios when connennection can be closed 'quietly' without connection manager knowing about it - when diameter watchdog is timed out, it'll close connection unconditionally, protocol stack may closed unhealthy or idle connection, etc.
The fix is relying on diameter Close Notifier channel to detect connection closure & reconnect.
The fix can potentially resolve issue #12067
Test Plan
unit tests, TVM
Additional Information