Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(feg): Diameter closed connection recovery #12192

Merged
merged 1 commit into from Mar 18, 2022

Conversation

emakeev
Copy link
Contributor

@emakeev emakeev commented Mar 17, 2022

Signed-off-by: Evgeniy Makeev evgeniym@fb.com

Summary

Add connection recovery logic for externally closed diameter connection.
Current connection manager only recovers errored connections. There several scenarios when connennection can be closed 'quietly' without connection manager knowing about it - when diameter watchdog is timed out, it'll close connection unconditionally, protocol stack may closed unhealthy or idle connection, etc.

The fix is relying on diameter Close Notifier channel to detect connection closure & reconnect.

The fix can potentially resolve issue #12067

Test Plan

unit tests, TVM

Additional Information

  • This change is backwards-breaking

@emakeev emakeev requested review from uri200 and a team March 17, 2022 22:30
@pull-request-size pull-request-size bot added the size/M Denotes a PR that changes 30-99 lines. label Mar 17, 2022
@emakeev emakeev force-pushed the diameter_connection_close_notify branch from fe426b7 to c6c1491 Compare March 17, 2022 22:33
}
if c != nil && conn != nil && c.destroyConnection(conn) {
// if connection was closed not by connection management functions, recover it
c.getDiamConnection()
Copy link
Contributor

@uri200 uri200 Mar 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible this fails again and then the connection is just lost?

Meaning if for some reason there is a network outage and the server is not reacheable getDiamConnection will fail and the connection will never be created. We may end up in the same place.

Maybe we need some kind of infinite loop to keep retrying every n seconds?

if c != nil && conn != nil && c.destroyConnection(conn) {
		for {
		// if connection was closed not by connection management functions, recover it
			_, _, err := c.getDiamConnection()
			if err != nil {
				return
			}
			glog.Error("Failed to recover closed connection. Retrying")
			time.sleep(1time.Second)
		}
		```


Or maybe this loop could be at `getDiamConnection` 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, if there is a prolonged network outage, we'll just fall back to active recovery - when the client tries to send out a request, getDiamConnection is called again and will attempt to reconnect. Putting infinite loop here could be a bit heavy, maybe a limited number of retries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note newConnection suffers of this issue. It never tries to reconnect if it fails once.
Maybe we could add a loop in that go routine and just call newConnection in this line instead of c.getDiamConnection()

http://github.com/magma/magma/blob/master/feg/gateway/diameter/connection.go#L48-L48

@emakeev emakeev force-pushed the diameter_connection_close_notify branch from c6c1491 to d41cf6a Compare March 18, 2022 00:10
break
}
time.Sleep(retryWaitTime)
retryWaitTime *= 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now i am just concerned about HA functionality :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HE disable will set connection to disabled...

Signed-off-by: Evgeniy Makeev <evgeniym@fb.com>
@emakeev emakeev force-pushed the diameter_connection_close_notify branch from c52413b to fd269b7 Compare March 18, 2022 01:44
@emakeev emakeev merged commit 28b33c9 into magma:master Mar 18, 2022
vikramgreddy pushed a commit to vikramgreddy/magma that referenced this pull request Mar 23, 2022
Signed-off-by: Evgeniy Makeev <evgeniym@fb.com>
tmdzk pushed a commit that referenced this pull request Mar 23, 2022
Signed-off-by: Evgeniy Makeev <evgeniym@fb.com>
ardzoht pushed a commit that referenced this pull request Mar 30, 2022
Signed-off-by: Evgeniy Makeev <evgeniym@fb.com>
mattymo pushed a commit that referenced this pull request Jul 6, 2022
Signed-off-by: Evgeniy Makeev <evgeniym@fb.com>
@emakeev emakeev deleted the diameter_connection_close_notify branch August 5, 2022 23:09
emakeev added a commit to emakeev/magma that referenced this pull request Aug 5, 2022
Signed-off-by: Evgeniy Makeev <evgeniym@fb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
apply-v1.7 size/M Denotes a PR that changes 30-99 lines.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants