Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exit if interface loses IP #10

Merged
merged 1 commit into from
Oct 3, 2019

Conversation

cybertron
Copy link
Member

On startup, mdns-publisher looks up which interface has the configured
IP. However, it is possible for this IP to be subsequently moved,
such as when the interface gets added to a bridge. In that case, we
need to be restarted and listen on the bridge or our records become
inaccessible.

This change adds a goroutine that loops every five seconds checking
if the detected interface still has the configured IP. If it does not,
the function sends a SIGTERM to the process to exit gracefully.
This should allow kubernetes to restart it on the correct interface.

@openshift-ci-robot openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Sep 27, 2019
// This shouldn't be possible, but we need to handle it anyway
log.Printf("[ERR] mdns-publish: Failed to find own process")
}
err = process.Signal(syscall.SIGTERM)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it better to pass shutdown channel here and close is internally? Instead of looking for your own PID and commit suicide. Would save some lines.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe closing the shutdown channel would only stop the listening goroutine. The main thread blocks waiting for a signal (<-sig in https://github.com/openshift/mdns-publisher/blob/master/cmd/publish.go#L93). If we only close shutdown then we'd stop listening but keep running until a signal is delivered.

I also considered just calling Exit() here, but apparently that's the nuclear option and doesn't allow any cleanup to run. I don't know whether that would cause trouble, but it seemed like a graceful shutdown was less likely to cause problems.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, fair enough.

func IfaceCheck(ip net.IP, iface net.Interface) {
for {
i, err := FindIface(ip)
if err != nil || i.Name != iface.Name {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say we fail to find the interface since the IP is being moved and does not sit anywhere at the moment. What would happen with mdns on the next start?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed that if the IP is not found mdns-publisher just exits immediately, so it would crash loop in that case until the IP gets configured.

@yboaron
Copy link
Contributor

yboaron commented Oct 2, 2019

A.
As far as I know, passing the interface name to 'JoinGroup' function [1] affects only the Tx side.

Meaning that both IGMP messages and MDNS multicast packets will be sent on the specified interface ( instead of using routing table for that purpose).

As per the Rx side, we use a UDP L4 socket with dest IP == MDNS multicast address, so we shouldn't care about the interface that the 'baremetal' IP is assigned to and neither the Rx network interface.

B. So this PR makes sure that mdns and IGMP packets sent on the interface that baremetal IP is attached to and trigger the 'cacheFlush' process (as a result of mdns-publisher restart)

C. I think that the comments in code should be updated ( e.g: 'The interface we were listening on changed').

[1] https://github.com/celebdor/zeroconf/blob/master/connection.go#L86

@@ -73,6 +76,30 @@ func FindIface(ip net.IP) (iface net.Interface, err error) {
return iface, fmt.Errorf("Couldn't find interface with IP address %s", ip)
}

// If an interface changes after we've started listening on it (say because it
// got bridged), we need to be restarted to listen on the new interface that
// contains the specified IP.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that comments should be updated here(something like - verify that we join to mdns multicast on the interface holds the IP)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update this.

cmd/publish.go Outdated
@@ -89,6 +89,7 @@ var publishCmd = &cobra.Command{
}).Info("Publishing service")
go publisher.Publish(ip, iface, service, shutdownChannel, waitGroup)
}
go publisher.IfaceCheck(ip, iface)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm new to GOLANG, so I'm not sure if it's a good idea.
Do you think that changing [1] to something similar to what we have in [2] , could be useful?

I mean something like that:

for {
select {
case <-shutdown:
'what we have do incase of shutdown'
default:
//call to any periodic function ( IfaceCheck, and we can add other functions in future)
IfaceCheck()
time.sleep(X)
}

[1]


[2] https://github.com/openshift/baremetal-runtimecfg/blob/master/pkg/monitor/monitor.go#L49

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I wasn't familiar with that pattern in Go, but I think we can use it to eliminate the pid stuff in the check function. I'm not sure we want to get rid of the goroutine, but we can use another channel to do the communication instead of signaling. I'll include that change in the next version of this patch.

On startup, mdns-publisher looks up which interface has the configured
IP. However, it is possible for this IP to be subsequently moved,
such as when the interface gets added to a bridge. In that case, we
need to be restarted and re-detect the bridge or our records become
inaccessible.

This change adds a goroutine that loops every five seconds checking
if the detected interface still has the configured IP. If it does not,
it closes a channel to indicate to the main thread that it should exit.
This should allow kubernetes to restart it on the correct interface.
@cybertron
Copy link
Member Author

Okay, I think this version addresses the comments so far.

Copy link
Contributor

@yboaron yboaron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked it on dev-scripts env + CNV - looks good

@celebdor
Copy link
Contributor

celebdor commented Oct 3, 2019

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 3, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: celebdor, cybertron, phoracek, yboaron

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 8117051 into openshift:master Oct 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants