-
Notifications
You must be signed in to change notification settings - Fork 897
Prevent concurrent instances of dhcp6c #8467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This avoids two instances running simultaneously due to a potential race condition where rtsold may run rtsold_script twice very close together.
|
I had a quick look and I couldn't see any changes in The two execs here in interface_dhcpv6_configure could be a problem, perhaps? Starting |
src/etc/inc/interfaces.inc
Outdated
| $dhcp6ccommand = exec_safe('flock -nF /var/run/dhcp6c.lock /usr/local/sbin/dhcp6c -c %s -p %s', [ | ||
| '/var/etc/dhcp6c.conf', | ||
| '/var/run/dhcp6c.pid', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| $dhcp6ccommand = exec_safe('flock -nF /var/run/dhcp6c.lock /usr/local/sbin/dhcp6c -c %s -p %s', [ | |
| '/var/etc/dhcp6c.conf', | |
| '/var/run/dhcp6c.pid', | |
| $dhcp6ccommand = exec_safe('/usr/local/bin/flock -n -E 0 -o %s /usr/local/sbin/dhcp6c -c %s -p %s', [ | |
| '/var/etc/dhcp6c.conf', | |
| '/var/etc/dhcp6c.conf', | |
| '/var/run/dhcp6c.pid', |
I don't see much use erroring when returning from not being able to lock since another process will do it for us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, the -E 0 change seems safe, I thought I did have to use -F to make it work though, which I see you've removed here, I'll have to look again when I have chance. I can't quite remember why I needed to use -F.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need the fork to release the lock properly, otherwise the process keeps it (see -o)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with -F the dhcp6c process holds the lock but it will be implicitly released when it exits, which is what we need. Although I guess your way works fine, just means the flock process holds the lock instead. I suppose your way is safer in case an update to dhcp6c makes it close all file descriptors when it starts or something like that.
I think I was trying to avoid the case where someone kills the flock process, meaning the lock is released, even though dhcp6c might still be running. But admins doing silly things is perhaps not something we can completely defend against.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dhcp6c shouldn't own the lock since we just want to avoid the start race. letting it keep the lock going into zombie mode would be problematic and at the time of the daemon()/fork() call in dhcp6c process the race should not be there anymore. After that it chugs along for the most part being handled by a "safer" SIGHUP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added your change but for the record from my testing I can still - occasionally - reproduce the race condition using your suggested change. I take your point about dhcp6c itself not holding the lock, but the problem with your change from what I can see in my tests that dhcp6c will fork into the background and exit, and flock then exits, which will release the lock, before the second run has tried to get the lock, leading to a possible race condition. With my original change I can't ever trigger a race, so far.
I think this PR in its current state is a definite improvement but it does feel like there's still a small chance of the race condition lurking. Maybe adding some locking into the opnsense/dhcp6c repo would be useful? I don't mind looking at that, when I have some time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it would be safer to handle this in dhcp6c. We could also remove the immediate flag but the race is still there either way as you indicate and since we have the source code that would be best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have created opnsense/dhcp6c#40 now. If you think this is the right overall approach I'll update this PR to just use the new dhcp6c -l parameter.
I specifically meant |
Understood - I was just meaning that because we start rtsold and the script at the same time, the change making rtsold send the solicit immediately (and thus run the script) is going to make the race condition more likely, I think. |
Remove unnecessary logging of failure to start dhcp6c. Co-authored-by: Franco Fichtner <franco@lastsummer.de>
Yes, agreed. |
Add suggestion from PR - don't use flock -F flag, and don't report error on lock failure. Co-authored-by: Franco Fichtner <franco@lastsummer.de>
This is needed to avoid concurrent runs of dhcp6c which can lead to IPv6 connectivity problems. Related to opnsense/core#8467 and opnsense/core#8431
|
No change needed on core, but thanks very much to moving this along! |
After an unexpected reboot due to a power cut earlier I noticed some dhcp6c errors in the log about XID mismatch, however IPv6 was working fine. On closer inspection it turns out there are 2 dhcp6c processes running, both started at the same time by rtsold.
That script does check for an existing pid file and if found, sends SIGHUP to an existing instance, however it seems as if there is a race condition where if the two runs of rtsold_script.sh are close enough, two instances of dhcp6c are started.
Avoiding this seems to be as simple as wrapping dhcp6c inside a
flockcommand.Log snippet showing 2 instances being started:
This unsurprisingly leads to interleaving of requests, and the wrong instance reading a response to the other instance's requests, leading to the mismatches:
On my test instance I haven't seen the 2 instances happen for real but I have been able to simulate it to test this fix using
which leads to the same XID mismatch error so seems to be representative of what can happen for real. When rebooting my test instance I have seen logs showing the double instance protection is kicking in though.
After this I do see a single dhcp6c running as expected and IPv6 connectivity comes up as normal.
I'm not sure how/why the 2 instances of dhcp6c happened and haven't dug into the rtsold source to see exactly how it runs the wrapper script, but this seems to me to be a useful protection against two instances running.