Skip to content

Conversation

@bensmithurst
Copy link
Contributor

After an unexpected reboot due to a power cut earlier I noticed some dhcp6c errors in the log about XID mismatch, however IPv6 was working fine. On closer inspection it turns out there are 2 dhcp6c processes running, both started at the same time by rtsold.

That script does check for an existing pid file and if found, sends SIGHUP to an existing instance, however it seems as if there is a race condition where if the two runs of rtsold_script.sh are close enough, two instances of dhcp6c are started.

Avoiding this seems to be as simple as wrapping dhcp6c inside a flock command.

Log snippet showing 2 instances being started:

<13>1 2025-03-21T10:48:53+00:00 HOSTNAME dhcp6c 16941 - [meta sequenceId="280"] RTSOLD script - Starting dhcp6c daemon
<29>1 2025-03-21T10:48:53+00:00 HOSTNAME dhcp6c 17028 - [meta sequenceId="281"] extracted an existing DUID from /var/db/dhcp6c_duid: ****
<29>1 2025-03-21T10:48:53+00:00 HOSTNAME dhcp6c 17028 - [meta sequenceId="282"] <3>[interface] (9)
<29>1 2025-03-21T10:48:53+00:00 HOSTNAME dhcp6c 17028 - [meta sequenceId="283"] <5>[igb0] (4)
<13>1 2025-03-21T10:48:53+00:00 HOSTNAME dhcp6c 18091 - [meta sequenceId="284"] RTSOLD script - Starting dhcp6c daemon
<29>1 2025-03-21T10:48:53+00:00 HOSTNAME dhcp6c 17028 - [meta sequenceId="285"] <3>begin of closure [{] (1)

This unsurprisingly leads to interleaving of requests, and the wrong instance reading a response to the other instance's requests, leading to the mismatches:

<29>1 2025-03-21T10:48:54+00:00 HOSTNAME dhcp6c 18518 - [meta sequenceId="484"] Sending Solicit
<29>1 2025-03-21T10:48:54+00:00 HOSTNAME dhcp6c 18518 - [meta sequenceId="485"] a new XID (f5b51) is generated
<29>1 2025-03-21T10:48:54+00:00 HOSTNAME dhcp6c 18518 - [meta sequenceId="486"] set client ID (len 14)
<29>1 2025-03-21T10:48:54+00:00 HOSTNAME dhcp6c 18518 - [meta sequenceId="487"] set elapsed time (len 2)
<29>1 2025-03-21T10:48:54+00:00 HOSTNAME dhcp6c 18518 - [meta sequenceId="488"] set option request (len 4)
<29>1 2025-03-21T10:48:54+00:00 HOSTNAME dhcp6c 18518 - [meta sequenceId="489"] set IA_PD prefix
<29>1 2025-03-21T10:48:54+00:00 HOSTNAME dhcp6c 18518 - [meta sequenceId="490"] set IA_PD
<29>1 2025-03-21T10:48:54+00:00 HOSTNAME dhcp6c 18518 - [meta sequenceId="491"] send solicit to ff02::1:2%igb0
<29>1 2025-03-21T10:48:54+00:00 HOSTNAME dhcp6c 18518 - [meta sequenceId="492"] reset a timer on igb0, state=SOLICIT, timeo=0, retrans=1052
<29>1 2025-03-21T10:48:54+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="493"] receive advertise from fe80::****%igb0 on igb0
<29>1 2025-03-21T10:48:54+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="494"] get DHCP option server ID, len 10
<29>1 2025-03-21T10:48:54+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="495"]   DUID: ****
<29>1 2025-03-21T10:48:54+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="496"] get DHCP option client ID, len 14
<29>1 2025-03-21T10:48:54+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="497"]   DUID: ****
<29>1 2025-03-21T10:48:54+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="498"] get DHCP option IA_PD, len 41
<29>1 2025-03-21T10:48:54+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="499"]   IA_PD: ID=0, T1=1800, T2=2880
<29>1 2025-03-21T10:48:54+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="500"] get DHCP option IA_PD prefix, len 25
<29>1 2025-03-21T10:48:54+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="501"]   IA_PD prefix: 2a0e:***::/56 pltime=3600 vltime=3600
<29>1 2025-03-21T10:48:54+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="502"] get DHCP option DNS, len 32
<29>1 2025-03-21T10:48:54+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="503"] XID mismatch
<29>1 2025-03-21T10:48:54+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="504"] Sending Solicit
<29>1 2025-03-21T10:48:54+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="505"] a new XID (ba5ecf) is generated
<29>1 2025-03-21T10:48:54+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="506"] set client ID (len 14)
<29>1 2025-03-21T10:48:54+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="507"] set elapsed time (len 2)
<29>1 2025-03-21T10:48:54+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="508"] set option request (len 4)
<29>1 2025-03-21T10:48:54+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="509"] set IA_PD prefix
<29>1 2025-03-21T10:48:54+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="510"] set IA_PD
<29>1 2025-03-21T10:48:54+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="511"] send solicit to ff02::1:2%igb0
<29>1 2025-03-21T10:48:54+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="512"] reset a timer on igb0, state=SOLICIT, timeo=0, retrans=1020
<29>1 2025-03-21T10:48:54+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="513"] receive advertise from fe80::****%igb0 on igb0
<29>1 2025-03-21T10:48:54+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="514"] get DHCP option server ID, len 10
<29>1 2025-03-21T10:48:54+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="515"]   DUID: ****
<29>1 2025-03-21T10:48:54+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="516"] get DHCP option client ID, len 14
<29>1 2025-03-21T10:48:54+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="517"]   DUID: ****
<29>1 2025-03-21T10:48:54+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="518"] get DHCP option IA_PD, len 41
<29>1 2025-03-21T10:48:54+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="519"]   IA_PD: ID=0, T1=1800, T2=2880
<29>1 2025-03-21T10:48:54+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="520"] get DHCP option IA_PD prefix, len 25
<29>1 2025-03-21T10:48:54+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="521"]   IA_PD prefix: 2a0e:***::/56 pltime=3600 vltime=3600
<29>1 2025-03-21T10:48:54+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="522"] get DHCP option DNS, len 32
<29>1 2025-03-21T10:48:54+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="523"] server ID: ****, pref=-1
<29>1 2025-03-21T10:48:54+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="524"] reset timer for igb0 to 0.995808
<29>1 2025-03-21T10:48:55+00:00 HOSTNAME dhcp6c 18518 - [meta sequenceId="525"] Sending Solicit
<29>1 2025-03-21T10:48:55+00:00 HOSTNAME dhcp6c 18518 - [meta sequenceId="526"] set client ID (len 14)
<29>1 2025-03-21T10:48:55+00:00 HOSTNAME dhcp6c 18518 - [meta sequenceId="527"] set elapsed time (len 2)
<29>1 2025-03-21T10:48:55+00:00 HOSTNAME dhcp6c 18518 - [meta sequenceId="528"] set option request (len 4)
<29>1 2025-03-21T10:48:55+00:00 HOSTNAME dhcp6c 18518 - [meta sequenceId="529"] set IA_PD prefix
<29>1 2025-03-21T10:48:55+00:00 HOSTNAME dhcp6c 18518 - [meta sequenceId="530"] set IA_PD
<29>1 2025-03-21T10:48:55+00:00 HOSTNAME dhcp6c 18518 - [meta sequenceId="531"] send solicit to ff02::1:2%igb0
<29>1 2025-03-21T10:48:55+00:00 HOSTNAME dhcp6c 18518 - [meta sequenceId="532"] reset a timer on igb0, state=SOLICIT, timeo=1, retrans=2008
<29>1 2025-03-21T10:48:55+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="533"] receive advertise from fe80::****%igb0 on igb0
<29>1 2025-03-21T10:48:55+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="534"] get DHCP option server ID, len 10
<29>1 2025-03-21T10:48:55+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="535"]   DUID: ****
<29>1 2025-03-21T10:48:55+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="536"] get DHCP option client ID, len 14
<29>1 2025-03-21T10:48:55+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="537"]   DUID: ****
<29>1 2025-03-21T10:48:55+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="538"] get DHCP option IA_PD, len 41
<29>1 2025-03-21T10:48:55+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="539"]   IA_PD: ID=0, T1=1800, T2=2880
<29>1 2025-03-21T10:48:55+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="540"] get DHCP option IA_PD prefix, len 25
<29>1 2025-03-21T10:48:55+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="541"]   IA_PD prefix: 2a0e:***::/56 pltime=3600 vltime=3600
<29>1 2025-03-21T10:48:55+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="542"] get DHCP option DNS, len 32
<29>1 2025-03-21T10:48:55+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="543"] XID mismatch
<29>1 2025-03-21T10:48:55+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="544"] picked a server (ID: ****)
<29>1 2025-03-21T10:48:55+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="545"] Sending Request
<29>1 2025-03-21T10:48:55+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="546"] a new XID (68923e) is generated
<29>1 2025-03-21T10:48:55+00:00 HOSTNAME dhcp6c 19076 - [meta sequenceId="547"] set client ID (len 14)

On my test instance I haven't seen the 2 instances happen for real but I have been able to simulate it to test this fix using

/var/etc/rtsold_script.sh em0 & /var/etc/rtsold_script.sh em0 &

which leads to the same XID mismatch error so seems to be representative of what can happen for real. When rebooting my test instance I have seen logs showing the double instance protection is kicking in though.

<13>1 2025-03-21T20:00:01+00:00 HOSTNAME dhcp6c 6143 - [meta sequenceId="265"] RTSOLD script - Starting dhcp6c daemon
<13>1 2025-03-21T20:00:01+00:00 HOSTNAME dhcp6c 9650 - [meta sequenceId="334"] RTSOLD script - Starting dhcp6c daemon
<13>1 2025-03-21T20:00:01+00:00 HOSTNAME dhcp6c 15501 - [meta sequenceId="335"] RTSOLD script - Failed to start dhcp6c daemon

After this I do see a single dhcp6c running as expected and IPv6 connectivity comes up as normal.

I'm not sure how/why the 2 instances of dhcp6c happened and haven't dug into the rtsold source to see exactly how it runs the wrapper script, but this seems to me to be a useful protection against two instances running.

This avoids two instances running simultaneously due to a potential race
condition where rtsold may run rtsold_script twice very close together.
@fichtner fichtner linked an issue Mar 21, 2025 that may be closed by this pull request
2 tasks
@fichtner fichtner self-assigned this Mar 21, 2025
@fichtner
Copy link
Member

Sounds like #8431 but also that something in 25.1 caused this as we haven’t had this for a long time. The surface timing is odd, either due to FreeBSD 14.2 or the rtsold no delay option 7fe558c

@bensmithurst
Copy link
Contributor Author

I had a quick look and I couldn't see any changes in rtsold upstream.

The two execs here in interface_dhcpv6_configure could be a problem, perhaps? Starting rtsold which will likely run rtsold_script.sh almost immediately and then immediately also running rtsold_script.sh directly.

Comment on lines 2888 to 2890
$dhcp6ccommand = exec_safe('flock -nF /var/run/dhcp6c.lock /usr/local/sbin/dhcp6c -c %s -p %s', [
'/var/etc/dhcp6c.conf',
'/var/run/dhcp6c.pid',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
$dhcp6ccommand = exec_safe('flock -nF /var/run/dhcp6c.lock /usr/local/sbin/dhcp6c -c %s -p %s', [
'/var/etc/dhcp6c.conf',
'/var/run/dhcp6c.pid',
$dhcp6ccommand = exec_safe('/usr/local/bin/flock -n -E 0 -o %s /usr/local/sbin/dhcp6c -c %s -p %s', [
'/var/etc/dhcp6c.conf',
'/var/etc/dhcp6c.conf',
'/var/run/dhcp6c.pid',

I don't see much use erroring when returning from not being able to lock since another process will do it for us.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, the -E 0 change seems safe, I thought I did have to use -F to make it work though, which I see you've removed here, I'll have to look again when I have chance. I can't quite remember why I needed to use -F.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need the fork to release the lock properly, otherwise the process keeps it (see -o)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with -F the dhcp6c process holds the lock but it will be implicitly released when it exits, which is what we need. Although I guess your way works fine, just means the flock process holds the lock instead. I suppose your way is safer in case an update to dhcp6c makes it close all file descriptors when it starts or something like that.

I think I was trying to avoid the case where someone kills the flock process, meaning the lock is released, even though dhcp6c might still be running. But admins doing silly things is perhaps not something we can completely defend against.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dhcp6c shouldn't own the lock since we just want to avoid the start race. letting it keep the lock going into zombie mode would be problematic and at the time of the daemon()/fork() call in dhcp6c process the race should not be there anymore. After that it chugs along for the most part being handled by a "safer" SIGHUP.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added your change but for the record from my testing I can still - occasionally - reproduce the race condition using your suggested change. I take your point about dhcp6c itself not holding the lock, but the problem with your change from what I can see in my tests that dhcp6c will fork into the background and exit, and flock then exits, which will release the lock, before the second run has tried to get the lock, leading to a possible race condition. With my original change I can't ever trigger a race, so far.

I think this PR in its current state is a definite improvement but it does feel like there's still a small chance of the race condition lurking. Maybe adding some locking into the opnsense/dhcp6c repo would be useful? I don't mind looking at that, when I have some time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it would be safer to handle this in dhcp6c. We could also remove the immediate flag but the race is still there either way as you indicate and since we have the source code that would be best.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created opnsense/dhcp6c#40 now. If you think this is the right overall approach I'll update this PR to just use the new dhcp6c -l parameter.

@fichtner
Copy link
Member

I had a quick look and I couldn't see any changes in rtsold upstream.

The two execs here in interface_dhcpv6_configure could be a problem, perhaps? Starting rtsold which will likely run rtsold_script.sh almost immediately and then immediately also running rtsold_script.sh directly.

I specifically meant

% git grep solicit.imme
business/24.10/24.10:o interfaces: send IPv6 solicit immediately on WAN interfaces
community/24.7/24.7.6:o interfaces: send IPv6 solicit immediately on WAN interfaces

@bensmithurst
Copy link
Contributor Author

I specifically meant

% git grep solicit.imme
business/24.10/24.10:o interfaces: send IPv6 solicit immediately on WAN interfaces
community/24.7/24.7.6:o interfaces: send IPv6 solicit immediately on WAN interfaces

Understood - I was just meaning that because we start rtsold and the script at the same time, the change making rtsold send the solicit immediately (and thus run the script) is going to make the race condition more likely, I think.

Remove unnecessary logging of failure to start dhcp6c.

Co-authored-by: Franco Fichtner <franco@lastsummer.de>
@fichtner
Copy link
Member

Understood - I was just meaning that because we start rtsold and the script at the same time, the change making rtsold send the solicit immediately (and thus run the script) is going to make the race condition more likely, I think.

Yes, agreed.

bensmithurst and others added 2 commits April 15, 2025 22:41
Add suggestion from PR - don't use flock -F flag, and don't report error on lock failure.

Co-authored-by: Franco Fichtner <franco@lastsummer.de>
bensmithurst added a commit to bensmithurst/opnsense-dhcp6c that referenced this pull request Apr 16, 2025
This is needed to avoid concurrent runs of dhcp6c which can lead
to IPv6 connectivity problems.

Related to opnsense/core#8467
and opnsense/core#8431
@fichtner
Copy link
Member

No change needed on core, but thanks very much to moving this along!

@fichtner fichtner closed this Apr 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

No IPv6 connectivity after reboot / duplicate dhcp6c processes

2 participants