Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dhcrelay: can get stuck with 100% CPU usage in new implementation #7471

Closed
2 tasks done
alexandredulche opened this issue May 19, 2024 · 15 comments · Fixed by opnsense/dhcrelay#1
Closed
2 tasks done
Assignees
Labels
bug Production bug
Milestone

Comments

@alexandredulche
Copy link

Important notices

Before you add a new report, we ask you kindly to acknowledge the following:

Describe the bug

Since upgrade to 24.1.6 Opnsense goes into 100% CPU usage from one of the dhcp relay processes.
This happens randomly after a few hours or days of uptime.

To Reproduce

Steps to reproduce the behavior:

  1. Go to 'Service > DHCRelay' *
  2. Create a list of destination DHCP servers (in my case one local, one remote over site-to-site VPN)*
  3. Create a configuration on multiple interfaces (VLAN) including management VLAN *
  4. Tick "Agent Information" *
  5. Save *
  6. Let it run (and observer CPU usage)

*In my case the new relay configuration was created automatically upon upgrading to 24.1.6 (destination list is called "Migrated IPv4 server entry")

Expected behavior

No heavy CPU usage should come from DHCP relay service.

Describe alternatives you considered

I tried disabling the DHCP relay on my management VLAN where my destinations DHCP servers reside.
I tried creating a CRON job to restart the dhcp relay service every hour but it's not working.
Now I have a CRON job rebooting the VM every morning.
Upgraded to 24.1.7 today (waiting for the issue to reappear)

Relevant log files

I don't know where to find logs for the new DHCP relay service.

Additional context

Apparently I'm not the only one facing the issue ;
https://forum.opnsense.org/index.php?topic=40126.0
https://forum.opnsense.org/index.php?topic=40284.0

Environment

Software version used and hardware type if relevant, e.g.:

OPNsense 24.1.6-amd64

My setup :

Edge sites (x2) :

  • ESXi 8
  • OpenSense VM as main gateway
  • OpenSense VM as "helper" with DHCP relay for multiple VLANs
  • Multiple VLANs
  • Unifi switches
  • Windows Server VM with DHCP server (as standby)

Central site :

  • ESXi 8
  • OpenSense VM as main gateway
  • OpenSense VM as "helper" with DHCP relay for multiple VLANs
  • Multiple VLANs
  • Unifi switches
  • Windows Server VM with DHCP server (as standby)

Site-to-site Wireguard VPN

No DHCP guarding whatsoever on Unifi side.

Opnsense VMs (router and helper) all have an interface in each VLAN.
Target DHCP servers on edge sites are both the local and the central Windows DHCP server.

This setup worked flawlessly for months (if not years) before 24.1.6.

@Unit764
Copy link

Unit764 commented May 23, 2024

We are seeing the same issue. We need to restart the dhcrelay service once every 2 or 3 days to get DHCP Relay functionality working again. Even after the OPNsense 24.1.7_4-amd64 update.

@fichtner
Copy link
Member

We are currently debugging the issue but the problem is elusive. It seems to hit an error condition in the BPF packet capture that the daemon can't recover from. We will publish updates as we encounter them.

@fichtner fichtner added the bug Production bug label May 24, 2024
@fichtner fichtner added this to the 24.7 milestone May 24, 2024
@fichtner fichtner changed the title DHCP relay issue (100% CPU usage) since 24.1.6 dhcrelay: 100% CPU usage in new implementation May 24, 2024
@fichtner fichtner changed the title dhcrelay: 100% CPU usage in new implementation dhcrelay: can get stuck with 100% CPU usage in new implementation May 24, 2024
@browne-net
Copy link

We are having the same issue since 24.1.6.
Restarting the DHCPv4 Relay services solves the issue for a few minutes.

Will the fix only be available in 24.7 or can we hope for a hotfix in any of the 24.1.x releases?

@fichtner
Copy link
Member

A fix will be available quickly as it is found for all supported version.

AdSchellevis added a commit to opnsense/dhcrelay that referenced this issue May 27, 2024
…als 0 and the length of the packet off the wire (bh_datalen) doesn't equal 0, we will loop forever in receive_packet()

should fix opnsense/core#7471
fichtner pushed a commit to opnsense/dhcrelay that referenced this issue May 27, 2024
…als 0 and the length of the packet off the wire (bh_datalen) doesn't equal 0, we will loop forever in receive_packet()

should fix opnsense/core#7471
@AdSchellevis
Copy link
Member

AdSchellevis commented May 27, 2024

To test opnsense/dhcrelay#1, install using the command below and re-apply the config via the gui.

REDACTED (see below)

@fichtner
Copy link
Member

While debugging and writing this we found that FreeBSD has 3 fixes way back from 2005/2006 in the tree for this particular code derived from dhclient which all originates from common ISC code and perfectly fits the problem.

freebsd/freebsd-src@4eae015
freebsd/freebsd-src@289d89d80
freebsd/freebsd-src@ebe609b4a27

Here is a test package with the FreeBSD changes instead of the previous PR state by @AdSchellevis

# pkg add -f https://pkg.opnsense.org/FreeBSD:13:amd64/snapshots/misc/dhcrelay-0.4_4.pkg

All feedback on both binaries is welcome.

@fichtner fichtner reopened this May 27, 2024
@TheHellSite
Copy link

TheHellSite commented May 27, 2024

Installed dhcrelay-0.4_4 4 hours ago and the issue seems to be gone. DHCRelay is working fine now and no high CPU usage visible.

@fichtner
Copy link
Member

@TheHellSite woohoo! tentatively at least :)

@browne-net
Copy link

browne-net commented May 28, 2024

@fichtner We are also successfully running the patch for about 24h now without noticing any issues. DHCRelay is working fine again. I think this can be closed.

@fichtner
Copy link
Member

@browne-net thanks we will ship in 24.1.8 tomorrow

@mileyceberus
Copy link

mileyceberus commented May 29, 2024

dhcrelay seems to be dropping BOOTREPLY messages if the source IP of the REPLY does not match the destination IP specified in the UI.

Previously, I was able to specify the VIP address of my DHCP servers in the DHCP relay config. BOOTREPLY from a different source IP (e.g. physical NIC of the active server) would still be forwarded to the client.

@fichtner
Copy link
Member

@mileyceberus feel free to open a new ticketl, but I don't quite understand what "VIP address of my DHCP servers" means. It just takes an address. It can be any address.

@AdSchellevis
Copy link
Member

when it's about source address, source nat is likely the place to look :)

@mileyceberus
Copy link

@fichtner, no problem. Happy to open a new ticket as required.

I was referring to the virtual ip (VIP/CARP) of my dhcp servers.

In the past, I could point dhcrelay to a VIP/CARP address. dhcrelay would simply pass the OFFER messages to the clients regardless of the source addresses (as these could change depending on which server is active).

However, this behaviour seems to have changed.

@mileyceberus
Copy link

mileyceberus commented May 29, 2024

@AdSchellevis Thanks for the suggestion. I have made the change on my side and it seems to have resolved the issue.

For the benefit of those who may be experiencing similar issues, this is what I did on my DHCP servers.

iptables -t nat -A POSTROUTING -o <OUTBOUND_INTERFACE> -p udp --sport 67 -j SNAT --to <VIRTUAL_IP>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Production bug
Development

Successfully merging a pull request may close this issue.

7 participants