Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Npcap kernel buffer can be excessively fragmented on systems with many CPU cores #1967

Open
dmiller-nmap opened this issue Mar 14, 2020 · 0 comments

Comments

@dmiller-nmap
Copy link

@dmiller-nmap dmiller-nmap commented Mar 14, 2020

The Npcap driver allocates a buffer for each open instance (handle returned by PacketOpenAdapter()) that stores packets until the user program retrieves them (via PacketReceivePacket()). The buffer is split into N segments, where N is the number of logical CPUs on the system. When a packet is processed, it's put into the segment of the buffer corresponding to the number of the processor where the filter callback is running (KeGetCurrentProcessorNumber()). If the segment is full, the packet is dropped and not captured.

On systems with a high number of processors, the segments could be too small and suffer packet drops even if the buffer as a whole is largely empty. This is even more likely if Windows just doesn't schedule NDIS threads on some processors, leaving those buffer segments entirely empty. This is probably the case many times, since the OS wants to schedule threads on the same processors they have run on before in order to avoid cache misses.

Additionally, the Nmap currently handles numbers of processors greater than sizeof(KAFFINITY) (64 on 64-bit Windows, 32 on 32-bit) very oddly. In at least one place it maps any system processor ID over 63 to 63, which means that if Npcap is running on a system with 128 processors in 2 groups, any threads which run on processors 63 through 127 will all share the same segment of the buffer. If the proposed change below is not implemented, we will have to rewrite this code anyway for NDIS 6.20, which adds support for more than 64 processors.

We need to better document the current behavior (issue to be opened later), but even better would be to limit the number of segments the buffer is split into and somehow encourage threads to use them all. I haven't worked out all the details, but here are some thoughts:

  1. The limit should be user-configurable via Packet.DLL function call, which would communicate with the driver via a new IoControl. The default value would probably be close to the number of CPU cores on a typical lower-end system, though we could set an upper limit which would be the number of logical processors on the system.
  2. If it's necessary for threads on a single processor to always use the same buffer segment, we would calculate the buffer number as processor_number % number_of_segments. However, I don't think this is even necessary: the threads could round-robin on the buffer segments to ensure they are all used. The benefit of segments at this point would be to avoid waiting on a spinlock to access the buffer; that is to say that the chances of not having to wait on a spinlock are number_of_segments times better than with a single buffer. The current scheme probably has a better performance characteristic in this sense, at the cost of wasted kernel memory and higher likelihood of dropped packets.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant
You can’t perform that action at this time.