Npcap performance improvements: BPF filter processing #535

dmiller-nmap · 2021-08-26T17:29:38Z

I had a few ideas to pursue that may improve Npcap's performance. As always, we'll have to measure throughput, identify the bottlenecks, and determine whether the changes would actually positively impact performance. The biggest source of slowdown is usually transfer of packets to the user program, which can be tuned by the user programmer via capture filters, snaplen, mintocopy, timeout, and buffer sizes (see #30). These new ideas are related to throughput of the NDIS LWF component, especially in cases where the packets are uninteresting to the user (rejected by capture filter, i.e. small signal-to-noise ratio).

The general sources of performance impact within the LWF are:

Acquiring spinlocks (and RWLocks), and
Allocating memory and copying data from packets.

I am not sure what impact the actual filtering function has on computational load. WinPcap had a JIT compiler for x86 only, which we could research to see whether any assessment was done on the performance impact of that improvement. Since we have 3 architectures we support (x86, x64, arm64) and are not experts at assembly and compilers, this is not likely to be a good idea to pursue. However, we can investigate other things, such as moving the filtering code earlier in the data path so that some of the known performance drags (1 and 2 above) can be avoided in more cases.

Currently, here are the locks and copies that are done in the average case (no error conditions or extra startup work needed) before the BPF engine is able to reject a packet:

Acquire OpenInstancesLock (Read)
Acquire & release AdapterHandleLock (spinlock)
Acquire & release OpenInUseLock (spinlock)
Allocate a NBLCopy from lookaside list
If raw WiFi capture, allocate a RadiotapHeader from lookaside list
Allocate a SrcNB from lookaside list
Acquire MachineLock (Read) for this instance's BPF filter.

If more than one instance (capture handle) is open, repeat steps 2, 3, and 7 for each of those. Each instance which matches (BPF returns >0) will copy that many bytes of packet (rounded up to multiple of 0xff) into the SrcNB, allocating buffers from a lookaside list. If subsequent instance's BPFs have a snaplen longer than the first one and therefore require more bytes of packet, additional copy operations and buffer allocations are done.

I think a good approach may be to keep a list of filters in the FilterModule object instead of in the OpenInstance object. Then a single RWLock could be used to acquire read access to the filters, run all of them in sequence, and keep track of the output. Then a single copy operation could be done for the maximum value returned by any of the filters, or if none matched, the packet could be passed up the stack without any copying or allocations. Only the instances whose filters matched would need to acquire the locks in steps 2 and 3 above. The locks and copying would be done within NPF_DoTap instead of within NPF_TapExForEachOpen, which may simplify the number of loops and gotos in that long function. Only when an instance adds, modifies, or deletes its filter would the RWLock need to be locked for writing.

dmiller-nmap added the enhancement label Aug 26, 2021

dmiller-nmap mentioned this issue Mar 20, 2023

BSoD with bugcheck 0x133 DPC_WATCHDOG_VIOLATION #663

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Npcap performance improvements: BPF filter processing #535

Npcap performance improvements: BPF filter processing #535

dmiller-nmap commented Aug 26, 2021

Npcap performance improvements: BPF filter processing #535

Npcap performance improvements: BPF filter processing #535

Comments

dmiller-nmap commented Aug 26, 2021