Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Npcap performance improvements: BPF filter processing #535

Open
dmiller-nmap opened this issue Aug 26, 2021 · 0 comments
Open

Npcap performance improvements: BPF filter processing #535

dmiller-nmap opened this issue Aug 26, 2021 · 0 comments

Comments

@dmiller-nmap
Copy link
Contributor

I had a few ideas to pursue that may improve Npcap's performance. As always, we'll have to measure throughput, identify the bottlenecks, and determine whether the changes would actually positively impact performance. The biggest source of slowdown is usually transfer of packets to the user program, which can be tuned by the user programmer via capture filters, snaplen, mintocopy, timeout, and buffer sizes (see #30). These new ideas are related to throughput of the NDIS LWF component, especially in cases where the packets are uninteresting to the user (rejected by capture filter, i.e. small signal-to-noise ratio).

The general sources of performance impact within the LWF are:

  1. Acquiring spinlocks (and RWLocks), and
  2. Allocating memory and copying data from packets.

I am not sure what impact the actual filtering function has on computational load. WinPcap had a JIT compiler for x86 only, which we could research to see whether any assessment was done on the performance impact of that improvement. Since we have 3 architectures we support (x86, x64, arm64) and are not experts at assembly and compilers, this is not likely to be a good idea to pursue. However, we can investigate other things, such as moving the filtering code earlier in the data path so that some of the known performance drags (1 and 2 above) can be avoided in more cases.

Currently, here are the locks and copies that are done in the average case (no error conditions or extra startup work needed) before the BPF engine is able to reject a packet:

  1. Acquire OpenInstancesLock (Read)
  2. Acquire & release AdapterHandleLock (spinlock)
  3. Acquire & release OpenInUseLock (spinlock)
  4. Allocate a NBLCopy from lookaside list
  5. If raw WiFi capture, allocate a RadiotapHeader from lookaside list
  6. Allocate a SrcNB from lookaside list
  7. Acquire MachineLock (Read) for this instance's BPF filter.

If more than one instance (capture handle) is open, repeat steps 2, 3, and 7 for each of those. Each instance which matches (BPF returns >0) will copy that many bytes of packet (rounded up to multiple of 0xff) into the SrcNB, allocating buffers from a lookaside list. If subsequent instance's BPFs have a snaplen longer than the first one and therefore require more bytes of packet, additional copy operations and buffer allocations are done.

I think a good approach may be to keep a list of filters in the FilterModule object instead of in the OpenInstance object. Then a single RWLock could be used to acquire read access to the filters, run all of them in sequence, and keep track of the output. Then a single copy operation could be done for the maximum value returned by any of the filters, or if none matched, the packet could be passed up the stack without any copying or allocations. Only the instances whose filters matched would need to acquire the locks in steps 2 and 3 above. The locks and copying would be done within NPF_DoTap instead of within NPF_TapExForEachOpen, which may simplify the number of loops and gotos in that long function. Only when an instance adds, modifies, or deletes its filter would the RWLock need to be locked for writing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant