You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I had a few ideas to pursue that may improve Npcap's performance. As always, we'll have to measure throughput, identify the bottlenecks, and determine whether the changes would actually positively impact performance. The biggest source of slowdown is usually transfer of packets to the user program, which can be tuned by the user programmer via capture filters, snaplen, mintocopy, timeout, and buffer sizes (see #30). These new ideas are related to throughput of the NDIS LWF component, especially in cases where the packets are uninteresting to the user (rejected by capture filter, i.e. small signal-to-noise ratio).
The general sources of performance impact within the LWF are:
Acquiring spinlocks (and RWLocks), and
Allocating memory and copying data from packets.
I am not sure what impact the actual filtering function has on computational load. WinPcap had a JIT compiler for x86 only, which we could research to see whether any assessment was done on the performance impact of that improvement. Since we have 3 architectures we support (x86, x64, arm64) and are not experts at assembly and compilers, this is not likely to be a good idea to pursue. However, we can investigate other things, such as moving the filtering code earlier in the data path so that some of the known performance drags (1 and 2 above) can be avoided in more cases.
Currently, here are the locks and copies that are done in the average case (no error conditions or extra startup work needed) before the BPF engine is able to reject a packet:
Acquire OpenInstancesLock (Read)
Acquire & release AdapterHandleLock (spinlock)
Acquire & release OpenInUseLock (spinlock)
Allocate a NBLCopy from lookaside list
If raw WiFi capture, allocate a RadiotapHeader from lookaside list
Allocate a SrcNB from lookaside list
Acquire MachineLock (Read) for this instance's BPF filter.
If more than one instance (capture handle) is open, repeat steps 2, 3, and 7 for each of those. Each instance which matches (BPF returns >0) will copy that many bytes of packet (rounded up to multiple of 0xff) into the SrcNB, allocating buffers from a lookaside list. If subsequent instance's BPFs have a snaplen longer than the first one and therefore require more bytes of packet, additional copy operations and buffer allocations are done.
I think a good approach may be to keep a list of filters in the FilterModule object instead of in the OpenInstance object. Then a single RWLock could be used to acquire read access to the filters, run all of them in sequence, and keep track of the output. Then a single copy operation could be done for the maximum value returned by any of the filters, or if none matched, the packet could be passed up the stack without any copying or allocations. Only the instances whose filters matched would need to acquire the locks in steps 2 and 3 above. The locks and copying would be done within NPF_DoTap instead of within NPF_TapExForEachOpen, which may simplify the number of loops and gotos in that long function. Only when an instance adds, modifies, or deletes its filter would the RWLock need to be locked for writing.
The text was updated successfully, but these errors were encountered:
I had a few ideas to pursue that may improve Npcap's performance. As always, we'll have to measure throughput, identify the bottlenecks, and determine whether the changes would actually positively impact performance. The biggest source of slowdown is usually transfer of packets to the user program, which can be tuned by the user programmer via capture filters, snaplen, mintocopy, timeout, and buffer sizes (see #30). These new ideas are related to throughput of the NDIS LWF component, especially in cases where the packets are uninteresting to the user (rejected by capture filter, i.e. small signal-to-noise ratio).
The general sources of performance impact within the LWF are:
I am not sure what impact the actual filtering function has on computational load. WinPcap had a JIT compiler for x86 only, which we could research to see whether any assessment was done on the performance impact of that improvement. Since we have 3 architectures we support (x86, x64, arm64) and are not experts at assembly and compilers, this is not likely to be a good idea to pursue. However, we can investigate other things, such as moving the filtering code earlier in the data path so that some of the known performance drags (1 and 2 above) can be avoided in more cases.
Currently, here are the locks and copies that are done in the average case (no error conditions or extra startup work needed) before the BPF engine is able to reject a packet:
If more than one instance (capture handle) is open, repeat steps 2, 3, and 7 for each of those. Each instance which matches (BPF returns >0) will copy that many bytes of packet (rounded up to multiple of 0xff) into the SrcNB, allocating buffers from a lookaside list. If subsequent instance's BPFs have a snaplen longer than the first one and therefore require more bytes of packet, additional copy operations and buffer allocations are done.
I think a good approach may be to keep a list of filters in the FilterModule object instead of in the OpenInstance object. Then a single RWLock could be used to acquire read access to the filters, run all of them in sequence, and keep track of the output. Then a single copy operation could be done for the maximum value returned by any of the filters, or if none matched, the packet could be passed up the stack without any copying or allocations. Only the instances whose filters matched would need to acquire the locks in steps 2 and 3 above. The locks and copying would be done within NPF_DoTap instead of within NPF_TapExForEachOpen, which may simplify the number of loops and gotos in that long function. Only when an instance adds, modifies, or deletes its filter would the RWLock need to be locked for writing.
The text was updated successfully, but these errors were encountered: