-
Notifications
You must be signed in to change notification settings - Fork 539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Npcap 0.993-0.9986 spontaneously stops capturing; all packets go to ps_drop #119
Comments
Hi Alexey. Thanks for the very detailed report. We're investigating now how this could happen and of course what can be done to fix it! |
Thanks again for the report. I have a few questions to help us narrow down what might be happening, since we have not yet been able to reproduce the issue here:
Thanks for any additional info you can provide. |
3] No, the pcap handle is used from only one thread in the process (although it's not the main thread). Additional observations which may or may not be helpful: |
Capture mechanisms that do buffering of packets, with a timeout to keep packets from remaining buffered for too long (because the incoming packet rate is low so that the buffer takes a long time to fill up), have two sorts of timeout:
For capture mechanisms with the second type of timer, As I remember, the WinPcap and Npcap NPF driver is the second type, so you may get 0 packets from a wakeup because the timer went off and no packets had arrived since the last time the buffer was read. |
At least as I read the |
So in that thread, are you in a loop that just calls |
There's a lot of information here to go through. This in particular intrigues me:
If there's some condition that is causing the buffer to fill up or the driver to think that the buffer is filling up, then it would start dropping packets that are too large for the remaining space in the buffer. This would lead to smaller and smaller packets being captured until they too fill up the buffer and there is no remaining space. This is just a hunch at this point. I need to read through the code again with this idea in mind. |
I haven't been able to figure out anywhere where we might be losing account of free space remaining. I'm guessing it has something to do with calculating how much space a packet takes up when writing it to the buffer or how much space to free up when reading a packet out of the buffer. So instead of constantly incrementing and decrementing the free space counter, I'm going to try a change that calculates the free space based on the size of the buffer and the positions of the consumer and producer pointers. We'll have to acquire a lock on the buffer in the Read handler in order to ensure we don't calculate based on an outdated position, but that's probably best anyway to avoid concurrency issues if some software somewhere is sharing an adapter handle between multiple threads. Please let us know if the issue persists in the next release. |
But if what you said were the case and we thought the buffer was full, wouldn't Again, without knowing anything else, if you are using a circular buffer as your response seems to suggest, I'd double check the edge/overflow conditions:
P.S. As part of troubleshooting I also tried to remove the |
Npcap 0.9987, released today, includes the above commit that may address this issue. Please let us know one way or the other so we know if we need to continue to investigate this issue. |
Instructed one of our users to try the new version, will confirm one way or another. |
Unfortunately, a deployment of version 0.9989 on one of the affected machines indicates that the problem still persists -- with roughly the same symptoms. |
@akontsevoy Thanks for letting us know. I'll take another look. |
Ok, here's an interesting thing: the processors listed appear to be Intel XEONs with 56 cores each. Npcap has some weirdness surrounding number of processors (#1967) that might explain the problem here: The kernel capture buffer for an instance is split into HOWEVER! In order to determine the processor number, Npcap uses So why do we not see a BSoD due to buffer overrun? Well, each Unfortunately, none of this explains why the |
@dmiller-nmap Yes, I thought too this issue might be related, but it doesn't appear to be. Here's the output of
This is running on VMware as far as I know. The underlying hardware may well have 56 logical CPUs, but this particular VM has only 6 logical CPUs over 2 virtual sockets -- nothing extraordinary. Besides, if what you say were the case, the issue would have been intermittent and resulted in a fixed percentage of capture loss -- after the capture stops because the thread was scheduled on a high-numbered logical CPU, eventually that thread would be scheduled on a low-numbered CPU again, and the capture would resume. And this never happens -- once the capture stops, it never resumes until I close and reopen the pcap handle. I'm not denying this is a problem, but it doesn't seem to be the problem. On a side note, old WinPcap is known to BSOD on AWS machines with more than 16 or 32 cores. I wonder if that problem is analogous to what you're describing here (memory arena optimization gone sideways). In your place, the next thing I'd turn my attention to is the fact that sometimes As for the suggested workaround, as I mentioned before it doesn't work reliably -- even on healthy systems, sometimes the event gets set, yet |
Thanks for the additional insight. I'm going ahead with a fix for nmap/nmap#1967 that reorganizes a lot of the internals of the ring buffer, and I'm adding some additional assertions for our testing to check for boundary conditions reading and writing from the ring buffer. Between the two, I think we'll eliminate this issue. One thing to note about my earlier problem description is that there is a difference between the maximum number of processors and the active number of processors on systems that support hot-add of CPU cores, like some VMware systems. There are just too many problems with the current mishmash of The new approach is pretty exciting, as I've found several ways to reduce the time spinlocks are held and to improve the utilization of the ring buffer. Still working on it, and of course it'll require significant testing. |
Npcap 0.9991 changes most of the code affecting this issue. We no longer keep per-CPU state, so the number of CPU cores should not have any bearing on how Npcap functions. Only a single thread is responsible for writing to the buffer, and free space is updated using interlocked-exchange functions, so there should be no reason for it to drift. Please let us know if the problem appears fixed so that we can close this issue. |
Npcap 0.9991 had some problems, but Npcap 0.9994 seems very stable and fast, and more importantly for this issue it changes all the code related to counting drops and free space. Any misbehavior related to capture stats in newer versions will be a completely separate issue, so we will close this one. Please let us know if there are any further problems. |
We received a private report of this issue also affecting Npcap 0.9997 (and by extension Npcap 1.00, since the driver code is essentially the same between the two). We have not received any confirmation of a fix, so I am copying the diagnostic questionnaire I provided via email here. If anyone is still experiencing this issue in Npcap 0.9997 or newer, please fill out as much information as possible: It is difficult to begin diagnosing this without more complete information about the capture. Specifically, to model the behavior of Npcap and identify a bug, I need to know the following: Capture parameters:
Information about how captured packets are processed:
Information about the packets being captured:
Information about the system state when packets are being dropped:
Troubleshooting steps:
Thanks for any information you can provide. |
I believe this issue may not have been related to the kernel driver code that was rewritten, but may instead be a bad interaction between the snaplen and user buffer size. I've opened the-tcpdump-group/libpcap#975 to address part of the issue, but we will need to investigate solutions here. Meanwhile, here is my description of the problem as I see it, along with some workarounds for programs using the Npcap API:
Potential solutions:
Potential workarounds:
|
Further information from the user reporting this issue in Npcap 0.9997: the issue appeared after receiving 2 frames of 65775 and 65859 bytes. This contradicts my analysis above, though I still believe that to be an issue. We will continue to investigate. |
And pcap-bpf.c 1) defaults to the maximum BPF buffer size and 2) makes the user buffer the same size as the kernel buffer, so we already have user buffers bigger than 256000. An extra 6KB isn't worth worrying about here. (It used to use, as I remember, the default kernel buffer size:
That made sense in the early 1990's, when BPF was first introduced - back in 1989, I seem to remember that Sun debated whether 4MB or 8MB was the right minimum memory size for a SPARCstation 1 - but memory sizes have gotten a LOT bigger since then. I cranked it up when a coworker in the remote file system group at Apple got Apple to set the default snaplen for tcpdump to the maximum - 65535 at the time - so that captures without Is there any reason not to have the user buffer be >= the kernel buffer in size? Is there any reason to have it be > the kernel buffer in size? Should it be a (fixed?) multiple of the kernel buffer size? |
@akontsevoy Is this issue still happening in any of the more recent Npcap releases? We would like to close this out if it is not a problem. |
Greetings,
We have an intermittent issue with Npcap running on the customer's VMware Windows Server 2016 machines. Initially everything works well and captures the traffic as expected, but eventually, despite pcap_dispatch() being called regularly and not returning errors, the traffic capture stops and all new packets seen by the adapter end up in pcap_stat::ps_drop. We have not been able to identify what triggers this problem.
Broadly, our product operates as follows:
pcap_findalldevs()
; filters out devices withPCAP_IF_LOOPBACK
flag.pcap = pcap_create(name, ...)
,pcap_set_buffer_size(pcap, 3 * 1024 * 1024)
,pcap_activate(pcap)
,pcap_set_snaplen(pcap, 0xFFFF)
,pcap_setnonblock(pcap, 1, ...)
, apply BPF (where BPF looks likenot host <single-IP>
),pcap_setmintocopy(pcap, 8000)
, and finallyevent = pcap_getevent(pcap)
. All of this completes without errors.pcap_dispatch(pcap, -1, handler, NULL)
every timeWaitForMultipleObjects(events_size, events, FALSE, 200)
returns (whether withWAIT_OBJECT_x
or withWAIT_TIMEOUT
, i.e. either by activity on the handle, or after 200 ms of inactivity). Thehandler
function always returns locally (doesn't throw exceptions). It does grab a mutex, but we've double-checked that it's getting released properly by other threads in the application (i.e. there's no deadlock condition). Again,pcap_dispatch()
returns without errors (always a number >= 0), but after a while stops returning data (i.e.handler
no longer gets called).pcap_findalldevs()
is called again, without closing the open pcap handles, to see if the network device list has changed. If it has, all pcap handles are closed and reopened; otherwise no action is taken.pcap_stats(pcap, &ps)
is called; this again returns without errors, but after a while every new packet goes intops.ps_drop
.A few observations:
Output of
systeminfo.exe
:Relevant output of
reg.exe query HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Class\{4d36e972-e325-11ce-bfc1-08002be10318} /s
:The text was updated successfully, but these errors were encountered: