-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace ringbuffer in Packet Pool with a stack for better cache locality #913
Conversation
Right, so there is a return stack per thread, and each packet 'knows' due to p->pool which pool it must be returned to. Normal ops require no locks. One thing that is possible is that T0 has packets in it's stack, but T1 hasn't, then the stack lookup still fails and we can fall back to direct alloc, correct? |
One thing to consider is that we currently use PacketGetFromAlloc for pseudo packets. I have observed some conditions where this lead to massive amounts of allocs per second. Adding a stack for this case won't work, as it would introduce a contention point. The local stack would run dry quicky, leading to locks at lookups. As the pseudo packets are returned by the other threads by definition, the return path has lock contention too. I have experimented with this, and observed it. I wonder if we should shoot for a comprehensive solution, that includes the pseudo packet issues. |
Correct, but since each pool has a fix set of packets, and those packets are always returned to that pool, it won’t be the case that all the packets end up on one stack while the others have none. No stack will ever have more than its original number of packets. It could be the case that all the packets from one pool are in use (TCP reassembly and such) and then that thread would resort to malloc. From: Victor Julien [mailto:notifications@github.com] Right, so there is a return stack per thread, and each packet 'knows' due to p->pool which pool it must be returned to. Normal ops require no locks. One thing that is possible is that T0 has packets in it's stack, but T1 hasn't, then the stack lookup still fails and we can fall back to direct alloc, correct? — |
Actually, I wonder if this will work with the autofp runmode. In the common case (reading a pcap or having one pcap reading thread), the thread getting a packet will be different from the packet returning it by definition. So all stream/detect threads will fight over the return stack lock of the receive/decode thread. |
If the return stack lock is a point of contention, it would be fairly easy to have N returns stacks per pool, each with its own lock. The threads returning packets could pick a random stack to use for returning, or statically hash them across the return stacks. Or, returning thread use trylock on one return stack and if it fails try another one. When the local stack is empty, on allocation, it can check each return stack until it finds some packets. Other ideas would be to allocate multiple packets at once when the stack is completely empty and add them to the local stack. Then only free allocated packets when the local stack goes above its original size. This would require keeping a count of the number of packets on the local packet stack, but that it trivial. Currently, AFP and PFRing and other packet sources call PacketGetFromAlooc() for each received packets, which itself is a large amount of requests per second. From: Victor Julien [mailto:notifications@github.com] One thing to consider is that we currently use PacketGetFromAlloc for pseudo packets. I have observed some conditions where this lead to massive amounts of allocs per second. Adding a stack for this case won't work, as it would introduce a contention point. The local stack would run dry quicky, leading to locks at lookups. As the pseudo packets are returned by the other threads by definition, the return path has lock contention too. I have experimented with this, and observed it. I wonder if we should shoot for a comprehensive solution, that includes the pseudo packet issues. — |
For the single allocation pool case, we could also have the freeing threads keep a local stack of packets to be freed and when that reaches some limit (8?) then they return them all in one lock of the return stack. This is extra complexity, since there would need to be a list of pending returns per pool on each thread. So, I think multiple return stacks in each pool is simpler. From: Victor Julien [mailto:notifications@github.com] Actually, I wonder if this will work with the autofp runmode. In the common case (reading a pcap or having one pcap reading thread), the thread getting a packet will be different from the packet returning it by definition. So all stream/detect threads will fight over the return stack lock of the receive/decode thread. — |
On 04/01/2014 05:28 PM, Ken Steele wrote:
This is a bit like what I experimented with in my multiple queue test I think it would make sense to take a multi stack approach. Having a
I like this idea.
This is incorrect, both use PacketGetFromQueueOrAlloc, which only allocs https://github.com/inliniac/suricata/blob/master/src/source-af-packet.c#L771 |
I will look at adding multiple return stacks. Should the number of return stacks be a compile time option, or Suricata.yaml configuration option? You are correct, I has misreading PacketGetFromAlloc() as PacketGetFromQueueOrAlloc(). Why not have Pseudo packets also allocate from PacketGetFromQueueOrAlloc()? From: Victor Julien [mailto:notifications@github.com] On 04/01/2014 05:28 PM, Ken Steele wrote:
This is a bit like what I experimented with in my multiple queue test I think it would make sense to take a multi stack approach. Having a
I like this idea.
This is incorrect, both use PacketGetFromQueueOrAlloc, which only allocs https://github.com/inliniac/suricata/blob/master/src/source-af-packet.c#L771 — |
On 04/01/2014 06:03 PM, Ken Steele wrote:
I'd say a yaml config with an 'auto' setting, where it is set to some |
Thoughts on dealing with systems without __thread? I was thinking on having one central Packet Pool, with multiple locked stacks and having thread "randomly" decide which stack to use to allocate or free a packet. |
We could do several things.
Like I mentioned privately yesterday, I'm thinking about the possibility of doing all this through an API, so that specific runmodes and architectures can use their own methods. This would have an added benefit that we could introduce new methods easily. |
Using a stack for free Packet storage causes recently freed Packets to be reused quickly, while there is more likelihood of the data still being in cache. The new structure has a per-thread private stack for allocating Packets which does not need any locking. Since Packets can be freed by any thread, there is a second stack (return stack) for freeing packets by other threads. The return stack is protected by a mutex. Packets are moved from the return stack to the private stack when the private stack is empty. Returning packets back to their "home" stack keeps the stacks from getting out of balance. The PacketPoolInit() function is now called by each thread that will be allocating packets. Each thread allocates max_pending_packets, which is a change from before, where that was the total number of packets across all threads.
I've currently run into the shared packetpool as a bottleneck when using workers mode with more than 10-11 "receive" threads. My current work-around now is multiple Suricata instances, but obviously that is not ideal. A quick win was a thread local packet pool - drops disappeared and CPU usage was much less. My first attempt did not use __thread but a crude implementation of a packet pool per receive thread which performed the same as __thread, but while more code, has the benefit of not requiring __thread. Could that be an option? Refactor the packet pool to use a context object and have it initialized in the "receive" thread init? Perhaps attach it to the packet so psuedo packets can retrieve from the same thread? |
Replaced by: #1005 |
Implements optimization 1039 (https://redmine.openinfosecfoundation.org/issues/1039#change-4093)
Using a stack for free Packet storage causes recently freed Packets to be
reused quickly, while there is more likelihood of the data still being in
cache.
The new structure has a per-thread private stack for allocating Packets
which does not need any locking. Since Packets can be freed by any thread,
there is a second stack (return stack) for freeing packets by other threads.
The return stack is protected by a mutex. Packets are moved from the return
stack to the private stack when the private stack is empty.
Returning packets back to their "home" stack keeps the stacks from getting out
of balance.
The PacketPoolInit() function is now called by each thread that will be
allocating packets. Each thread allocates max_pending_packets, which is a
change from before, where that was the total number of packets across all
threads.
This code depends on __thread to store one packet pool per thread. I need to find a solution for BSD, but wanted to get feed back on this approach first. One thought the case of not having __thread is to simply have one central buffer stack using a lock. I looked into storing the PktPool in ThreadVars, but that was not available every place a packet is allocated.
Passes regression script:
https://buildbot.suricata-ids.org/builders/ken-tilera/builds/133
https://buildbot.suricata-ids.org/builders/ken-tilera-pcap/builds/67
@poona you might want to see if this is going to break Cuda.