Replace ringbuffer in Packet Pool with a stack for better cache locality #913

ken-tilera · 2014-03-28T19:34:35Z

Implements optimization 1039 (https://redmine.openinfosecfoundation.org/issues/1039#change-4093)
Using a stack for free Packet storage causes recently freed Packets to be
reused quickly, while there is more likelihood of the data still being in
cache.

The new structure has a per-thread private stack for allocating Packets
which does not need any locking. Since Packets can be freed by any thread,
there is a second stack (return stack) for freeing packets by other threads.
The return stack is protected by a mutex. Packets are moved from the return
stack to the private stack when the private stack is empty.

Returning packets back to their "home" stack keeps the stacks from getting out
of balance.

The PacketPoolInit() function is now called by each thread that will be
allocating packets. Each thread allocates max_pending_packets, which is a
change from before, where that was the total number of packets across all
threads.

This code depends on __thread to store one packet pool per thread. I need to find a solution for BSD, but wanted to get feed back on this approach first. One thought the case of not having __thread is to simply have one central buffer stack using a lock. I looked into storing the PktPool in ThreadVars, but that was not available every place a packet is allocated.

Passes regression script:
https://buildbot.suricata-ids.org/builders/ken-tilera/builds/133
https://buildbot.suricata-ids.org/builders/ken-tilera-pcap/builds/67

@poona you might want to see if this is going to break Cuda.

inliniac · 2014-04-01T15:14:47Z

Right, so there is a return stack per thread, and each packet 'knows' due to p->pool which pool it must be returned to. Normal ops require no locks.

One thing that is possible is that T0 has packets in it's stack, but T1 hasn't, then the stack lookup still fails and we can fall back to direct alloc, correct?

inliniac · 2014-04-01T15:19:29Z

One thing to consider is that we currently use PacketGetFromAlloc for pseudo packets. I have observed some conditions where this lead to massive amounts of allocs per second. Adding a stack for this case won't work, as it would introduce a contention point. The local stack would run dry quicky, leading to locks at lookups. As the pseudo packets are returned by the other threads by definition, the return path has lock contention too. I have experimented with this, and observed it. I wonder if we should shoot for a comprehensive solution, that includes the pseudo packet issues.

ken-tilera · 2014-04-01T15:20:04Z

Correct, but since each pool has a fix set of packets, and those packets are always returned to that pool, it won’t be the case that all the packets end up on one stack while the others have none. No stack will ever have more than its original number of packets. It could be the case that all the packets from one pool are in use (TCP reassembly and such) and then that thread would resort to malloc.

From: Victor Julien [mailto:notifications@github.com]
Sent: Tuesday, April 01, 2014 11:15 AM
To: inliniac/suricata
Cc: Kenneth Steele
Subject: Re: [suricata] Replace ringbuffer in Packet Pool with a stack for better cache locality (#913)

Right, so there is a return stack per thread, and each packet 'knows' due to p->pool which pool it must be returned to. Normal ops require no locks.

One thing that is possible is that T0 has packets in it's stack, but T1 hasn't, then the stack lookup still fails and we can fall back to direct alloc, correct?

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/913#issuecomment-39217409.

inliniac · 2014-04-01T15:23:15Z

Actually, I wonder if this will work with the autofp runmode. In the common case (reading a pcap or having one pcap reading thread), the thread getting a packet will be different from the packet returning it by definition. So all stream/detect threads will fight over the return stack lock of the receive/decode thread.

ken-tilera · 2014-04-01T15:28:35Z

If the return stack lock is a point of contention, it would be fairly easy to have N returns stacks per pool, each with its own lock. The threads returning packets could pick a random stack to use for returning, or statically hash them across the return stacks. Or, returning thread use trylock on one return stack and if it fails try another one. When the local stack is empty, on allocation, it can check each return stack until it finds some packets.

Other ideas would be to allocate multiple packets at once when the stack is completely empty and add them to the local stack. Then only free allocated packets when the local stack goes above its original size. This would require keeping a count of the number of packets on the local packet stack, but that it trivial.

Currently, AFP and PFRing and other packet sources call PacketGetFromAlooc() for each received packets, which itself is a large amount of requests per second.

From: Victor Julien [mailto:notifications@github.com]
Sent: Tuesday, April 01, 2014 11:19 AM
To: inliniac/suricata
Cc: Kenneth Steele
Subject: Re: [suricata] Replace ringbuffer in Packet Pool with a stack for better cache locality (#913)

One thing to consider is that we currently use PacketGetFromAlloc for pseudo packets. I have observed some conditions where this lead to massive amounts of allocs per second. Adding a stack for this case won't work, as it would introduce a contention point. The local stack would run dry quicky, leading to locks at lookups. As the pseudo packets are returned by the other threads by definition, the return path has lock contention too. I have experimented with this, and observed it. I wonder if we should shoot for a comprehensive solution, that includes the pseudo packet issues.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/913#issuecomment-39217974.

ken-tilera · 2014-04-01T15:35:11Z

For the single allocation pool case, we could also have the freeing threads keep a local stack of packets to be freed and when that reaches some limit (8?) then they return them all in one lock of the return stack. This is extra complexity, since there would need to be a list of pending returns per pool on each thread. So, I think multiple return stacks in each pool is simpler.

From: Victor Julien [mailto:notifications@github.com]
Sent: Tuesday, April 01, 2014 11:23 AM
To: inliniac/suricata
Cc: Kenneth Steele
Subject: Re: [suricata] Replace ringbuffer in Packet Pool with a stack for better cache locality (#913)

Actually, I wonder if this will work with the autofp runmode. In the common case (reading a pcap or having one pcap reading thread), the thread getting a packet will be different from the packet returning it by definition. So all stream/detect threads will fight over the return stack lock of the receive/decode thread.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/913#issuecomment-39218429.

inliniac · 2014-04-01T15:36:45Z

On 04/01/2014 05:28 PM, Ken Steele wrote:

If the return stack lock is a point of contention, it would be fairly easy to have N returns stacks per pool, each with its own lock. The threads returning packets could pick a random stack to use for returning, or statically hash them across the return stacks. Or, returning thread use trylock on one return stack and if it fails try another one. When the local stack is empty, on allocation, it can check each return stack until it finds some packets.

This is a bit like what I experimented with in my multiple queue test
#845
I found that this worked much better in certain contention conditions
caused by flow timeout pseudo packets.

I think it would make sense to take a multi stack approach. Having a
single return lock for pseudo packets or in the autofp case will
certainly cause problems.

Other ideas would be to allocate multiple packets at once when the stack is completely empty and add them to the local stack. Then only free allocated packets when the local stack goes above its original size. This would require keeping a count of the number of packets on the local packet stack, but that it trivial.

I like this idea.

Currently, AFP and PFRing and other packet sources call PacketGetFromAlooc() for each received packets, which itself is a large amount of requests per second.

This is incorrect, both use PacketGetFromQueueOrAlloc, which only allocs
if packetpool is empty:

https://github.com/inliniac/suricata/blob/master/src/source-af-packet.c#L771
https://github.com/inliniac/suricata/blob/master/src/source-pfring.c#L326

ken-tilera · 2014-04-01T16:03:29Z

I will look at adding multiple return stacks. Should the number of return stacks be a compile time option, or Suricata.yaml configuration option?

You are correct, I has misreading PacketGetFromAlloc() as PacketGetFromQueueOrAlloc(). Why not have Pseudo packets also allocate from PacketGetFromQueueOrAlloc()?

From: Victor Julien [mailto:notifications@github.com]
Sent: Tuesday, April 01, 2014 11:37 AM
To: inliniac/suricata
Cc: Kenneth Steele
Subject: Re: [suricata] Replace ringbuffer in Packet Pool with a stack for better cache locality (#913)

On 04/01/2014 05:28 PM, Ken Steele wrote:

If the return stack lock is a point of contention, it would be fairly easy to have N returns stacks per pool, each with its own lock. The threads returning packets could pick a random stack to use for returning, or statically hash them across the return stacks. Or, returning thread use trylock on one return stack and if it fails try another one. When the local stack is empty, on allocation, it can check each return stack until it finds some packets.

This is a bit like what I experimented with in my multiple queue test
#845
I found that this worked much better in certain contention conditions
caused by flow timeout pseudo packets.

I think it would make sense to take a multi stack approach. Having a
single return lock for pseudo packets or in the autofp case will
certainly cause problems.

Other ideas would be to allocate multiple packets at once when the stack is completely empty and add them to the local stack. Then only free allocated packets when the local stack goes above its original size. This would require keeping a count of the number of packets on the local packet stack, but that it trivial.

I like this idea.

Currently, AFP and PFRing and other packet sources call PacketGetFromAlooc() for each received packets, which itself is a large amount of requests per second.

This is incorrect, both use PacketGetFromQueueOrAlloc, which only allocs
if packetpool is empty:

https://github.com/inliniac/suricata/blob/master/src/source-af-packet.c#L771
https://github.com/inliniac/suricata/blob/master/src/source-pfring.c#L326

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/913#issuecomment-39220174.

inliniac · 2014-04-01T16:07:14Z

On 04/01/2014 06:03 PM, Ken Steele wrote:

I will look at adding multiple return stacks. Should the number of return stacks be a compile time option, or Suricata.yaml configuration option?

I'd say a yaml config with an 'auto' setting, where it is set to some
multiple of the number of threads?

ken-tilera · 2014-04-02T13:14:40Z

Thoughts on dealing with systems without __thread? I was thinking on having one central Packet Pool, with multiple locked stacks and having thread "randomly" decide which stack to use to allocate or free a packet.

inliniac · 2014-04-02T15:25:28Z

We could do several things.

Fallback to a single queue, which will perform poorly.
Try the posix thread storage api, which may be supported (not sure though): pthread_key_create, pthread_getspecific, pthread_setspecific
Assign unique id's to each thread used to index a global array of stacks.

Like I mentioned privately yesterday, I'm thinking about the possibility of doing all this through an API, so that specific runmodes and architectures can use their own methods. This would have an added benefit that we could introduce new methods easily.

Using a stack for free Packet storage causes recently freed Packets to be reused quickly, while there is more likelihood of the data still being in cache. The new structure has a per-thread private stack for allocating Packets which does not need any locking. Since Packets can be freed by any thread, there is a second stack (return stack) for freeing packets by other threads. The return stack is protected by a mutex. Packets are moved from the return stack to the private stack when the private stack is empty. Returning packets back to their "home" stack keeps the stacks from getting out of balance. The PacketPoolInit() function is now called by each thread that will be allocating packets. Each thread allocates max_pending_packets, which is a change from before, where that was the total number of packets across all threads.

jasonish · 2014-04-28T21:56:52Z

I've currently run into the shared packetpool as a bottleneck when using workers mode with more than 10-11 "receive" threads. My current work-around now is multiple Suricata instances, but obviously that is not ideal.

A quick win was a thread local packet pool - drops disappeared and CPU usage was much less. My first attempt did not use __thread but a crude implementation of a packet pool per receive thread which performed the same as __thread, but while more code, has the benefit of not requiring __thread.

Could that be an option? Refactor the packet pool to use a context object and have it initialized in the "receive" thread init? Perhaps attach it to the packet so psuedo packets can retrieve from the same thread?

ken-tilera · 2014-07-01T21:25:41Z

Replaced by: #1005

ken-tilera mentioned this pull request Jul 1, 2014

Replace ringbuffer in Packet Pool with a stack for better cache locality #1005

Closed

ken-tilera closed this Jul 1, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace ringbuffer in Packet Pool with a stack for better cache locality #913

Replace ringbuffer in Packet Pool with a stack for better cache locality #913

ken-tilera commented Mar 28, 2014

inliniac commented Apr 1, 2014

inliniac commented Apr 1, 2014

ken-tilera commented Apr 1, 2014

inliniac commented Apr 1, 2014

ken-tilera commented Apr 1, 2014

ken-tilera commented Apr 1, 2014

inliniac commented Apr 1, 2014

ken-tilera commented Apr 1, 2014

inliniac commented Apr 1, 2014

ken-tilera commented Apr 2, 2014

inliniac commented Apr 2, 2014

jasonish commented Apr 28, 2014

ken-tilera commented Jul 1, 2014

Replace ringbuffer in Packet Pool with a stack for better cache locality #913

Replace ringbuffer in Packet Pool with a stack for better cache locality #913

Conversation

ken-tilera commented Mar 28, 2014

inliniac commented Apr 1, 2014

inliniac commented Apr 1, 2014

ken-tilera commented Apr 1, 2014

inliniac commented Apr 1, 2014

ken-tilera commented Apr 1, 2014

ken-tilera commented Apr 1, 2014

inliniac commented Apr 1, 2014

ken-tilera commented Apr 1, 2014

inliniac commented Apr 1, 2014

ken-tilera commented Apr 2, 2014

inliniac commented Apr 2, 2014

jasonish commented Apr 28, 2014

ken-tilera commented Jul 1, 2014