Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance optimization suggestions #221

Open
powertang opened this issue Feb 21, 2020 · 1 comment
Open

Performance optimization suggestions #221

powertang opened this issue Feb 21, 2020 · 1 comment

Comments

@powertang
Copy link

  I am using hyperscan 5.2.1 version on the Intel E5-2678v3 CPU(Haswell architecture). When I analyzed the processing performance of hyperscan, I found some interesting places. In block matching mode, the hwlm/hwlm.c:noodExec () function is frequently accessed. noodExec () has a function call chain: noodExec () => scan () => scanDouble () => scanDoubleCase () => scanDoubleMain () => scanDoubleFast (). scan (), scanDouble () and other functions are forcibly inlined by noodExec. Tracking and analyzing noodExec will see the following assembly code snippet:

0x45d0ce <noodExec+1326> xchg %ax,%ax
0x45d0d0 <noodExec+1328> add $0x20,%r13
0x45d0d4 <noodExec+1332> cmp %r13,%rbx
0x45d0d7 <noodExec+1335> jbe 0x45e220 <noodExec+5760>
0x45d0dd <noodExec+1341> vpcmpeqb 0x0(%r13),%ymm1,%ymm0
0x45d0e3 <noodExec+1347> movzbl %r15b,%r15d #Performance bottleneck?lastz0 u8 to u32
0x45d0e7 <noodExec+1351> prefetcht0 0x80(%r13) #__builtin_prefetch(d + 128);
0x45d0ef <noodExec+1359> vpmovmskb %ymm0,%esi #%esi => z0
0x45d0f3 <noodExec+1363> vpcmpeqb 0x0(%r13),%ymm2,%ymm0
0x45d0f9 <noodExec+1369> lea (%rsi,%rsi,1),%edx #%edx => z1
0x45d0fc <noodExec+1372> shr $0x1f,%esi #z0 >>= 31
0x45d0ff <noodExec+1375> or %r15d,%edx # lastz0 | (z0 << 1) edx=>(z0 << 1)
0x45d102 <noodExec+1378> mov %esi,%r15d #lastz0 = z0 >> 31

Notice the instruction on line 0x45d0e3: "movzbl% r15b,% r15d"! It actually converts the data type of the value in r15 from u8 to u32. Combined with context analysis, this code is the inline result of hwlm / noodle_engine_avx2.c: scanDoubleFast () function. Refer to the code of scanDoubleFast ():

static really_inline
hwlm_error_t scanDoubleFast(const struct noodTable *n, const u8 *buf,
size_t len, bool noCase, m256 caseMask, m256 mask1,
m256 mask2, const struct cb_info *cbi, size_t start,
size_t end) {
const u8 *d = buf + start, *e = buf + end;
DEBUG_PRINTF("start %zu end %zu \n", start, end);
assert(d < e);
u8 lastz0 = 0; //lastz0 is U8

for (; d < e; d += 32) {
    m256 v = noCase ? and256(load256(d), caseMask) : load256(d);

    // we have to pull the masks out of the AVX registers because we can't
    // byte shift between the lanes
    u32 z0 = movemask256(eq256(mask1, v));
    u32 z1 = movemask256(eq256(mask2, v));
    u32 z = (lastz0 | (z0 << 1)) & z1; //lastz0 is U32
    lastz0 = z0 >> 31;

    // On large packet buffers, this prefetch appears to get us about 2%.
    __builtin_prefetch(d + 128);

    DOUBLE_ZSCAN();

}
return HWLM_SUCCESS;

}

The variable corresponding to r15 is lastz0; the function actually converts the lastz0 variable to u32 for operation! Changing the type of lastz0 from u8 to u32 can reduce one assembly instruction, which is beneficial to performance improvement!

Similarly, analyze the data structure "struct noodTable", the data type of the member variable "nocase" is u8, but the called function scan () requires "bool" type. Therefore, there are u8 to u32 type conversion instructions. Adjusting the structure definition can also eliminate redundant instructions.
#Original definition
struct noodTable {
u32 id;
u64a msk;
u64a cmp;
u8 msk_len;
u8 key_offset;
u8 nocase; //scan(n->nocase to bool)
u8 single;
u8 key0;
u8 key1;
};
#Modified definition
struct noodTable {
u64a msk;
u64a cmp;

u32 id;
u8  msk_len;
u8  key_offset;
u8  key0;
u8  key1;

u32 nocase;
u8  single;
u8  padded[3];

};

After modifying the data type definition, I use the hsbench tool to test the sample database and rule set provided on the official website. The performance of the single core can be improved by nearly one percentage point, with more processing ~20Mbps per second.

is this right?

@xiangwang1
Copy link
Contributor

Thanks for your detailed explanations! This is a good catch.

For the lastz0 case, I think this makes sense as it's at the critical path. So saving of a single instruction in each iteration could be noticeable if you are spending most of CPU cycles running scanDoubleFast function. nocase conversion isn't at the critical path, so I think it'll be very hard to make any differences.

Regarding hsbench test case, if you are using sample rules from our website, then you won't be able to trigger these functions. That being said, the performance numbers won't reflect the optimizations you've applied.

Nor7th pushed a commit that referenced this issue May 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants