Add early length checks to TryParse implementations #491

psychon · 2020-06-19T13:46:30Z

Most TryParse implementations that we have contain a series of calls to
other try_parse() implementations. Most of these are for primitive
types, e.g. u32. Thus, some rough sketch of the code is this:

let (a, remaining) = u32::try_parse(remaining)?;
let (b, remaining) = u32::try_parse(remaining)?;
let (c, remaining) = u32::try_parse(remaining)?;
let (d, remaining) = u32::try_parse(remaining)?;

This is relatively readable code. We do not have to mess round with byte
offsets or something like that. However, every single call to
u32::try_parse() checks if there are another 32 bit of data available.
This leads to lots of repetitions in the assembly code, because these
length checks are not merged.

This commit changes the code generator to compute the minimum size that
some object can have on the wire. This minimum size is the sum of all
fixed-size data. Basically, this means that lists are assumed to be
empty.

Then, the generated code checks if at least the minimum number of bytes
is available. If not, an early returns is done.

This early length check allows the compiler (in release mode) to
optimise away most later length checks, thus resulting in less assembly
code.

The effect of this commit was measured in two ways.

"cargo rustc --release -- --Copt-level=s --emit asm" allows to get an
assembly file for the library. Before this commit, this assembly file
has 159662 lines. Afterwards, it has 157796 lines. This is a
reduction of about 1.2 %.
"cargo build --release --example xclock_utc" and calling "strip" on
the resulting binary previously produced a file with 503 KiB. After
this commit, the file has only 499 KiB. Thus, about 4 KiB or 1 % is
saved.

Related-to: #488
Signed-off-by: Uli Schlachter psychon@znc.in

On one hand, the results for this change are a bit underwhelming. On the other hand, this is a relative simple change. Saving 4 KiB just by doing more length checks is a large reduction for little work. Dunno if this PR is worth it, hence keeping it as draft status. Hopefully I can come up with more PRs like this and we can see how much all of this brings.

Most TryParse implementations that we have contain a series of calls to other try_parse() implementations. Most of these are for primitive types, e.g. u32. Thus, some rough sketch of the code is this: let (a, remaining) = u32::try_parse(remaining)?; let (b, remaining) = u32::try_parse(remaining)?; let (c, remaining) = u32::try_parse(remaining)?; let (d, remaining) = u32::try_parse(remaining)?; This is relatively readable code. We do not have to mess round with byte offsets or something like that. However, every single call to u32::try_parse() checks if there are another 32 bit of data available. This leads to lots of repetitions in the assembly code, because these length checks are not merged. This commit changes the code generator to compute the minimum size that some object can have on the wire. This minimum size is the sum of all fixed-size data. Basically, this means that lists are assumed to be empty. Then, the generated code checks if at least the minimum number of bytes is available. If not, an early returns is done. This early length check allows the compiler (in release mode) to optimise away most later length checks, thus resulting in less assembly code. The effect of this commit was measured in two ways. 1. "cargo rustc --release -- --Copt-level=s --emit asm" allows to get an assembly file for the library. Before this commit, this assembly file has 159662 lines. Afterwards, it has 157796 lines. This is a reduction of about 1.2 %. 2. "cargo build --release --example xclock_utc" and calling "strip" on the resulting binary previously produced a file with 503 KiB. After this commit, the file has only 499 KiB. Thus, about 4 KiB or 1 % is saved. Related-to: #488 Signed-off-by: Uli Schlachter <psychon@znc.in>

khuey · 2020-06-19T16:10:06Z

Frankly it looks like rustc/llvm are doing a terrible job here. Here's one I pulled at random from a build (without your PR) with nightly rustc.

_ZN5x11rb8protocol3xkb21LatchLockStateRequest17try_parse_request17hb503842b081124b2E:
        movq    %rdi, %rax
        shrq    $40, %rsi
        cmpb    $5, %sil
        jne     .LBB4888_13
        cmpq    $2, %rcx
        jb      .LBB4888_16
        je      .LBB4888_16
        cmpq    $3, %rcx
        je      .LBB4888_16
        cmpq    $4, %rcx
        je      .LBB4888_16
        cmpb    $0, 4(%rdx)
        setne   %sil
        cmpq    $5, %rcx
        je      .LBB4888_16
        movb    5(%rdx), %dil
        cmpb    $4, %dil
        jae     .LBB4888_13
        cmpq    $6, %rcx
        je      .LBB4888_16
        cmpq    $7, %rcx
        je      .LBB4888_16
        cmpq    $8, %rcx
        je      .LBB4888_16
        cmpq    $9, %rcx
        je      .LBB4888_16
        andq    $-2, %rcx
        cmpq    $10, %rcx
        jne     .LBB4888_12
.LBB4888_16:
        movb    $0, 1(%rax)
        movb    $1, %cl
        movb    %cl, (%rax)
        retq
.LBB4888_13:
        movb    $1, 1(%rax)
        movb    $1, %cl
        movb    %cl, (%rax)
        retq
.LBB4888_12:
        movzwl  (%rdx), %ecx
        movb    2(%rdx), %r8b
        movb    3(%rdx), %r9b
        movb    6(%rdx), %r10b
        cmpb    $0, 9(%rdx)
        movzwl  10(%rdx), %edx
        movw    %cx, 2(%rax)
        movw    %dx, 4(%rax)
        movb    %r8b, 6(%rax)
        movb    %r9b, 7(%rax)
        movb    %sil, 8(%rax)
        movb    %dil, 9(%rax)
        movb    %r10b, 10(%rax)
        setne   11(%rax)
        xorl    %ecx, %ecx
        movb    %cl, (%rax)
        retq

There's a lot of sadness here. Things that jump out at me:

That a sequential jb <label>; je <label>; isn't condensed into a single jbe <label>; is absolutely wild.
That adjacent cmp $N, %rcx, je <label>; cmp $N+1, %rcx, je <label> aren't condensed into a single comparison is also very bad.

Less bad:

setne %sil could just write directly into 8(%rax) rather than a register.
Similarly this code could skip loading 5(%rdx) into %dil, though that may not actually be a perf win.
In the andq $-2, %rcx; cmpq $10, %rcx; jne <label> the and is totally unnecessary because we've already compared %rcx against anything below 10. This is admittedly harder for an optimizer to recognize. It's also not clear to me where this code even comes from.

So it seems like up fronts bounds checks would remove a lot of this badness, though really rustc/llvm should be doing better.

khuey · 2020-06-19T16:19:52Z

I filed rust-lang/rust#73512

psychon · 2020-06-19T16:25:41Z

Here's one I pulled at random from a build (without your PR)

My PR actually misses the all the Request structs. So, this PR does not make a difference. Whoops.

with nightly rustc.

I get basically the same with 1.42.0.

When adding an explicit if value.len() < 16 { return Err(ParseError::ParseError); } at the beginning of that function, the code improves a bit and doesn't look too bad:

	.section	.text._ZN5x11rb8protocol3xkb21LatchLockStateRequest17try_parse_request17h2dec0bd5f7c26762E,"ax",@progbits
	.globl	_ZN5x11rb8protocol3xkb21LatchLockStateRequest17try_parse_request17h2dec0bd5f7c26762E
	.type	_ZN5x11rb8protocol3xkb21LatchLockStateRequest17try_parse_request17h2dec0bd5f7c26762E,@function
_ZN5x11rb8protocol3xkb21LatchLockStateRequest17try_parse_request17h2dec0bd5f7c26762E:
	.cfi_startproc
	movq	%rdi, %rax
	cmpq	$16, %rcx
	jb	.LBB5284_2
	shrq	$40, %rsi
	cmpb	$5, %sil
	jne	.LBB5284_2
	movb	5(%rdx), %cl
	cmpb	$4, %cl
	jae	.LBB5284_2
	xorl	%r8d, %r8d
	cmpb	%r8b, 4(%rdx)
	movzwl	(%rdx), %edi
	movzwl	2(%rdx), %esi
	movb	6(%rdx), %r9b
	movb	9(%rdx), %r10b
	movzwl	10(%rdx), %edx
	setne	6(%rax)
	cmpb	%r8b, %r10b
	movw	%di, (%rax)
	movw	%dx, 2(%rax)
	movw	%si, 4(%rax)
	movb	%cl, 7(%rax)
	movb	%r9b, 8(%rax)
	setne	9(%rax)
	retq
.LBB5284_2:
	movb	$2, 6(%rax)
	retq
.Lfunc_end5284:
	.size	_ZN5x11rb8protocol3xkb21LatchLockStateRequest17try_parse_request17h2dec0bd5f7c26762E, .Lfunc_end5284-_ZN5x11rb8protocol3xkb21LatchLockStateRequest17try_parse_request17h2dec0bd5f7c26762E
	.cfi_endproc

khuey · 2020-06-21T02:48:12Z

Rather surprisingly LLVM doesn't seem to make any effort to deduplicate basic blocks. This is at the end of opt -O3 on the IR that rustc produces

https://gist.github.com/khuey/eedf7183f427323d1c373c1ead25f01d

bb55, bb63, bb72, and bb80 are all effectively identical but LLVM refuses to condense them and eliminate the extra branches.

psychon mentioned this pull request Jun 19, 2020

Can we get smaller binary code? #488

Open

khuey mentioned this pull request Jun 19, 2020

Terrible code generation with a zillion bounds checks. rust-lang/rust#73512

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add early length checks to TryParse implementations #491

Add early length checks to TryParse implementations #491

psychon commented Jun 19, 2020

khuey commented Jun 19, 2020

khuey commented Jun 19, 2020

psychon commented Jun 19, 2020

khuey commented Jun 21, 2020

Add early length checks to TryParse implementations #491

Are you sure you want to change the base?

Add early length checks to TryParse implementations #491

Conversation

psychon commented Jun 19, 2020

khuey commented Jun 19, 2020

khuey commented Jun 19, 2020

psychon commented Jun 19, 2020

khuey commented Jun 21, 2020