Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite cbor decoder to avoid use of fnptrs. #49

Merged
merged 1 commit into from Aug 3, 2019
Merged

Rewrite cbor decoder to avoid use of fnptrs. #49

merged 1 commit into from Aug 3, 2019

Conversation

warpfork
Copy link
Member

@warpfork warpfork commented Aug 3, 2019

This results in substantial performance improvements.

Previously, function pointers were used in the statemachines. (This is an idea I originally garnered from a talk on "Lexical Scanning in Go" -- https://talks.golang.org/2011/lex.slide#19 -- but it seems that it's a fairly fragile concept where minor variation from the pattern can result in disasterous obstructions to optimizations!)

If these function pointers were only being treated as numbers, the performance of this usage would be perfectly fine. However, they are not: disassembling the compiler output for this design reveals that the compiler generates a type to hold closure information, allocates it, and then proceeds with that pointer. All this happens in a line of source which seems to be a simple "="!

So. Rewrite the decoder to use simple consts for statemachine state.

With this change, step_acceptMapValue went from 93 assembler instructions to 63, a closure disappeared, a generated type disappeared, the implicit runtime.newobject disappeared, and several occurances of runtime.gcWriteBarrier disappeared. All of these costs are associated with the generation of that closure, and the switch to using consts for the statemachine state eliminates all of them at once.

The biggest impact is the removal of the unnecessary allocation. This substantially reduces the GC pressure generated, and thereby increases overall performance significantly.

Benchcmp has the following to say:

                                     old ns/op    new ns/op    delta
StructAlpha_UnmarshalFromCborRefmt   6203         4792         -22.75%
MapAlpha_UnmarshalFromCborRefmt      12656        11130        -12.06%

                                     old allocs   new allocs   delta
StructAlpha_UnmarshalFromCborRefmt   54           10           -81.48%
MapAlpha_UnmarshalFromCborRefmt      157          113          -28.03%

                                     old bytes    new bytes    delta
StructAlpha_UnmarshalFromCborRefmt   1044         340          -67.43%
MapAlpha_UnmarshalFromCborRefmt      4656         3952         -15.12%

The exact numbers will vary per shape of the data in the workload, but
suffice it to say: significantly faster (>20% should be common), and
drastically less memory pressure (80% less? You saw it here).

The number of allocations for unmarshalling cbor into a struct with
refmt is now half as many as the number of allocations for doing
the same unmarshal using stdlib json. Nice.

For reference, this is the old assembly of step_acceptMapValue (93 instructions):

  0x0000 00000 (cborDecoder.go:169) TEXT	"".(*Decoder).step_acceptMapValue(SB), ABIInternal, $64-40
  0x0000 00000 (cborDecoder.go:169) MOVQ	(TLS), CX
  0x0009 00009 (cborDecoder.go:169) CMPQ	SP, 16(CX)
  0x000d 00013 (cborDecoder.go:169) JLS	266
  0x0013 00019 (cborDecoder.go:169) SUBQ	$64, SP
  0x0017 00023 (cborDecoder.go:169) MOVQ	BP, 56(SP)
  0x001c 00028 (cborDecoder.go:169) LEAQ	56(SP), BP
  0x0021 00033 (cborDecoder.go:169) FUNCDATA	$0, gclocals·5af671a95c0d19577a0fa6fa8a10967f(SB)
  0x0021 00033 (cborDecoder.go:169) FUNCDATA	$1, gclocals·7d2d5fca80364273fb07d5820a76fef4(SB)
  0x0021 00033 (cborDecoder.go:169) FUNCDATA	$3, gclocals·b3cd19c3ced5a6f764ea50d3b770f05d(SB)
  0x0021 00033 (cborDecoder.go:171) PCDATA	$2, $1
  0x0021 00033 (cborDecoder.go:171) PCDATA	$0, $0
  0x0021 00033 (cborDecoder.go:171) MOVQ	"".d+72(SP), AX
  0x0026 00038 (cborDecoder.go:171) MOVQ	8(AX), CX
  0x002a 00042 (cborDecoder.go:171) PCDATA	$2, $2
  0x002a 00042 (cborDecoder.go:171) MOVQ	16(AX), DX
  0x002e 00046 (cborDecoder.go:171) MOVQ	48(CX), CX
  0x0032 00050 (cborDecoder.go:171) PCDATA	$2, $0
  0x0032 00050 (cborDecoder.go:171) MOVQ	DX, (SP)
  0x0036 00054 (cborDecoder.go:171) CALL	CX
  0x0038 00056 (cborDecoder.go:171) MOVBLZX	8(SP), AX
  0x003d 00061 (cborDecoder.go:171) PCDATA	$2, $3
  0x003d 00061 (cborDecoder.go:171) MOVQ	24(SP), CX
  0x0042 00066 (cborDecoder.go:171) MOVQ	16(SP), DX
  0x0047 00071 (cborDecoder.go:172) TESTQ	DX, DX
  0x004a 00074 (cborDecoder.go:172) JNE	241
  0x0050 00080 (cborDecoder.go:171) PCDATA	$2, $0
  0x0050 00080 (cborDecoder.go:171) MOVB	AL, "".majorByte+55(SP)
  0x0054 00084 (cborDecoder.go:175) PCDATA	$2, $1
  0x0054 00084 (cborDecoder.go:175) LEAQ	type.noalg.struct { F uintptr; R *"".Decoder }(SB), AX
  0x005b 00091 (cborDecoder.go:175) PCDATA	$2, $0
  0x005b 00091 (cborDecoder.go:175) MOVQ	AX, (SP)
  0x005f 00095 (cborDecoder.go:175) CALL	runtime.newobject(SB)
  0x0064 00100 (cborDecoder.go:175) PCDATA	$2, $1
  0x0064 00100 (cborDecoder.go:175) MOVQ	8(SP), AX
  0x0069 00105 (cborDecoder.go:175) LEAQ	"".(*Decoder).step_acceptMapKey-fm(SB), CX
  0x0070 00112 (cborDecoder.go:175) MOVQ	CX, (AX)
  0x0073 00115 (cborDecoder.go:175) PCDATA	$2, $-2
  0x0073 00115 (cborDecoder.go:175) PCDATA	$0, $-2
  0x0073 00115 (cborDecoder.go:175) CMPL	runtime.writeBarrier(SB), $0
  0x007a 00122 (cborDecoder.go:175) JNE	204
  0x007c 00124 (cborDecoder.go:175) MOVQ	"".d+72(SP), CX
  0x0081 00129 (cborDecoder.go:175) MOVQ	CX, 8(AX)
  0x0085 00133 (cborDecoder.go:175) MOVQ	AX, 48(CX)
  0x0089 00137 (cborDecoder.go:176) PCDATA	$2, $4
  0x0089 00137 (cborDecoder.go:176) PCDATA	$0, $1
  0x0089 00137 (cborDecoder.go:176) MOVQ	"".tokenSlot+80(SP), AX
  0x008e 00142 (cborDecoder.go:176) MOVB	$0, 88(AX)
  0x0092 00146 (cborDecoder.go:177) PCDATA	$2, $1
  0x0092 00146 (cborDecoder.go:177) MOVQ	CX, (SP)
  0x0096 00150 (cborDecoder.go:177) MOVBLZX	"".majorByte+55(SP), CX
  0x009b 00155 (cborDecoder.go:177) MOVB	CL, 8(SP)
  0x009f 00159 (cborDecoder.go:177) PCDATA	$2, $0
  0x009f 00159 (cborDecoder.go:177) MOVQ	AX, 16(SP)
  0x00a4 00164 (cborDecoder.go:177) CALL	"".(*Decoder).stepHelper_acceptValue(SB)
  0x00a9 00169 (cborDecoder.go:177) MOVQ	32(SP), AX
  0x00ae 00174 (cborDecoder.go:177) PCDATA	$2, $3
  0x00ae 00174 (cborDecoder.go:177) MOVQ	40(SP), CX
  0x00b3 00179 (cborDecoder.go:178) MOVB	$0, "".done+88(SP)
  0x00b8 00184 (cborDecoder.go:178) PCDATA	$0, $2
  0x00b8 00184 (cborDecoder.go:178) MOVQ	AX, "".err+96(SP)
  0x00bd 00189 (cborDecoder.go:178) PCDATA	$2, $0
  0x00bd 00189 (cborDecoder.go:178) MOVQ	CX, "".err+104(SP)
  0x00c2 00194 (cborDecoder.go:178) MOVQ	56(SP), BP
  0x00c7 00199 (cborDecoder.go:178) ADDQ	$64, SP
  0x00cb 00203 (cborDecoder.go:178) RET
  0x00cc 00204 (cborDecoder.go:175) PCDATA	$2, $-2
  0x00cc 00204 (cborDecoder.go:175) PCDATA	$0, $-2
  0x00cc 00204 (cborDecoder.go:175) LEAQ	8(AX), DI
  0x00d0 00208 (cborDecoder.go:175) MOVQ	AX, CX
  0x00d3 00211 (cborDecoder.go:175) MOVQ	"".d+72(SP), AX
  0x00d8 00216 (cborDecoder.go:175) CALL	runtime.gcWriteBarrier(SB)
  0x00dd 00221 (cborDecoder.go:175) LEAQ	48(AX), DI
  0x00e1 00225 (cborDecoder.go:169) MOVQ	AX, DX
  0x00e4 00228 (cborDecoder.go:175) MOVQ	CX, AX
  0x00e7 00231 (cborDecoder.go:175) CALL	runtime.gcWriteBarrier(SB)
  0x00ec 00236 (cborDecoder.go:177) MOVQ	DX, CX
  0x00ef 00239 (cborDecoder.go:175) JMP	137
  0x00f1 00241 (cborDecoder.go:173) PCDATA	$2, $3
  0x00f1 00241 (cborDecoder.go:173) PCDATA	$0, $1
  0x00f1 00241 (cborDecoder.go:173) MOVB	$1, "".done+88(SP)
  0x00f6 00246 (cborDecoder.go:173) PCDATA	$0, $2
  0x00f6 00246 (cborDecoder.go:173) MOVQ	DX, "".err+96(SP)
  0x00fb 00251 (cborDecoder.go:173) PCDATA	$2, $0
  0x00fb 00251 (cborDecoder.go:173) MOVQ	CX, "".err+104(SP)
  0x0100 00256 (cborDecoder.go:173) MOVQ	56(SP), BP
  0x0105 00261 (cborDecoder.go:173) ADDQ	$64, SP
  0x0109 00265 (cborDecoder.go:173) RET
  0x010a 00266 (cborDecoder.go:173) NOP
  0x010a 00266 (cborDecoder.go:169) PCDATA	$0, $-1
  0x010a 00266 (cborDecoder.go:169) PCDATA	$2, $-1
  0x010a 00266 (cborDecoder.go:169) CALL	runtime.morestack_noctxt(SB)
  0x010f 00271 (cborDecoder.go:169) JMP	0

And contrast it with the new assembly for step_acceptMapValue (only 63 instructions):

  0x0000 00000 (cborDecoder.go:196) TEXT	"".(*Decoder).step_acceptMapValue(SB), ABIInternal, $56-40
  0x0000 00000 (cborDecoder.go:196) MOVQ	(TLS), CX
  0x0009 00009 (cborDecoder.go:196) CMPQ	SP, 16(CX)
  0x000d 00013 (cborDecoder.go:196) JLS	163
  0x0013 00019 (cborDecoder.go:196) SUBQ	$56, SP
  0x0017 00023 (cborDecoder.go:196) MOVQ	BP, 48(SP)
  0x001c 00028 (cborDecoder.go:196) LEAQ	48(SP), BP
  0x0021 00033 (cborDecoder.go:196) FUNCDATA	$0, gclocals·56d33af5d84ec1114330c1119ad93f68(SB)
  0x0021 00033 (cborDecoder.go:196) FUNCDATA	$1, gclocals·f6bd6b3389b872033d462029172c8612(SB)
  0x0021 00033 (cborDecoder.go:196) FUNCDATA	$3, gclocals·fca29e89d033ef11d64e11a599ce9bf0(SB)
  0x0021 00033 (cborDecoder.go:198) PCDATA	$2, $1
  0x0021 00033 (cborDecoder.go:198) PCDATA	$0, $0
  0x0021 00033 (cborDecoder.go:198) MOVQ	"".d+64(SP), AX
  0x0026 00038 (cborDecoder.go:198) MOVQ	8(AX), CX
  0x002a 00042 (cborDecoder.go:198) PCDATA	$2, $2
  0x002a 00042 (cborDecoder.go:198) MOVQ	16(AX), DX
  0x002e 00046 (cborDecoder.go:198) MOVQ	48(CX), CX
  0x0032 00050 (cborDecoder.go:198) PCDATA	$2, $0
  0x0032 00050 (cborDecoder.go:198) MOVQ	DX, (SP)
  0x0036 00054 (cborDecoder.go:198) CALL	CX
  0x0038 00056 (cborDecoder.go:198) PCDATA	$2, $1
  0x0038 00056 (cborDecoder.go:198) MOVQ	24(SP), AX
  0x003d 00061 (cborDecoder.go:198) MOVQ	16(SP), CX
  0x0042 00066 (cborDecoder.go:199) TESTQ	CX, CX
  0x0045 00069 (cborDecoder.go:199) JEQ	96
  0x0047 00071 (cborDecoder.go:200) PCDATA	$0, $1
  0x0047 00071 (cborDecoder.go:200) MOVB	$1, "".done+80(SP)
  0x004c 00076 (cborDecoder.go:200) PCDATA	$0, $2
  0x004c 00076 (cborDecoder.go:200) MOVQ	CX, "".err+88(SP)
  0x0051 00081 (cborDecoder.go:200) PCDATA	$2, $0
  0x0051 00081 (cborDecoder.go:200) MOVQ	AX, "".err+96(SP)
  0x0056 00086 (cborDecoder.go:200) MOVQ	48(SP), BP
  0x005b 00091 (cborDecoder.go:200) ADDQ	$56, SP
  0x005f 00095 (cborDecoder.go:200) RET
  0x0060 00096 (cborDecoder.go:202) PCDATA	$2, $1
  0x0060 00096 (cborDecoder.go:202) PCDATA	$0, $3
  0x0060 00096 (cborDecoder.go:202) MOVQ	"".d+64(SP), AX
  0x0065 00101 (cborDecoder.go:202) MOVB	$5, 48(AX)
  0x0069 00105 (cborDecoder.go:203) PCDATA	$2, $3
  0x0069 00105 (cborDecoder.go:203) PCDATA	$0, $1
  0x0069 00105 (cborDecoder.go:203) MOVQ	"".tokenSlot+72(SP), CX
  0x006e 00110 (cborDecoder.go:203) MOVB	$0, 88(CX)
  0x0072 00114 (cborDecoder.go:204) PCDATA	$2, $4
  0x0072 00114 (cborDecoder.go:204) MOVQ	AX, (SP)
  0x0076 00118 (cborDecoder.go:204) PCDATA	$2, $0
  0x0076 00118 (cborDecoder.go:204) MOVQ	CX, 16(SP)
  0x007b 00123 (cborDecoder.go:204) CALL	"".(*Decoder).stepHelper_acceptValue(SB)
  0x0080 00128 (cborDecoder.go:204) PCDATA	$2, $1
  0x0080 00128 (cborDecoder.go:204) MOVQ	40(SP), AX
  0x0085 00133 (cborDecoder.go:204) MOVQ	32(SP), CX
  0x008a 00138 (cborDecoder.go:205) MOVB	$0, "".done+80(SP)
  0x008f 00143 (cborDecoder.go:205) PCDATA	$0, $2
  0x008f 00143 (cborDecoder.go:205) MOVQ	CX, "".err+88(SP)
  0x0094 00148 (cborDecoder.go:205) PCDATA	$2, $0
  0x0094 00148 (cborDecoder.go:205) MOVQ	AX, "".err+96(SP)
  0x0099 00153 (cborDecoder.go:205) MOVQ	48(SP), BP
  0x009e 00158 (cborDecoder.go:205) ADDQ	$56, SP
  0x00a2 00162 (cborDecoder.go:205) RET
  0x00a3 00163 (cborDecoder.go:205) NOP
  0x00a3 00163 (cborDecoder.go:196) PCDATA	$0, $-1
  0x00a3 00163 (cborDecoder.go:196) PCDATA	$2, $-1
  0x00a3 00163 (cborDecoder.go:196) CALL	runtime.morestack_noctxt(SB)
  0x00a8 00168 (cborDecoder.go:196) JMP	0

(Assembly is from go version go1.12.5 linux/amd64.)

Huge thanks to @gmasgras and the folks working on IPFS infra! They provided some pprof files that were a perfect kick in the shorts for starting to look into these issues, and were fantastic data to aim with.

In the future, similar optimizations are probably possible in several other parts of refmt: the obj package also uses function pointers in several places where we now might regard it as a unwise. These are not all as simple to update, though, so it may take place in future PRs.

This results in substantial performance improvements.

Previously, function pointers were used in the statemachines.
(This is an idea I originally garnered from a talk on "Lexical Scanning
in Go" -- https://talks.golang.org/2011/lex.slide#19 -- but it seems
that it's a fairly fragile concept where minor variation from the
pattern can result in disasterous obstructions to optimizations!)

If these function pointers were *only* being treated as numbers,
the performance of this usage would be perfectly fine.  However, they
are not: disassembling the compiler output for this design reveals
that the compiler generates a type to hold closure information,
**allocates it**, and then proceeds with *that* pointer.
All this happens in a line of source which seems to be a simple "="!

So.  Rewrite the decoder to use simple consts for statemachine state.

With this change, `step_acceptMapValue` went from 93 assembler
instructions to 63, a closure disappeared, a generated type
disappeared, **the implicit `runtime.newobject` disappeared**, and
several occurances of `runtime.gcWriteBarrier` disappeared.
All of these costs are associated with the generation of that closure,
and the switch to using consts for the statemachine state eliminates
all of them at once.

The biggest impact is the removal of the unnecessary allocation.  This
substantially reduces the GC pressure generated, and thereby increases
overall performance significantly.

Benchcmp has the following to say:

```
benchmark                                              old ns/op     new ns/op     delta
Benchmark_StructAlpha_UnmarshalFromCborRefmt-8         6203          4792          -22.75%
Benchmark_MapAlpha_UnmarshalFromCborRefmt-8            12656         11130         -12.06%

benchmark                                              old allocs    new allocs    delta
Benchmark_StructAlpha_UnmarshalFromCborRefmt-8         54            10            -81.48%
Benchmark_MapAlpha_UnmarshalFromCborRefmt-8            157           113           -28.03%

benchmark                                              old bytes     new bytes     delta
Benchmark_StructAlpha_UnmarshalFromCborRefmt-8         1044          340           -67.43%
Benchmark_MapAlpha_UnmarshalFromCborRefmt-8            4656          3952          -15.12%
```

The exact numbers will vary per shape of the data in the workload, but
suffice it to say: significantly faster (>20% should be common), and
drastically less memory pressure (80% less?  You saw it here).

The number of allocations for unmarshalling cbor into a struct with
refmt is now *half* as many as the number of allocations for doing
the same unmarshal using stdlib json.  Nice.

For reference, this is the *new* assembly of `step_acceptMapValue`:

```
  0x0000 00000 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:196) TEXT	"".(*Decoder).step_acceptMapValue(SB), ABIInternal, $56-40
  0x0000 00000 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:196) MOVQ	(TLS), CX
  0x0009 00009 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:196) CMPQ	SP, 16(CX)
  0x000d 00013 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:196) JLS	163
  0x0013 00019 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:196) SUBQ	$56, SP
  0x0017 00023 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:196) MOVQ	BP, 48(SP)
  0x001c 00028 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:196) LEAQ	48(SP), BP
  0x0021 00033 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:196) FUNCDATA	$0, gclocals·56d33af5d84ec1114330c1119ad93f68(SB)
  0x0021 00033 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:196) FUNCDATA	$1, gclocals·f6bd6b3389b872033d462029172c8612(SB)
  0x0021 00033 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:196) FUNCDATA	$3, gclocals·fca29e89d033ef11d64e11a599ce9bf0(SB)
  0x0021 00033 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:198) PCDATA	$2, $1
  0x0021 00033 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:198) PCDATA	$0, $0
  0x0021 00033 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:198) MOVQ	"".d+64(SP), AX
  0x0026 00038 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:198) MOVQ	8(AX), CX
  0x002a 00042 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:198) PCDATA	$2, $2
  0x002a 00042 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:198) MOVQ	16(AX), DX
  0x002e 00046 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:198) MOVQ	48(CX), CX
  0x0032 00050 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:198) PCDATA	$2, $0
  0x0032 00050 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:198) MOVQ	DX, (SP)
  0x0036 00054 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:198) CALL	CX
  0x0038 00056 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:198) PCDATA	$2, $1
  0x0038 00056 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:198) MOVQ	24(SP), AX
  0x003d 00061 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:198) MOVQ	16(SP), CX
  0x0042 00066 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:199) TESTQ	CX, CX
  0x0045 00069 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:199) JEQ	96
  0x0047 00071 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:200) PCDATA	$0, $1
  0x0047 00071 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:200) MOVB	$1, "".done+80(SP)
  0x004c 00076 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:200) PCDATA	$0, $2
  0x004c 00076 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:200) MOVQ	CX, "".err+88(SP)
  0x0051 00081 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:200) PCDATA	$2, $0
  0x0051 00081 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:200) MOVQ	AX, "".err+96(SP)
  0x0056 00086 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:200) MOVQ	48(SP), BP
  0x005b 00091 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:200) ADDQ	$56, SP
  0x005f 00095 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:200) RET
  0x0060 00096 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:202) PCDATA	$2, $1
  0x0060 00096 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:202) PCDATA	$0, $3
  0x0060 00096 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:202) MOVQ	"".d+64(SP), AX
  0x0065 00101 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:202) MOVB	$5, 48(AX)
  0x0069 00105 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:203) PCDATA	$2, $3
  0x0069 00105 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:203) PCDATA	$0, $1
  0x0069 00105 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:203) MOVQ	"".tokenSlot+72(SP), CX
  0x006e 00110 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:203) MOVB	$0, 88(CX)
  0x0072 00114 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:204) PCDATA	$2, $4
  0x0072 00114 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:204) MOVQ	AX, (SP)
  0x0076 00118 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:204) PCDATA	$2, $0
  0x0076 00118 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:204) MOVQ	CX, 16(SP)
  0x007b 00123 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:204) CALL	"".(*Decoder).stepHelper_acceptValue(SB)
  0x0080 00128 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:204) PCDATA	$2, $1
  0x0080 00128 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:204) MOVQ	40(SP), AX
  0x0085 00133 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:204) MOVQ	32(SP), CX
  0x008a 00138 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:205) MOVB	$0, "".done+80(SP)
  0x008f 00143 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:205) PCDATA	$0, $2
  0x008f 00143 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:205) MOVQ	CX, "".err+88(SP)
  0x0094 00148 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:205) PCDATA	$2, $0
  0x0094 00148 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:205) MOVQ	AX, "".err+96(SP)
  0x0099 00153 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:205) MOVQ	48(SP), BP
  0x009e 00158 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:205) ADDQ	$56, SP
  0x00a2 00162 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:205) RET
  0x00a3 00163 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:205) NOP
  0x00a3 00163 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:196) PCDATA	$0, $-1
  0x00a3 00163 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:196) PCDATA	$2, $-1
  0x00a3 00163 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:196) CALL	runtime.morestack_noctxt(SB)
  0x00a8 00168 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:196) JMP	0
```

And contrast it with the larger, *old* assembly for `step_acceptMapValue`:

```
  0x0000 00000 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:169) TEXT	"".(*Decoder).step_acceptMapValue(SB), ABIInternal, $64-40
  0x0000 00000 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:169) MOVQ	(TLS), CX
  0x0009 00009 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:169) CMPQ	SP, 16(CX)
  0x000d 00013 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:169) JLS	266
  0x0013 00019 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:169) SUBQ	$64, SP
  0x0017 00023 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:169) MOVQ	BP, 56(SP)
  0x001c 00028 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:169) LEAQ	56(SP), BP
  0x0021 00033 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:169) FUNCDATA	$0, gclocals·5af671a95c0d19577a0fa6fa8a10967f(SB)
  0x0021 00033 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:169) FUNCDATA	$1, gclocals·7d2d5fca80364273fb07d5820a76fef4(SB)
  0x0021 00033 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:169) FUNCDATA	$3, gclocals·b3cd19c3ced5a6f764ea50d3b770f05d(SB)
  0x0021 00033 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:171) PCDATA	$2, $1
  0x0021 00033 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:171) PCDATA	$0, $0
  0x0021 00033 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:171) MOVQ	"".d+72(SP), AX
  0x0026 00038 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:171) MOVQ	8(AX), CX
  0x002a 00042 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:171) PCDATA	$2, $2
  0x002a 00042 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:171) MOVQ	16(AX), DX
  0x002e 00046 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:171) MOVQ	48(CX), CX
  0x0032 00050 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:171) PCDATA	$2, $0
  0x0032 00050 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:171) MOVQ	DX, (SP)
  0x0036 00054 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:171) CALL	CX
  0x0038 00056 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:171) MOVBLZX	8(SP), AX
  0x003d 00061 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:171) PCDATA	$2, $3
  0x003d 00061 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:171) MOVQ	24(SP), CX
  0x0042 00066 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:171) MOVQ	16(SP), DX
  0x0047 00071 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:172) TESTQ	DX, DX
  0x004a 00074 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:172) JNE	241
  0x0050 00080 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:171) PCDATA	$2, $0
  0x0050 00080 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:171) MOVB	AL, "".majorByte+55(SP)
  0x0054 00084 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:175) PCDATA	$2, $1
  0x0054 00084 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:175) LEAQ	type.noalg.struct { F uintptr; R *"".Decoder }(SB), AX
  0x005b 00091 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:175) PCDATA	$2, $0
  0x005b 00091 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:175) MOVQ	AX, (SP)
  0x005f 00095 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:175) CALL	runtime.newobject(SB)
  0x0064 00100 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:175) PCDATA	$2, $1
  0x0064 00100 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:175) MOVQ	8(SP), AX
  0x0069 00105 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:175) LEAQ	"".(*Decoder).step_acceptMapKey-fm(SB), CX
  0x0070 00112 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:175) MOVQ	CX, (AX)
  0x0073 00115 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:175) PCDATA	$2, $-2
  0x0073 00115 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:175) PCDATA	$0, $-2
  0x0073 00115 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:175) CMPL	runtime.writeBarrier(SB), $0
  0x007a 00122 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:175) JNE	204
  0x007c 00124 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:175) MOVQ	"".d+72(SP), CX
  0x0081 00129 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:175) MOVQ	CX, 8(AX)
  0x0085 00133 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:175) MOVQ	AX, 48(CX)
  0x0089 00137 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:176) PCDATA	$2, $4
  0x0089 00137 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:176) PCDATA	$0, $1
  0x0089 00137 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:176) MOVQ	"".tokenSlot+80(SP), AX
  0x008e 00142 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:176) MOVB	$0, 88(AX)
  0x0092 00146 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:177) PCDATA	$2, $1
  0x0092 00146 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:177) MOVQ	CX, (SP)
  0x0096 00150 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:177) MOVBLZX	"".majorByte+55(SP), CX
  0x009b 00155 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:177) MOVB	CL, 8(SP)
  0x009f 00159 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:177) PCDATA	$2, $0
  0x009f 00159 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:177) MOVQ	AX, 16(SP)
  0x00a4 00164 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:177) CALL	"".(*Decoder).stepHelper_acceptValue(SB)
  0x00a9 00169 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:177) MOVQ	32(SP), AX
  0x00ae 00174 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:177) PCDATA	$2, $3
  0x00ae 00174 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:177) MOVQ	40(SP), CX
  0x00b3 00179 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:178) MOVB	$0, "".done+88(SP)
  0x00b8 00184 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:178) PCDATA	$0, $2
  0x00b8 00184 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:178) MOVQ	AX, "".err+96(SP)
  0x00bd 00189 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:178) PCDATA	$2, $0
  0x00bd 00189 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:178) MOVQ	CX, "".err+104(SP)
  0x00c2 00194 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:178) MOVQ	56(SP), BP
  0x00c7 00199 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:178) ADDQ	$64, SP
  0x00cb 00203 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:178) RET
  0x00cc 00204 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:175) PCDATA	$2, $-2
  0x00cc 00204 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:175) PCDATA	$0, $-2
  0x00cc 00204 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:175) LEAQ	8(AX), DI
  0x00d0 00208 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:175) MOVQ	AX, CX
  0x00d3 00211 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:175) MOVQ	"".d+72(SP), AX
  0x00d8 00216 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:175) CALL	runtime.gcWriteBarrier(SB)
  0x00dd 00221 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:175) LEAQ	48(AX), DI
  0x00e1 00225 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:169) MOVQ	AX, DX
  0x00e4 00228 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:175) MOVQ	CX, AX
  0x00e7 00231 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:175) CALL	runtime.gcWriteBarrier(SB)
  0x00ec 00236 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:177) MOVQ	DX, CX
  0x00ef 00239 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:175) JMP	137
  0x00f1 00241 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:173) PCDATA	$2, $3
  0x00f1 00241 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:173) PCDATA	$0, $1
  0x00f1 00241 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:173) MOVB	$1, "".done+88(SP)
  0x00f6 00246 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:173) PCDATA	$0, $2
  0x00f6 00246 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:173) MOVQ	DX, "".err+96(SP)
  0x00fb 00251 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:173) PCDATA	$2, $0
  0x00fb 00251 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:173) MOVQ	CX, "".err+104(SP)
  0x0100 00256 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:173) MOVQ	56(SP), BP
  0x0105 00261 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:173) ADDQ	$64, SP
  0x0109 00265 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:173) RET
  0x010a 00266 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:173) NOP
  0x010a 00266 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:169) PCDATA	$0, $-1
  0x010a 00266 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:169) PCDATA	$2, $-1
  0x010a 00266 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:169) CALL	runtime.morestack_noctxt(SB)
  0x010f 00271 (/gopath/src/github.com/polydawn/refmt/cbor/cborDecoder.go:169) JMP	0
```

(Assembly is from go version go1.12.5 linux/amd64.)

Huge thanks to @gmasgras and the folks working on IPFS infra!  They
provided some pprof files that were a perfect kick in the shorts for
starting to look into these issues, and were fantastic data to aim with.

In the future, similar optimizations are probably possible in several
other parts of refmt: the obj package also uses function pointers in
several places where we now might regard it as a unwise.  These are not
all as simple to update, though, so it may take place in future PRs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant