Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: image/png: Encode should allow users to specify the filter type #51982

Open
fumin opened this issue Mar 28, 2022 · 9 comments
Open

proposal: image/png: Encode should allow users to specify the filter type #51982

fumin opened this issue Mar 28, 2022 · 9 comments
Labels
Projects
Milestone

Comments

@fumin
Copy link

@fumin fumin commented Mar 28, 2022

What version of Go are you using (go version)?

$ go version
go version go1.17.6 windows/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
set GO111MODULE=
set GOARCH=amd64
set GOBIN=
set GOCACHE=C:\Users\a3367\AppData\Local\go-build
set GOENV=C:\Users\a3367\AppData\Roaming\go\env
set GOEXE=.exe
set GOEXPERIMENT=
set GOFLAGS=
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOINSECURE=
set GOMODCACHE=C:\Users\a3367\go\pkg\mod
set GONOPROXY=
set GONOSUMDB=
set GOOS=windows
set GOPATH=C:\Users\a3367\go
set GOPRIVATE=
set GOPROXY=https://proxy.golang.org,direct
set GOROOT=C:\Program Files\Go
set GOSUMDB=sum.golang.org
set GOTMPDIR=
set GOTOOLDIR=C:\Program Files\Go\pkg\tool\windows_amd64
set GOVCS=
set GOVERSION=go1.17.6
set GCCGO=gccgo
set AR=ar
set CC=gcc
set CXX=g++
set CGO_ENABLED=1
set GOMOD=C:\Users\a3367\work\misc\seg\go.mod
set CGO_CFLAGS=-g -O2
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-g -O2
set CGO_FFLAGS=-g -O2
set CGO_LDFLAGS=-g -O2
set PKG_CONFIG=pkg-config
set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0 -fdebug-prefix-map=C:\Users\a3367\AppData\Local\Temp\go-build4034705114=/tmp/go-build -gno-record-gcc-switches
gdb --version: GNU gdb (GDB) 8.1

What did you do?

Encode a 23500x45500 image.NRGBA image.
This image is around 4GB uncompressed.

What did you expect to see?

I expected the encoding to finish within 5 seconds.

What did you see instead?

It took longer than 20 seconds.
According to pprof, most time is spent on selecting the png filter type.

png_bestspeed
If the png library allowed us to simply select the ftNone, this would speed up dramatically.
The below pprof shows this by simulating ftNone with no compression.
png_nocompression
This file png_encode.zip is the pprof dump showing png.filter taking a lot time.

One side note, even in the case of no compression, much time is still spent on runtime.memmove, any ideas how to eliminate this? Is this related to garbage collection?

@gopherbot gopherbot added this to the Proposal milestone Mar 28, 2022
@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Mar 30, 2022

@ianlancetaylor ianlancetaylor added this to Incoming in Proposals Mar 30, 2022
@nigeltao
Copy link
Contributor

@nigeltao nigeltao commented Mar 31, 2022

If this is a proposal, what's your proposed API for skipping filter selection?

Bear in mind, though, that the standard library is especially constrained by the Go 1.x backwards compatibility promise, both in terms of (1) we can't change any existing API (e.g. adding new function arguments) or behavior if it will break previous users and (2) any new API we invent will be frozen and have to be supported for a long time.

The best solution may not necessarily be "change the stdlib PNG encoder" but "create a new PNG encoder package, under a different import path" instead - one that isn't subject to these constraints. The stdlib PNG package is open source and easily forked.

Along those lines, if you're looking for a PNG encoder more focused on compression speed than compression size, you might find inspiration in https://github.com/richgel999/fpng and https://github.com/veluca93/fpnge

@fumin
Copy link
Author

@fumin fumin commented Mar 31, 2022

Hi Nigel

I initially thought of adding a flag in the Encoder struct, https://pkg.go.dev/image/png#Encoder , something like FilterType that hardcodes the filter for every row in the image.
However, diving deeper into the issue, I realized the fix to the performance issue is more contrived than it seems.
It turns out the reason why much time is spent in runtime.memmove is because the current row buffer is not alligned in units of 4 bytes https://cs.opensource.google/go/go/+/refs/tags/go1.18:src/image/png/writer.go;l=351 .
Writing a single byte for the filter type first, then allocating a buffer that is a multiple of the image row size, which for RGBA is always a multiple of 4, halves the time from 20 seconds to ten seconds for a single filter type. Currently, the standard library tries all 5 filter types, so the speedup is 2x times 5 == 10x, which explains why it is taking around 90 seconds!

While adding FilterType makes sense, I am not sure I want the library to support memory alignment flags, as it is too low level and only works for RGBA/NRGBA images.

I agree with you that in the near term, it is perhaps better for the community to explore different ideas outside the standard library, and then incorporate them back later.
Nonetheless, I think after much profiling and with a custom function that I wrote, I conclude that it's possible for Go to encode as fast as C down to the assembly level, many thanks to the team for the entire Go toolchain! Also, thanks Nigel for pointing to the two libraries for inspiration.
Therefore, feel free to close this issue.

@nigeltao
Copy link
Contributor

@nigeltao nigeltao commented Apr 1, 2022

It turns out the reason why much time is spent in runtime.memmove is because the current row buffer is not alligned in units of 4 bytes https://cs.opensource.google/go/go/+/refs/tags/go1.18:src/image/png/writer.go;l=351 .
Writing a single byte for the filter type first, then allocating a buffer that is a multiple of the image row size, which for RGBA is always a multiple of 4, halves the time from 20 seconds to ten seconds for a single filter type. Currently, the standard library tries all 5 filter types, so the speedup is 2x times 5 == 10x, which explains why it is taking around 90 seconds!

I'm not sure if that 10x number is right. ftNone basically does nothing but a runtime.memmove. The other filter types do more computation, so the relative speed-up would be less. The speed-ups also don't add up like that: if you're driving 5 miles and make yourself drive each mile 2x faster, you're overall 2x faster, not overall 10x faster. If you're driving 10 miles total (e.g. at 1 mile per minute pace) and make 5 out of the 10 2x faster (2 miles per minute), you're overall only 10 minutes / 7.5 minutes = 1.33x faster.

Nonetheless, it's interesting that runtime.memmove friendly alignment (presumably when both dst and src are 4-byte or 8-byte aligned??) can have significant performance impact. @ianlancetaylor do we have any existing issues discussing this (or Go compiler team people thinking about this)? I did a quick skim of the open Go issues but didn't find anything.

Sticking with @ianlancetaylor: absent additional language or stdlib support, is there a recommended way to get a []byte that's 8-byte aligned? Is the best workaround to allocate N+8 bytes, examine (uintptr(unsafe.Pointer(&mySlice[0])) % 8) and sub-slice based on that offset?

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Apr 1, 2022

I think that on most processors we could do slightly better if src & 7 == dst & 7, regardless of whether src and dst are aligned or not. But if we don't have that, we're stuck. On any modern processor it's always going to be faster to move aligned data.

I think that the a large underlying array of a []byte should always be aligned anyhow, because that is how the current memory allocator works. But of course there is no way to ensure alignment of a subslice.

Note that I haven't tried to understand this issue.

@fumin
Copy link
Author

@fumin fumin commented Apr 1, 2022

I'm not sure if that 10x number is right. ftNone basically does nothing but a runtime.memmove. The other filter types do more computation, so the relative speed-up would be less. The speed-ups also don't add up like that: if you're driving 5 miles and make yourself drive each mile 2x faster, you're overall 2x faster, not overall 10x faster. If you're driving 10 miles total (e.g. at 1 mile per minute pace) and make 5 out of the 10 2x faster (2 miles per minute), you're overall only 10 minutes / 7.5 minutes = 1.33x faster.

Sorry, the 10x was with respect to my original naive attempt of encoding with (image/png + *image.RGBA).

The below file contains 3 profiles:
prof.zip

  • stdlib_rgba.prof: using the standard library to encode an *image.RGBA
  • stdlib.prof: using the standard library to encode an *image.NRGBA (non-premultiplied)
  • custom.prof: using a custom function to encode an *image.NRGBA

The durations are:

  • stdlib_rgba.prof: 108.01s
  • stdlib.prof: 44.44s
  • custom.prof: 8.05s

The difference between the standard library and a custom function is 5x, with 2x coming from byte alignment, and another 2.5x coming from hardcoding the ftSub filter. Nigel, you are right that skiping the other 4 filters does not achieve a 4x speedup, since ftNone is a no-op. In fact, the majority of the savings came from ftPaeth (4.10s/11.10s in stdlib.prof) as can be seen below. Apologies for my previous mistake. Nonetheless, it seems that it is still worthwhile to use just one filter to get a 2.5x speedup.

PS C:\Users\a3367\Desktop\tmp\prof> go tool pprof .\png.test.exe .\stdlib.prof
File: png.test.exe
Type: cpu
Time: Apr 1, 2022 at 3:25pm (CST)
Duration: 44.44s, Total samples = 26.97s (60.69%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) list filter
Total: 26.97s
ROUTINE ======================== image/png.filter in C:\Program Files\Go\src\image\png\writer.go
    11.10s     17.16s (flat, cum) 63.63% of Total
         .          .    208:   // We try all five filter types, and pick the one that minimizes the sum of absolute differences.
         .          .    209:   // This is the same heuristic that libpng uses, although the filters are attempted in order of
         .          .    210:   // estimated most likely to be minimal (ftUp, ftPaeth, ftNone, ftSub, ftAverage), rather than
         .          .    211:   // in their enumeration order (ftNone, ftSub, ftUp, ftAverage, ftPaeth).
         .          .    212:   cdat0 := cr[0][1:]
     240ms      260ms    213:   cdat1 := cr[1][1:]
         .          .    214:   cdat2 := cr[2][1:]
         .          .    215:   cdat3 := cr[3][1:]
         .          .    216:   cdat4 := cr[4][1:]
         .          .    217:   pdat := pr[1:]
         .          .    218:   n := len(cdat0)
         .          .    219:
         .          .    220:   // The up filter.
         .          .    221:   sum := 0
     440ms      440ms    222:   for i := 0; i < n; i++ {
     1.42s      1.42s    223:           cdat2[i] = cdat0[i] - pdat[i]
     170ms      1.11s    224:           sum += abs8(cdat2[i])
         .          .    225:   }
         .          .    226:   best := sum
         .          .    227:   filter := ftUp
         .          .    228:
         .          .    229:   // The Paeth filter.
         .          .    230:   sum = 0
         .          .    231:   for i := 0; i < bpp; i++ {
         .          .    232:           cdat4[i] = cdat0[i] - pdat[i]
         .          .    233:           sum += abs8(cdat4[i])
         .          .    234:   }
     1.10s      1.10s    235:   for i := bpp; i < n; i++ {
     5.14s      9.25s    236:           cdat4[i] = cdat0[i] - paeth(cdat0[i-bpp], pdat[i], pdat[i-bpp])
     210ms      820ms    237:           sum += abs8(cdat4[i])
     150ms      150ms    238:           if sum >= best {
         .          .    239:                   break
         .          .    240:           }
         .          .    241:   }
     100ms      100ms    242:   if sum < best {
         .          .    243:           best = sum
         .          .    244:           filter = ftPaeth
         .          .    245:   }
         .          .    246:
         .          .    247:   // The none filter.
         .          .    248:   sum = 0
         .          .    249:   for i := 0; i < n; i++ {
         .       20ms    250:           sum += abs8(cdat0[i])
         .          .    251:           if sum >= best {
         .          .    252:                   break
         .          .    253:           }
         .          .    254:   }
         .          .    255:   if sum < best {
         .          .    256:           best = sum
         .          .    257:           filter = ftNone
         .          .    258:   }
         .          .    259:
         .          .    260:   // The sub filter.
         .          .    261:   sum = 0
         .          .    262:   for i := 0; i < bpp; i++ {
         .          .    263:           cdat1[i] = cdat0[i]
         .          .    264:           sum += abs8(cdat1[i])
         .          .    265:   }
     220ms      220ms    266:   for i := bpp; i < n; i++ {
     1.37s      1.37s    267:           cdat1[i] = cdat0[i] - cdat0[i-bpp]
     140ms      490ms    268:           sum += abs8(cdat1[i])
     280ms      280ms    269:           if sum >= best {
         .          .    270:                   break
         .          .    271:           }
         .          .    272:   }
         .          .    273:   if sum < best {
         .          .    274:           best = sum
         .          .    275:           filter = ftSub
         .          .    276:   }
         .          .    277:
         .          .    278:   // The average filter.
         .          .    279:   sum = 0
         .          .    280:   for i := 0; i < bpp; i++ {
      60ms       60ms    281:           cdat3[i] = cdat0[i] - pdat[i]/2
         .          .    282:           sum += abs8(cdat3[i])
         .          .    283:   }
         .          .    284:   for i := bpp; i < n; i++ {
      40ms       40ms    285:           cdat3[i] = cdat0[i] - uint8((int(cdat0[i-bpp])+int(pdat[i]))/2)
      10ms       20ms    286:           sum += abs8(cdat3[i])
      10ms       10ms    287:           if sum >= best {
         .          .    288:                   break
         .          .    289:           }
         .          .    290:   }
         .          .    291:   if sum < best {
         .          .    292:           filter = ftAverage
(pprof) disasm filter
Total: 26.97s
ROUTINE ======================== image/png.filter
    13.03s     17.56s (flat, cum) 65.11% of Total
         .          .     507680: LEAQ -0x8(SP), R12                      ;writer.go:207
         .          .     507685: CMPQ 0x10(R14), R12
         .          .     507689: JBE 0x507d0d
         .          .     50768f: SUBQ $0x88, SP
         .          .     507696: MOVQ BP, 0x80(SP)
         .          .     50769e: LEAQ 0x80(SP), BP
         .          .     5076a6: MOVQ BX, 0x98(SP)
         .          .     5076ae: MOVQ 0x8(AX), DX                        ;writer.go:212
         .          .     5076b2: MOVQ 0(AX), R8
         .          .     5076b5: MOVQ 0x10(AX), R9
         .          .     5076b9: NOPL 0(AX)
         .          .     5076c0: CMPQ $0x1, DX
         .          .     5076c4: JB 0x507cff
         .          .     5076ca: DECQ R9
         .          .     5076cd: NEGQ R9
         .          .     5076d0: SARQ $0x3f, R9
         .          .     5076d4: ANDQ $0x1, R9
         .          .     5076d8: ADDQ R9, R8
         .          .     5076db: MOVQ 0x20(AX), R9                       ;writer.go:213
         .          .     5076df: MOVQ 0x18(AX), R10
         .          .     5076e3: MOVQ 0x28(AX), R11
         .          .     5076e7: CMPQ $0x1, R9
         .          .     5076eb: JB 0x507cf2
         .          .     5076f1: DECQ R11
         .          .     5076f4: NEGQ R11
         .          .     5076f7: SARQ $0x3f, R11
         .          .     5076fb: ANDQ $0x1, R11
         .          .     5076ff: ADDQ R11, R10
         .          .     507702: MOVQ 0x38(AX), R11                      ;writer.go:214
         .          .     507706: MOVQ 0x30(AX), R12
         .          .     50770a: MOVQ 0x40(AX), R13
         .          .     50770e: CMPQ $0x1, R11
         .          .     507712: JB 0x507ce5
         .          .     507718: MOVQ R10, 0x70(SP)                      ;writer.go:213
         .          .     50771d: DECQ R13                                ;writer.go:214
         .          .     507720: NEGQ R13
         .          .     507723: SARQ $0x3f, R13
         .          .     507727: ANDQ $0x1, R13
         .          .     50772b: ADDQ R13, R12
         .          .     50772e: MOVQ 0x50(AX), R13                      ;writer.go:215
         .          .     507732: MOVQ 0x48(AX), R15
         .          .     507736: MOVQ 0x58(AX), R10
         .          .     50773a: NOPW 0(AX)(AX*1)
         .          .     507740: CMPQ $0x1, R13
         .          .     507744: JB 0x507cd6
         .          .     50774a: MOVQ R13, 0x50(SP)
         .          .     50774f: DECQ R10
         .          .     507752: NEGQ R10
         .          .     507755: SARQ $0x3f, R10
         .          .     507759: ANDQ $0x1, R10
         .          .     50775d: ADDQ R15, R10
         .          .     507760: MOVQ R10, 0x68(SP)
         .          .     507765: MOVQ 0x68(AX), R15                      ;writer.go:216
         .          .     507769: MOVQ 0x60(AX), R10
         .          .     50776d: MOVQ 0x70(AX), R13
         .          .     507771: CMPQ $0x1, R15
         .          .     507775: JB 0x507cc9
         .          .     50777b: DECQ R13
         .          .     50777e: NEGQ R13
         .          .     507781: SARQ $0x3f, R13
         .          .     507785: ANDQ $0x1, R13
         .          .     507789: ADDQ R13, R10
         .          .     50778c: CMPQ $0x1, CX                           ;writer.go:217
         .          .     507790: JB 0x507cbf
         .          .     507796: MOVQ R9, 0x48(SP)                       ;writer.go:213
         .          .     50779b: DECQ DX                                 ;writer.go:212
         .          .     50779e: DECQ R11                                ;writer.go:214
         .          .     5077a1: DECQ DI                                 ;writer.go:217
         .          .     5077a4: NEGQ DI
         .          .     5077a7: SARQ $0x3f, DI
         .          .     5077ab: ANDQ $0x1, DI
         .          .     5077af: ADDQ BX, DI
         .          .     5077b2: DECQ CX
         .          .     5077b5: XORL AX, AX
         .          .     5077b7: XORL BX, BX
         .          .     5077b9: JMP 0x5077c6                            ;writer.go:222
     250ms      250ms     5077bb: INCQ AX                                 ;image/png.filter writer.go:222
      80ms      100ms     5077be: ADDQ R9, BX                             ;image/png.filter writer.go:224
     120ms      140ms     5077c1: MOVQ 0x48(SP), R9                       ;image/png.filter writer.go:213
     190ms      190ms     5077c6: CMPQ DX, AX                             ;image/png.filter writer.go:222
         .          .     5077c9: JGE 0x50780a                            ;writer.go:222
     130ms      130ms     5077cb: MOVZX 0(AX)(R8*1), R13                  ;image/png.filter writer.go:223
     470ms      470ms     5077d0: CMPQ CX, AX
         .          .     5077d3: JAE 0x507cba                            ;writer.go:223
     110ms      110ms     5077d9: MOVZX 0(AX)(DI*1), R9                   ;image/png.filter writer.go:223
     220ms      220ms     5077de: SUBL R9, R13
     350ms      350ms     5077e1: CMPQ R11, AX
         .          .     5077e4: JAE 0x507cb2                            ;writer.go:223
     140ms      140ms     5077ea: MOVB R13, 0(R12)(AX*1)                  ;image/png.filter writer.go:223
     810ms      810ms     5077ee: CMPL $0x80, R13                         ;image/png.filter writer.go:90
         .          .     5077f2: JAE 0x5077fa                            ;writer.go:90
     110ms      110ms     5077f4: MOVZX R13, R9                           ;image/png.filter writer.go:91
      90ms       90ms     5077f8: JMP 0x5077bb                            ;image/png.filter writer.go:224
      10ms       10ms     5077fa: MOVZX R13, R13                          ;image/png.abs8 writer.go:93
      10ms       10ms     5077fe: LEAQ 0xffffff00(R13), R9
         .          .     507805: NEGQ R9                                 ;writer.go:93
         .          .     507808: JMP 0x5077bb                            ;writer.go:224
         .          .     50780a: LEAQ -0x1(R15), R11                     ;writer.go:216
         .          .     50780e: XORL AX, AX
         .          .     507810: XORL R12, R12
         .          .     507813: JMP 0x507820                            ;writer.go:231
         .          .     507815: INCQ AX
         .          .     507818: ADDQ R13, R12                           ;writer.go:233
         .          .     50781b: NOPL 0(AX)(AX*1)
         .          .     507820: CMPQ AX, SI                             ;writer.go:231
         .          .     507823: JLE 0x50786d
         .          .     507825: CMPQ DX, AX                             ;writer.go:232
         .          .     507828: JAE 0x507caa
         .          .     50782e: MOVZX 0(AX)(R8*1), R13
         .          .     507833: CMPQ CX, AX
         .          .     507836: JAE 0x507ca5
         .          .     50783c: MOVZX 0(AX)(DI*1), R15
         .          .     507841: SUBL R15, R13
         .          .     507844: CMPQ R11, AX
         .          .     507847: JAE 0x507c98
         .          .     50784d: MOVB R13, 0(R10)(AX*1)
         .          .     507851: CMPL $0x80, R13                         ;writer.go:90
         .          .     507855: JAE 0x50785d
         .          .     507857: MOVZX R13, R13                          ;writer.go:91
         .          .     50785b: JMP 0x507815                            ;writer.go:233
         .          .     50785d: MOVZX R13, R15                          ;writer.go:93
         .          .     507861: LEAQ 0xffffff00(R15), R13
         .          .     507868: NEGQ R13
         .          .     50786b: JMP 0x507815                            ;writer.go:233
         .          .     50786d: MOVQ SI, 0xb0(SP)                       ;writer.go:207
         .          .     507875: MOVQ DX, 0x20(SP)                       ;writer.go:212
         .          .     50787a: MOVQ R8, 0x78(SP)
         .          .     50787f: MOVQ R11, 0x30(SP)                      ;writer.go:216
         .          .     507884: MOVQ R10, 0x60(SP)
         .          .     507889: MOVQ CX, 0x18(SP)                       ;writer.go:217
         .          .     50788e: MOVQ DI, 0x58(SP)
         .          .     507893: MOVQ BX, 0x10(SP)                       ;writer.go:224
         .          .     507898: MOVQ SI, AX                             ;writer.go:207
         .          .     50789b: JMP 0x5078e0                            ;writer.go:235
     330ms      330ms     50789d: INCQ SI                                 ;image/png.filter writer.go:235
      20ms       20ms     5078a0: MOVQ 0x58(SP), R13
      20ms       20ms     5078a5: MOVQ 0x20(SP), R15
     120ms      120ms     5078aa: MOVQ 0xb0(SP), AX
     140ms      140ms     5078b2: MOVQ 0x18(SP), BX
      90ms       90ms     5078b7: MOVQ BX, CX                             ;image/png.filter writer.go:236
      50ms       50ms     5078ba: MOVQ R15, DX                            ;image/png.filter writer.go:235
     100ms      100ms     5078bd: MOVQ R9, BX                             ;image/png.filter writer.go:242
     160ms      160ms     5078c0: MOVQ 0x78(SP), R8                       ;image/png.filter writer.go:236
     120ms      120ms     5078c5: MOVQ 0x48(SP), R9                       ;image/png.filter writer.go:213
      70ms       70ms     5078ca: MOVQ 0x60(SP), R10                      ;image/png.filter writer.go:236
     110ms      110ms     5078cf: MOVQ 0x30(SP), R11
      50ms       80ms     5078d4: MOVQ DI, R12                            ;image/png.filter writer.go:237
     140ms      140ms     5078d7: MOVQ R13, DI                            ;image/png.filter writer.go:236
      80ms       80ms     5078da: NOPW 0(AX)(AX*1)
      80ms       80ms     5078e0: CMPQ DX, SI                             ;image/png.filter writer.go:235
         .          .     5078e3: JGE 0x5079ae                            ;writer.go:235
     290ms      290ms     5078e9: MOVQ SI, R13                            ;image/png.filter writer.go:236
      40ms       40ms     5078ec: SUBQ AX, SI
      90ms       90ms     5078ef: CMPQ DX, SI
         .          .     5078f2: JAE 0x507c8d                            ;writer.go:236
     150ms      150ms     5078f8: MOVZX 0(R8)(SI*1), R15                  ;image/png.filter writer.go:236
     800ms      800ms     5078fd: NOPL 0(AX)
      50ms       50ms     507900: CMPQ CX, R13
         .          .     507903: JAE 0x507c85                            ;writer.go:236
      20ms       20ms     507909: MOVZX 0(R13)(DI*1), R9                  ;image/png.filter writer.go:236
     190ms      190ms     50790f: CMPQ CX, SI
         .          .     507912: JAE 0x507c78                            ;writer.go:236
     340ms      340ms     507918: MOVQ R13, 0x40(SP)                      ;image/png.filter writer.go:235
      50ms       50ms     50791d: MOVQ R12, 0x38(SP)                      ;image/png.filter writer.go:237
      40ms       40ms     507922: MOVZX 0(DI)(SI*1), CX                   ;image/png.filter writer.go:236
     120ms      120ms     507926: MOVL R15, AX
     390ms      390ms     507929: MOVL R9, BX
      20ms      4.12s     50792c: CALL image/png.paeth(SB)
     480ms      480ms     507931: MOVQ 0x78(SP), DX
     170ms      170ms     507936: MOVQ 0x40(SP), SI
     100ms      110ms     50793b: MOVZX 0(SI)(DX*1), DI
     940ms      940ms     50793f: SUBL AX, DI
     230ms      230ms     507941: MOVQ 0x30(SP), CX
     170ms      170ms     507946: CMPQ CX, SI
         .          .     507949: JAE 0x507c70                            ;writer.go:236
      70ms       70ms     50794f: MOVQ 0x60(SP), R8                       ;image/png.filter writer.go:236
     130ms      130ms     507954: MOVB DI, 0(R8)(SI*1)
     580ms      580ms     507958: CMPL $0x80, DI                          ;image/png.filter writer.go:90
         .          .     50795c: JAE 0x507964                            ;writer.go:90
      20ms       20ms     50795e: MOVZX DI, DI                            ;image/png.abs8 writer.go:91
      10ms       10ms     507962: JMP 0x507972                            ;image/png.filter writer.go:237
         .          .     507964: MOVZX DI, R9                            ;writer.go:93
      10ms       10ms     507968: LEAQ 0xffffff00(R9), DI                 ;image/png.abs8 writer.go:93
         .          .     50796f: NEGQ DI                                 ;writer.go:93
      50ms       50ms     507972: MOVQ 0x38(SP), R9                       ;image/png.filter writer.go:237
      50ms       50ms     507977: ADDQ R9, DI
     100ms      100ms     50797a: MOVQ 0x10(SP), R9                       ;image/png.filter writer.go:238
         .          .     50797f: NOPL                                    ;writer.go:238
      50ms       50ms     507980: CMPQ DI, R9                             ;image/png.filter writer.go:238
         .          .     507983: JG 0x50789d                             ;writer.go:238
         .          .     507989: MOVQ 0xb0(SP), AX                       ;writer.go:262
         .          .     507991: MOVQ 0x18(SP), CX                       ;writer.go:281
         .          .     507996: MOVQ R9, BX                             ;writer.go:242
         .          .     507999: MOVQ DX, R8                             ;writer.go:250
         .          .     50799c: MOVQ 0x48(SP), R9                       ;writer.go:213
         .          .     5079a1: MOVQ DI, R12                            ;writer.go:242
         .          .     5079a4: MOVQ 0x20(SP), DX                       ;writer.go:249
         .          .     5079a9: MOVQ 0x58(SP), DI                       ;writer.go:281
         .          .     5079ae: CMPQ BX, R12                            ;writer.go:242
         .          .     5079b1: MOVQ BX, SI                             ;writer.go:251
         .          .     5079b4: CMOVL R12, BX
         .          .     5079b8: XORL R10, R10
         .          .     5079bb: XORL R11, R11
         .          .     5079be: NOPW
         .          .     5079c0: JMP 0x5079c5                            ;writer.go:242
         .          .     5079c2: INCQ R10                                ;writer.go:249
         .          .     5079c5: CMPQ DX, R10
         .          .     5079c8: JGE 0x5079e7
         .          .     5079ca: MOVZX 0(R10)(R8*1), R13                 ;writer.go:250
      10ms       10ms     5079cf: CMPL $0x80, R13                         ;image/png.filter writer.go:90
         .          .     5079d3: JB 0x5079df                             ;writer.go:90
      10ms       10ms     5079d5: ADDQ $-0x100, R13                       ;image/png.filter writer.go:93
         .          .     5079dc: NEGQ R13                                ;writer.go:93
         .          .     5079df: ADDQ R13, R11                           ;writer.go:250
         .          .     5079e2: CMPQ R11, BX                            ;writer.go:251
         .          .     5079e5: JG 0x5079c2
         .          .     5079e7: CMPQ BX, R11                            ;writer.go:255
         .          .     5079ea: MOVQ BX, R10                            ;writer.go:269
         .          .     5079ed: CMOVL R11, BX
         .          .     5079f1: CMPQ SI, R12                            ;writer.go:242
         .          .     5079f4: MOVL $0x2, SI                           ;writer.go:295
         .          .     5079f9: MOVL $0x4, R12
         .          .     5079ff: CMOVL R12, SI
         .          .     507a03: CMPQ R10, R11                           ;writer.go:255
         .          .     507a06: MOVL $0x0, R10                          ;writer.go:295
         .          .     507a0c: CMOVL R10, SI
         .          .     507a10: DECQ R9                                 ;writer.go:213
         .          .     507a13: MOVQ 0x70(SP), R10                      ;writer.go:255
         .          .     507a18: XORL R11, R11
         .          .     507a1b: XORL R12, R12
         .          .     507a1e: NOPW
         .          .     507a20: JMP 0x507a28
         .          .     507a22: INCQ R11                                ;writer.go:262
         .          .     507a25: ADDQ R13, R12                           ;writer.go:264
         .          .     507a28: CMPQ R11, AX                            ;writer.go:262
         .          .     507a2b: JLE 0x507a5f
         .          .     507a2d: CMPQ DX, R11                            ;writer.go:263
         .          .     507a30: JAE 0x507c65
         .          .     507a36: MOVZX 0(R11)(R8*1), R13
         .          .     507a3b: NOPL 0(AX)(AX*1)
         .          .     507a40: CMPQ R9, R11
         .          .     507a43: JAE 0x507c58
         .          .     507a49: MOVB R13, 0(R10)(R11*1)
         .          .     507a4d: CMPL $0x80, R13                         ;writer.go:90
         .          .     507a51: JB 0x507a22
         .          .     507a53: ADDQ $-0x100, R13                       ;writer.go:93
         .          .     507a5a: NEGQ R13
         .          .     507a5d: JMP 0x507a22                            ;writer.go:264
         .          .     507a5f: MOVQ AX, R11                            ;writer.go:207
         .          .     507a62: JMP 0x507a70                            ;writer.go:266
     150ms      150ms     507a64: LEAQ 0x1(R15), AX                       ;image/png.filter writer.go:266
      20ms      370ms     507a68: MOVQ DI, R12                            ;image/png.filter writer.go:268
      60ms       60ms     507a6b: MOVQ 0x58(SP), DI                       ;image/png.filter writer.go:281
      70ms       70ms     507a70: CMPQ DX, AX                             ;image/png.filter writer.go:266
         .          .     507a73: JGE 0x507ad7                            ;writer.go:266
     110ms      110ms     507a75: JAE 0x507c50                            ;image/png.filter writer.go:267
     130ms      130ms     507a7b: MOVZX 0(AX)(R8*1), R13
      40ms       40ms     507a80: MOVQ AX, R15
     200ms      200ms     507a83: SUBQ R11, AX
     120ms      120ms     507a86: CMPQ DX, AX
         .          .     507a89: JAE 0x507c48                            ;writer.go:267
      80ms       80ms     507a8f: MOVZX 0(R8)(AX*1), DI                   ;image/png.filter writer.go:267
     400ms      400ms     507a94: SUBL DI, R13
     200ms      200ms     507a97: NOPW 0(AX)(AX*1)
      60ms       60ms     507aa0: CMPQ R9, R15
         .          .     507aa3: JAE 0x507c3d                            ;writer.go:267
      30ms       30ms     507aa9: MOVB R13, 0(R10)(R15*1)                 ;image/png.filter writer.go:267
     290ms      290ms     507aad: CMPL $0x80, R13                         ;image/png.abs8 writer.go:90
         .          .     507ab1: JAE 0x507ab9                            ;writer.go:90
      40ms       40ms     507ab3: MOVZX R13, DI                           ;image/png.abs8 writer.go:91
     100ms      100ms     507ab7: JMP 0x507ac7                            ;image/png.filter writer.go:268
      10ms       10ms     507ab9: MOVZX R13, R13                          ;image/png.abs8 writer.go:93
      10ms       10ms     507abd: LEAQ 0xffffff00(R13), DI
         .          .     507ac4: NEGQ DI                                 ;writer.go:93
      20ms       20ms     507ac7: ADDQ R12, DI                            ;image/png.filter writer.go:268
     280ms      280ms     507aca: CMPQ DI, BX                             ;image/png.filter writer.go:269
         .          .     507acd: JG 0x507a64                             ;writer.go:269
         .          .     507acf: MOVQ DI, R12                            ;writer.go:273
         .          .     507ad2: MOVQ 0x58(SP), DI                       ;writer.go:281
         .          .     507ad7: CMPQ BX, R12                            ;writer.go:273
         .          .     507ada: CMOVL R12, BX                           ;writer.go:287
         .          .     507ade: MOVL $0x1, R9                           ;writer.go:295
         .          .     507ae4: CMOVL R9, SI
         .          .     507ae8: MOVQ 0x50(SP), R9                       ;writer.go:215
         .          .     507aed: DECQ R9
         .          .     507af0: MOVQ 0x68(SP), R10                      ;writer.go:273
         .          .     507af5: XORL AX, AX
         .          .     507af7: XORL R12, R12
         .          .     507afa: JMP 0x507b02
         .          .     507afc: INCQ AX                                 ;writer.go:280
         .          .     507aff: ADDQ R13, R12                           ;writer.go:282
         .          .     507b02: CMPQ AX, R11                            ;writer.go:280
         .          .     507b05: JLE 0x507b52
         .          .     507b07: CMPQ DX, AX                             ;writer.go:281
         .          .     507b0a: JAE 0x507c35
         .          .     507b10: MOVZX 0(AX)(R8*1), R13
         .          .     507b15: CMPQ CX, AX
         .          .     507b18: JAE 0x507c30
         .          .     507b1e: MOVZX 0(AX)(DI*1), R15
         .          .     507b23: SHRL $0x1, R15
         .          .     507b26: SUBL R15, R13
         .          .     507b29: CMPQ R9, AX
         .          .     507b2c: JAE 0x507c28
         .          .     507b32: MOVB R13, 0(R10)(AX*1)
         .          .     507b36: CMPL $0x80, R13                         ;writer.go:90
         .          .     507b3a: JAE 0x507b42
         .          .     507b3c: MOVZX R13, R13                          ;writer.go:91
         .          .     507b40: JMP 0x507afc                            ;writer.go:282
         .          .     507b42: MOVZX R13, R15                          ;writer.go:93
         .          .     507b46: LEAQ 0xffffff00(R15), R13
         .          .     507b4d: NEGQ R13
         .          .     507b50: JMP 0x507afc                            ;writer.go:282
         .          .     507b52: MOVQ SI, 0x28(SP)                       ;writer.go:295
         .          .     507b57: MOVQ R11, AX                            ;writer.go:207
         .          .     507b5a: JMP 0x507b68                            ;writer.go:284
         .          .     507b5c: LEAQ 0x1(R15), R11
         .          .     507b60: MOVQ SI, R12                            ;writer.go:286
         .          .     507b63: MOVQ 0x28(SP), SI                       ;writer.go:295
         .          .     507b68: CMPQ DX, R11                            ;writer.go:284
         .          .     507b6b: JGE 0x507be0
         .          .     507b6d: JAE 0x507c1d                            ;writer.go:285
         .          .     507b73: MOVZX 0(R11)(R8*1), R13
         .          .     507b78: MOVQ R11, R15
         .          .     507b7b: SUBQ AX, R11
         .          .     507b7e: NOPW
         .          .     507b80: CMPQ DX, R11
         .          .     507b83: JAE 0x507c12
         .          .     507b89: MOVZX 0(R8)(R11*1), R11
         .          .     507b8e: CMPQ CX, R15
         .          .     507b91: JAE 0x507c0a
         .          .     507b93: MOVZX 0(R15)(DI*1), SI
         .          .     507b98: ADDQ R11, SI
         .          .     507b9b: MOVQ SI, R11
         .          .     507b9e: SHRQ $0x3f, SI
         .          .     507ba2: ADDQ R11, SI
         .          .     507ba5: SARQ $0x1, SI
      20ms       20ms     507ba8: SUBL SI, R13                            ;image/png.filter writer.go:285
      20ms       20ms     507bab: CMPQ R9, R15
         .          .     507bae: JAE 0x507bff                            ;writer.go:285
         .          .     507bb0: MOVB R13, 0(R10)(R15*1)
      10ms       10ms     507bb4: CMPL $0x80, R13                         ;image/png.filter writer.go:90
         .          .     507bb8: JAE 0x507bc2                            ;writer.go:90
         .          .     507bba: MOVZX R13, SI                           ;writer.go:91
         .          .     507bbe: NOPW
         .          .     507bc0: JMP 0x507bd0                            ;writer.go:286
         .          .     507bc2: MOVZX R13, R11                          ;writer.go:93
         .          .     507bc6: LEAQ 0xffffff00(R11), SI
         .          .     507bcd: NEGQ SI
      10ms       10ms     507bd0: ADDQ R12, SI                            ;image/png.filter writer.go:286
      10ms       10ms     507bd3: CMPQ SI, BX                             ;image/png.filter writer.go:287
         .          .     507bd6: JG 0x507b5c                             ;writer.go:287
         .          .     507bd8: MOVQ SI, R12                            ;writer.go:291
         .          .     507bdb: MOVQ 0x28(SP), SI                       ;writer.go:295
         .          .     507be0: CMPQ BX, R12                            ;writer.go:291
         .          .     507be3: MOVL $0x3, CX                           ;writer.go:295
         .          .     507be8: CMOVL CX, SI
         .          .     507bec: MOVQ SI, AX
         .          .     507bef: MOVQ 0x80(SP), BP                       ;writer.go:291
         .          .     507bf7: ADDQ $0x88, SP
         .          .     507bfe: RET
         .          .     507bff: MOVQ R15, AX                            ;writer.go:285
         .          .     507c02: MOVQ R9, CX
         .          .     507c05: CALL runtime.panicIndex(SB)
         .          .     507c0a: MOVQ R15, AX
         .          .     507c0d: CALL runtime.panicIndex(SB)
         .          .     507c12: MOVQ R11, AX
         .          .     507c15: MOVQ DX, CX
         .          .     507c18: CALL runtime.panicIndex(SB)
         .          .     507c1d: MOVQ R11, AX
         .          .     507c20: MOVQ DX, CX
         .          .     507c23: CALL runtime.panicIndex(SB)
         .          .     507c28: MOVQ R9, CX                             ;writer.go:281
         .          .     507c2b: CALL runtime.panicIndex(SB)
         .          .     507c30: CALL runtime.panicIndex(SB)
         .          .     507c35: MOVQ DX, CX
         .          .     507c38: CALL runtime.panicIndex(SB)
         .          .     507c3d: MOVQ R15, AX                            ;writer.go:267
         .          .     507c40: MOVQ R9, CX
         .          .     507c43: CALL runtime.panicIndex(SB)
         .          .     507c48: MOVQ DX, CX
         .          .     507c4b: CALL runtime.panicIndex(SB)
         .          .     507c50: MOVQ DX, CX
         .          .     507c53: CALL runtime.panicIndex(SB)
         .          .     507c58: MOVQ R11, AX                            ;writer.go:263
         .          .     507c5b: MOVQ R9, CX
         .          .     507c5e: NOPW
         .          .     507c60: CALL runtime.panicIndex(SB)
         .          .     507c65: MOVQ R11, AX
         .          .     507c68: MOVQ DX, CX
         .          .     507c6b: CALL runtime.panicIndex(SB)
         .          .     507c70: MOVQ SI, AX                             ;writer.go:236
         .          .     507c73: CALL runtime.panicIndex(SB)
         .          .     507c78: MOVQ SI, AX
         .          .     507c7b: NOPL 0(AX)(AX*1)
         .          .     507c80: CALL runtime.panicIndex(SB)
         .          .     507c85: MOVQ R13, AX
         .          .     507c88: CALL runtime.panicIndex(SB)
         .          .     507c8d: MOVQ SI, AX
         .          .     507c90: MOVQ DX, CX
         .          .     507c93: CALL runtime.panicIndex(SB)
         .          .     507c98: MOVQ R11, CX                            ;writer.go:232
         .          .     507c9b: NOPL 0(AX)(AX*1)
         .          .     507ca0: CALL runtime.panicIndex(SB)
         .          .     507ca5: CALL runtime.panicIndex(SB)
         .          .     507caa: MOVQ DX, CX
         .          .     507cad: CALL runtime.panicIndex(SB)
         .          .     507cb2: MOVQ R11, CX                            ;writer.go:223
         .          .     507cb5: CALL runtime.panicIndex(SB)
         .          .     507cba: CALL runtime.panicIndex(SB)
         .          .     507cbf: MOVL $0x1, AX                           ;writer.go:217
         .          .     507cc4: CALL runtime.panicSliceB(SB)
         .          .     507cc9: MOVL $0x1, AX                           ;writer.go:216
         .          .     507cce: MOVQ R15, CX
         .          .     507cd1: CALL runtime.panicSliceB(SB)
         .          .     507cd6: MOVL $0x1, AX                           ;writer.go:215
         .          .     507cdb: MOVQ R13, CX
         .          .     507cde: NOPW
         .          .     507ce0: CALL runtime.panicSliceB(SB)
         .          .     507ce5: MOVL $0x1, AX                           ;writer.go:214
         .          .     507cea: MOVQ R11, CX
         .          .     507ced: CALL runtime.panicSliceB(SB)
         .          .     507cf2: MOVL $0x1, AX                           ;writer.go:213
         .          .     507cf7: MOVQ R9, CX
         .          .     507cfa: CALL runtime.panicSliceB(SB)
         .          .     507cff: MOVL $0x1, AX                           ;writer.go:212
         .          .     507d04: MOVQ DX, CX
         .          .     507d07: CALL runtime.panicSliceB(SB)
         .          .     507d0c: NOPL
         .          .     507d0d: MOVQ AX, 0x8(SP)                        ;writer.go:207
         .          .     507d12: MOVQ BX, 0x10(SP)
         .          .     507d17: MOVQ CX, 0x18(SP)
         .          .     507d1c: MOVQ DI, 0x20(SP)
         .          .     507d21: MOVQ SI, 0x28(SP)
         .          .     507d26: CALL runtime.morestack_noctxt.abi0(SB)
         .          .     507d2b: MOVQ 0x8(SP), AX
         .          .     507d30: MOVQ 0x10(SP), BX
         .          .     507d35: MOVQ 0x18(SP), CX
         .          .     507d3a: MOVQ 0x20(SP), DI
         .          .     507d3f: MOVQ 0x28(SP), SI
         .          .     507d44: JMP image/png.filter(SB)
         .          .     507d49: INT $0x3
         .          .     507d4a: INT $0x3
         .          .     507d4b: INT $0x3
         .          .     507d4c: INT $0x3
         .          .     507d4d: INT $0x3
         .          .     507d4e: INT $0x3
         .          .     507d4f: INT $0x3
         .          .     507d50: INT $0x3
         .          .     507d51: INT $0x3
         .          .     507d52: INT $0x3
         .          .     507d53: INT $0x3
         .          .     507d54: INT $0x3
         .          .     507d55: INT $0x3
         .          .     507d56: INT $0x3
         .          .     507d57: INT $0x3
         .          .     507d58: INT $0x3
         .          .     507d59: INT $0x3
         .          .     507d5a: INT $0x3
         .          .     507d5b: INT $0x3
         .          .     507d5c: INT $0x3
         .          .     507d5d: INT $0x3
         .          .     507d5e: INT $0x3
(pprof)

Regarding the mystery of runtime.memove, comparing "stdlib.prof" and "custom.prof" seems to reveal that the main difference is the time spent in MOVOU on line 350.

stdlib.prof

PS C:\Users\a3367\Desktop\tmp\prof> go tool pprof .\png.test.exe .\stdlib.prof
File: png.test.exe
Type: cpu
Time: Apr 1, 2022 at 3:25pm (CST)
Duration: 44.44s, Total samples = 26.97s (60.69%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) list memmove
Total: 26.97s
ROUTINE ======================== runtime.memmove in C:\Program Files\Go\src\runtime\memmove_amd64.s
     6.06s      6.06s (flat, cum) 22.47% of Total
         .          .     36:TEXT runtime·memmove<ABIInternal>(SB), NOSPLIT, $0-24
         .          .     37:#ifdef GOEXPERIMENT_regabiargs
         .          .     38:   // AX = to
         .          .     39:   // BX = from
         .          .     40:   // CX = n
      20ms       20ms     41:   MOVQ    AX, DI
         .          .     42:   MOVQ    BX, SI
         .          .     43:   MOVQ    CX, BX
         .          .     44:#else
         .          .     45:   MOVQ    to+0(FP), DI
         .          .     46:   MOVQ    from+8(FP), SI
         .          .     47:   MOVQ    n+16(FP), BX
         .          .     48:#endif
         .          .     49:
         .          .     50:   // REP instructions have a high startup cost, so we handle small sizes
         .          .     51:   // with some straightline code. The REP MOVSQ instruction is really fast
         .          .     52:   // for large sizes. The cutover is approximately 2K.
         .          .     53:tail:
         .          .     54:   // move_129through256 or smaller work whether or not the source and the
         .          .     55:   // destination memory regions overlap because they load all data into
         .          .     56:   // registers before writing it back.  move_256through2048 on the other
         .          .     57:   // hand can be used only when the memory regions don't overlap or the copy
         .          .     58:   // direction is forward.
         .          .     59:   //
         .          .     60:   // BSR+branch table make almost all memmove/memclr benchmarks worse. Not worth doing.
         .          .     61:   TESTQ   BX, BX
         .          .     62:   JEQ     move_0
      20ms       20ms     63:   CMPQ    BX, $2
         .          .     64:   JBE     move_1or2
         .          .     65:   CMPQ    BX, $4
         .          .     66:   JB      move_3
      10ms       10ms     67:   JBE     move_4
         .          .     68:   CMPQ    BX, $8
         .          .     69:   JB      move_5through7
         .          .     70:   JE      move_8
      10ms       10ms     71:   CMPQ    BX, $16
         .          .     72:   JBE     move_9through16
         .          .     73:   CMPQ    BX, $32
         .          .     74:   JBE     move_17through32
         .          .     75:   CMPQ    BX, $64
         .          .     76:   JBE     move_33through64
         .          .     77:   CMPQ    BX, $128
         .          .     78:   JBE     move_65through128
         .          .     79:   CMPQ    BX, $256
         .          .     80:   JBE     move_129through256
         .          .     81:
         .          .     82:   TESTB   $1, runtime·useAVXmemmove(SB)
         .          .     83:   JNZ     avxUnaligned
         .          .     84:
         .          .     85:/*
         .          .     86: * check and set for backwards
         .          .     87: */
         .          .     88:   CMPQ    SI, DI
         .          .     89:   JLS     back
         .          .     90:
         .          .     91:/*
         .          .     92: * forward copy loop
         .          .     93: */
         .          .     94:forward:
         .          .     95:   CMPQ    BX, $2048
         .          .     96:   JLS     move_256through2048
         .          .     97:
         .          .     98:   // If REP MOVSB isn't fast, don't use it
         .          .     99:   CMPB    internal∕cpu·X86+const_offsetX86HasERMS(SB), $1 // enhanced REP MOVSB/STOSB
         .          .    100:   JNE     fwdBy8
         .          .    101:
         .          .    102:   // Check alignment
         .          .    103:   MOVL    SI, AX
         .          .    104:   ORL     DI, AX
         .          .    105:   TESTL   $7, AX
         .          .    106:   JEQ     fwdBy8
         .          .    107:
         .          .    108:   // Do 1 byte at a time
         .          .    109:   MOVQ    BX, CX
         .          .    110:   REP;    MOVSB
         .          .    111:   RET
         .          .    112:
         .          .    113:fwdBy8:
         .          .    114:   // Do 8 bytes at a time
         .          .    115:   MOVQ    BX, CX
         .          .    116:   SHRQ    $3, CX
         .          .    117:   ANDQ    $7, BX
         .          .    118:   REP;    MOVSQ
         .          .    119:   JMP     tail
         .          .    120:
         .          .    121:back:
         .          .    122:/*
         .          .    123: * check overlap
         .          .    124: */
         .          .    125:   MOVQ    SI, CX
         .          .    126:   ADDQ    BX, CX
         .          .    127:   CMPQ    CX, DI
         .          .    128:   JLS     forward
         .          .    129:/*
         .          .    130: * whole thing backwards has
         .          .    131: * adjusted addresses
         .          .    132: */
         .          .    133:   ADDQ    BX, DI
         .          .    134:   ADDQ    BX, SI
         .          .    135:   STD
         .          .    136:
         .          .    137:/*
         .          .    138: * copy
         .          .    139: */
         .          .    140:   MOVQ    BX, CX
         .          .    141:   SHRQ    $3, CX
         .          .    142:   ANDQ    $7, BX
         .          .    143:
         .          .    144:   SUBQ    $8, DI
         .          .    145:   SUBQ    $8, SI
         .          .    146:   REP;    MOVSQ
         .          .    147:
         .          .    148:   CLD
         .          .    149:   ADDQ    $8, DI
         .          .    150:   ADDQ    $8, SI
         .          .    151:   SUBQ    BX, DI
         .          .    152:   SUBQ    BX, SI
         .          .    153:   JMP     tail
         .          .    154:
         .          .    155:move_1or2:
         .          .    156:   MOVB    (SI), AX
         .          .    157:   MOVB    -1(SI)(BX*1), CX
         .          .    158:   MOVB    AX, (DI)
         .          .    159:   MOVB    CX, -1(DI)(BX*1)
         .          .    160:   RET
         .          .    161:move_0:
         .          .    162:   RET
         .          .    163:move_4:
         .          .    164:   MOVL    (SI), AX
         .          .    165:   MOVL    AX, (DI)
         .          .    166:   RET
         .          .    167:move_3:
         .          .    168:   MOVW    (SI), AX
         .          .    169:   MOVB    2(SI), CX
         .          .    170:   MOVW    AX, (DI)
         .          .    171:   MOVB    CX, 2(DI)
         .          .    172:   RET
         .          .    173:move_5through7:
         .          .    174:   MOVL    (SI), AX
         .          .    175:   MOVL    -4(SI)(BX*1), CX
         .          .    176:   MOVL    AX, (DI)
         .          .    177:   MOVL    CX, -4(DI)(BX*1)
         .          .    178:   RET
         .          .    179:move_8:
         .          .    180:   // We need a separate case for 8 to make sure we write pointers atomically.
         .          .    181:   MOVQ    (SI), AX
         .          .    182:   MOVQ    AX, (DI)
         .          .    183:   RET
         .          .    184:move_9through16:
         .          .    185:   MOVQ    (SI), AX
         .          .    186:   MOVQ    -8(SI)(BX*1), CX
         .          .    187:   MOVQ    AX, (DI)
         .          .    188:   MOVQ    CX, -8(DI)(BX*1)
         .          .    189:   RET
         .          .    190:move_17through32:
         .          .    191:   MOVOU   (SI), X0
         .          .    192:   MOVOU   -16(SI)(BX*1), X1
         .          .    193:   MOVOU   X0, (DI)
         .          .    194:   MOVOU   X1, -16(DI)(BX*1)
         .          .    195:   RET
         .          .    196:move_33through64:
         .          .    197:   MOVOU   (SI), X0
         .          .    198:   MOVOU   16(SI), X1
         .          .    199:   MOVOU   -32(SI)(BX*1), X2
         .          .    200:   MOVOU   -16(SI)(BX*1), X3
         .          .    201:   MOVOU   X0, (DI)
         .          .    202:   MOVOU   X1, 16(DI)
         .          .    203:   MOVOU   X2, -32(DI)(BX*1)
         .          .    204:   MOVOU   X3, -16(DI)(BX*1)
         .          .    205:   RET
         .          .    206:move_65through128:
         .          .    207:   MOVOU   (SI), X0
         .          .    208:   MOVOU   16(SI), X1
         .          .    209:   MOVOU   32(SI), X2
         .          .    210:   MOVOU   48(SI), X3
         .          .    211:   MOVOU   -64(SI)(BX*1), X4
         .          .    212:   MOVOU   -48(SI)(BX*1), X5
         .          .    213:   MOVOU   -32(SI)(BX*1), X6
         .          .    214:   MOVOU   -16(SI)(BX*1), X7
         .          .    215:   MOVOU   X0, (DI)
         .          .    216:   MOVOU   X1, 16(DI)
         .          .    217:   MOVOU   X2, 32(DI)
         .          .    218:   MOVOU   X3, 48(DI)
         .          .    219:   MOVOU   X4, -64(DI)(BX*1)
         .          .    220:   MOVOU   X5, -48(DI)(BX*1)
         .          .    221:   MOVOU   X6, -32(DI)(BX*1)
         .          .    222:   MOVOU   X7, -16(DI)(BX*1)
         .          .    223:   RET
         .          .    224:move_129through256:
         .          .    225:   MOVOU   (SI), X0
         .          .    226:   MOVOU   16(SI), X1
         .          .    227:   MOVOU   32(SI), X2
         .          .    228:   MOVOU   48(SI), X3
         .          .    229:   MOVOU   64(SI), X4
         .          .    230:   MOVOU   80(SI), X5
         .          .    231:   MOVOU   96(SI), X6
         .          .    232:   MOVOU   112(SI), X7
         .          .    233:   MOVOU   -128(SI)(BX*1), X8
         .          .    234:   MOVOU   -112(SI)(BX*1), X9
         .          .    235:   MOVOU   -96(SI)(BX*1), X10
         .          .    236:   MOVOU   -80(SI)(BX*1), X11
         .          .    237:   MOVOU   -64(SI)(BX*1), X12
         .          .    238:   MOVOU   -48(SI)(BX*1), X13
         .          .    239:   MOVOU   -32(SI)(BX*1), X14
         .          .    240:   MOVOU   -16(SI)(BX*1), X15
         .          .    241:   MOVOU   X0, (DI)
         .          .    242:   MOVOU   X1, 16(DI)
         .          .    243:   MOVOU   X2, 32(DI)
         .          .    244:   MOVOU   X3, 48(DI)
         .          .    245:   MOVOU   X4, 64(DI)
         .          .    246:   MOVOU   X5, 80(DI)
         .          .    247:   MOVOU   X6, 96(DI)
         .          .    248:   MOVOU   X7, 112(DI)
         .          .    249:   MOVOU   X8, -128(DI)(BX*1)
         .          .    250:   MOVOU   X9, -112(DI)(BX*1)
         .          .    251:   MOVOU   X10, -96(DI)(BX*1)
         .          .    252:   MOVOU   X11, -80(DI)(BX*1)
         .          .    253:   MOVOU   X12, -64(DI)(BX*1)
         .          .    254:   MOVOU   X13, -48(DI)(BX*1)
         .          .    255:   MOVOU   X14, -32(DI)(BX*1)
         .          .    256:   MOVOU   X15, -16(DI)(BX*1)
         .          .    257:#ifdef GOEXPERIMENT_regabig
         .          .    258:   // X15 must be zero on return
         .          .    259:   PXOR    X15, X15
         .          .    260:#endif
         .          .    261:   RET
         .          .    262:move_256through2048:
         .          .    263:   SUBQ    $256, BX
         .          .    264:   MOVOU   (SI), X0
         .          .    265:   MOVOU   16(SI), X1
         .          .    266:   MOVOU   32(SI), X2
         .          .    267:   MOVOU   48(SI), X3
         .          .    268:   MOVOU   64(SI), X4
         .          .    269:   MOVOU   80(SI), X5
         .          .    270:   MOVOU   96(SI), X6
         .          .    271:   MOVOU   112(SI), X7
         .          .    272:   MOVOU   128(SI), X8
         .          .    273:   MOVOU   144(SI), X9
         .          .    274:   MOVOU   160(SI), X10
         .          .    275:   MOVOU   176(SI), X11
         .          .    276:   MOVOU   192(SI), X12
         .          .    277:   MOVOU   208(SI), X13
         .          .    278:   MOVOU   224(SI), X14
         .          .    279:   MOVOU   240(SI), X15
         .          .    280:   MOVOU   X0, (DI)
         .          .    281:   MOVOU   X1, 16(DI)
         .          .    282:   MOVOU   X2, 32(DI)
         .          .    283:   MOVOU   X3, 48(DI)
         .          .    284:   MOVOU   X4, 64(DI)
         .          .    285:   MOVOU   X5, 80(DI)
         .          .    286:   MOVOU   X6, 96(DI)
         .          .    287:   MOVOU   X7, 112(DI)
         .          .    288:   MOVOU   X8, 128(DI)
         .          .    289:   MOVOU   X9, 144(DI)
         .          .    290:   MOVOU   X10, 160(DI)
         .          .    291:   MOVOU   X11, 176(DI)
         .          .    292:   MOVOU   X12, 192(DI)
         .          .    293:   MOVOU   X13, 208(DI)
         .          .    294:   MOVOU   X14, 224(DI)
         .          .    295:   MOVOU   X15, 240(DI)
         .          .    296:   CMPQ    BX, $256
         .          .    297:   LEAQ    256(SI), SI
         .          .    298:   LEAQ    256(DI), DI
         .          .    299:   JGE     move_256through2048
         .          .    300:#ifdef GOEXPERIMENT_regabig
         .          .    301:   // X15 must be zero on return
         .          .    302:   PXOR    X15, X15
         .          .    303:#endif
         .          .    304:   JMP     tail
         .          .    305:
         .          .    306:avxUnaligned:
         .          .    307:   // There are two implementations of move algorithm.
         .          .    308:   // The first one for non-overlapped memory regions. It uses forward copying.
         .          .    309:   // The second one for overlapped regions. It uses backward copying
         .          .    310:   MOVQ    DI, CX
         .          .    311:   SUBQ    SI, CX
         .          .    312:   // Now CX contains distance between SRC and DEST
         .          .    313:   CMPQ    CX, BX
         .          .    314:   // If the distance lesser than region length it means that regions are overlapped
         .          .    315:   JC      copy_backward
         .          .    316:
         .          .    317:   // Non-temporal copy would be better for big sizes.
         .          .    318:   CMPQ    BX, $0x100000
         .          .    319:   JAE     gobble_big_data_fwd
         .          .    320:
         .          .    321:   // Memory layout on the source side
         .          .    322:   // SI                                       CX
         .          .    323:   // |<---------BX before correction--------->|
         .          .    324:   // |       |<--BX corrected-->|             |
         .          .    325:   // |       |                  |<--- AX  --->|
         .          .    326:   // |<-R11->|                  |<-128 bytes->|
         .          .    327:   // +----------------------------------------+
         .          .    328:   // | Head  | Body             | Tail        |
         .          .    329:   // +-------+------------------+-------------+
         .          .    330:   // ^       ^                  ^
         .          .    331:   // |       |                  |
         .          .    332:   // Save head into Y4          Save tail into X5..X12
         .          .    333:   //         |
         .          .    334:   //         SI+R11, where R11 = ((DI & -32) + 32) - DI
         .          .    335:   // Algorithm:
         .          .    336:   // 1. Unaligned save of the tail's 128 bytes
         .          .    337:   // 2. Unaligned save of the head's 32  bytes
         .          .    338:   // 3. Destination-aligned copying of body (128 bytes per iteration)
         .          .    339:   // 4. Put head on the new place
         .          .    340:   // 5. Put the tail on the new place
         .          .    341:   // It can be important to satisfy processor's pipeline requirements for
         .          .    342:   // small sizes as the cost of unaligned memory region copying is
         .          .    343:   // comparable with the cost of main loop. So code is slightly messed there.
         .          .    344:   // There is more clean implementation of that algorithm for bigger sizes
         .          .    345:   // where the cost of unaligned part copying is negligible.
         .          .    346:   // You can see it after gobble_big_data_fwd label.
         .          .    347:   LEAQ    (SI)(BX*1), CX
         .          .    348:   MOVQ    DI, R10
         .          .    349:   // CX points to the end of buffer so we need go back slightly. We will use negative offsets there.
     2.97s      2.97s    350:   MOVOU   -0x80(CX), X5
      10ms       10ms    351:   MOVOU   -0x70(CX), X6
         .          .    352:   MOVQ    $0x80, AX
         .          .    353:   // Align destination address
         .          .    354:   ANDQ    $-32, DI
         .          .    355:   ADDQ    $32, DI
         .          .    356:   // Continue tail saving.
         .          .    357:   MOVOU   -0x60(CX), X7
         .          .    358:   MOVOU   -0x50(CX), X8
         .          .    359:   // Make R11 delta between aligned and unaligned destination addresses.
         .          .    360:   MOVQ    DI, R11
         .          .    361:   SUBQ    R10, R11
         .          .    362:   // Continue tail saving.
         .          .    363:   MOVOU   -0x40(CX), X9
         .          .    364:   MOVOU   -0x30(CX), X10
         .          .    365:   // Let's make bytes-to-copy value adjusted as we've prepared unaligned part for copying.
         .          .    366:   SUBQ    R11, BX
         .          .    367:   // Continue tail saving.
         .          .    368:   MOVOU   -0x20(CX), X11
         .          .    369:   MOVOU   -0x10(CX), X12
         .          .    370:   // The tail will be put on its place after main body copying.
         .          .    371:   // It's time for the unaligned heading part.
         .          .    372:   VMOVDQU (SI), Y4
         .          .    373:   // Adjust source address to point past head.
         .          .    374:   ADDQ    R11, SI
         .          .    375:   SUBQ    AX, BX
         .          .    376:   // Aligned memory copying there
         .          .    377:gobble_128_loop:
     570ms      570ms    378:   VMOVDQU (SI), Y0
     780ms      780ms    379:   VMOVDQU 0x20(SI), Y1
     890ms      890ms    380:   VMOVDQU 0x40(SI), Y2
     630ms      630ms    381:   VMOVDQU 0x60(SI), Y3
     100ms      100ms    382:   ADDQ    AX, SI
         .          .    383:   VMOVDQA Y0, (DI)
         .          .    384:   VMOVDQA Y1, 0x20(DI)
         .          .    385:   VMOVDQA Y2, 0x40(DI)
      30ms       30ms    386:   VMOVDQA Y3, 0x60(DI)
      20ms       20ms    387:   ADDQ    AX, DI
         .          .    388:   SUBQ    AX, BX
         .          .    389:   JA      gobble_128_loop
         .          .    390:   // Now we can store unaligned parts.
         .          .    391:   ADDQ    AX, BX
         .          .    392:   ADDQ    DI, BX
(pprof)

custom.prof

PS C:\Users\a3367\Desktop\tmp\prof> go tool pprof .\custom.prof
Type: cpu
Time: Apr 1, 2022 at 2:40pm (CST)
Duration: 8.05s, Total samples = 4.40s (54.68%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) list memmove
Total: 4.40s
ROUTINE ======================== runtime.memmove in C:\Program Files\Go\src\runtime\memmove_amd64.s
     540ms      540ms (flat, cum) 12.27% of Total
         .          .    194:   MOVOU   X1, -16(DI)(BX*1)
         .          .    195:   RET
         .          .    196:move_33through64:
         .          .    197:   MOVOU   (SI), X0
         .          .    198:   MOVOU   16(SI), X1
      10ms       10ms    199:   MOVOU   -32(SI)(BX*1), X2
         .          .    200:   MOVOU   -16(SI)(BX*1), X3
         .          .    201:   MOVOU   X0, (DI)
         .          .    202:   MOVOU   X1, 16(DI)
         .          .    203:   MOVOU   X2, -32(DI)(BX*1)
      10ms       10ms    204:   MOVOU   X3, -16(DI)(BX*1)
         .          .    205:   RET
         .          .    206:move_65through128:
         .          .    207:   MOVOU   (SI), X0
         .          .    208:   MOVOU   16(SI), X1
         .          .    209:   MOVOU   32(SI), X2
         .          .    210:   MOVOU   48(SI), X3
         .          .    211:   MOVOU   -64(SI)(BX*1), X4
         .          .    212:   MOVOU   -48(SI)(BX*1), X5
         .          .    213:   MOVOU   -32(SI)(BX*1), X6
         .          .    214:   MOVOU   -16(SI)(BX*1), X7
         .          .    215:   MOVOU   X0, (DI)
         .          .    216:   MOVOU   X1, 16(DI)
         .          .    217:   MOVOU   X2, 32(DI)
         .          .    218:   MOVOU   X3, 48(DI)
         .          .    219:   MOVOU   X4, -64(DI)(BX*1)
         .          .    220:   MOVOU   X5, -48(DI)(BX*1)
         .          .    221:   MOVOU   X6, -32(DI)(BX*1)
         .          .    222:   MOVOU   X7, -16(DI)(BX*1)
         .          .    223:   RET
         .          .    224:move_129through256:
         .          .    225:   MOVOU   (SI), X0
         .          .    226:   MOVOU   16(SI), X1
         .          .    227:   MOVOU   32(SI), X2
         .          .    228:   MOVOU   48(SI), X3
         .          .    229:   MOVOU   64(SI), X4
         .          .    230:   MOVOU   80(SI), X5
         .          .    231:   MOVOU   96(SI), X6
         .          .    232:   MOVOU   112(SI), X7
         .          .    233:   MOVOU   -128(SI)(BX*1), X8
         .          .    234:   MOVOU   -112(SI)(BX*1), X9
         .          .    235:   MOVOU   -96(SI)(BX*1), X10
         .          .    236:   MOVOU   -80(SI)(BX*1), X11
         .          .    237:   MOVOU   -64(SI)(BX*1), X12
         .          .    238:   MOVOU   -48(SI)(BX*1), X13
         .          .    239:   MOVOU   -32(SI)(BX*1), X14
         .          .    240:   MOVOU   -16(SI)(BX*1), X15
         .          .    241:   MOVOU   X0, (DI)
         .          .    242:   MOVOU   X1, 16(DI)
         .          .    243:   MOVOU   X2, 32(DI)
         .          .    244:   MOVOU   X3, 48(DI)
         .          .    245:   MOVOU   X4, 64(DI)
         .          .    246:   MOVOU   X5, 80(DI)
         .          .    247:   MOVOU   X6, 96(DI)
         .          .    248:   MOVOU   X7, 112(DI)
         .          .    249:   MOVOU   X8, -128(DI)(BX*1)
         .          .    250:   MOVOU   X9, -112(DI)(BX*1)
         .          .    251:   MOVOU   X10, -96(DI)(BX*1)
         .          .    252:   MOVOU   X11, -80(DI)(BX*1)
         .          .    253:   MOVOU   X12, -64(DI)(BX*1)
         .          .    254:   MOVOU   X13, -48(DI)(BX*1)
         .          .    255:   MOVOU   X14, -32(DI)(BX*1)
         .          .    256:   MOVOU   X15, -16(DI)(BX*1)
         .          .    257:#ifdef GOEXPERIMENT_regabig
         .          .    258:   // X15 must be zero on return
         .          .    259:   PXOR    X15, X15
         .          .    260:#endif
         .          .    261:   RET
         .          .    262:move_256through2048:
         .          .    263:   SUBQ    $256, BX
         .          .    264:   MOVOU   (SI), X0
         .          .    265:   MOVOU   16(SI), X1
         .          .    266:   MOVOU   32(SI), X2
         .          .    267:   MOVOU   48(SI), X3
         .          .    268:   MOVOU   64(SI), X4
         .          .    269:   MOVOU   80(SI), X5
         .          .    270:   MOVOU   96(SI), X6
         .          .    271:   MOVOU   112(SI), X7
         .          .    272:   MOVOU   128(SI), X8
         .          .    273:   MOVOU   144(SI), X9
         .          .    274:   MOVOU   160(SI), X10
         .          .    275:   MOVOU   176(SI), X11
         .          .    276:   MOVOU   192(SI), X12
         .          .    277:   MOVOU   208(SI), X13
         .          .    278:   MOVOU   224(SI), X14
         .          .    279:   MOVOU   240(SI), X15
         .          .    280:   MOVOU   X0, (DI)
         .          .    281:   MOVOU   X1, 16(DI)
         .          .    282:   MOVOU   X2, 32(DI)
         .          .    283:   MOVOU   X3, 48(DI)
         .          .    284:   MOVOU   X4, 64(DI)
         .          .    285:   MOVOU   X5, 80(DI)
         .          .    286:   MOVOU   X6, 96(DI)
         .          .    287:   MOVOU   X7, 112(DI)
         .          .    288:   MOVOU   X8, 128(DI)
         .          .    289:   MOVOU   X9, 144(DI)
         .          .    290:   MOVOU   X10, 160(DI)
         .          .    291:   MOVOU   X11, 176(DI)
         .          .    292:   MOVOU   X12, 192(DI)
         .          .    293:   MOVOU   X13, 208(DI)
         .          .    294:   MOVOU   X14, 224(DI)
         .          .    295:   MOVOU   X15, 240(DI)
         .          .    296:   CMPQ    BX, $256
         .          .    297:   LEAQ    256(SI), SI
         .          .    298:   LEAQ    256(DI), DI
         .          .    299:   JGE     move_256through2048
         .          .    300:#ifdef GOEXPERIMENT_regabig
         .          .    301:   // X15 must be zero on return
         .          .    302:   PXOR    X15, X15
         .          .    303:#endif
         .          .    304:   JMP     tail
         .          .    305:
         .          .    306:avxUnaligned:
         .          .    307:   // There are two implementations of move algorithm.
         .          .    308:   // The first one for non-overlapped memory regions. It uses forward copying.
         .          .    309:   // The second one for overlapped regions. It uses backward copying
         .          .    310:   MOVQ    DI, CX
         .          .    311:   SUBQ    SI, CX
         .          .    312:   // Now CX contains distance between SRC and DEST
         .          .    313:   CMPQ    CX, BX
         .          .    314:   // If the distance lesser than region length it means that regions are overlapped
         .          .    315:   JC      copy_backward
         .          .    316:
         .          .    317:   // Non-temporal copy would be better for big sizes.
         .          .    318:   CMPQ    BX, $0x100000
         .          .    319:   JAE     gobble_big_data_fwd
         .          .    320:
         .          .    321:   // Memory layout on the source side
         .          .    322:   // SI                                       CX
         .          .    323:   // |<---------BX before correction--------->|
         .          .    324:   // |       |<--BX corrected-->|             |
         .          .    325:   // |       |                  |<--- AX  --->|
         .          .    326:   // |<-R11->|                  |<-128 bytes->|
         .          .    327:   // +----------------------------------------+
         .          .    328:   // | Head  | Body             | Tail        |
         .          .    329:   // +-------+------------------+-------------+
         .          .    330:   // ^       ^                  ^
         .          .    331:   // |       |                  |
         .          .    332:   // Save head into Y4          Save tail into X5..X12
         .          .    333:   //         |
         .          .    334:   //         SI+R11, where R11 = ((DI & -32) + 32) - DI
         .          .    335:   // Algorithm:
         .          .    336:   // 1. Unaligned save of the tail's 128 bytes
         .          .    337:   // 2. Unaligned save of the head's 32  bytes
         .          .    338:   // 3. Destination-aligned copying of body (128 bytes per iteration)
         .          .    339:   // 4. Put head on the new place
         .          .    340:   // 5. Put the tail on the new place
         .          .    341:   // It can be important to satisfy processor's pipeline requirements for
         .          .    342:   // small sizes as the cost of unaligned memory region copying is
         .          .    343:   // comparable with the cost of main loop. So code is slightly messed there.
         .          .    344:   // There is more clean implementation of that algorithm for bigger sizes
         .          .    345:   // where the cost of unaligned part copying is negligible.
         .          .    346:   // You can see it after gobble_big_data_fwd label.
         .          .    347:   LEAQ    (SI)(BX*1), CX
         .          .    348:   MOVQ    DI, R10
         .          .    349:   // CX points to the end of buffer so we need go back slightly. We will use negative offsets there.
      20ms       20ms    350:   MOVOU   -0x80(CX), X5
         .          .    351:   MOVOU   -0x70(CX), X6
         .          .    352:   MOVQ    $0x80, AX
         .          .    353:   // Align destination address
         .          .    354:   ANDQ    $-32, DI
         .          .    355:   ADDQ    $32, DI
         .          .    356:   // Continue tail saving.
         .          .    357:   MOVOU   -0x60(CX), X7
         .          .    358:   MOVOU   -0x50(CX), X8
         .          .    359:   // Make R11 delta between aligned and unaligned destination addresses.
         .          .    360:   MOVQ    DI, R11
         .          .    361:   SUBQ    R10, R11
         .          .    362:   // Continue tail saving.
         .          .    363:   MOVOU   -0x40(CX), X9
         .          .    364:   MOVOU   -0x30(CX), X10
         .          .    365:   // Let's make bytes-to-copy value adjusted as we've prepared unaligned part for copying.
         .          .    366:   SUBQ    R11, BX
         .          .    367:   // Continue tail saving.
         .          .    368:   MOVOU   -0x20(CX), X11
         .          .    369:   MOVOU   -0x10(CX), X12
         .          .    370:   // The tail will be put on its place after main body copying.
         .          .    371:   // It's time for the unaligned heading part.
         .          .    372:   VMOVDQU (SI), Y4
         .          .    373:   // Adjust source address to point past head.
         .          .    374:   ADDQ    R11, SI
         .          .    375:   SUBQ    AX, BX
         .          .    376:   // Aligned memory copying there
         .          .    377:gobble_128_loop:
      70ms       70ms    378:   VMOVDQU (SI), Y0
     120ms      120ms    379:   VMOVDQU 0x20(SI), Y1
      80ms       80ms    380:   VMOVDQU 0x40(SI), Y2
     100ms      100ms    381:   VMOVDQU 0x60(SI), Y3
     100ms      100ms    382:   ADDQ    AX, SI
         .          .    383:   VMOVDQA Y0, (DI)
      10ms       10ms    384:   VMOVDQA Y1, 0x20(DI)
      10ms       10ms    385:   VMOVDQA Y2, 0x40(DI)
         .          .    386:   VMOVDQA Y3, 0x60(DI)
      10ms       10ms    387:   ADDQ    AX, DI
         .          .    388:   SUBQ    AX, BX
         .          .    389:   JA      gobble_128_loop
         .          .    390:   // Now we can store unaligned parts.
         .          .    391:   ADDQ    AX, BX
         .          .    392:   ADDQ    DI, BX
(pprof)

@nigeltao
Copy link
Contributor

@nigeltao nigeltao commented Apr 7, 2022

@ianlancetaylor to tweak my earlier comment, the src is given to us as is, and we'd like to allocate (or allocate-and-subslice) a dst such that &src[0] & 7 == &dst[0] & 7.

So... allocate too many bytes, tooMuch := make([]byte, N+7), and subslice, dst := tooMuch[i:i+N], where i depends on &src[0] and &tooMuch[0], sprinkling in uintptr, unsafe.Pointer and & 7 as needed.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Apr 7, 2022

SGTM

@fumin
Copy link
Author

@fumin fumin commented Apr 14, 2022

In case anyone hit over the same bottlenecks, here is a solution limited to image.NRGBA
https://pkg.go.dev/github.com/fumin/png

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Proposals
Incoming
Development

No branches or pull requests

4 participants