[arm64] Performance regression from removal of 4x loop decoding #205

lizthegrey · 2023-04-11T18:07:41Z

1cbdd81 appears to have removed the unrolled 4x copy, meaning the finishing copy after the remainder is less than 8 bytes long may do up to 7x unaligned individual byte copies, rather than the maximum of 3x individual unaligned byte copies from before. Profiling shows that we're seeing 10% of all time spent in the lz4 library being spent on the one line at https://github.com/pierrec/lz4/blob/v4/internal/lz4block/decode_arm64.s#L206

I suspect we need to put that aligned 4x unrolled access back, do you concur, @greatroar?

The text was updated successfully, but these errors were encountered:

lizthegrey · 2023-06-02T17:28:46Z

pinging again @pierrec and @greatroar for thoughts. this is quite bad IMO.

  Total:    1174.89s   1174.89s (flat, cum) 10.73%
     31            .          .           // func decodeBlock(dst, src, dict []byte) int 
     32            .          .           TEXT ·decodeBlock(SB), NOFRAME+NOSPLIT, $0-80 
[...]
    183            .          .           copyMatchTry8: 
    184            .          .           	// Copy doublewords if both len and offset are at least eight. 
    185            .          .           	// A 16-at-a-time loop doesn't provide a further speedup. 
    186       30.24s     30.24s           	CMP  $8, len 
    187        2.34s      2.34s           	CCMP HS, offset, $8, $0 
    188        250ms      250ms           	BLO  copyMatchLoop1 
    189            .          .            
    190        4.98s      4.98s           	AND    $7, len, lenRem 
    191        9.83s      9.83s           	SUB    $8, len 
    192            .          .           copyMatchLoop8: 
    193        3.85s      3.85s           	MOVD.P 8(match), tmp1 
    194       62.76s     62.76s           	MOVD.P tmp1, 8(dst) 
    195        2.55s      2.55s           	SUBS   $8, len 
    196        1.93s      1.93s           	BPL    copyMatchLoop8 
    197            .          .            
    198        910ms      910ms           	MOVD (match)(len), tmp2 // match+len == match+lenRem-8. 
    199       24.24s     24.24s           	ADD  lenRem, dst 
    200         60ms       60ms           	MOVD $0, len 
    201        440ms      440ms           	MOVD tmp2, -8(dst) 
    202        550ms      550ms           	B    copyMatchDone 
    203            .          .            
    204            .          .           copyMatchLoop1: 
    205            .          .           	// Byte-at-a-time copy for small offsets. 
    206      132.84s    132.84s           	MOVBU.P 1(match), tmp2 
    207      177.05s    177.05s           	MOVB.P  tmp2, 1(dst) 
    208        8.42s      8.42s           	SUBS    $1, len 
    209        5.27s      5.27s           	BNE     copyMatchLoop1 
    210            .          .            
    211            .          .           copyMatchDone: 
    212       18.42s     18.42s           	CMP src, srcend 
    213       12.36s     12.36s           	BNE loop

evanphx · 2024-01-06T01:48:34Z

Hi all, I know this is closed but some work over the holidays lead me to dig into this more and I wanted to get y'all thoughts on #215, directly relevant to this sort of unrolling. Thanks!

lizthegrey mentioned this issue Jun 3, 2023

Restore copyMatchTry4 #207

Merged

pierrec closed this as completed in #207 Jun 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[arm64] Performance regression from removal of 4x loop decoding #205

[arm64] Performance regression from removal of 4x loop decoding #205

lizthegrey commented Apr 11, 2023 •

edited

lizthegrey commented Jun 2, 2023

evanphx commented Jan 6, 2024

[arm64] Performance regression from removal of 4x loop decoding #205

[arm64] Performance regression from removal of 4x loop decoding #205

Comments

lizthegrey commented Apr 11, 2023 • edited

lizthegrey commented Jun 2, 2023

evanphx commented Jan 6, 2024

lizthegrey commented Apr 11, 2023 •

edited