Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[arm64] Performance regression from removal of 4x loop decoding #205

Closed
lizthegrey opened this issue Apr 11, 2023 · 2 comments · Fixed by #207
Closed

[arm64] Performance regression from removal of 4x loop decoding #205

lizthegrey opened this issue Apr 11, 2023 · 2 comments · Fixed by #207

Comments

@lizthegrey
Copy link
Contributor

lizthegrey commented Apr 11, 2023

1cbdd81 appears to have removed the unrolled 4x copy, meaning the finishing copy after the remainder is less than 8 bytes long may do up to 7x unaligned individual byte copies, rather than the maximum of 3x individual unaligned byte copies from before. Profiling shows that we're seeing 10% of all time spent in the lz4 library being spent on the one line at https://github.com/pierrec/lz4/blob/v4/internal/lz4block/decode_arm64.s#L206

I suspect we need to put that aligned 4x unrolled access back, do you concur, @greatroar?

@lizthegrey
Copy link
Contributor Author

pinging again @pierrec and @greatroar for thoughts. this is quite bad IMO.

  Total:    1174.89s   1174.89s (flat, cum) 10.73%
     31            .          .           // func decodeBlock(dst, src, dict []byte) int 
     32            .          .           TEXT ·decodeBlock(SB), NOFRAME+NOSPLIT, $0-80 
[...]
    183            .          .           copyMatchTry8: 
    184            .          .           	// Copy doublewords if both len and offset are at least eight. 
    185            .          .           	// A 16-at-a-time loop doesn't provide a further speedup. 
    186       30.24s     30.24s           	CMP  $8, len 
    187        2.34s      2.34s           	CCMP HS, offset, $8, $0 
    188        250ms      250ms           	BLO  copyMatchLoop1 
    189            .          .            
    190        4.98s      4.98s           	AND    $7, len, lenRem 
    191        9.83s      9.83s           	SUB    $8, len 
    192            .          .           copyMatchLoop8: 
    193        3.85s      3.85s           	MOVD.P 8(match), tmp1 
    194       62.76s     62.76s           	MOVD.P tmp1, 8(dst) 
    195        2.55s      2.55s           	SUBS   $8, len 
    196        1.93s      1.93s           	BPL    copyMatchLoop8 
    197            .          .            
    198        910ms      910ms           	MOVD (match)(len), tmp2 // match+len == match+lenRem-8. 
    199       24.24s     24.24s           	ADD  lenRem, dst 
    200         60ms       60ms           	MOVD $0, len 
    201        440ms      440ms           	MOVD tmp2, -8(dst) 
    202        550ms      550ms           	B    copyMatchDone 
    203            .          .            
    204            .          .           copyMatchLoop1: 
    205            .          .           	// Byte-at-a-time copy for small offsets. 
    206      132.84s    132.84s           	MOVBU.P 1(match), tmp2 
    207      177.05s    177.05s           	MOVB.P  tmp2, 1(dst) 
    208        8.42s      8.42s           	SUBS    $1, len 
    209        5.27s      5.27s           	BNE     copyMatchLoop1 
    210            .          .            
    211            .          .           copyMatchDone: 
    212       18.42s     18.42s           	CMP src, srcend 
    213       12.36s     12.36s           	BNE loop 

@evanphx
Copy link
Contributor

evanphx commented Jan 6, 2024

Hi all, I know this is closed but some work over the holidays lead me to dig into this more and I wanted to get y'all thoughts on #215, directly relevant to this sort of unrolling. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants