Second tuning for thomaswue #263

thomaswue · 2024-01-09T15:09:58Z

Check List:

Tests pass (./test.sh <username> shows no differences between expected and actual outputs)
All formatting changes by the build are committed
Your launch script is named calculate_average_<username>.sh (make sure to match casing of your GH user name) and is executable
Output matches that of calculate_average_baseline.sh

Execution time: 0.80s
Execution time of reference implementation: 120.37s

This incrementally improves upon the previous version and uses now a long value at a time searching for the delimiter and also for the hash collision check. Improves ~15% over previous version on my machine and runs now < 800ms.

Update: Incorporated the code of @merykitty to do branch-less number parsing and it runs now in 0.76s on my machine. So the gain was another 5%.

Update2: After adding suggestion of @mukel and an additional optimization for the name comparison, it is down to 0.70s on my machine, so an additional 9% faster.

thomaswue · 2024-01-09T15:15:04Z

@merykitty Your clever number parsing trick is the major aspect missing here I think. Let's join forces ;-).

merykitty · 2024-01-09T15:48:08Z

@thomaswue Yes I think it would be helpful, since the branches are quite unpredictable.

thomaswue · 2024-01-09T15:51:21Z

I can try to adopt it, but would like to add you as a co-author of the submission if it gains.

gunnarmorling · 2024-01-09T15:54:18Z

That's awesome, loving this spirit! Exactly what I was hoping to get out of it, folks exchanging with and inspiring each other, which IMO is way more important than who ends up on number #1. Way to go! Still working on rebootstrapping the new eval machine, will get to running this (and all the other pending entries) later this week. Thanks for your patience!

jkroepke · 2024-01-09T16:00:09Z

@gunnarmorling sorry for cross-posting here, but what did you think running an GitHub Runner on the eval machine? and integrate the runner into the CI of the repo?

merykitty · 2024-01-09T16:03:05Z

@thomaswue It would be my pleasure, thanks very much

gunnarmorling · 2024-01-09T16:05:09Z

@jkroepke, I think that's an interesting idea. Could you open an issue for discussing it separately? Thx!

artsiomkorzun · 2024-01-09T17:45:47Z

src/main/java/dev/morling/onebrc/CalculateAverage_thomaswue.java

+            long hash = 0;
+
+            // Search for ';', one long at a time.
+            long word = UNSAFE.getLong(scanPtr);


This read can go over the end of the file. For example, a file contains the last entry "X;0.0\n". The read goes over 2 bytes at the end of the file. It will touch the next page, if the file has size equal to page size. If the next page is protected, that is seg fault. But I can't reproduce it locally. I always have ~136K memory after the end of the mapped region which I can access witout getting a seg fault. Maybe you have an explaination?

A workaround is to simply copy the tail with padding (equal has the same flaw and we want it read words).

Yes, didn't think about that. It is very unlikely to exactly have a page boundary there, but will have to add the workaround for that last row.

BSchneppe · 2024-01-09T17:54:58Z

src/main/java/dev/morling/onebrc/CalculateAverage_thomaswue.java

-                        }
-                    }
-                    if (result) {
+                    if (((UNSAFE.getLong(existingResult.nameAddress + i) ^ UNSAFE.getLong(nameAddress + i)) << (64 - (nameLength - i) << 3)) == 0) {
                        existingResult.min = (short) Math.min(existingResult.min, number);


Looks like these cast would now be unnecessary like in lines 69-70

Yes, agreed, thanks for spotting. Left over from last version and now this might actually add an additional instruction!

thomaswue · 2024-01-09T19:38:47Z

@thomaswue It would be my pleasure, thanks very much

@gunnarmorling This has now the number parsing code from @merykitty in the PR. Please add him as a co-author to the solution. It runs now in 0.76s on my machine, so this gave another ~5%.

mukel · 2024-01-09T22:47:29Z

src/main/java/dev/morling/onebrc/CalculateAverage_thomaswue.java

+    private static int findDelimiter(long word) {
+        long input = word ^ 0x3B3B3B3B3B3B3B3BL;
+        long tmp = (input & 0x7F7F7F7F7F7F7F7FL) + 0x7F7F7F7F7F7F7F7FL;
+        tmp = ~(tmp | input | 0x7F7F7F7F7F7F7F7FL);
+        return Long.numberOfTrailingZeros(tmp) >>> 3;
+    }


Can be micro-optimized as follows:

private static int findDelimiter(long word) { long input = word ^ 0x3B3B3B3B3B3B3B3BL; long tmp = (input - 0x0101010101010101L) & ~input & 0x8080808080808080L; return Long.numberOfTrailingZeros(tmp) >>> 3; }

Which saves 1 instruction.
Before: 139,978,244,139 instructions:u
With this patch: 134,391,924,137 instructions:u

Yes, indeed, thanks. This got another almost 2%.

@gunnarmorling This is already the second contribution from @mukel, I would like to ask if you could add him as a co-author to the submission as well in case you will re-evaluate.

gunnarmorling · 2024-01-10T17:37:21Z

Alrighty, finally back to evaluations :)

This brings your time on the new machine (using 8 of its cores) from 00:03.911 (as per current version on main) down to 00:03.044 as of this branch! Will do a run of the Top 10 on all 64 cores a bit later this months, should be below 1 sec for sure.

I'll add @merykitty and @mukel as co-authors to the entry on the loeaderboard.

thomaswue · 2024-01-10T23:01:03Z

Cool, thank you!

thomaswue added 4 commits January 9, 2024 14:55

Optimize checking for collisions by doing this a long at a time always.

afc80e1

Use a long at a time scanning for delimiter.

1067e20

Minor tuning. Now below 0.80s on Intel i9-13900K.

636c3e9

Merge branch 'gunnarmorling:main' into main

e97937f

artsiomkorzun reviewed Jan 9, 2024

View reviewed changes

BSchneppe reviewed Jan 9, 2024

View reviewed changes

Add number parsing code from Quan Anh Mai. Fix name length issue.

b3b8851

mukel reviewed Jan 9, 2024

View reviewed changes

thomaswue added 5 commits January 10, 2024 13:57

Merge branch 'gunnarmorling:main' into main

7d12c49

Include suggestion from Alfonso Peterssen for another 1.5%.

5598cfb

Optimize hash collision check compare for ~4% gain.

9fd6367

Add perf stats based on latest version.

bc5db32

Merge branch 'gunnarmorling:main' into main

45dda82

gunnarmorling merged commit af66ac1 into gunnarmorling:main Jan 10, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Second tuning for thomaswue #263

Second tuning for thomaswue #263

thomaswue commented Jan 9, 2024 •

edited

thomaswue commented Jan 9, 2024

merykitty commented Jan 9, 2024

thomaswue commented Jan 9, 2024

gunnarmorling commented Jan 9, 2024

jkroepke commented Jan 9, 2024

merykitty commented Jan 9, 2024

gunnarmorling commented Jan 9, 2024

artsiomkorzun Jan 9, 2024

thomaswue Jan 9, 2024

BSchneppe Jan 9, 2024

thomaswue Jan 9, 2024

thomaswue commented Jan 9, 2024

mukel Jan 9, 2024

thomaswue Jan 10, 2024

thomaswue Jan 10, 2024

gunnarmorling commented Jan 10, 2024

thomaswue commented Jan 10, 2024

Second tuning for thomaswue #263

Second tuning for thomaswue #263

Conversation

thomaswue commented Jan 9, 2024 • edited

Check List:

thomaswue commented Jan 9, 2024

merykitty commented Jan 9, 2024

thomaswue commented Jan 9, 2024

gunnarmorling commented Jan 9, 2024

jkroepke commented Jan 9, 2024

merykitty commented Jan 9, 2024

gunnarmorling commented Jan 9, 2024

artsiomkorzun Jan 9, 2024

Choose a reason for hiding this comment

thomaswue Jan 9, 2024

Choose a reason for hiding this comment

BSchneppe Jan 9, 2024

Choose a reason for hiding this comment

thomaswue Jan 9, 2024

Choose a reason for hiding this comment

thomaswue commented Jan 9, 2024

mukel Jan 9, 2024

Choose a reason for hiding this comment

thomaswue Jan 10, 2024

Choose a reason for hiding this comment

thomaswue Jan 10, 2024

Choose a reason for hiding this comment

gunnarmorling commented Jan 10, 2024

thomaswue commented Jan 10, 2024

thomaswue commented Jan 9, 2024 •

edited