New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Second tuning for thomaswue #263
Conversation
@merykitty Your clever number parsing trick is the major aspect missing here I think. Let's join forces ;-). |
@thomaswue Yes I think it would be helpful, since the branches are quite unpredictable. |
I can try to adopt it, but would like to add you as a co-author of the submission if it gains. |
That's awesome, loving this spirit! Exactly what I was hoping to get out of it, folks exchanging with and inspiring each other, which IMO is way more important than who ends up on number #1. Way to go! Still working on rebootstrapping the new eval machine, will get to running this (and all the other pending entries) later this week. Thanks for your patience! |
@gunnarmorling sorry for cross-posting here, but what did you think running an GitHub Runner on the eval machine? and integrate the runner into the CI of the repo? |
@thomaswue It would be my pleasure, thanks very much |
@jkroepke, I think that's an interesting idea. Could you open an issue for discussing it separately? Thx! |
long hash = 0; | ||
|
||
// Search for ';', one long at a time. | ||
long word = UNSAFE.getLong(scanPtr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This read can go over the end of the file. For example, a file contains the last entry "X;0.0\n". The read goes over 2 bytes at the end of the file. It will touch the next page, if the file has size equal to page size. If the next page is protected, that is seg fault. But I can't reproduce it locally. I always have ~136K memory after the end of the mapped region which I can access witout getting a seg fault. Maybe you have an explaination?
A workaround is to simply copy the tail with padding (equal has the same flaw and we want it read words).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, didn't think about that. It is very unlikely to exactly have a page boundary there, but will have to add the workaround for that last row.
} | ||
} | ||
if (result) { | ||
if (((UNSAFE.getLong(existingResult.nameAddress + i) ^ UNSAFE.getLong(nameAddress + i)) << (64 - (nameLength - i) << 3)) == 0) { | ||
existingResult.min = (short) Math.min(existingResult.min, number); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like these cast would now be unnecessary like in lines 69-70
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, agreed, thanks for spotting. Left over from last version and now this might actually add an additional instruction!
@gunnarmorling This has now the number parsing code from @merykitty in the PR. Please add him as a co-author to the solution. It runs now in 0.76s on my machine, so this gave another ~5%. |
private static int findDelimiter(long word) { | ||
long input = word ^ 0x3B3B3B3B3B3B3B3BL; | ||
long tmp = (input & 0x7F7F7F7F7F7F7F7FL) + 0x7F7F7F7F7F7F7F7FL; | ||
tmp = ~(tmp | input | 0x7F7F7F7F7F7F7F7FL); | ||
return Long.numberOfTrailingZeros(tmp) >>> 3; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be micro-optimized as follows:
private static int findDelimiter(long word) {
long input = word ^ 0x3B3B3B3B3B3B3B3BL;
long tmp = (input - 0x0101010101010101L) & ~input & 0x8080808080808080L;
return Long.numberOfTrailingZeros(tmp) >>> 3;
}
Which saves 1 instruction.
Before: 139,978,244,139 instructions:u
With this patch: 134,391,924,137 instructions:u
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, indeed, thanks. This got another almost 2%.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gunnarmorling This is already the second contribution from @mukel, I would like to ask if you could add him as a co-author to the submission as well in case you will re-evaluate.
Alrighty, finally back to evaluations :) This brings your time on the new machine (using 8 of its cores) from 00:03.911 (as per current version on main) down to 00:03.044 as of this branch! Will do a run of the Top 10 on all 64 cores a bit later this months, should be below 1 sec for sure. I'll add @merykitty and @mukel as co-authors to the entry on the loeaderboard. |
Cool, thank you! |
Check List:
./test.sh <username>
shows no differences between expected and actual outputs)calculate_average_<username>.sh
(make sure to match casing of your GH user name) and is executablecalculate_average_baseline.sh
This incrementally improves upon the previous version and uses now a long value at a time searching for the delimiter and also for the hash collision check. Improves ~15% over previous version on my machine and runs now < 800ms.
Update: Incorporated the code of @merykitty to do branch-less number parsing and it runs now in 0.76s on my machine. So the gain was another 5%.
Update2: After adding suggestion of @mukel and an additional optimization for the name comparison, it is down to 0.70s on my machine, so an additional 9% faster.