Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Second tuning for thomaswue #263

Merged
merged 10 commits into from Jan 10, 2024
Merged

Conversation

thomaswue
Copy link
Contributor

@thomaswue thomaswue commented Jan 9, 2024

Check List:

  • Tests pass (./test.sh <username> shows no differences between expected and actual outputs)
  • All formatting changes by the build are committed
  • Your launch script is named calculate_average_<username>.sh (make sure to match casing of your GH user name) and is executable
  • Output matches that of calculate_average_baseline.sh
  • Execution time: 0.80s
  • Execution time of reference implementation: 120.37s

This incrementally improves upon the previous version and uses now a long value at a time searching for the delimiter and also for the hash collision check. Improves ~15% over previous version on my machine and runs now < 800ms.

Update: Incorporated the code of @merykitty to do branch-less number parsing and it runs now in 0.76s on my machine. So the gain was another 5%.

Update2: After adding suggestion of @mukel and an additional optimization for the name comparison, it is down to 0.70s on my machine, so an additional 9% faster.

@thomaswue
Copy link
Contributor Author

@merykitty Your clever number parsing trick is the major aspect missing here I think. Let's join forces ;-).

@merykitty
Copy link
Contributor

@thomaswue Yes I think it would be helpful, since the branches are quite unpredictable.

@thomaswue
Copy link
Contributor Author

I can try to adopt it, but would like to add you as a co-author of the submission if it gains.

@gunnarmorling
Copy link
Owner

That's awesome, loving this spirit! Exactly what I was hoping to get out of it, folks exchanging with and inspiring each other, which IMO is way more important than who ends up on number #1. Way to go! Still working on rebootstrapping the new eval machine, will get to running this (and all the other pending entries) later this week. Thanks for your patience!

@jkroepke
Copy link

jkroepke commented Jan 9, 2024

@gunnarmorling sorry for cross-posting here, but what did you think running an GitHub Runner on the eval machine? and integrate the runner into the CI of the repo?

@merykitty
Copy link
Contributor

@thomaswue It would be my pleasure, thanks very much

@gunnarmorling
Copy link
Owner

@jkroepke, I think that's an interesting idea. Could you open an issue for discussing it separately? Thx!

long hash = 0;

// Search for ';', one long at a time.
long word = UNSAFE.getLong(scanPtr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This read can go over the end of the file. For example, a file contains the last entry "X;0.0\n". The read goes over 2 bytes at the end of the file. It will touch the next page, if the file has size equal to page size. If the next page is protected, that is seg fault. But I can't reproduce it locally. I always have ~136K memory after the end of the mapped region which I can access witout getting a seg fault. Maybe you have an explaination?

A workaround is to simply copy the tail with padding (equal has the same flaw and we want it read words).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, didn't think about that. It is very unlikely to exactly have a page boundary there, but will have to add the workaround for that last row.

}
}
if (result) {
if (((UNSAFE.getLong(existingResult.nameAddress + i) ^ UNSAFE.getLong(nameAddress + i)) << (64 - (nameLength - i) << 3)) == 0) {
existingResult.min = (short) Math.min(existingResult.min, number);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like these cast would now be unnecessary like in lines 69-70

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agreed, thanks for spotting. Left over from last version and now this might actually add an additional instruction!

@thomaswue
Copy link
Contributor Author

@thomaswue It would be my pleasure, thanks very much

@gunnarmorling This has now the number parsing code from @merykitty in the PR. Please add him as a co-author to the solution. It runs now in 0.76s on my machine, so this gave another ~5%.

Comment on lines 224 to 229
private static int findDelimiter(long word) {
long input = word ^ 0x3B3B3B3B3B3B3B3BL;
long tmp = (input & 0x7F7F7F7F7F7F7F7FL) + 0x7F7F7F7F7F7F7F7FL;
tmp = ~(tmp | input | 0x7F7F7F7F7F7F7F7FL);
return Long.numberOfTrailingZeros(tmp) >>> 3;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be micro-optimized as follows:

private static int findDelimiter(long word) {
    long input = word ^ 0x3B3B3B3B3B3B3B3BL;
    long tmp = (input - 0x0101010101010101L) & ~input & 0x8080808080808080L;
    return Long.numberOfTrailingZeros(tmp) >>> 3;
}

Which saves 1 instruction.
Before: 139,978,244,139 instructions:u
With this patch: 134,391,924,137 instructions:u

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, indeed, thanks. This got another almost 2%.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gunnarmorling This is already the second contribution from @mukel, I would like to ask if you could add him as a co-author to the submission as well in case you will re-evaluate.

@gunnarmorling
Copy link
Owner

Alrighty, finally back to evaluations :)

This brings your time on the new machine (using 8 of its cores) from 00:03.911 (as per current version on main) down to 00:03.044 as of this branch! Will do a run of the Top 10 on all 64 cores a bit later this months, should be below 1 sec for sure.

I'll add @merykitty and @mukel as co-authors to the entry on the loeaderboard.

@gunnarmorling gunnarmorling merged commit af66ac1 into gunnarmorling:main Jan 10, 2024
1 check passed
@thomaswue
Copy link
Contributor Author

Cool, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants