Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt of an initial version by rtreffer...... #186

Closed
wants to merge 1 commit into from

Conversation

rtreffer
Copy link

@rtreffer rtreffer commented Jan 6, 2024

Check List:

  • Tests pass (./test.sh rtreffer)
  • All formatting changes by the build are committed
  • Output matches that of calculate_average_baseline.sh
  • Execution time: 0:02.00 (2s)
  • Execution time of reference implementation: 1:34.59 (1m34s)

This one will need some judgement calls 馃し

It's a mashup pulling from @thomaswue (still 75% of the code) and @royvanrijn (SIMD idea). I really hope some of this makes it into other implementations.
The tiny unique ideas are: fnv64(alike) hashing of the city name as unique identifier + prime sized hash table.
This required some annoying work of the parse loop.

Using fnv64 (or any other hash function) on 10'000 station names is not guaranteed to be collision free, but it should be up to 99.99999999999995% correct ((1-1/(2^64))^10000). (Correction: should be closer to 99.999999999728922351%, ups)

That said it's really fast.

@thomaswue
Copy link
Contributor

Very cool work! How much faster on your machine than my solution is it with those changes?

I think one can still read 8 bytes when parsing for the city length and just special case the last entry, but it adds some more complexity to the code.

The full collision check in 8 byte chunk comparisons instead of the hash compare should be quite OK. In my solution it is still on a per-byte basis and might be a bottleneck because of that.

@rtreffer
Copy link
Author

rtreffer commented Jan 7, 2024

@thomaswue yes, that's all correct, I am playing around with these alternatives, but it is getting way to late for my taste now 馃檲

So this version right now is extremely fast as it skips the whole string comparison. That's most of the speedup. I think it would take back spot 1, but it is sorta questionable as I can be sure there must be names that have a collision, it's just very very hard to find 馃槅. I noted results on my machine in the code

 * - measurements3: rtreffer=3.9s baseline=2m6s thomaswue=8.5s
 * - measurements:  rtreffer=2.6s baseline=1m47 thomaswue=4.1s royvanrijn=3.2s

Although I must have gotten a bad run on your code, it seems a bit faster at the moment.

8 byte SIMD + a single 4 byte SIMD scan work. I have that lying around. I also went for 8 byte string comparisons, with a 4 and 2 byte comparison if needed (hack: you can always compare one more character, the ';'). However I am then close to your impl in speed, so I must have degraded at a few places.

I am right now considering to drop the whole OOP results space, make it one memory block and use that for the results.
At the end I can pack each city result into ~128 bytes (8 sum, 4 count, 4 min, 4 max, 100(+1) city name).
That's just 2-3x more than my L2 at an oversized hash table. But if I can make the hashtable dynamic then all of this should fit. Similar to the whole scanning approach.

馃泴

@gunnarmorling
Copy link
Owner

Using fnv64 (or any other hash function) on 10'000 station names is not guaranteed to be collision free, but it should be up to 99.99999999999995% correct ((1-1/(2^64))^10000). (Correction: should be closer to 99.999999999728922351%, ups)

Hey @rtreffer, thanks a lot for this submission. Unfortunately, as is, it's not compliant with the requirements, specifically the rule "Implementations must not rely on specifics of a given data set, e.g. any valid station name as per the constraints above and any data distribution (number of measurements per station) must be supported". If there is an, albeit small, risk for hash collisions which may cause incorrect results to be emitted, that solution is not within the spec.

@rtreffer
Copy link
Author

rtreffer commented Jan 7, 2024

Imagined that, I'll do a new submission soon :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants