Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unionArrayBy: Find next 1-bits with countTrailingZeros #395

Merged
merged 6 commits into from
Mar 31, 2022

Conversation

sjakobi
Copy link
Member

@sjakobi sjakobi commented Mar 24, 2022

Closes #374.

@sjakobi
Copy link
Member Author

sjakobi commented Mar 24, 2022

The benchmark results are pretty surprising: The mean runtime of the HashMap.union benchmark has literally doubled from ~60μs to ~120μs!

I think this is due to the weird benchmark data of contiguous Ints. Of course it doesn't hurt to increment the index by 1 if the node is full! :/

So, I'll need to improve the benchmark data, maybe simply try with ByteStrings.

Also, check the generated code.

Data/HashMap/Internal.hs Outdated Show resolved Hide resolved
Data/HashMap/Internal.hs Outdated Show resolved Hide resolved
@sjakobi
Copy link
Member Author

sjakobi commented Mar 24, 2022

The benchmark results are pretty surprising: The mean runtime of the HashMap.union benchmark has literally doubled from ~60μs to ~120μs!

These numbers are from my desktop with an i7-4790K CPU.

I'm now using my old laptop with an i3-2350M CPU. For the union.Int benchmark I'm seeing a slowdown from 128 μs to 277 μs (116%), for the new union.ByteString benchmark I'm also seeing a slowdown from 167 μs to 240 μs (43%).

That's pretty terrible. I'll have to look at the generated code. At least I know that it matters!

…and bring back array indices.
@sjakobi sjakobi force-pushed the sjakobi/374-unionArrayBy-clz branch from 0f4f623 to eedc326 Compare March 25, 2022 22:13
@sjakobi
Copy link
Member Author

sjakobi commented Mar 25, 2022

The latest version (eedc326) finally brings a speed up for the union.ByteString benchmark: 167μs -> 143 μs (-14%).

The union.Int benchmark is still slower at 128μs -> 140μs (+9%), but IMHO less realistic.

(Both measurements are from the ancient laptop.)

I should try another version that computes the indices instead of having them as arguments in the loop.

It might also be interesting to combine the various bitmap checks into a single case expression. Maybe GHC can generate something like a lookup table from this?!

@sjakobi
Copy link
Member Author

sjakobi commented Mar 26, 2022

I should try another version that computes the indices instead of having them as arguments in the loop.

Much worse, even if I keep i as a loop variable.

@sjakobi
Copy link
Member Author

sjakobi commented Mar 27, 2022

The latest version (eedc326) finally brings a speed up for the union.ByteString benchmark: 167μs -> 143 μs (-14%).

The union.Int benchmark is still slower at 128μs -> 140μs (+9%), but IMHO less realistic.

(Both measurements are from the ancient laptop.)

On my relatively beefier machine with the i7-4790K I'm getting the following numbers:

  • union.Int: 60μs -> 66μs (+10%)
  • union.ByteString: 78μs -> 66μs (-15%)

@sjakobi
Copy link
Member Author

sjakobi commented Mar 29, 2022

It might also be interesting to combine the various bitmap checks into a single case expression. Maybe GHC can generate something like a lookup table from this?!

I tried this:

    let go !i !i1 !i2 !b
            | b == 0 = return ()
            | otherwise = do
                let m = 1 `unsafeShiftL` countTrailingZeros b
                let testBit :: Word -> Word
                    testBit x = fromIntegral (fromEnum (x .&. m /= 0))
                let b'' = b .&. complement m
                let t1 = testBit b1
                let t2 = testBit b2 `unsafeShiftL` 1
                case t1 .|. t2 of
                    1 -> do
                        A.write mary i =<< A.indexM ary1 i1
                        go (i+1) (i1+1)  i2    b''
                    2 -> do
                        A.write mary i =<< A.indexM ary2 i2
                        go (i+1)  i1    (i2+1) b''
                    _ -> do
                        x1 <- A.indexM ary1 i1
                        x2 <- A.indexM ary2 i2
                        A.write mary i $! f x1 x2
                        go (i+1) (i1+1) (i2+1) b''
    go 0 0 0 b'

Unfortunately it's slower (measured on the old laptop):

    union
      Int:        OK (1.78s)
        162  μs ± 1.6 μs
      ByteString: OK (10.76s)
        156  μs ± 1.5 μs

@sjakobi sjakobi marked this pull request as ready for review March 29, 2022 17:44
@sjakobi
Copy link
Member Author

sjakobi commented Mar 30, 2022

I think I'll leave it at this for now. Ultimately it would be nice to recover the perf losses for cases like the union.Int benchmark, but I'll leave that for future work.

@sjakobi sjakobi merged commit b6bde46 into master Mar 31, 2022
@sjakobi sjakobi deleted the sjakobi/374-unionArrayBy-clz branch March 31, 2022 19:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use countTrailingZeros to speed up iterating over bitmaps?!
2 participants