Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize IntSet.Bin #998

Merged
merged 8 commits into from
Jun 4, 2024
Merged

Optimize IntSet.Bin #998

merged 8 commits into from
Jun 4, 2024

Conversation

meooow25
Copy link
Contributor

@meooow25 meooow25 commented Apr 6, 2024

  • Replace the separate Prefix and Mask Int fields in the Bin constructor with a single Int field which contains both merged together. This reduces the memory required by a Bin from 5 to 4 words, at the cost of more computations (which are cheap bitwise ops) being necessary for certains operations. This follows a similar change done for IntMap.Bin (Optimize IntMap.Bin #995).
  • Benchmarks show that runtimes for most operations remain unchanged or decrease by a small amount (<10%). As expected, allocations are consistently lower by 11-16% for all set operations that have to make O(log n) allocations.
  • The functions and types used by both IntSet and IntMap have been moved into a IntTreeCommons module.

Fixes #991.

Memory

Concretely, this reduces the memory required by an IntSet by ~12.5%.

Calculations: For a tree with n Tips, each Tips costs 3 words and there are n-1 Bins each costing 5 words before this change and 4 words after. So we save about 1 out of 8 words.

Compatibility

This PR makes breaking changes to IntSet internals which is reflected in the exports of the internal module Data.IntSet.Internal. Due to the introduction of IntTreeCommons, some exports of Data.IntMap.Internal are moved to IntTreeCommons. There is no change in the exports of any other module.

Benchmarks

Benchmarks done with GHC 9.6.3.
Last updated: 080dde1

Benchmark command: cabal run <target> -- --csv <csv> +RTS -T

intset-benchmarks

Name                       Time - - - - - - - -    Allocated - - - - -     Copied - - - - - - -
                                A       B     %         A       B     %         A       B     %
delete                      66 μs   65 μs   -1%    729 KB  601 KB  -17%    127 B   104 B   -18%
deleteMax                   17 ns   16 ns   -5%    175 B   159 B    -9%      0 B     0 B
deleteMin                   57 ns   51 ns  -11%    764 B   647 B   -15%      0 B     0 B
difference                 721 ns  683 ns   -5%    4.0 KB  3.5 KB  -12%      1 B     2 B   +100%
disjoint:false              33 ns   34 ns   +3%     23 B    23 B    +0%      0 B     0 B
disjoint:true              443 ns  409 ns   -7%      0 B    12 B             0 B     0 B
filter                      21 μs   21 μs   +0%     68 KB   67 KB   -1%     43 B    39 B    -9%
findMax                    6.8 ns  7.1 ns   +4%     39 B    39 B    +0%      0 B     0 B
findMin                    9.1 ns  9.0 ns   +0%     39 B    39 B    +0%      0 B     0 B
fold                       154 ns  156 ns   +0%    2.7 KB  2.7 KB   +0%      1 B     1 B    +0%
fromAscList                 17 μs   17 μs   +0%    4.0 KB  2.8 KB  -29%      2 B     2 B    +0%
fromDistinctAscList         17 μs   17 μs   +0%    4.0 KB  2.8 KB  -29%      3 B     2 B   -33%
fromList                    52 μs   46 μs  -11%    576 KB  479 KB  -16%    392 B   297 B   -24%
fromRange                  347 ns  332 ns   -4%    4.0 KB  3.6 KB  -12%      2 B     1 B   -50%
fromRange:small            9.2 ns   11 ns  +19%    119 B   111 B    -6%      0 B     0 B
insert                      52 μs   45 μs  -12%    576 KB  479 KB  -16%    409 B   308 B   -24%
instanceOrd:dense          454 ms  454 ms   +0%    2.2 GB  2.2 GB   +0%    142 MB  141 MB   +0%
instanceOrd:sparse         826 ms  818 ms   +0%    2.4 GB  2.4 GB   +0%    241 MB  229 MB   -4%
intersection               710 ns  702 ns   -1%    4.0 KB  3.5 KB  -13%      2 B     1 B   -50%
map                         85 μs   81 μs   -5%    1.0 MB  960 KB   -9%    1.1 KB  912 B   -16%
member                      50 μs   48 μs   -2%      0 B     0 B             0 B     0 B
null.intersection:false    706 ns  699 ns   +0%    4.0 KB  3.5 KB  -13%      2 B     1 B   -50%
null.intersection:true     425 ns  420 ns   -1%      0 B    12 B             0 B     0 B
partition                   24 μs   22 μs   -4%     72 KB   71 KB   -1%     79 B    69 B   -12%
spanAntitone:dense         118 ns  112 ns   -4%    581 B   527 B    -9%      0 B     0 B
spanAntitone:sparse        137 ns  120 ns  -12%    772 B   677 B   -12%      0 B     0 B
split:dense                 36 ns   32 ns  -10%    375 B   319 B   -14%      0 B     0 B
split:sparse                49 ns   48 ns   -1%    486 B   399 B   -17%      0 B     0 B
splitMember:dense           38 ns   35 ns   -7%    383 B   326 B   -14%      0 B     0 B
splitMember:sparse          56 ns   44 ns  -20%    494 B   406 B   -17%      0 B     0 B
union                      691 ns  662 ns   -4%    4.0 KB  3.5 KB  -12%      2 B     1 B   -50%
unions                     694 ns  677 ns   -2%    4.1 KB  3.6 KB  -12%      2 B     1 B   -50%

set-operations-intset

Name                           Time - - - - - - - -    Allocated - - - - -     Copied - - - - - - -
                                    A       B     %         A       B     %         A       B     %
difference-block_nn             12 μs   11 μs   -1%     65 KB   54 KB  -16%    530 B   388 B   -26%
difference-block_nn_swap        12 μs   12 μs   +0%     65 KB   54 KB  -16%    552 B   394 B   -28%
difference-block_ns            1.6 μs  1.5 μs   -3%     11 KB  8.8 KB  -17%     17 B    11 B   -35%
difference-block_sn_swap       1.3 μs  1.3 μs   +1%    6.5 KB  5.5 KB  -15%      7 B     5 B   -28%
difference-common_nn            18 μs   17 μs   -3%     98 KB   85 KB  -12%    1.2 KB  925 B   -23%
difference-common_nn_swap       11 μs  9.7 μs  -15%      0 B     0 B             0 B     0 B
difference-common_ns            18 μs   18 μs   -3%     97 KB   85 KB  -12%    1.2 KB  932 B   -23%
difference-common_nt           6.4 μs  6.0 μs   -6%     47 KB   39 KB  -17%    291 B   202 B   -30%
difference-common_sn_swap       11 μs  9.7 μs  -15%      0 B     0 B             0 B     0 B
difference-common_tn_swap      3.2 μs  2.8 μs  -11%      0 B     0 B             0 B     0 B
difference-disj_nn              47 ns   46 ns   -2%    255 B   214 B   -16%      0 B     0 B
difference-disj_nn_swap         51 ns   49 ns   -3%    334 B   278 B   -16%      0 B     0 B
difference-disj_ns              43 ns   41 ns   -4%    255 B   214 B   -16%      0 B     0 B
difference-disj_nt              38 ns   35 ns   -8%    255 B   214 B   -16%      0 B     0 B
difference-disj_sn_swap         44 ns   43 ns   -2%    255 B   214 B   -16%      0 B     0 B
difference-disj_tn_swap         30 ns   34 ns  +14%    135 B   119 B   -11%      0 B     0 B
difference-mix_nn               38 μs   36 μs   -5%    193 KB  169 KB  -12%    4.9 KB  3.7 KB  -23%
difference-mix_nn_swap          38 μs   36 μs   -4%    193 KB  169 KB  -12%    4.8 KB  3.7 KB  -22%
difference-mix_ns               20 μs   19 μs   -4%    107 KB   94 KB  -12%    1.4 KB  1.1 KB  -22%
difference-mix_nt              6.5 μs  6.1 μs   -6%     47 KB   39 KB  -17%    290 B   204 B   -29%
difference-mix_sn_swap          20 μs   19 μs   -5%    107 KB   94 KB  -12%    1.4 KB  1.1 KB  -23%
difference-mix_tn_swap         4.4 μs  4.1 μs   -7%     20 KB   17 KB  -12%     54 B    39 B   -27%
intersection-block_nn          7.2 μs  6.9 μs   -5%      0 B     0 B             0 B     0 B
intersection-block_nn_swap     7.0 μs  6.9 μs   -1%      0 B     0 B             0 B     0 B
intersection-block_ns          891 ns  879 ns   -1%     25 B    25 B    +0%      0 B     0 B
intersection-block_sn_swap     962 ns  864 ns  -10%      0 B    25 B             0 B     0 B
intersection-common_nn          18 μs   17 μs   -2%     97 KB   85 KB  -12%    1.2 KB  904 B   -24%
intersection-common_nn_swap     18 μs   18 μs   +0%     97 KB   85 KB  -11%    1.2 KB  931 B   -21%
intersection-common_ns          18 μs   18 μs   +0%     97 KB   85 KB  -12%    1.1 KB  920 B   -21%
intersection-common_nt         4.6 μs  4.4 μs   -2%     19 KB   17 KB  -11%     50 B    38 B   -24%
intersection-common_sn_swap     18 μs   17 μs   -1%     97 KB   85 KB  -12%    1.2 KB  898 B   -24%
intersection-common_tn_swap    4.3 μs  4.3 μs   +0%     19 KB   17 KB  -11%     50 B    40 B   -19%
intersection-disj_nn            33 ns   30 ns   -9%     31 B    31 B    +0%      0 B     0 B
intersection-disj_nn_swap       32 ns   32 ns   +0%     31 B    31 B    +0%      0 B     0 B
intersection-disj_ns            30 ns   26 ns  -13%     31 B    31 B    +0%      0 B     0 B
intersection-disj_nt            26 ns   23 ns   -8%     31 B    31 B    +0%      0 B     0 B
intersection-disj_sn_swap       29 ns   28 ns   -3%     31 B    31 B    +0%      0 B     0 B
intersection-disj_tn_swap       25 ns   24 ns   -5%     31 B    31 B    +0%      0 B     0 B
intersection-mix_nn             22 μs   21 μs   -2%      0 B     0 B             0 B     0 B
intersection-mix_nn_swap        22 μs   21 μs   -1%      0 B     0 B             0 B     0 B
intersection-mix_ns             12 μs   12 μs   +0%      0 B     0 B             0 B     0 B
intersection-mix_nt            3.2 μs  3.3 μs   +1%      0 B     0 B             0 B     0 B
intersection-mix_sn_swap        12 μs   12 μs   +0%      0 B     0 B             0 B     0 B
intersection-mix_tn_swap       3.1 μs  3.0 μs   -3%      0 B     0 B             0 B     0 B
union-block_nn                  14 μs   13 μs   -4%     93 KB   77 KB  -16%    1.1 KB  776 B   -30%
union-block_nn_swap             14 μs   13 μs   -6%     93 KB   77 KB  -16%    1.1 KB  777 B   -29%
union-block_ns                 1.8 μs  1.7 μs   -7%     13 KB   11 KB  -17%     27 B    18 B   -33%
union-block_sn_swap            1.9 μs  1.7 μs   -9%     13 KB   11 KB  -17%     27 B    18 B   -33%
union-common_nn                 17 μs   17 μs   +1%     97 KB   85 KB  -12%    1.2 KB  965 B   -20%
union-common_nn_swap            17 μs   17 μs   +0%     97 KB   85 KB  -12%    1.2 KB  934 B   -24%
union-common_ns                 17 μs   18 μs   +1%     97 KB   85 KB  -11%    1.2 KB  969 B   -20%
union-common_nt                6.4 μs  5.7 μs   -9%     47 KB   39 KB  -16%    309 B   203 B   -34%
union-common_sn_swap            17 μs   17 μs   +0%     97 KB   85 KB  -12%    1.2 KB  942 B   -22%
union-common_tn_swap           5.9 μs  5.3 μs   -9%     47 KB   39 KB  -16%    301 B   206 B   -31%
union-disj_nn                   62 ns   47 ns  -24%    534 B   438 B   -17%      0 B     0 B
union-disj_nn_swap              61 ns   54 ns  -10%    534 B   438 B   -17%      0 B     0 B
union-disj_ns                   56 ns   41 ns  -26%    454 B   375 B   -17%      0 B     0 B
union-disj_nt                   44 ns   32 ns  -28%    334 B   279 B   -16%      0 B     0 B
union-disj_sn_swap              56 ns   48 ns  -14%    454 B   375 B   -17%      0 B     0 B
union-disj_tn_swap              47 ns   38 ns  -20%    334 B   279 B   -16%      0 B     0 B
union-mix_nn                    36 μs   37 μs   +1%    193 KB  171 KB  -11%    4.7 KB  3.7 KB  -21%
union-mix_nn_swap               36 μs   37 μs   +3%    193 KB  171 KB  -11%    4.7 KB  3.7 KB  -21%
union-mix_ns                    19 μs   20 μs   +2%    107 KB   93 KB  -12%    1.4 KB  1.1 KB  -21%
union-mix_nt                   6.4 μs  5.8 μs   -9%     47 KB   39 KB  -16%    290 B   205 B   -29%
union-mix_sn_swap               19 μs   19 μs   +1%    107 KB   94 KB  -12%    1.4 KB  1.1 KB  -23%
union-mix_tn_swap              5.9 μs  5.5 μs   -6%     47 KB   39 KB  -16%    289 B   204 B   -29%

* Replace the separate Prefix and Mask Int fields in the Bin constructor
  with a single Int field which contains both merged together. This
  reduces the memory required by a Bin from 5 to 4 words, at the cost of
  more computations (which are cheap bitwise ops) being necessary for
  certains operations. This follows a similar change done for IntMap.Bin.

* Benchmarks show that runtimes for most operations remain unchanged or
  decrease by a small amount (<10%). As expected, allocations are
  consistently lower by 11-16% for all set operations that have to
  make O(log n) allocations.

* The functions and types used by both IntSet and IntMap have been moved
  into a IntTreeCommons module.
@treeowl
Copy link
Contributor

treeowl commented Apr 6, 2024

This isn't related to your changes, but it looks like the IntSet validity test doesn't ensure that each Tip contains at least one element.

@@ -334,7 +363,7 @@ null _ = False
size :: IntSet -> Int
size = go 0
where
go !acc (Bin _ _ l r) = go (go acc l) r
go !acc (Bin _ l r) = go (go acc l) r
go acc (Tip _ bm) = acc + bitcount 0 bm
go acc Nil = acc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does bitcount take two arguments? Should we just use popCount directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does bitcount take two arguments?

This seems to be left over from when bitcount was a loop and the first parameter was the accumulator: 4952822#diff-06e022e2fe36764ea9baec24a03a8186d708cc561677a2e6583f08fe180ca073L73-L76

What I don't understand is why it is acc + bitcount 0 bm rather than bitcount acc bm 😄

Should we just use popCount directly?

Probably. Might be better in a different PR though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be a totally reasonable "by the way" for this one, if you're so inclined. It strictly improves readability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we should get rid of the note citing a source for the highest bit algorithm that we no longer use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we were considering removing the Nat stuff in #995, I think it would a better fit with those changes. Same for the obsolete note.

@alexfmpe alexfmpe mentioned this pull request Apr 6, 2024
-- * All keys in a Bin start with the Bin's shared prefix.
-- * All keys in the Bin's left child have the Prefix's mask bit unset.
-- * All keys in the Bin's right child have the Prefix's mask bit set.
prefixOk :: IntSet -> Property
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see anything checking the prefixes of Bins other than the root. Shouldn't this be called prefixesOk, and perform a recursive test?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For test performance reasons, I suspect we want to consider validity of the prefix of a Bin relative to that of its parent when making the test recursive, treating the (peculiarly simple) case of the root specially.

Copy link
Contributor Author

@meooow25 meooow25 Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed 🤦

For test performance reasons, I suspect we want to consider validity of the prefix of a Bin relative to that of its parent when making the test recursive, treating the (peculiarly simple) case of the root specially.

Is the trouble really worth it?

$ cabal run intset-properties
...
All 60 tests passed (0.21s)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the 0.21s, no. But why does it run for such a short time? Can we make it run more tests? Larger tests? Ideally I want to use more efficient tests and also get that testing time up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separately, I suspect that a relative test is likely to help give more useful information in case of failure, though I don't know that for sure. Regardless, I won't block up this PR for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make it run more tests?

This would be the easiest, we can increase the number of quickcheck runs.

Ideally I want to use more efficient tests and also get that testing time up.

Well, if it helps catch errors. Don't know how one would judge that, except maybe coverage. Otherwise it is a waste.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Data.Sequence, that's helpful for dealing with "special casing" in some parts of the code. Maybe it's overkill here.

t ->
valid t .&&.
toAscList t === List.sort (nub ((List.intersect) (xs) (ys)))
toAscList t === (toAscList xs List.\\ toAscList ys)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data.List.Ordered.minus would be faster than \\.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean from data-ordlist? Surely this is not worth a new dependency?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, from that package. I imagine we can copy what we like, though I haven't checked its license to be sure. My only real concern about the dependency is that the package doesn't seem to be actively maintained, so it might run into annoying issues with base changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the tests are quite fast (as mentioned the other comment), this seems like something that shouldn't cause too much worry.
Besides, a new function/dependency is another chance to introduce a bug. Data.List is a little more trustworthy in that regard, I'd say.

case intersection xs ys of
t ->
valid t .&&.
toAscList t === (toAscList xs `List.intersect` toAscList ys)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, Data.List.Ordered.isect should be faster than List.intersect.

@meooow25
Copy link
Contributor Author

@treeowl, anything I can do here?

| otherwise = go l r
go def (Bin p l r) | nomatch x p = if x < unPrefix p then unsafeFindMax def else unsafeFindMax r
| left x p = go def l
| otherwise = go l r
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't related to your changes, per se, but we use somewhat different algorithms for lookup in IntMap than in member for IntSet, and the reasoning doesn't seem to be documented in the source. As I recall, IntMap lookup is (as of relatively recently) optimized for the common case that the key is present in the map, whereas IntSet membership is optimized for the (presumably?) common case that the element is not in the set. What's the right reasoning for lookupLT and lookupGT here? I don't have much of a clue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I recall, IntMap lookup is (as of relatively recently) optimized for the common case that the key is present in the map...

Yes, appears to be so due to #800

What's the right reasoning for lookupLT and lookupGT here? I don't have much of a clue.

I'm not sure either. Note that the lookup logic (using zero, before #995) could not be directly applied to lookup{L,G}{T,E}. It checked specific bits and was not a binary search. However, the current logic with left can be applied without the nomatch check. This seems worth checking out, but we don't have to do it here I think.

@treeowl
Copy link
Contributor

treeowl commented Apr 16, 2024

I'm dealing with a family medical emergency this week. I hope to give this a final review shortly.

@meooow25
Copy link
Contributor Author

@treeowl I hope you are able to find time for this. I would really like to finish this work.

@treeowl treeowl merged commit 8562003 into haskell:master Jun 4, 2024
10 checks passed
@treeowl
Copy link
Contributor

treeowl commented Jun 4, 2024

Yes, I'm so sorry about the delay. Problems continue on my end, but that's no reason to hold you up. Merged! Will you try the IntSet now?

@meooow25
Copy link
Contributor Author

meooow25 commented Jun 4, 2024

Thank you!

Will you try the IntSet now?

This was the IntSet PR. Are you thinking of something else?

@meooow25 meooow25 deleted the intset-prefix-mask branch June 4, 2024 13:43
@treeowl
Copy link
Contributor

treeowl commented Jun 4, 2024 via email

@meooow25
Copy link
Contributor Author

meooow25 commented Jun 4, 2024

I hope things get better 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Potential memory optimization for IntMap and IntSet
2 participants