Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort) #3938

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

iaroslavski
Copy link
Contributor

@iaroslavski iaroslavski commented May 8, 2021

Sorting:

  • adopt radix sort for sequential and parallel sorts on int/long/float/double arrays (almost random and length > 6K)
  • fix tryMergeRuns() to better handle case when the last run is a single element
  • minor javadoc and comment changes

Testing:

  • add new data inputs in tests for sorting
  • add min/max/infinity values to float/double testing
  • add tests for radix sort

Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8266431: Dual-Pivot Quicksort improvements (Radix sort)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/3938/head:pull/3938
$ git checkout pull/3938

Update a local copy of the PR:
$ git checkout pull/3938
$ git pull https://git.openjdk.java.net/jdk pull/3938/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 3938

View PR using the GUI difftool:
$ git pr show -t 3938

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/3938.diff

Sorting:

- adopt radix sort for sequential and parallel sorts on int/long/float/double arrays (almost random and length > 6K)
- fix tryMergeRuns() to better handle case when the last run is a single element
- minor javadoc and comment changes

Testing:
- add new data inputs in tests for sorting
- add min/max/infinity values to float/double testing
- add tests for radix sort
@bourgesl
Copy link
Contributor

@bourgesl bourgesl commented May 9, 2021

Congratulation,
what an amazing job !

DPQS is now 50% faster in average but 2.5 times faster (rms) my favorite !!

Finally OOME is now handled to let sort work under low-mem conditions !

I will work on more benchmarks for long/float/double types.

Laurent

@bridgekeeper bridgekeeper bot added the oca label May 10, 2021
@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented May 10, 2021

Hi @iaroslavski, welcome to this OpenJDK project and thanks for contributing!

We do not recognize you as Contributor and need to ensure you have signed the Oracle Contributor Agreement (OCA). If you have not signed the OCA, please follow the instructions. Please fill in your GitHub username in the "Username" field of the application. Once you have signed the OCA, please let us know by writing /signed in a comment in this pull request.

If you already are an OpenJDK Author, Committer or Reviewer, please click here to open a new issue so that we can record that fact. Please use "Add GitHub user iaroslavski" as summary for the issue.

If you are contributing this work on behalf of your employer and your employer has signed the OCA, please let us know by writing /covered in a comment in this pull request.

@openjdk
Copy link

@openjdk openjdk bot commented May 10, 2021

@iaroslavski The following label will be automatically applied to this pull request:

  • core-libs

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the core-libs label May 10, 2021
@iaroslavski
Copy link
Contributor Author

@iaroslavski iaroslavski commented May 10, 2021

/signed

@bridgekeeper bridgekeeper bot added the oca-verify label May 10, 2021
@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented May 10, 2021

Thank you! Please allow for up to two weeks to process your OCA, although it is usually done within one to two business days. Also, please note that pull requests that are pending an OCA check will not usually be evaluated, so your patience is appreciated!

int[] count4 = new int[256];

for (int i = low; i < high; ++i) {
count1[ a[i] & 0xFF]--;
Copy link

@jhorstmann jhorstmann May 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a reviewer, but having recently implemented a radixsort myself I'm wondering what is the logic or benefit of decrementing and counting backwards here?

One thing I did differently, and I'm not fully sure is an optimization, is remembering the last bucket for each of the 4 counts. Checking whether the data is already sorted by that digit can then be done by checking count[last_bucket] == size, which avoids the first loop in passLevel. Again, not sure whether it is actually faster, maybe the two separate simple loops like here are better.

Copy link
Contributor Author

@iaroslavski iaroslavski May 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jhorstmann In counting backwards we perform comparing with 0 in a loop, Comparing with 0 may be faster than comparing with other value. In general, backward or forward moving is your favor.

Keeping last_bucket and use check count[last_bucket] == size sounds good, but we need to run benchmarking and compare. I think we will not see big difference because the size of count array is too small, 256 only, whereas we scan whole array with size at least 6K. The corner case when we can see max effect of using last_bucket is array with all equal elements (4 count arrays will have only one non-zero element). but such array will be caught by tryMerge method.

*
* @author Vladimir Yaroslavskiy
* @author Jon Bentley
* @author Josh Bloch
* @author Doug Lea
*
* @version 2018.08.18
* @version 2020.06.14
Copy link
Contributor

@bourgesl bourgesl May 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vladimir, I would update to 2021.05.06 (+your hash)

Copy link
Contributor Author

@iaroslavski iaroslavski May 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Laurent, the date in this class is not the date of our last commit,
this date is the date when I have final idea regarding to Radix sort,
therefore, I prefer to keep 2020.06.14

Copy link
Contributor

@bourgesl bourgesl May 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as you want, but you should maybe indicate which version corresponds to your master code too.

It would help tracking changes between forks (iarosalvskiy/sorting master)

Copy link

@richardstartin richardstartin May 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @iaroslavski I'm unconvinced that this work was from 14/06/2020 - I believe this work derives from an unsigned radix sort I implemented on 10/04/2021 richardstartin/radix-sort-benchmark@ab4da23#diff-6c13d3fb74f38906677dbfa1a70a123c8e5baf4a39219c81ef121e078d0013bcR226 which has numerous structural similarities to this work:

  • Producing all four histograms in one pass
  • Skipping passes based on detecting the total in the histogram
  • Bailing out of the skip detection if a nonzero value not equal to the total is encountered
  • Manually unrolling the LSD radix sort loop in order to avoid array copies

My implementation from 10th April is below for reference:

  public static void unrollOnePassHistogramsSkipLevels(int[] data) {
    int[] histogram1 = new int[257];
    int[] histogram2 = new int[257];
    int[] histogram3 = new int[257];
    int[] histogram4 = new int[257];

    for (int value : data) {
      ++histogram1[(value & 0xFF) + 1];
      ++histogram2[((value >>> 8) & 0xFF) + 1];
      ++histogram3[((value >>> 16) & 0xFF) + 1];
      ++histogram4[(value >>> 24) + 1];
    }
    boolean skipLevel1 = canSkipLevel(histogram1, data.length);
    boolean skipLevel2 = canSkipLevel(histogram2, data.length);
    boolean skipLevel3 = canSkipLevel(histogram3, data.length);
    boolean skipLevel4 = canSkipLevel(histogram4, data.length);

    if (skipLevel1 && skipLevel2 && skipLevel3 && skipLevel4) {
      return;
    }
    int[] copy = new int[data.length];

    int[] source = data;
    int[] dest = copy;

    if (!skipLevel1) {
      for (int i = 1; i < histogram1.length; ++i) {
        histogram1[i] += histogram1[i - 1];
      }
      for (int value : source) {
        dest[histogram1[value & 0xFF]++] = value;
      }
      if (!skipLevel2 || !skipLevel3 || !skipLevel4) {
        int[] tmp = dest;
        dest = source;
        source = tmp;
      }
    }

    if (!skipLevel2) {
      for (int i = 1; i < histogram2.length; ++i) {
        histogram2[i] += histogram2[i - 1];
      }
      for (int value : source) {
        dest[histogram2[(value >>> 8) & 0xFF]++] = value;
      }
      if (!skipLevel3 || !skipLevel4) {
        int[] tmp = dest;
        dest = source;
        source = tmp;
      }
    }

    if (!skipLevel3) {
      for (int i = 1; i < histogram3.length; ++i) {
        histogram3[i] += histogram3[i - 1];
      }
      for (int value : data) {
        dest[histogram3[(value >>> 16) & 0xFF]++] = value;
      }
      if (!skipLevel4) {
        int[] tmp = dest;
        dest = source;
        source = tmp;
      }
    }

    if (!skipLevel4) {
      for (int i = 1; i < histogram4.length; ++i) {
        histogram4[i] += histogram4[i - 1];
      }
      for (int value : source) {
        dest[histogram4[value >>> 24]++] = value;
      }
    }
    if (dest != data) {
      System.arraycopy(dest, 0, data, 0, data.length);
    }
  }

  private static boolean canSkipLevel(int[] histogram, int dataSize) {
    for (int count : histogram) {
      if (count == dataSize) {
        return true;
      } else if (count > 0) {
        return false;
      }
    }
    return true;
  }

Moreover, @bourgesl forked my repository on 11/04/2021 and communicated with me about doing so. On 25/04/2021 there was a new implementation of DualPivotQuicksort with a signed radix sort but the same structural similarities, and with the same method and variable names in places bourgesl/radix-sort-benchmark@90ff7e4#diff-397ce8fd791e2ce508cf9127201bc9ab46264cd2a79fd0487a63569f2e4b59b2R607-R609

    // TODO add javadoc
    private static void radixSort(Sorter sorter, int[] a, int low, int high) {
        int[] b;
        // LBO: prealloc (high - low) +1 element:
        if (sorter == null || (b = sorter.b) == null || b.length < (high - low)) {
            // System.out.println("alloc b: " + (high - low));
            b = new int[high - low];
        }

        int[] count1, count2, count3, count4;
        if (sorter != null) {
            sorter.resetRadixBuffers();
            count1 = sorter.count1;
            count2 = sorter.count2;
            count3 = sorter.count3;
            count4 = sorter.count4;
        } else {
            // System.out.println("alloc radix buffers(4x256)");
            count1 = new int[256];
            count2 = new int[256];
            count3 = new int[256];
            count4 = new int[256];
        }

        for (int i = low; i < high; ++i) {
            --count1[ a[i]         & 0xFF ];
            --count2[(a[i] >>>  8) & 0xFF ];
            --count3[(a[i] >>> 16) & 0xFF ];
            --count4[(a[i] >>> 24) ^ 0x80 ];
        }

        boolean skipLevel4 = canSkipLevel(count4, low - high);
        boolean skipLevel3 = skipLevel4 && canSkipLevel(count3, low - high);
        boolean skipLevel2 = skipLevel3 && canSkipLevel(count2, low - high);

        count1[255] += high;
        count2[255] += high;
        count3[255] += high;
        count4[255] += high;

        // 1 todo process LSD
        for (int i = 255; i > 0; --i) {
            count1[i - 1] += count1[i];
        }

        for (int i = low; i < high; ++i) {
            b[count1[a[i] & 0xFF]++ - low] = a[i];
        }

        if (skipLevel2) {
            System.arraycopy(b, 0, a, low, high - low);
            return;
        }

        // 2
        for (int i = 255; i > 0; --i) {
            count2[i - 1] += count2[i];
        }

        //for (int value : b) {
        //    a[count2[(value >> 8) & 0xFF]++] = value;
        for (int i = low; i < high; ++i) {
            a[count2[(b[i] >> 8) & 0xFF]++] = b[i];
        }

        if (skipLevel3) {
            return;
        }

        // 3
        for (int i = 255; i > 0; --i) {
            count3[i - 1] += count3[i];
        }

        for (int i = low; i < high; ++i) {
            b[count3[(a[i] >> 16) & 0xFF]++ - low] = a[i];
        }

        if (skipLevel4) {
            System.arraycopy(b, 0, a, low, high - low);
            return;
        }

        // 4
        for (int i = 255; i > 0; --i) {
            count4[i - 1] += count4[i];
        }

        // for (int value : b) {
        //    a[count4[ (value >>> 24) ^ 0x80]++] = value;
        for (int i = low; i < high; ++i) {
            a[count4[ (b[i] >>> 24) ^ 0x80]++] = b[i];
        }
    }

    // TODO: add javadoc
    private static boolean canSkipLevel(int[] count, int total) {
        for (int c : count) {
            if (c == 0) {
                continue;
            }
            return c == total;
        }
        return true;
    }

Later, the name of the method canSkipLevel changed to skipByte: bourgesl/radix-sort-benchmark@a693b26#diff-397ce8fd791e2ce508cf9127201bc9ab46264cd2a79fd0487a63569f2e4b59b2R719 - this is the name of the method in the version of this sort you committed on 05/01/2021 richardstartin/sorting@f076073#diff-4b4d68fc834c2ad12a9fb9d316a812221af7c398338ed2ee907d0a795e7aadafR601-R604

    // TODO add javadoc
//  private 
    static void radixSort(Sorter sorter, int[] a, int low, int high) {
//System.out.println("                        Radix !!!");
        int[] count1 = new int[256];
        int[] count2 = new int[256];
        int[] count3 = new int[256];
        int[] count4 = new int[256];

        for (int i = low; i < high; ++i) {
            count1[ a[i]         & 0xFF ]--;
            count2[(a[i] >>>  8) & 0xFF ]--;
            count3[(a[i] >>> 16) & 0xFF ]--;
            count4[(a[i] >>> 24) ^ 0x80 ]--;
        }
        boolean skipByte4 = skipByte(count4, low - high, high, true);
        boolean skipByte3 = skipByte(count3, low - high, high, skipByte4);
        boolean skipByte2 = skipByte(count2, low - high, high, skipByte3);
        boolean skipByte1 = skipByte(count1, low - high, high, skipByte2);

        if (skipByte1) {
//Main.check(a, low, high - 1); // todo
            return;
        }
//        int xorA = Main.getXor(a, low, high);

        int[] b; int offset = low;

        if (sorter == null || (b = (int[]) sorter.b) == null) {
            b = new int[high - low];
        } else {
            offset = sorter.offset;
//System.out.println("      !!!! offset: " + offset);
        }
        int start = low - offset;
        int last = high - offset;


        // 1 todo process LSD
        for (int i = low; i < high; ++i) {
            b[count1[a[i] & 0xFF]++ - offset] = a[i];
        }

//        if (xorA != Main.getXor(a, low, high)) System.out.println("6K_4 1 xor changed");

        if (skipByte2) {
            System.arraycopy(b, start, a, low, high - low);
//Main.check(a, low, high - 1); // todo
            return;
        }

        // 2
        for (int i = start; i < last; ++i) {
            a[count2[(b[i] >> 8) & 0xFF]++] = b[i];
        }

//        if (xorA != Main.getXor(a, low, high)) System.out.println("6K_4 2 xor changed");

        if (skipByte3) {
//Main.check(a, low, high - 1); // todo
            return;
        }

        // 3
        for (int i = low; i < high; ++i) {
            b[count3[(a[i] >> 16) & 0xFF]++ - offset] = a[i];
        }

//        if (xorA != Main.getXor(a, low, high)) System.out.println("6K_4 3 xor changed");

        if (skipByte4) {
            System.arraycopy(b, start, a, low, high - low);
//Main.check(a, low, high - 1); // todo
            return;
        }

//        if (xorA != Main.getXor(a, low, high)) System.out.println("6K_4 4 xor changed");
        // 4
        for (int i = start; i < last; ++i) {
            a[count4[(b[i] >>> 24) ^ 0x80]++] = b[i];
        }
//        if (xorA != Main.getXor(a, low, high)) System.out.println("6K_4 5 xor changed");
//Main.check(a, low, high - 1); // todo
    }

    // TODO: add javadoc
    private static boolean skipByte(int[] count, int total, int high, boolean prevSkip) {
        if (prevSkip) {
            for (int c : count) {
                if (c == 0) {
                    continue;
                }
                if (c == total) {
                    return true;
                }
                break;
            }
        }
        // todo create historgam
        count[255] += high;

        for (int i = 255; i > 0; --i) {
            count[i - 1] += count[i];
        }
        return false;
    }

skipByte was later renamed to passLevel here, which is the name used in this PR: richardstartin/sorting@22357e4#diff-4b4d68fc834c2ad12a9fb9d316a812221af7c398338ed2ee907d0a795e7aadaf

Given the structural similarities mentioned, the chronological order of these commits, and the demonstrable provenance of the method name passLevel from canSkipLevel and since you have patented sort algorithms in the past, I want to make sure that it is recognised that this work is derived from my own.

Copy link
Contributor

@bourgesl bourgesl May 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You misunderstood my approach:

  • vladimir & tagir discussed radix sorts since previous work on DPQS in 2019
  • I enjoyed reading your blog post testing the performance of your radix sort vs Arrays.sort()
  • I tested and forked your radix-sort-benchmark to reproduce your experiments on openjdk16 (dpqs 19.11)
  • Vladimir proposed his own radixsort()
  • I did port DPQS impls in my fork of your benchmark to make a global comparison: dpqs 11, 19, 21 vs yours + arrays.sort(): it helped comparing implementations and decide when radix sort wins depending on dataset presortness
  • I tried many variants on my github repositories, but Vladimir never merged any of my change as he is not a regular github user (clone, fork, merge).

Our goal is not to underestimate your work (sort + benchmark) but Vladimir wrote his own code, me many experiments (tests, benchmarks) to obtain this original patch, written by Vladimir, with his radix sort implementation working on any int/long/float/double arrays, following the GPLv2 license.

You gave me motivation to make Arrays.sort() faster and we worked hard to write, test and benchmark this patch to be ready for OpenJDK 17

Copy link
Contributor

@AlanBateman AlanBateman Sep 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @iaroslavski I'm unconvinced that this work was from 14/06/2020 - I believe this work derives from an unsigned radix sort I implemented on 10/04/2021 richardstartin/radix-sort-benchmark@ab4da23#diff-6c13d3fb74f38906677dbfa1a70a123c8e5baf4a39219c81ef121e078d0013bcR226 which has numerous structural similarities to this work:
Moreover, @bourgesl forked my repository on 11/04/2021 and communicated with me about doing so. On 25/04/2021 there was a new implementation of DualPivotQuicksort with a signed radix sort but the same structural similarities, and with the same method and variable names in places bourgesl/radix-sort-benchmark@90ff7e4#diff-397ce8fd791e2ce508cf9127201bc9ab46264cd2a79fd0487a63569f2e4b59b2R607-R609

@iaroslavski The attribution is not clear here. Can you provide a summary as to who is contributing to this patch? I can't tell if all involved have signed the OCA or not. I'm sure there will be questions about space/time trade-offs with radix sort but I think it's important to first establish the origins of this patch first.

Copy link
Contributor Author

@iaroslavski iaroslavski Sep 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlanBateman There was discussion with Richard Startin, where Laurent and I explained the origin of my patch. There was misunderstanding based on git changes only. Finally Richard wrote comment in PR on May 13, 2021,
I duplicate it here:

"In private correspondence with Vladimir, it was explained that where Vladimir's code and Laurent's code are identical, including typos (Vladimir's code, Laurent's code) it is because Vladimir sent the code to Laurent, not the other way around, therefore Vladimir's code does not derive from Laurent's, and it does not derive from mine. I can only trust that this is the case, so please disregard my claim that this is derivative work when reviewing this PR."

@AlanBateman Vertical pipeline of PR hides comments in the middle and you have to click on "Show more..." to see all comments. There are no claims related to the origin of my patch, it doesn't violate any rights.

Copy link
Contributor

@AlanBateman AlanBateman Sep 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlanBateman Vertical pipeline of PR hides comments in the middle and you have to click on "Show more..." to see all comments. There are no claims related to the origin of my patch, it doesn't violate any rights.

There is a comment from richardstartin suggesting that something was derived from code in his repo. Is this a benchmark that is not part of this PR? Only asking because I can't find him on OCA signatories. You can use the Skara /contributor command to list the contributors.

Copy link
Contributor Author

@iaroslavski iaroslavski Sep 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioned benchmarking is not a part of this PR. Our benchmarking was done by Laurent.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlanBateman my claim was that the implementation was derived from my implementation, and demonstrated a sequence of name changes after @bourgesl forked my repository containing a structurally similar radix sort implementation and benchmarks, in order to provide circumstantial evidence for my claim. Via email @iaroslavski told me that this was not the case, which I decided to accept at face value. So please judge this PR on its merits, and disregard the claims made in these comments. I have not signed an OCA but do not want to block this PR if the space time tradeoff is deemed acceptable.

*/
if (a[e1] < a[e2] && a[e2] < a[e3] && a[e3] < a[e4] && a[e4] < a[e5]) {

/*
* Invoke radix sort on large array.
*/
Copy link
Contributor

@bourgesl bourgesl May 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer testing (sorter == null) first as it is always true for sequential sort and avoid testing bits > ... in this case

Copy link
Contributor Author

@iaroslavski iaroslavski May 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense, I will update.

count1[ a[i] & 0xFF]--;
count2[(a[i] >>> 8) & 0xFF]--;
count3[(a[i] >>> 16) & 0xFF]--;
count4[(a[i] >>> 24) ^ 0x80]--;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that C2 can't eliminate the bounds check here because of the xor, even though this can't possibly exceed 256. The three masked accesses above are all eliminated. Maybe someone could look in to improving that.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@iaroslavski iaroslavski May 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Laurent (bourgesl), see his comment on May 15 regarding to xor:
using Unsafe is only 2% faster, not worth the extra complexity for few percent.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small update of the XorINodes type calculation makes the bound check go away. Testing now.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That bounds check shouldn't be there, it's an obvious case and will be fixed: #4136

Copy link
Contributor

@bourgesl bourgesl May 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing hotspot C2 (xor) so quickly !

iaroslavski added 2 commits May 17, 2021
Sorting:

- move radix sort out from quicksort partitioning
- rename radixSort to tryRadixSort
- minor javadoc and comment changes

Testing:
- rename radixSort to tryRadixSort in helper
Testing:
- remove @SInCE and @Date, otherwise jtreg tag parser fails
@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Jun 26, 2021

@iaroslavski This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@iaroslavski
Copy link
Contributor Author

@iaroslavski iaroslavski commented Jun 26, 2021

Still waiting OCA approval

@bourgesl
Copy link
Contributor

@bourgesl bourgesl commented Jul 24, 2021

Still waiting for individual OCA approval

@iaroslavski
Copy link
Contributor Author

@iaroslavski iaroslavski commented Jul 26, 2021

@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Aug 23, 2021

@iaroslavski This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@iaroslavski
Copy link
Contributor Author

@iaroslavski iaroslavski commented Aug 23, 2021

Still waiting OCA approval

@bridgekeeper bridgekeeper bot removed oca oca-verify labels Sep 13, 2021
@openjdk openjdk bot added the rfr label Sep 13, 2021
@mlbridge
Copy link

@mlbridge mlbridge bot commented Sep 13, 2021

@iaroslavski
Copy link
Contributor Author

@iaroslavski iaroslavski commented Nov 15, 2021

Hi reviewers!

@jhorstmann
@tarsa
@bourgesl
@richardstartin
@AlanBateman
@neliasso

I applied and tested all ideas/comments from Laurent and Alan,
the sorting sources (3 classes) are in final state.
Could you please review and approve the PR, if you don't have comments?

@bourgesl
Copy link
Contributor

@bourgesl bourgesl commented Nov 15, 2021

As I am not an official openjdk reviewer, I can not approve, but I do support

@bourgesl
Copy link
Contributor

@bourgesl bourgesl commented Nov 18, 2021

Please core-libs reviewers could you have a look before jdk18 rdp0 or not ?

@iaroslavski
Copy link
Contributor Author

@iaroslavski iaroslavski commented Nov 18, 2021

Hi all,

Not only Dual-Pivot Quicksort has been optimized, but I also improved test classes.
I measured coverage of these tests. It was shown that all methods, all branches are invoked.
So, we have full test coverage.

@bourgesl
Copy link
Contributor

@bourgesl bourgesl commented Nov 29, 2021

As jdk18 Ramp Down 0 is approaching and jdk19 is being opened soon, maybe this PR review should be planned for jdk19 build 1 ?

* Updated javadoc
* Optimized insertion sort threshold
* Refactored parallel sorting section
* Improved step for pivot candidates
* Changed condition for Radix sort
@bourgesl
Copy link
Contributor

@bourgesl bourgesl commented Dec 12, 2021

Vladimir,
I updated the benchmarking code and here are the complete test results: system vs dpqs 21.5 vs 21.11:
https://github.com/bourgesl/nearly-optimal-mergesort-code/blob/master/sort-bench/results/2021/2021-12-final/DPQS-sort-1k-1M-stats.out

Full results:
https://github.com/bourgesl/nearly-optimal-mergesort-code/blob/master/sort-bench/results/2021/2021-12-final/sort-1k-1M-stats.out

All good, on 1k, 10k, 100k, 1m integer arrays.
Congratulations
🥳

@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Jan 9, 2022

@iaroslavski This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@bourgesl
Copy link
Contributor

@bourgesl bourgesl commented Jan 12, 2022

Please review this PR, it is ready for months, or give comments, please !

* Updated javadoc
* Improved mixed insertion
* Optimized sort networking
* Improved pivot partitioning
@iaroslavski
Copy link
Contributor Author

@iaroslavski iaroslavski commented Jan 12, 2022

Please review optimized sorting code.

@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Feb 9, 2022

@iaroslavski This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@iaroslavski
Copy link
Contributor Author

@iaroslavski iaroslavski commented Feb 25, 2022

New update is coming

* Improved mixed insertion sort
* Optimized insertion sort
* Improved merging sort
* Optimized soring tests
@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Apr 11, 2022

@iaroslavski This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@iaroslavski
Copy link
Contributor Author

@iaroslavski iaroslavski commented Apr 25, 2022

waiting testing result from Laurent

@bourgesl
Copy link
Contributor

@bourgesl bourgesl commented May 6, 2022

I am improving test code (jmh + robust estimators) to have more robust results on small arrays (20 - 1000).
I do hope having new results within 1 or 2 weeks.

@tarsa
Copy link

@tarsa tarsa commented May 13, 2022

allocating extra buffers and catching OOME when sorting primitives is rather unsatisfactory. you're not giving a reliable option for sorting under low memory conditions. IMO at least the single-threaded primitives (ints, longs, floats, etc all non-objects) sorting should be frugal when it comes to memory usage.

just my 2 cents.

@bourgesl
Copy link
Contributor

@bourgesl bourgesl commented May 15, 2022

allocating extra buffers and catching OOME when sorting primitives is rather unsatisfactory. you're not giving a reliable option for sorting under low memory conditions. IMO at least the single-threaded primitives (ints, longs, floats, etc all non-objects) sorting should be frugal when it comes to memory usage.

just my 2 cents.

DPQS uses several sorting algorithms, merge sort & new radix sort need an extra buffer in contrary to isort, mixed isort, single and dual pivot quick sort.
In this PR an upper limit on the heap usage is in use min(max heap / 16, 128mb) to reduce the memory footprint.
Anyway catching OOME now permits DPQS to use in-place but slower algorithms if the extra buffer can not be allocated.
Possibly the upper limit could be made configurable using system properties if it is really critical.
To sum up, low allocations are now under control in this PR.

@bourgesl
Copy link
Contributor

@bourgesl bourgesl commented May 15, 2022

I checked the latest code: line 128:

Max length of additional buffer, limited by max_heap / 64 or 256mb elements (2gb max).

private static final int MAX_BUFFER_LENGTH =
        (int) Math.min(Runtime.getRuntime().maxMemory() >> 6, 256L << 20);

So the current upper limit concerns the max length = max_heap_bytes / 64 (max heap/8) or 2gb (if heap is huge).

@tarsa
Copy link

@tarsa tarsa commented May 15, 2022

i think it would make much more sense to have the extra buffer size limit in bytes, not in elements. for single-threaded sorting the limit should be low, e.g. 1/64 of the heap, not 1/8 (in case of sorting e.g. longs as each long is 8 byte long). multi-threaded sorting could be more aggressive when it comes to resource usage as it's less likely to be used in resource constrained environment.

@iaroslavski
Copy link
Contributor Author

@iaroslavski iaroslavski commented May 20, 2022

@tarsa Hi Piotr,

Thank you for your interest to sorting and your point to allocation of buffers in sorting.

Let’s I share my thoughts.

The number of elements of any array is not more than ~2 billion
(max positive int value) and therefore the size (in bytes) of
int array is not more than ~2 GB * 4 = ~8 GB.

We need additional buffer for int, long, float and double types only.

We will have 2 constants which limit the number of int/float and
long/double elements in array, like this:

private static final int MAX_BUFFER_LENGTH_1 =              // for int/float
        (int) Math.min(Runtime.getRuntime().maxMemory() >> 10, Integer.MAX_VALUE);

private static final int MAX_BUFFER_LENGTH_2 =             // for long/double
        (int) Math.min(Runtime.getRuntime().maxMemory() >> 11, Integer.MAX_VALUE);

See, that the max number of elements for long/double is in 2 times less than for int/float,
because long/double occupies 2 times more bytes.

I take factors 10 and 11 as example, it may be other values, but the first is less than the second by 1.

Why do we need to use Math.min(…, Integer.MAX_VALUE)? We must do it in this case:
if Runtime.getRuntime().maxMemory() >> 11 is larger than max int (for example, we have power server),
casting to int will produce negative value. So, limit by 2 Gb max is required.

And the method tryAllocate will look like this:

private static Object tryAllocate(Object a, int length) {
try {
if (a instanceof int[]) {
return length > MAX_BUFFER_LENGTH_1 ? null : new int[length];
}
if (a instanceof long[]) {
return length > MAX_BUFFER_LENGTH_2 ? null : new long[length];
}
if (a instanceof float[]) {
return length > MAX_BUFFER_LENGTH_1 ? null : new float[length];
}
if (a instanceof double[]) {
return length > MAX_BUFFER_LENGTH_2 ? null : new double[length];
}
throw new IllegalArgumentException("Unknown array: " + a.getClass().getName());
} catch (OutOfMemoryError e) {
return null;
}
}

What do you think, what value of shift is the best, 6, 7, 8, 9, 10, 11, 12 etc.?
Runtime.getRuntime().maxMemory() >> 10??

Do you have actual scenarios? Actual requirements? Your feedback are welcome!

@tarsa
Copy link

@tarsa tarsa commented May 21, 2022

hi Vladimir,

the main reason i raise the issue with catching out-of-memory-error is that it's done very rarely in production code and also it's rather unpredictable when multiple things are going on (e.g. one thread can allocate a lot of memory, but other threads receive oome).

i've searched through /src/ in openjdk/jdk repository for oome (i.e. this one) and found that most times the oome is caught is in java.desktop module (probably this means the oome was related to native memory or other resources external to jvm), so i've narrowed search to java.base: https://github.com/search?q=outofmemoryerror+repo%3Aopenjdk%2Fjdk+path%3A%2Fsrc%2Fjava.base%2F+language%3AJava&type=Code&ref=advsearch&l=Java and found just 2 cases of catching oome without always rethrowing it:

overall, it seems that swallowing oome (i.e. in this case: catching and not rethrowing) is definitely not a readily used approach.

We need additional buffer for int, long, float and double types only.

We will have 2 constants which limit the number of int/float and
long/double elements in array, like this:
(...)
See, that the max number of elements for long/double is in 2 times less than for int/float,
because long/double occupies 2 times more bytes.

sounds good

Why do we need to use Math.min(…, Integer.MAX_VALUE)? (...) So, limit by 2 Gb max is required.

yep, understood

What do you think, what value of shift is the best, 6, 7, 8, 9, 10, 11, 12 etc.?
Runtime.getRuntime().maxMemory() >> 10??

Do you have actual scenarios? Actual requirements? Your feedback are welcome!

keeping the extra buffer size below 1% of max memory should be safe enough. currently the code is:

    private static final int MAX_BUFFER_LENGTH =
            (int) Math.min(Runtime.getRuntime().maxMemory() >> 6, 256L << 20);
    // ...
    private static Object tryAllocate(Object a, int length) {
        if (length > MAX_BUFFER_LENGTH) {
            return null;
        }
        try {
            if (a instanceof int[]) {
                return new int[length];
            }
            if (a instanceof long[]) {
                return new long[length];
            }
    ...

which means that in the worst case sizeof(long) * (max memory >> 6) == 12.5% of heap size limit would be used which is a lot when someone runs java code in memory constrained environment and that is rather common nowadays (microservices, containers, etc).

@iaroslavski
Copy link
Contributor Author

@iaroslavski iaroslavski commented May 25, 2022

Hi Piotr,

I agree that catching out-of-memory-error is very rarely in prod.
I can say that LSD Radix sort works 3-5 times faster than Quicksort
(array of int 2M elements), merging sort works 10-30 (!) times faster
than Quicksort (structured array of int 2M elements), therefore it
is worth to use algorithms with additional buffers.

If we allocate buffer like 'new int[size]' and there is no free memory,
sorting fails with OOME. If we catch memory error, we switch to
in-place algorithms and sorting will be completed successfully.
I checked such use case, it works fine.

It doesn't matter if we run several threads or not. Some threads
may use additional buffers, some threads - not, but finally
all threads will be completed without errors.

The known problem with OOME is handling inside catch block.
If we try to create new object, try to log message with details,
we can face with new OOME. In DualPivotQuicksort we return null only,
no any actions, no new objects etc.

@bourgesl
Copy link
Contributor

@bourgesl bourgesl commented May 25, 2022

Well explained, vladimir.

I propose to adopt max_heap / 128 min to stay less than 1% footprint.

Then factors are 7 for ints, 8 for long, min.

For 1gb heap, it gives 8mb, so 2m ints or 1m long.

Laurent

@quincunx-45
Copy link

@quincunx-45 quincunx-45 commented May 25, 2022

Will the VM still exit if -XX:+ExitOnOutOfMemoryError/-XX:+CrashOnOutOfMemoryError/-XX:OnOutOfMemoryError/... was specified? I.e. can this change lead to more (avoidable) crashes?

@iaroslavski
Copy link
Contributor Author

@iaroslavski iaroslavski commented May 25, 2022

This change will not lead to more crashes.

@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Jun 22, 2022

@iaroslavski This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@iaroslavski
Copy link
Contributor Author

@iaroslavski iaroslavski commented Jun 22, 2022

Improvement of parallel sorting is under progress

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-libs rfr
8 participants