Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JDK-8277175 : Add a parallel multiply method to BigInteger #6409

Closed
wants to merge 14 commits into from

Conversation

kabutz
Copy link
Contributor

@kabutz kabutz commented Nov 16, 2021

BigInteger currently uses three different algorithms for multiply. The simple quadratic algorithm, then the slightly better Karatsuba if we exceed a bit count and then Toom Cook 3 once we go into the several thousands of bits. Since Toom Cook 3 is a recursive algorithm, it is trivial to parallelize it. I have demonstrated this several times in conference talks. In order to be consistent with other classes such as Arrays and Collection, I have added a parallelMultiply() method. Internally we have added a parameter to the private multiply method to indicate whether the calculation should be done in parallel.

The performance improvements are as should be expected. Fibonacci of 100 million (using a single-threaded Dijkstra's sum of squares version) completes in 9.2 seconds with the parallelMultiply() vs 25.3 seconds with the sequential multiply() method. This is on my 1-8-2 laptop. The final multiplications are with very large numbers, which then benefit from the parallelization of Toom-Cook 3. Fibonacci 100 million is a 347084 bit number.

We have also parallelized the private square() method. Internally, the square() method defaults to be sequential.

Some benchmark results, run on my 1-6-2 server:

Benchmark                                          (n)  Mode  Cnt      Score      Error  Units
BigIntegerParallelMultiply.multiply            1000000    ss    4     51.707 ±   11.194  ms/op
BigIntegerParallelMultiply.multiply           10000000    ss    4    988.302 ±  235.977  ms/op
BigIntegerParallelMultiply.multiply          100000000    ss    4  24662.063 ± 1123.329  ms/op
BigIntegerParallelMultiply.parallelMultiply    1000000    ss    4     49.337 ±   26.611  ms/op
BigIntegerParallelMultiply.parallelMultiply   10000000    ss    4    527.560 ±  268.903  ms/op
BigIntegerParallelMultiply.parallelMultiply  100000000    ss    4   9076.551 ± 1899.444  ms/op

We can see that for larger calculations (fib 100m), the execution is 2.7x faster in parallel. For medium size (fib 10m) it is 1.873x faster. And for small (fib 1m) it is roughly the same. Considering that the fibonacci algorithm that we used was in itself sequential, and that the last 3 calculations would dominate, 2.7x faster should probably be considered quite good on a 1-6-2 machine.


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed
  • Change requires a CSR request to be approved

Issues

  • JDK-8277175: Add a parallel multiply method to BigInteger
  • JDK-8278886: Add a parallel multiply method to BigInteger (CSR)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/6409/head:pull/6409
$ git checkout pull/6409

Update a local copy of the PR:
$ git checkout pull/6409
$ git pull https://git.openjdk.java.net/jdk pull/6409/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 6409

View PR using the GUI difftool:
$ git pr show -t 6409

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/6409.diff

@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Nov 16, 2021

👋 Welcome back kabutz! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr label Nov 16, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Nov 16, 2021

@kabutz The following label will be automatically applied to this pull request:

  • core-libs

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the core-libs label Nov 16, 2021
@mlbridge
Copy link

@mlbridge mlbridge bot commented Nov 16, 2021

@jddarcy
Copy link
Member

@jddarcy jddarcy commented Nov 16, 2021

It isn't clear to me that parallelMultiply should be a public method as opposed to an implementation detail.

@kabutz
Copy link
Contributor Author

@kabutz kabutz commented Nov 16, 2021

It isn't clear to me that parallelMultiply should be a public method as opposed to an implementation detail.

Hi Joe, thanks for responding. I would have preferred for it to be an implementation detail, but I thought it best to keep things consistent with other parts of the JDK. For example:

  • Arrays#sort() vs Arrays#parallelSort()
  • Collection#stream() vs Collection#parallelStream()

One could do it as an implementation detail, with a special flag to turn off the functionality. But that would mean that the parallelMultiply() would be on for all multiplications in a system. I would expect that some organizations would for whatever reason not want to use more cores than absolutely necessary?

@jddarcy
Copy link
Member

@jddarcy jddarcy commented Nov 16, 2021

It isn't clear to me that parallelMultiply should be a public method as opposed to an implementation detail.

Hi Joe, thanks for responding. I would have preferred for it to be an implementation detail, but I thought it best to keep things consistent with other parts of the JDK. For example:

* Arrays#sort() vs Arrays#parallelSort()

* Collection#stream() vs Collection#parallelStream()

One could do it as an implementation detail, with a special flag to turn off the functionality. But that would mean that the parallelMultiply() would be on for all multiplications in a system. I would expect that some organizations would for whatever reason not want to use more cores than absolutely necessary?

Hi Heinz,

As you cite, there are a few other cases in the JDK API were a second "parallelFoo" method is exposed. However, I don't think those precedents would necessarily mandate a parallelMultiply method in BigInteger. Without a separate method, there is a question of tuning of course.

@kabutz
Copy link
Contributor Author

@kabutz kabutz commented Nov 16, 2021

As you cite, there are a few other cases in the JDK API were a second "parallelFoo" method is exposed. However, I don't think those precedents would necessarily mandate a parallelMultiply method in BigInteger. Without a separate method, there is a question of tuning of course.

Hi Joe, I guess with sorting it makes sense to have two different methods, because it is usually quite a bit slower to parallel sort an already sorted array. Similarly, if the array is short, we don't want to unnecessarily default to trying to do it in parallel.

This is different. By the time we are doing a Toom-Cook calculation, we already have thousands of bits that we are multiplying together, thus a rather large number. In all likelihood doing the multiplication in parallel will always be faster, provided that we have enough cores. In that case though, we would need a fall-back in case we have no threads in the common FJP. We could decide that when the multiply() is called and then do the calculation sequentially.

We probably also need to be able to turn it off entirely. Perhaps with something like -Djava.math.BigInteger.disableParallelMultiply=true.

Heinz

@mlbridge
Copy link

@mlbridge mlbridge bot commented Nov 16, 2021

Mailing list message from Simon Roberts on core-libs-dev:

Is there not also an architectural consideration here? I'm inclined to
think that just because execution in parallel finishes in less wall time
does not mean that's the right thing to do. Those CPUs will be unavailable
for other work, and in any multi-user/client type system, that effectively
makes life worse for the others, doesn't it?

On Tue, Nov 16, 2021 at 1:20 PM kabutz <duke at openjdk.java.net> wrote:

--
Simon Roberts
(303) 249 3613

@kabutz
Copy link
Contributor Author

@kabutz kabutz commented Nov 16, 2021

Is there not also an architectural consideration here? I'm inclined to think that just because execution in parallel finishes in less wall time does not mean that's the right thing to do. Those CPUs will be unavailable for other work, and in any multi-user/client type system, that effectively makes life worse for the others, doesn't it?

Exactly, which is why we would need a kill-switch (or an on-switch) if we don't have a public parallelMultiply() method. It would then be either on or off for all multiplies, albeit only for large numbers.

@cl4es
Copy link
Member

@cl4es cl4es commented Nov 16, 2021

I would be wary to make any API use multiple threads behind the scenes without the user explicitly asking for it. While latency of the given operation might improve in isolation, parallelization always incur some (often significant) additional cost of computation. This might reduce power efficiency, reduce CPU availability for other tasks, and play tricks with scalability.

Cost: On my system multiply burns about half as many cycles as parallelMultiply, even though it takes 2.4x longer (measured with -prof perfnorm).

Scalability: To simulate how well the solution scales you could try running the multiply and parallelMultiply micros with increasingly large -t values. (-t controls the number of threads running the benchmarks in parallel, default 1). On my system the advantage of parallelMultiply diminishes markedly as I saturate the system. On a test where I see a 2.4x speed-up at -t 1 (n = 50000000), the advantage drops to 1.5x at -t 4 and only gives me a 1.1x speed-up at -t 8.

I'd favor a public parallelMultiply(). Alternatively a flag to opt-in to parallelization.

(Nit: avoid appending flags to microbenchmarks that aren't strictly necessary for the tests, or such that can be reasonably expected to be within bounds on any test system. I didn't have 16Gb of free RAM.)

@fisk
Copy link
Contributor

@fisk fisk commented Nov 17, 2021

I would be wary to make any API use multiple threads behind the scenes without the user explicitly asking for it. While latency of the given operation might improve in isolation, parallelization always incur some (often significant) additional cost of computation. This might reduce power efficiency, reduce CPU availability for other tasks, and play tricks with scalability.

Cost: On my system multiply burns about half as many cycles as parallelMultiply, even though it takes 2.4x longer (measured with -prof perfnorm).

Scalability: To simulate how well the solution scales you could try running the multiply and parallelMultiply micros with increasingly large -t values. (-t controls the number of threads running the benchmarks in parallel, default 1). On my system the advantage of parallelMultiply diminishes markedly as I saturate the system. On a test where I see a 2.4x speed-up at -t 1 (n = 50000000), the advantage drops to 1.5x at -t 4 and only gives me a 1.1x speed-up at -t 8.

I'd favor a public parallelMultiply(). Alternatively a flag to opt-in to parallelization.

(Nit: avoid appending flags to microbenchmarks that aren't strictly necessary for the tests, or such that can be reasonably expected to be within bounds on any test system. I didn't have 16Gb of free RAM.)

+1 on everything you said. We tell people not to oversaturate their CPUs because latencies then go out of the window then. It is then not helpful when the JDK internals automagically oversaturates the machine for you even though you didn't ask for it. And I can imagine a class like BigInteger being used in latency critical applications. It definitely ought to be an opt in, as opposed to a global opt out behind some flag that 99.9% of users dont know about and shouldn't have to know about, to do the expected thing. Principle of least surprise holds IMHO.

Also, this is seemingly only a performance win if the rest of the machine is idle. It is far from obvious that the idling machine is in greater need of better performance, compared to the dangerously saturated machine. My gut feeling would be that it is the other way around. That might be another reason why it might not be a suitable default behaviour IMO.

@bplb
Copy link
Member

@bplb bplb commented Nov 17, 2021

I also do not like potentially non-obvious default behavior, nor a command line flag, nor a (static) setting on BigInteger. Thus I think the original parallelMultiply() (or perhaps multiplyParallel()) would be preferable.

Would adding a parameter in the nature of maxProcessors make any sense?

@PaulSandoz
Copy link
Member

@PaulSandoz PaulSandoz commented Nov 17, 2021

I also do not like potentially non-obvious default behavior, nor a command line flag, nor a (static) setting on BigInteger. Thus I think the original parallelMultiply() (or perhaps multiplyParallel()) would be preferable.

For the parallel supported features we added in Java 8 we made it explicit for the reasons you and others have stated.

Would adding a parameter in the nature of maxProcessors make any sense?

Kind of but i would recommend not doing it. That's hard to express in a manner that developers will choose appropriate values across all deployments. This is why you don't see such configuration for parallel streams or the parallel array operations. It's controlled by the common pool parallelism configuration (developers have learnt the trick of running within some constrained F/J task to limit parallelism but that is unsupported behaviour). The closest we get to anything configurable is a parallelism threshold for methods on ConcurrentHashMap, where the developer can assess the cost of transformation per element in combination with how many elements to be processed (likely assessed empirically from performance measurements).

--

I would like to get a sense of how common it might be that developers operate on very large numbers that this becomes worthwhile while supporting.

The implementation looks reasonable and quite cleverly minimal, but I think it would be useful to get a sense of whether the recursion goes beyond some proportion of # runtime processors after which there is likely no point in creating more recursive tasks e.g. from the size in bits of the inputs can we determine a useful approximate threshold?

@mlbridge
Copy link

@mlbridge mlbridge bot commented Nov 17, 2021

Mailing list message from Brian Burkhalter on core-libs-dev:

On Nov 17, 2021, at 11:14 AM, Paul Sandoz <psandoz at openjdk.java.net<mailto:psandoz at openjdk.java.net>> wrote:

Would adding a parameter in the nature of `maxProcessors` make any sense?

Kind of but i would recommend not doing it. That's hard to express in a manner that developers will choose appropriate values across all deployments. This is why you don't see such configuration for parallel streams or the parallel array operations.[?]

Thanks for the background.

?
[?] I think it would be useful to get a sense of whether the recursion goes beyond some proportion of # runtime processors after which there is likely no point in creating more recursive tasks e.g. from the size in bits of the inputs can we determine a useful approximate threshold?

That sounds like a good idea.

@kabutz
Copy link
Contributor Author

@kabutz kabutz commented Nov 17, 2021

I would be wary to make any API use multiple threads behind the scenes without the user explicitly asking for it. While latency of the given operation might improve in isolation, parallelization always incur some (often significant) additional cost of computation. This might reduce power efficiency, reduce CPU availability for other tasks, and play tricks with scalability.

Cost: On my system multiply burns about half as many cycles as parallelMultiply, even though it takes 2.4x longer (measured with -prof perfnorm).

Scalability: To simulate how well the solution scales you could try running the multiply and parallelMultiply micros with increasingly large -t values. (-t controls the number of threads running the benchmarks in parallel, default 1). On my system the advantage of parallelMultiply diminishes markedly as I saturate the system. On a test where I see a 2.4x speed-up at -t 1 (n = 50000000), the advantage drops to 1.5x at -t 4 and only gives me a 1.1x speed-up at -t 8.

Furthermore, since it uses the common FJP by default, any simultaneous parallel execution (parallel sort, parallel streams, etc.) would decrease the performance.

I'd favor a public parallelMultiply(). Alternatively a flag to opt-in to parallelization.

The reason I would prefer a public method to a flag is that it might be useful to enable parallel multiply on a case-by-case basis. With a flag, it is an all-or-nothing approach. If it has to be a flag, then I'd agree that we should have to opt it in.

(Nit: avoid appending flags to microbenchmarks that aren't strictly necessary for the tests, or such that can be reasonably expected to be within bounds on any test system. I didn't have 16Gb of free RAM.)

Thanks, will change that. Fun fact - the book Optimizing Java has a graph in the introduction that refers to some tests I did on Fibonacci a long time ago. The GC was dominating because the spaces were too small. However, in that case I was calculating Fibonacci of 1 billion. For "just" 100m, we don't need as much memory.

Here is the amount of memory that each of the Fibonacci calculations allocate. For the 100m calculation, the resident set size for multiply() is about 125mb and for parallelMultiply() about 190mb.

BigIntegerParallelMultiply.multiply:              1_000_000    177 MB
BigIntegerParallelMultiply.multiply:             10_000_000   3411 MB
BigIntegerParallelMultiply.multiply:            100_000_000  87689 MB
BigIntegerParallelMultiply.parallelMultiply:      1_000_000    177 MB
BigIntegerParallelMultiply.parallelMultiply:     10_000_000   3411 MB
BigIntegerParallelMultiply.parallelMultiply:    100_000_000  87691 MB

@kabutz
Copy link
Contributor Author

@kabutz kabutz commented Nov 17, 2021

I would be wary to make any API use multiple threads behind the scenes without the user explicitly asking for it. While latency of the given operation might improve in isolation, parallelization always incur some (often significant) additional cost of computation. This might reduce power efficiency, reduce CPU availability for other tasks, and play tricks with scalability.
Cost: On my system multiply burns about half as many cycles as parallelMultiply, even though it takes 2.4x longer (measured with -prof perfnorm).
Scalability: To simulate how well the solution scales you could try running the multiply and parallelMultiply micros with increasingly large -t values. (-t controls the number of threads running the benchmarks in parallel, default 1). On my system the advantage of parallelMultiply diminishes markedly as I saturate the system. On a test where I see a 2.4x speed-up at -t 1 (n = 50000000), the advantage drops to 1.5x at -t 4 and only gives me a 1.1x speed-up at -t 8.
I'd favor a public parallelMultiply(). Alternatively a flag to opt-in to parallelization.
(Nit: avoid appending flags to microbenchmarks that aren't strictly necessary for the tests, or such that can be reasonably expected to be within bounds on any test system. I didn't have 16Gb of free RAM.)

+1 on everything you said. We tell people not to oversaturate their CPUs because latencies then go out of the window then. It is then not helpful when the JDK internals automagically oversaturates the machine for you even though you didn't ask for it. And I can imagine a class like BigInteger being used in latency critical applications. It definitely ought to be an opt in, as opposed to a global opt out behind some flag that 99.9% of users dont know about and shouldn't have to know about, to do the expected thing. Principle of least surprise holds IMHO.

Considering how much memory is allocated by BigInteger, I'd be surprised if latency critical applications used it. Besides, we are talking about rather large calculations, with numbers that are thousands of bits large. But of course I agree that opt-in would be better.

Also, this is seemingly only a performance win if the rest of the machine is idle. It is far from obvious that the idling machine is in greater need of better performance, compared to the dangerously saturated machine. My gut feeling would be that it is the other way around. That might be another reason why it might not be a suitable default behaviour IMO.

This is why I tried to keep the behaviour similar to what we are used to - parallel streams have to be explicitly opted in and I hardly ever do it in real code. Same with parallelSort()

@kabutz
Copy link
Contributor Author

@kabutz kabutz commented Nov 17, 2021

I also do not like potentially non-obvious default behavior, nor a command line flag, nor a (static) setting on BigInteger. Thus I think the original parallelMultiply() (or perhaps multiplyParallel()) would be preferable.

For the parallel supported features we added in Java 8 we made it explicit for the reasons you and others have stated.

Would adding a parameter in the nature of maxProcessors make any sense?

Kind of but i would recommend not doing it. That's hard to express in a manner that developers will choose appropriate values across all deployments. This is why you don't see such configuration for parallel streams or the parallel array operations. It's controlled by the common pool parallelism configuration (developers have learnt the trick of running within some constrained F/J task to limit parallelism but that is unsupported behaviour).

The "unsupported behavior" is described in the fork() method of ForkJoinTask:

"Arranges to asynchronously execute this task in the pool the current task is running in, if applicable, or using the {@link ForkJoinPool#commonPool()} if not {@link #inForkJoinPool}"

So yes, there is that workaround if someone really wants more threads for the calculation. Or they can increase the common pool parallelism with the system property.

The closest we get to anything configurable is a parallelism threshold for methods on ConcurrentHashMap, where the developer can assess the cost of transformation per element in combination with how many elements to be processed (likely assessed empirically from performance measurements).

AFAIK that concurrencyLevel is not really used anymore for the CHM since Java 8, except to have at least that many initial bins:

        if (initialCapacity < concurrencyLevel)   // Use at least as many bins
            initialCapacity = concurrencyLevel;   // as estimated threads

--

I would like to get a sense of how common it might be that developers operate on very large numbers that this becomes worthwhile while supporting.

The implementation looks reasonable and quite cleverly minimal, but I think it would be useful to get a sense of whether the recursion goes beyond some proportion of # runtime processors after which there is likely no point in creating more recursive tasks e.g. from the size in bits of the inputs can we determine a useful approximate threshold?

I will look at this and get back to you. Usually by the time we get to Toom Cook 3, the chunks that we are working with are so large that the RecursiveTask costs are probably not significant in comparison. But I will check.

@kabutz
Copy link
Contributor Author

@kabutz kabutz commented Nov 17, 2021

One concern I had was what would happen when the common pool parallelism was set to 0. In the parallelSort mechanism, they only sort in parallel if the parallelism is at least 2. Thus a parallel sort on a dual-core machine (without hyperthreading) will run sequentially. However, the parallelBigInteger seems to work OK with different common pool sizes. We always have one thread more than the common pool parallelism working, since the thread that calls parallelMultiply() also does calculations:

Showing only the cases for Fibonacci(100m):

Method            (n)           common_parallelism  ms/op
multiply          100_000_000   0                   24619.004         
parallelMultiply  100_000_000   0                   24977.280        
multiply          100_000_000   1                   24526.498
parallelMultiply  100_000_000   1                   18169.969
multiply          100_000_000   2                   24528.111         
parallelMultiply  100_000_000   2                   11073.637         
multiply          100_000_000   3                   24861.324
parallelMultiply  100_000_000   3                    9472.986
multiply          100_000_000   4                   25036.295         
parallelMultiply  100_000_000   4                    9151.940         
multiply          100_000_000   5                   25129.139
parallelMultiply  100_000_000   5                    8964.403
multiply          100_000_000  11                   24717.125         
parallelMultiply  100_000_000  11                    9024.319         

@dholmes-ora
Copy link
Member

@dholmes-ora dholmes-ora commented Nov 18, 2021

To add my 2c IMO a parallel version of this type absolutely must be opt-in. There are simply far too many side-effects of using the FJP and multiple threads to perform the calculation in parallel as if it is just a minor implementation detail. A clear API is 1000x better than a "kill switch".

And yes you may still need to expose some kind of tuning knob.

David

@kabutz
Copy link
Contributor Author

@kabutz kabutz commented Nov 18, 2021

To add my 2c IMO a parallel version of this type absolutely must be opt-in. There are simply far too many side-effects of using the FJP and multiple threads to perform the calculation in parallel as if it is just a minor implementation detail. A clear API is 1000x better than a "kill switch".

And yes you may still need to expose some kind of tuning knob.

David

Yes, it must be opt-in. However I'm not sure that a tuning knob will be necessary. BigInteger has thresholds for using different multiply algorithms and these are also not configurable.

@mlbridge
Copy link

@mlbridge mlbridge bot commented Nov 18, 2021

Mailing list message from Bernd Eckenfels on core-libs-dev:

What about a new API multiply method which takes an forkjoinpool, and only if that is used/specified it will use the parallel mode (and only if Notsitzes threshold applies?). Most of the pool tuning can then done with this argument. It also avoids surprise threads.

Gruss
Bernd
--
http://bernd.eckenfels.net
________________________________
Von: core-libs-dev <core-libs-dev-retn at openjdk.java.net> im Auftrag von kabutz <duke at openjdk.java.net>
Gesendet: Thursday, November 18, 2021 11:41:40 AM
An: core-libs-dev at openjdk.java.net <core-libs-dev at openjdk.java.net>
Betreff: Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger [v3]

On Thu, 18 Nov 2021 07:26:45 GMT, David Holmes <dholmes at openjdk.org> wrote:

To add my 2c IMO a parallel version of this type absolutely **must** be opt-in. There are simply far too many side-effects of using the FJP and multiple threads to perform the calculation in parallel as if it is just a minor implementation detail. A clear API is 1000x better than a "kill switch".

And yes you may still need to expose some kind of tuning knob.

David

Yes, it **must** be opt-in. However I'm not sure that a tuning knob will be necessary. BigInteger has thresholds for using different multiply algorithms and these are also not configurable.

-------------

PR: https://git.openjdk.java.net/jdk/pull/6409

@mlbridge
Copy link

@mlbridge mlbridge bot commented Nov 18, 2021

Mailing list message from Remi Forax on core-libs-dev:

----- Original Message -----

From: "Bernd Eckenfels" <ecki at zusammenkunft.net>
To: "core-libs-dev" <core-libs-dev at openjdk.java.net>
Sent: Jeudi 18 Novembre 2021 12:07:19
Subject: Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger [v3]

What about a new API multiply method which takes an forkjoinpool, and only if
that is used/specified it will use the parallel mode (and only if Notsitzes
threshold applies?). Most of the pool tuning can then done with this argument.
It also avoids surprise threads.

You don't need it, here is the usual trick if you want to specify a specific fork join pool
https://stackoverflow.com/questions/21163108/custom-thread-pool-in-java-8-parallel-stream

Gruss
Bernd

regards,
R?mi

@mlbridge
Copy link

@mlbridge mlbridge bot commented Nov 18, 2021

Mailing list message from Bernd Eckenfels on core-libs-dev:

Yes but that does not help with the decision if parallel should be used or not. But yes, if it is generally not wanted to make the pool explicite a simple parallel signature without argument would also work to make the decision explicite (I.e. new api). That won?t automatically tune the performance but it does allow users to use it - majority would be crypto anyway where it can be used by the JCE and JSSE (maybe?).

Gruss
Bernd

--
http://bernd.eckenfels.net
________________________________
Von: Remi Forax <forax at univ-mlv.fr>
Gesendet: Thursday, November 18, 2021 12:16:31 PM
An: Bernd Eckenfels <ecki at zusammenkunft.net>
Cc: core-libs-dev <core-libs-dev at openjdk.java.net>
Betreff: Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger [v3]

----- Original Message -----

From: "Bernd Eckenfels" <ecki at zusammenkunft.net>
To: "core-libs-dev" <core-libs-dev at openjdk.java.net>
Sent: Jeudi 18 Novembre 2021 12:07:19
Subject: Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger [v3]

What about a new API multiply method which takes an forkjoinpool, and only if
that is used/specified it will use the parallel mode (and only if Notsitzes
threshold applies?). Most of the pool tuning can then done with this argument.
It also avoids surprise threads.

You don't need it, here is the usual trick if you want to specify a specific fork join pool
https://stackoverflow.com/questions/21163108/custom-thread-pool-in-java-8-parallel-stream

Gruss
Bernd

regards,
R?mi

@PaulSandoz
Copy link
Member

@PaulSandoz PaulSandoz commented Jan 14, 2022

@kabutz i just finalized the CSR (the new year break meant all this is taking longer than usual). Based on that review we may need to make some tweaks to the API doc, which should be quick to process. So it should be in the bag for 19.

for (int n = 0; n <= 10; n++) {
BigInteger fib = fibonacci(n, BigInteger::multiply);
System.out.printf("fibonacci(%d) = %d%n", n, fib);
}
Copy link
Member

@PaulSandoz PaulSandoz Jan 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove this and the loop block at #70-80, since we have the performance test. After that we are good.

Copy link
Contributor Author

@kabutz kabutz Jan 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@PaulSandoz
Copy link
Member

@PaulSandoz PaulSandoz commented Jan 19, 2022

@kabutz please see comments from Joe on the CSR, which should be easy to address (i can update the CSR after you make changes).

@kabutz
Copy link
Contributor Author

@kabutz kabutz commented Jan 28, 2022

@kabutz please see comments from Joe on the CSR, which should be easy to address (i can update the CSR after you make changes).

I'm working on some results for the question by Joe about the latency vs CPU usage for the parallelMultiply() vs multiply() methods. It wasn't so easy, because measuring a single thread is easier than all of the FJP threads. But I have a nice benchmark that I'm running now. I had to write my own harness and not use JMH, because I don't think that JMH can test at that level. I'm also measuring object allocation.

Furthermore, I'm testing against all Java versions going back to Java 8, to make sure that we don't get any surprises. Here is my version:

BigInteger.multiply()
real  0m22.616s
user  0m22.470s
sys   0m0.008s
mem   84.0GB
BigInteger.parallelMultiply()
real  0m6.283s
user  1m3.200s
sys   0m0.004s
mem   84.0GB

I will upload the results for all the Java versions later, and will also submit the benchmark.

@kabutz
Copy link
Contributor Author

@kabutz kabutz commented Jan 28, 2022

I have added a benchmark for checking performance difference between sequential and parallel multiply of very large Mersenne primes using BigInteger. We want to measure real time, user time, system time and the amount of memory allocated. To calculate this, we create our own thread factory for the common ForkJoinPool and then use that to measure user time, cpu time and bytes allocated.

We use reflection to discover all methods that match "*ultiply", and use them to multiply two very large Mersenne primes together.

Results on a 1-6-2 machine running Ubuntu linux

Memory allocation increased from 83.9GB to 84GB, for both the sequential and parallel versions. This is an increase of just 0.1%. On this machine, the parallel version was 3.8x faster in latency (real time), but it used 2.7x more CPU resources.

Testing multiplying Mersenne primes of 2^57885161-1 and 2^82589933-1

openjdk version "18-internal" 2022-03-15

BigInteger.parallelMultiply()
real  0m6.288s
user  1m3.010s
sys   0m0.027s
mem   84.0GB
BigInteger.multiply()
real  0m23.682s
user  0m23.530s
sys   0m0.004s
mem   84.0GB

openjdk version "1.8.0_302"

BigInteger.multiply()
real  0m25.657s
user  0m25.390s
sys   0m0.001s
mem   83.9GB

openjdk version "9.0.7.1"

BigInteger.multiply()
real  0m24.907s
user  0m24.700s
sys   0m0.001s
mem   83.9GB

openjdk version "10.0.2" 2018-07-17

BigInteger.multiply()
real  0m24.632s
user  0m24.380s
sys   0m0.004s
mem   83.9GB

openjdk version "11.0.12" 2021-07-20 LTS

BigInteger.multiply()
real  0m22.114s
user  0m21.930s
sys   0m0.001s
mem   83.9GB

openjdk version "12.0.2" 2019-07-16

BigInteger.multiply()
real  0m23.015s
user  0m22.830s
sys   0m0.000s
mem   83.9GB

openjdk version "13.0.9" 2021-10-19

BigInteger.multiply()
real  0m23.548s
user  0m23.350s
sys   0m0.005s
mem   83.9GB

openjdk version "14.0.2" 2020-07-14

BigInteger.multiply()
real  0m22.918s
user  0m22.530s
sys   0m0.131s
mem   83.9GB

openjdk version "15.0.5" 2021-10-19

BigInteger.multiply()
real  0m22.038s
user  0m21.750s
sys   0m0.003s
mem   83.9GB

openjdk version "16.0.2" 2021-07-20

BigInteger.multiply()
real  0m23.049s
user  0m22.760s
sys   0m0.006s
mem   83.9GB

openjdk version "17" 2021-09-14

BigInteger.multiply()
real  0m22.580s
user  0m22.310s
sys   0m0.001s
mem   83.9GB

@PaulSandoz
Copy link
Member

@PaulSandoz PaulSandoz commented Feb 2, 2022

@kabutz thanks for the additional testing, kind of what we intuitively expected.

Can you please update the specification in response to Joe's comment?

Generally for parallel constructs we try to say as little as possible with regards to latency, CPU time, and memory. The first two are sort of obvious, the later less so for the developer.

From your results I think can say a little more. Here a suggestive update addressing Joe's comments:

/**
 * Returns a BigInteger whose value is {@code (this * val)}.
 * When both {@code this} and {@code val} are large, typically
 * in the thousands of bits, parallel multiply might be used. 
 * This method returns the exact same mathematical result as {@link #multiply}. 
 *
 * @implNote This implementation may offer better algorithmic
 * performance when {@code val == this}.
 * 
 * @implNote Compared to {@link #multiply} this implementation's parallel multiplication algorithm 
 * will use more CPU resources to compute the result faster, with a relatively small increase memory
 * consumption.
 *
 * @param  val value to be multiplied by this BigInteger.
 * @return {@code this * val}
 * @see #multiply
 */

@jddarcy
Copy link
Member

@jddarcy jddarcy commented Feb 2, 2022

@kabutz thanks for the additional testing, kind of what we intuitively expected.

Can you please update the specification in response to Joe's comment?

Generally for parallel constructs we try to say as little as possible with regards to latency, CPU time, and memory. The first two are sort of obvious, the later less so for the developer.

From your results I think can say a little more. Here a suggestive update addressing Joe's comments:

/**
 * Returns a BigInteger whose value is {@code (this * val)}.
 * When both {@code this} and {@code val} are large, typically
 * in the thousands of bits, parallel multiply might be used. 
 * This method returns the exact same mathematical result as {@link #multiply}. 
 *
 * @implNote This implementation may offer better algorithmic
 * performance when {@code val == this}.
 * 
 * @implNote Compared to {@link #multiply} this implementation's parallel multiplication algorithm 
 * will use more CPU resources to compute the result faster, with a relatively small increase memory
 * consumption.
 *
 * @param  val value to be multiplied by this BigInteger.
 * @return {@code this * val}
 * @see #multiply
 */

Yes, my intention is that the parallelMultiply spec give some guidance to the user on when to use it and warning about the consequences of doing so (same answer, should be in less time, but more compute and possibly a bit more memory).

@kabutz
Copy link
Contributor Author

@kabutz kabutz commented Feb 3, 2022

The multiply() and parallelMultiply() use the exact same amount of memory now. However, they both use a little bit more than the previous multiply() method when the numbers are very large. We tried various approaches to keep the memory usage the same for non-parallel multiply(), but the solutions were not elegant. Since the small memory increase is only when the object allocation is huge, the extra memory did not make a difference. For small numbers, multiply() and parallelMultiply() are exactly the same as the old multiply(). multiply() thus has the same latency and CPU consumption as before.

A question about wording of the @implNote. In multiply() they say: "An implementation may offer better algorithmic ...", but we changed this to "This implementation may offer better algorithmic ..." I've kept it as "This implementation may ...", but what is the better way of writing such implementation notes?

* @implNote Compared to {@link #multiply}, this implementation's
* parallel multiplication algorithm will use more CPU resources
* to compute the result faster, with no increase in memory
* consumption.
Copy link
Member

@jddarcy jddarcy Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implNote should cover a space of possible parallel multiply implementations so it doesn't have to be updated as often as the implementation is tuned or adjusted. So I'd prefer to have a statement like "may use more memory" even if the current implementation doesn't actually use more memory. If there are any "contraindications" on when to use the method, they could be listed here too.

Copy link
Member

@PaulSandoz PaulSandoz Feb 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kabutz I approved, but can you address Joe's comment, then i will update the CSR.

Copy link
Contributor Author

@kabutz kabutz Feb 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@PaulSandoz
Copy link
Member

@PaulSandoz PaulSandoz commented Feb 9, 2022

A question about wording of the @implNote. In multiply() they say: "An implementation may offer better algorithmic ...", but we changed this to "This implementation may offer better algorithmic ..." I've kept it as "This implementation may ...", but what is the better way of writing such implementation notes?

I usually refer to "This implementation or this method" mostly out of habit of writing a bunch of these notes in java.util. I am fine with either approach.

@kabutz
Copy link
Contributor Author

@kabutz kabutz commented Feb 10, 2022

A question about wording of the @implNote. In multiply() they say: "An implementation may offer better algorithmic ...", but we changed this to "This implementation may offer better algorithmic ..." I've kept it as "This implementation may ...", but what is the better way of writing such implementation notes?

I usually refer to "This implementation or this method" mostly out of habit of writing a bunch of these notes in java.util. I am fine with either approach.

That makes sense - thank you so much for your patience :-)

@openjdk openjdk bot removed the csr label Feb 10, 2022
@openjdk
Copy link

@openjdk openjdk bot commented Feb 10, 2022

@kabutz This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8277175: Add a parallel multiply method to BigInteger

Reviewed-by: psandoz

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 1033 new commits pushed to the master branch:

  • 4d64076: 8047749: javadoc for getPathBounds() in TreeUI and BasicTreeUI is incorrect
  • 90939cb: 8281626: NonblockingQueue should use nullptr
  • 3a13425: 8072070: Improve interpreter stack banging
  • 8441d51: 8281419: The source data for the color conversion can be discarded
  • a037b3c: 8281460: Let ObjectMonitor have its own NMT category
  • 65831eb: 8281318: Improve jfr/event/allocation tests reliability
  • eee6a56: 8281522: Rename ADLC classes which have the same name as hotspot variants
  • 84868e3: 8281275: Upgrading from 8 to 11 no longer accepts '/' as filepath separator in gc paths
  • 58c2bd3: 8281536: JFR: Improve jdk.jfr.ContentType documentation
  • 83b6e4b: 8281294: [vectorapi] FIRST_NONZERO reduction operation throws IllegalArgumentExcept on zero vectors
  • ... and 1023 more: https://git.openjdk.java.net/jdk/compare/7906eb050d4675092536048e8e21334767e397e6...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@PaulSandoz) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

@openjdk openjdk bot added the ready label Feb 10, 2022
@kabutz
Copy link
Contributor Author

@kabutz kabutz commented Feb 11, 2022

/integrate

@openjdk openjdk bot added the sponsor label Feb 11, 2022
@openjdk
Copy link

@openjdk openjdk bot commented Feb 11, 2022

@kabutz
Your change (at version 2eec2e9) is now ready to be sponsored by a Committer.

@PaulSandoz
Copy link
Member

@PaulSandoz PaulSandoz commented Feb 11, 2022

/sponsor

@openjdk
Copy link

@openjdk openjdk bot commented Feb 11, 2022

Going to push as commit 83ffbd2.
Since your change was applied there have been 1041 commits pushed to the master branch:

  • 0786ddb: 8281535: Create a regression test for JDK-4670051
  • c5ff6e4: 8223077: module path support for dynamic CDS archive
  • 8886839: 8281622: JFR: Improve documentation of jdk.jfr.Relational
  • e75e8cd: 8279997: check_for_dynamic_dump should not exit vm
  • e73ee0c: 8281259: MutableBigInteger subtraction could be simplified
  • f399ae5: 8202836: [macosx] test java/awt/Graphics/TextAAHintsTest.java fails
  • 4ff5824: 8281100: Spurious "variable might not have been initialized" with sealed class switch
  • d254cf2: 8281638: jfr/event/allocation tests fail with release VMs after JDK-8281318 due to lack of -XX:+UnlockDiagnosticVMOptions
  • 4d64076: 8047749: javadoc for getPathBounds() in TreeUI and BasicTreeUI is incorrect
  • 90939cb: 8281626: NonblockingQueue should use nullptr
  • ... and 1031 more: https://git.openjdk.java.net/jdk/compare/7906eb050d4675092536048e8e21334767e397e6...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated label Feb 11, 2022
@openjdk openjdk bot closed this Feb 11, 2022
@openjdk openjdk bot removed ready rfr sponsor labels Feb 11, 2022
@openjdk
Copy link

@openjdk openjdk bot commented Feb 11, 2022

@PaulSandoz @kabutz Pushed as commit 83ffbd2.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-libs integrated
7 participants