Better precision and Performance with Runge-Kutta-4 integrator #121

fat-lobyte · 2017-12-27T16:09:13Z

Up until now, Trajectories used the Verlet integration scheme. This worked kind of alright, however it required a rather small step size, which lead to rather hefty performance penalty.
To alleviate this, the force cache was introduced that sped up the force calculation, but came with its own drawback of decreased precision.

Using a more advanced integration Scheme like the Runga-Kutta 4 Integration allows a much higher integration step (improving performance) while maintaining precision.

There are now 4 invocations of the force function instead of one, but this allowed us to increase the stepsize from 0.1s to 2.0s, resulting in a 2.0 / 0.1 / 4 = 5-fold performance increase, while maintaining about the same (cache-less) precision.
This allows users to disable the cache for no or a small perfomance loss for a high increase in precision.

The step size is now variable and can be changed in the code as Trajectories.Settings.IntegrationStepSize or in the settings by adding <float name="IntegrationStepSize">2.0</float> to /GameData/Trajectories/Plugin/PluginData/Trajectories/config.xml.

What I think should be done before merging this PR:

Testing! Please download and test in various situations.
Determining the actual performance cost compared to the Verlet (old, in master) integrator. From my testing, using the RK4 without Cache is about 1.5x slower than using Verlet with Cache.

What I think should be done after merging this PR:

Add a slider for the IntegrationStepSize in the Settings UI
Add settings and switching Code to chose the old (Verlet) or the RK-4 integrator, and the option to add more integrators
In the near future, I would like to try a little optimization pass. Additionally, I would like to try and limit the refresh rate of the trajectory calculation.
We really need the result at most 0.5 times a second, unless you're doing KOS things.
I would like to try implement an Adaptive method like Dormand-Prince 4/5 that adapts the step size to the highest value when changes are little, and to a lower value if there are many changes. This could increase the performance even further which will allow us to completely ditch the cache.

A compiled version of this branch can be found here:

https://github.com/fat-lobyte/KSPTrajectories/releases/download/v1.7.2-pre_integrator1/Trajectories.dll

We weren't quite done with the local truncation error: Adding the LTE to both the position AND velocity improves precision significantly.

fat-lobyte · 2017-12-28T15:43:24Z

Here is the precision comparison.

The columns correspond to the situations that were tested (represented by the .sfs quicksave files in the master branch).

The rows correspond to the Integration methods that were tested. Verlet is the old integrator, RK-4 is the new integrator. I tried out 3 different Integration steps: 3.0s, 2.0s and 0.5s.

The cells show the deviation in kilometers, where:

Overshooting is potitive (further downrange than predicted)
Falling short is negative (sooner than predicted)

	kerbin-half	kerbin-flat	eve-quarter	eve-flat	duna-half
Verlet [dt=0.1]	2,5	47	-2,1	15	4,4
Verlet [dt=0.1] w. cache	-1,5	-403	-5	-415	4,9
RK-4 [dt=2]	0,4	16,7	-4,2	-2,6	3,7
RK-4 [dt=2] w. cache	-3,7	-513	-7,2	-520	4,1
RK-4 [dt=3]	0,2	14,9	-4,2	3	3,9
RK-4 [dt=0.5]	0,3	18,2	-4,2	-2,8	3,7

As you can see, the RK-4 with an IntegrationStepSize of 2.0 yields about the same or better results as Verlet with a 0.1 stepsize. Decreasing the stepsize did not help much with precision.

fat-lobyte · 2017-12-28T16:00:55Z

Here is the performance comparison.

The columns correspond to the situations that were tested (represented by the .sfs quicksave files in the master branch).

The rows correspond to the Integration methods that were tested. Verlet is the old integrator, RK-4 is the new integrator. I tried out 3 different Integration steps: 3.0s, 2.0s and 0.5s.

The cells show the performance impact as shown in the Settings-Tab of the Trajectory window, in Milliseconds [ms].

On an Intel Core i5 4690K with a boost clock of 3.9 GHz (4.74 GFLOPS per Core), these are the results:

	kerbin-half	kerbin-flat	eve-quarter	eve-flat	duna-half
Verlet [dt=0.1]	20,8	19,7	20,1	20,3	19,9
Verlet [dt=0.1] w. cache	2,9	2,9	2,8	2,8	2,9
RK-4 [dt=2]	4	3,9	3,9	3,7	3,9
RK-4 [dt=2] w. cache	2,4	2,6	2,2	2,4	2,6
RK-4 [dt=3]	2,9	2,9	2,7	2,9	2,9
RK-4 [dt=3] w. cache	2,5	2,8	1,5	2,1	2,6
RK-4 [dt=0.5]	16,3	15,8	15,8	15,5	15,9
RK-4 [dt=0.5] w. cache	3	3	2,6	2,9	2,9

On an Intel Core i5 3320M with a maximum clock of 3100 MHz, these are the results:

	kerbin-half	kerbin-flat	eve-quarter	eve-flat	duna-half
Verlet [dt=0.1]	36,2	34	32	33,5	36,2
Verlet [dt=0.1] w. cache	3,7	3.2	3	3,3	3,2
RK-4 [dt=2]	7,8	7,5	7	7,5	7,8
RK-4 [dt=2] w. cache	2,5	2,6	2,3	2,7	2,7
RK-4 [dt=3]	5,4	5	4,4	4,8	4,9
RK-4 [dt=3] w. cache	2,6	2,7	2,5	2,3	3,4
RK-4 [dt=0.5]	31,5	25,5	28,5	29,5	30,8
RK-4 [dt=0.5] w. cache	4,1	3,7	3,3	3,1	3,6

The performance benefit without cache and approximately the same precision is between 4,4 and 5,2 times.

Unfortunately, the difference between Verlet with Cache and RK-4 without cache are still a bit too much for comfort, here the cache-less RK-4 can be about 2 times slower than the Verlet with cache on slower systems.

Whether or not this is acceptable is a more difficult question, but personally I would always prefer the much higher precision without the cache to a small performance hit. What are your opinions?

With this PR merged, I will probably start suggesting to turn off the cache most of the time to users.

fat-lobyte · 2018-01-04T14:58:21Z

Merged due to lack of negative feedback, a.k.a. "works for me".

PiezPiedPy · 2018-01-04T15:57:53Z

I've had no problems with it and got a performance boost on my system.
Forgot to post, with new year and all I completely forgot, Sorry :(

fat-lobyte · 2018-01-04T16:59:31Z

No worries, glad that you could look at it :)

PiezPiedPy · 2018-01-04T17:34:40Z

Got a question regarding the accelerationFunc lambda
Would it not be better to change this to a Local function (new to C#7) since calling a local function is cheaper, possibly increasing performance even further.

fat-lobyte · 2018-01-04T19:16:11Z

Got a question regarding the accelerationFunc lambda
Would it not be better to change this to a Local function (new to C#7) since calling a local function is cheaper, possibly increasing performance even further.

Thanks for bringing this feature to my attention! I checked it out, and it seems the performance is actually a tiny bit worse.

This is probably because the main advantage of a local function would be the lack of the creation of a delegate object - but we do need to create a delegate object! The RK4Step() function takes a function delegate and uses it to evaluate the acceleration at different positions.

Of course, theoretically it would be possible to manually inline everything into the AddPatch() function to save any kind of function call overhead. But TBH, I would actually prefer to go the opposite way, namely modularizing the Code more and more, to reduce all of the Monster-sized functions into more managable chunks.

PiezPiedPy · 2018-01-04T19:59:48Z

I checked it out, and it seems the performance is actually a tiny bit worse.

Thanks for checking.

we do need to create a delegate object! The RK4Step() function takes a function delegate and uses it to evaluate the acceleration at different positions.

Thanks for letting me know, I didn't look too deeply into how the calculations were performed.

I would actually prefer to go the opposite way, namely modularizing the Code more and more, to reduce all of the Monster-sized functions into more managable chunks.

I totally agree with you.

fat-lobyte added 11 commits December 25, 2017 12:28

Trajectory: move integration step into function

588744f

Integration: remove prevPos variable and parameter

a27b5ce

Integration: create lambda from acceleration function

1d56a59

Add RK4 Integrator

983a2aa

Add Profiler statements across trajectory calculation

d95af48

Enable RK4 solver

30de15a

Emulate local truncation error in RK4 calculation

ae5c48c

Avoid duplicate calculation of Acceleration

b446018

Add local truncation error for velocity

da0cea8

We weren't quite done with the local truncation error: Adding the LTE to both the position AND velocity improves precision significantly.

Make Integration step size configurable

092dd54

CHANGELOG: Update for Integrator

ca26159

fat-lobyte added precision requires testing labels Dec 27, 2017

fat-lobyte added this to the v2.0.0 milestone Dec 27, 2017

fat-lobyte self-assigned this Dec 27, 2017

fat-lobyte requested a review from PiezPiedPy December 27, 2017 16:09

fat-lobyte merged commit bd1dcaa into neuoy:master Jan 4, 2018

fat-lobyte deleted the integrator branch January 4, 2018 14:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better precision and Performance with Runge-Kutta-4 integrator #121

Better precision and Performance with Runge-Kutta-4 integrator #121

fat-lobyte commented Dec 27, 2017

fat-lobyte commented Dec 28, 2017 •

edited

fat-lobyte commented Dec 28, 2017

fat-lobyte commented Jan 4, 2018

PiezPiedPy commented Jan 4, 2018

fat-lobyte commented Jan 4, 2018

PiezPiedPy commented Jan 4, 2018

fat-lobyte commented Jan 4, 2018

PiezPiedPy commented Jan 4, 2018

Better precision and Performance with Runge-Kutta-4 integrator #121

Better precision and Performance with Runge-Kutta-4 integrator #121

Conversation

fat-lobyte commented Dec 27, 2017

fat-lobyte commented Dec 28, 2017 • edited

fat-lobyte commented Dec 28, 2017

fat-lobyte commented Jan 4, 2018

PiezPiedPy commented Jan 4, 2018

fat-lobyte commented Jan 4, 2018

PiezPiedPy commented Jan 4, 2018

fat-lobyte commented Jan 4, 2018

PiezPiedPy commented Jan 4, 2018

fat-lobyte commented Dec 28, 2017 •

edited