Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ObjData buffer generation performance optimizations (about 2x as faster) #30

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

PhpXp
Copy link

@PhpXp PhpXp commented Aug 18, 2023

It's faster to create an array with data and then bulk put() it to a buffer. For loops were also optimized.
Code was tested on an Android phone and after the optimisations, buffers are created about 2x faster.

Public function signatures were changed, hopefully this isn't a big problem.
For example in getFaceVertexIndices(), you shouldn't be adding one value at a time to the buffer anyway, that's slow.
Note that the code isn't final, Javadoc still has to be updated.

@javagl
Copy link
Owner

javagl commented Aug 18, 2023

I may have to take a closer look at the exact changes. From quickly skimming over the changes, it seems to be related to two things that one could call "high-level implementation decisions":


1:

When dealing with such "bulk" data, it can often be useful to have two flavors of functions. One that allocates a new buffer, and one that writes the result into a given buffer. For example, as in
FloatBuffer vertices = something.getVertices();
and

FloatBuffer vertices = allocate();
something.getVertices(vertices);

An aside: It often makes sense to combine this into a "hybrid" function, like

// Fills the target and returns it
// When the target is null, then a new buffer is created, filled, and returned
FloatBuffer getVertices(FloatBuffer target) {
    FloatBuffer result = target;
    if (result == null) result = allocate();
    ...
    return result;
}

but this isn't the pattern that is used here.

You changed the function

public static void getVertices(ReadableObj obj, FloatBuffer target) { ... }
// to
public static void getVertices(ReadableObj obj, float[]     target) { ... }

First of all, that's a breaking API change and I won't accept this in this exact form.
But of course it would be easy to work around that: There could be the original getVertices function, and - following the naming pattern that is already used in general - there could be another one like getVerticesArray(... float[] target).

The change is not only breaking, but also reduces the versatility. With the FloatBuffer one, you could have

FloatBuffer verticesTwice = allocate(12 * 2);
ObjData.getVertices(obj, verticesTwice);
ObjData.getVertices(obj, verticesTwice);

e.g. for filling the buffer with vertices from multiple OBJs. In order to "emulate" this possibility, you'd have to change the method to

public static void getVerticesArray(
    ReadableObj obj, 
    float[] target,
    int offset) // Add an 'offset' parameter, saying where to start writing into the array

So the FloatBuffer one is more flexible in this regard. And this flexibility leads to the other design option:


2:

When dealing with "bulk data of primitive values", then there usually are the options of using a primitive array like float[], or a buffer like FloatBuffer. And the aforementioned flexibility is related to the question (that I always try to think about, not only in this particular case):

When there are two design options A and B, which one can emulate the other?

And as a rule of thumb:

  • When A can emulate B: Use A
  • When B can emulate A: Use B
  • When both can emulate each other: Find other criteria for the decision (and that may be 'performance' - more on that below)
  • Neither can emulate the other: These are not two 'options', but entirely different things ...

The point is (and that you may have noticed in the methods of the ObjData class that you modified):

A: When you have a method that fills a FloatBuffer target, then it's trivial to fill a float[] array with that:

float[] array = new ...;
fill(FloatBuffer.wrap(array));

B: When you have a method that fills a float[] target array, then ... it can be a little bit quirky and inconvenient when you want to fill a FloatBuffer...

FloatBuffer buffer = alloc();
float array[] temp = new ...;
fill(temp, 0);
buffer.put(temp);

(allocating the data twice, just to shove it from the array into the buffer)

An aside: For buffers, there's the additional design dimension of whether you want to heap-allocated buffers or 'direct' buffers. For anything that is related to OpenGL/native libraries, you usually want 'direct' buffers...


A detail:

The float[] getValues(); method in the FloatTuple looks innocent. After all, "it's just a getter"....

But ... it makes the state mutable. You could call
tuple.getValues()[1] = Float.NaN;
Until now, the FloatTuple is immutable.

Beyond that: I know that it's very unlikely that there will ever be a different implementation of the FloatTuple interface than the DefaultFloatTuple. But imagine there was an implementation that did not store the values in a float[] array. If there was an implementation that was based on a FloatBuffer (or if I wanted to change the DefaultFloatTuple to store the values in a FloatBuffer for some reason), then this method would bite back, painfully:

    @Override
    public float[] getValues() {
        float[] result = new float[3]; // Potentially called millions of times...
        result[0] = data.get(0);
        result[1] = data.get(1);
        result[2] = data.get(2);
        return values;
    }

This would probably eat up many performance gains that one might achieve elsewhere...

This leads to the main point:


The main claim of this PR is to be a "performance optimization" and make things "2x as faster" (sic).

I doubt that :)

Performance tests are difficult. Always.
And they are particularly difficult on a Virtual Machine.
And they are even more difficult on a Virtual Machine that has a Just-In-Time-Compiler.

But as a very quick test (to be taken with a grain of salt for the above reasons), I created this:

import java.io.FileInputStream;
import java.nio.FloatBuffer;
import java.nio.IntBuffer;

import de.javagl.obj.Obj;
import de.javagl.obj.ObjData;
import de.javagl.obj.ObjReader;


public class ObjPerformanceTestPR30
{
    public static void main(String[] args) throws Exception
    {
        String s = "example.obj";
        Obj obj = ObjReader.read(new FileInputStream(s));
        
        double sum = 0;
        long before = System.nanoTime();
        for (int i=0; i<1000; i++)
        {
            IntBuffer indices = ObjData.getFaceVertexIndices(obj);
            FloatBuffer vertices = ObjData.getVertices(obj);
            sum += indices.get(0);
            sum += indices.get(1);
            sum += vertices.get(0);
            sum += vertices.get(1);
        }
        long after = System.nanoTime();
        System.out.println("Took "+(after-before)/1e6+" ms with sum "+sum);
    }
    
}

I ran this with some arbitrary ~10MB OBJ file that I had lying around here.

With master, the output for me was
Took 2271.6398 ms with sum -7199.090003967285

With this branch, the output for me was
Took 2494.5546 ms with sum -7199.090003967285

Of course, the results on another VM (e.g. on Android) may be completely different. But still, I'd like to see the claim that the new approach is "noticably faster" (on most or some VMs) to be backed by examples...

@PhpXp
Copy link
Author

PhpXp commented Aug 22, 2023

Thanks for the long write up. I expected a response like this, completely understandable. I'm surprised by your benchmark results. I'll do my own tests on Android again, hopefully later this week.

@javagl
Copy link
Owner

javagl commented Aug 22, 2023

I'm curious how large the difference will be on a different VM. And I know that that "performance test" that I created there could be questioned in many ways. It follows the most simple form of the usual pattern for such basic benchmarks. Namely: To repeat the operation several (1000) times, to give the JIT a chance to kick in. But one could argue here: That may not be how these functions are usually used. When someone is extracting the data from an OBJ, then this may often be something that is done once - after the OBJ is loaded, and before the data is sent to the renderer.

So if there was ...

  • approach A, where the first run takes 500ms and the 999 following runs took 50ms
  • approach B, where the first run takes 100ms, and the 999 following runs took 90ms

then the first one would appear to be faster in such a benchmark, but ... that initial delay of 500ms may be undesirable.

Sooo.... all this is to be taken with a grain of salt. It may make sense to switch the FloatBuffer<->float[] uses at some places, but for now, the evidence that this is worth it is just not strong enough...

@PhpXp
Copy link
Author

PhpXp commented Aug 24, 2023

It seems like the performance depends on the platform (Android/Windows) and the size of the model. I ran benchmarks on multiple phones with a small file (about 500KB) and a big one (about 80MB).

X axis are individual consecutive runs, Y axis is time in nanoseconds.

Results from Google Pixel 3 XL:
image
image

Results from Asus Zephyrus G14 (AMD Ryzen 9 4900HS):
image
image

The results from the Pixel 3 XL match other phones I also tried (Samsung Galaxy S22 (Exynos variant), Sony Xperia XCompact, Samsung Galaxy XCover 5). It seems like my optimizations are better only in specific cases, but they do make a big difference there.

@javagl
Copy link
Owner

javagl commented Aug 24, 2023

Impressive. I didn't expect these details.

(And while I do have some experience with much of that (Java, performance testing, working of the JIT, a bit of JMH, VisualVM, JFR...), I have to admit that I do not have any knowledge about the specifics of Android. Only knowing that ~"there's that thing (called 'Dalvik' IIRC) that serves as the VM, and is not the OpenJDK/Oracle HotSpot VM" caused me to assume that the results may vary vastly...)

I assume that the charts that do not say "big model" are done with the small model.

  • The 59% faster for the small model on Android looks good
  • The 4% "faster" for the big model on Android: Looking at the details of the chart, the line is much more noisy, A wild guess is that this could be caused by an increased GC load - meaning that it seems to be faster for some runs, but stalled by the more frequent GC runs. (Does that make sense? It's a very wild guess for now...)
  • The 3% faster the small model on Desktop: Doesn't seem worth the effort. That 3% is in the range that I'd call "noise", looking at the somewhat noisy chart...
  • The 26% slower for the bit model on Desktop does not look so good...

Sooo... eventually the trade-off here, summarized very roughly: It could be ~60% faster for small models on Android, but ~25% slower for large models on Desktop.

One important question: Are these runs with the test that I posted above, or did you create an own test/benchmark class? As I said: That test was only quickly ~"written down", to have a ballpark estimate. I wonder if one should try to create other test cases, or try out other patterns (like the get(targetBuffer) versions of the methods instead of the buffer = get() versions or so...)

I appreciate the effort that you put in all of this, but I currently don't exactly know what's the best way to to proceed here. (Iff the slowdown for large models could be reduced (and API compatibiltiy could be ensured), that would be a big plus, but there still are many unknowns for me...)

@PhpXp
Copy link
Author

PhpXp commented Aug 24, 2023

there's that thing (called 'Dalvik' IIRC) that serves as the VM

Dalvik is the old Runtime. ART (Android Runtime) is the new one. Briefly: it combines and interpreter, JIT, and AOT (ahead of time) compilation. There are also baseline profiles, which is a list of methods that are called frequently enough to be additionally optimized. If the device has Google Play Services installed, those profiles are shared between all devices globally. Pretty advanced stuff.

I assume that the charts that do not say "big model" are done with the small model.

That is correct.

A wild guess is that this could be caused by an increased GC load

I can confirm that this is the case. Garbage collections are reported to the LogCat and on a low-end device (XCover 5) the blocking GC can pause execution for upwards of 300ms. Clearly not optimal. I wonder if there is an easy solution to this, to avoid allocating a new array every time, but still get the benefits of the bulk put. Maybe a smaller array (like 128 bytes) inserted into the IntBuffer multiple times instead of the full-size array?

The 26% slower for the bit model on Desktop does not look so good...

It could be that IntBuffer is more optimal on desktop VMs or it could be the GC. I'll work on this tomorrow.

Are these runs with the test that I posted above, or did you create an own test/benchmark class?

I wrote the test in Kotlin, but in principle it does the same thing:

val filenames = listOf(
    "obj_files/4506B/4506B.obj",
    "obj_files/4507B-B001-B002/4507B.obj",
    "obj_files/4507B003/4507B003.obj",
    "obj_files/4507B004-B005-B006/4507B004.obj",
    "obj_files/4508B/4508B.obj",
    "obj_files/4524B/4524B.obj",
    "obj_files/4528B/4528B.obj",
    "obj_files/4529B/4529B.obj",
    "obj_files/4535B/4535B.obj",
)

val files = filenames.map {
    val inputStream = javaClass.classLoader!!.getResourceAsStream(it)
    ObjReader.read(inputStream)
}

//val files = listOf("bugatti/bugatti.obj").map {
//    val inputStream = javaClass.classLoader!!.getResourceAsStream(it)
//    ObjReader.read(inputStream)
//}

repeat(500) {
    var totalTime = 0L
    var sum = 0f
    var time = System.nanoTime()
    for(obj in files) {
        val indices = ObjData.getFaceVertexIndices(obj)
        val vertices = ObjData.getVertices(obj)

        sum += indices.get(0)
        sum += indices.get(1)
        sum += vertices.get(0)
        sum += vertices.get(1)

        val now = System.nanoTime()
        totalTime += (now - time)
        time = now
    }
    println("TOTAL TIME $totalTime")
}

@javagl
Copy link
Owner

javagl commented Aug 24, 2023

Thanks for these updates.

A few high-level points about how strongly I'd consider to merge this PR:

  • The API change from get...(...IntBuffer target) to get...(...int[] target) should be straightened out. As I said, one could consider an additional get...Array(... int[] target) function there, and maybe an additional int offset (into the target array) parameter
  • That float[] getValues(); in FloatTuple is ... 😕 ... meh (for the reasons mentioned above). I'd really prefer to not have this

The latter may affect the performance, but I don't have a clue how much. I'll probably try to allocate some time (maybe during the weekend) to play with all this in more detail (although I'd only run tests on Desktop - so my focus would roughly be to keep the functionality that makes it ~60% faster for the small/Android case, and try to avoid the 25% slowdown for the large/Desktop case...)

I wonder if there is an easy solution to this, to avoid allocating a new array every time, but still get the benefits of the bulk put. Maybe a smaller array (like 128 bytes) inserted into the IntBuffer multiple times instead of the full-size array?

The GC is a tricky beast. And these GC pauses may very well be an artifact of the artificial test: Of course there's a lot of garbage generated when val vertices = ObjData.getVertices(obj) is called 500 times and the result is thrown away immediately. Maybe I'll also try to come up with a "more realistic" benchmark, but ... that's tricky (I have no idea how exactly people are using this library...).

I already tried to avoid generating garbage inside the Obj library. Iff GC pauses are caused by garbage that is generated in the library, one could try to avoid that, with some ThreadLocal<float[]> scratchBuffer or whatnot, but I don't have a place in mind where this actually happens...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants