-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ObjData buffer generation performance optimizations (about 2x as faster) #30
base: master
Are you sure you want to change the base?
Conversation
I may have to take a closer look at the exact changes. From quickly skimming over the changes, it seems to be related to two things that one could call "high-level implementation decisions": 1: When dealing with such "bulk" data, it can often be useful to have two flavors of functions. One that allocates a new buffer, and one that writes the result into a given buffer. For example, as in
An aside: It often makes sense to combine this into a "hybrid" function, like
but this isn't the pattern that is used here. You changed the function
First of all, that's a breaking API change and I won't accept this in this exact form. The change is not only breaking, but also reduces the versatility. With the
e.g. for filling the buffer with vertices from multiple OBJs. In order to "emulate" this possibility, you'd have to change the method to
So the 2: When dealing with "bulk data of primitive values", then there usually are the options of using a primitive array like When there are two design options A and B, which one can emulate the other? And as a rule of thumb:
The point is (and that you may have noticed in the methods of the A: When you have a method that fills a
B: When you have a method that fills a
(allocating the data twice, just to shove it from the array into the buffer) An aside: For buffers, there's the additional design dimension of whether you want to heap-allocated buffers or 'direct' buffers. For anything that is related to OpenGL/native libraries, you usually want 'direct' buffers... A detail: The But ... it makes the state mutable. You could call Beyond that: I know that it's very unlikely that there will ever be a different implementation of the
This would probably eat up many performance gains that one might achieve elsewhere... This leads to the main point: The main claim of this PR is to be a "performance optimization" and make things "2x as faster" (sic). I doubt that :) Performance tests are difficult. Always. But as a very quick test (to be taken with a grain of salt for the above reasons), I created this:
I ran this with some arbitrary ~10MB OBJ file that I had lying around here. With With this branch, the output for me was Of course, the results on another VM (e.g. on Android) may be completely different. But still, I'd like to see the claim that the new approach is "noticably faster" (on most or some VMs) to be backed by examples... |
Thanks for the long write up. I expected a response like this, completely understandable. I'm surprised by your benchmark results. I'll do my own tests on Android again, hopefully later this week. |
I'm curious how large the difference will be on a different VM. And I know that that "performance test" that I created there could be questioned in many ways. It follows the most simple form of the usual pattern for such basic benchmarks. Namely: To repeat the operation several (1000) times, to give the JIT a chance to kick in. But one could argue here: That may not be how these functions are usually used. When someone is extracting the data from an OBJ, then this may often be something that is done once - after the OBJ is loaded, and before the data is sent to the renderer. So if there was ...
then the first one would appear to be faster in such a benchmark, but ... that initial delay of 500ms may be undesirable. Sooo.... all this is to be taken with a grain of salt. It may make sense to switch the |
Impressive. I didn't expect these details. (And while I do have some experience with much of that (Java, performance testing, working of the JIT, a bit of JMH, VisualVM, JFR...), I have to admit that I do not have any knowledge about the specifics of Android. Only knowing that ~"there's that thing (called 'Dalvik' IIRC) that serves as the VM, and is not the OpenJDK/Oracle HotSpot VM" caused me to assume that the results may vary vastly...) I assume that the charts that do not say "big model" are done with the small model.
Sooo... eventually the trade-off here, summarized very roughly: It could be ~60% faster for small models on Android, but ~25% slower for large models on Desktop. One important question: Are these runs with the test that I posted above, or did you create an own test/benchmark class? As I said: That test was only quickly ~"written down", to have a ballpark estimate. I wonder if one should try to create other test cases, or try out other patterns (like the I appreciate the effort that you put in all of this, but I currently don't exactly know what's the best way to to proceed here. (Iff the slowdown for large models could be reduced (and API compatibiltiy could be ensured), that would be a big plus, but there still are many unknowns for me...) |
Dalvik is the old Runtime. ART (Android Runtime) is the new one. Briefly: it combines and interpreter, JIT, and AOT (ahead of time) compilation. There are also baseline profiles, which is a list of methods that are called frequently enough to be additionally optimized. If the device has Google Play Services installed, those profiles are shared between all devices globally. Pretty advanced stuff.
That is correct.
I can confirm that this is the case. Garbage collections are reported to the LogCat and on a low-end device (XCover 5) the blocking GC can pause execution for upwards of 300ms. Clearly not optimal. I wonder if there is an easy solution to this, to avoid allocating a new array every time, but still get the benefits of the bulk put. Maybe a smaller array (like 128 bytes) inserted into the
It could be that
I wrote the test in Kotlin, but in principle it does the same thing: val filenames = listOf(
"obj_files/4506B/4506B.obj",
"obj_files/4507B-B001-B002/4507B.obj",
"obj_files/4507B003/4507B003.obj",
"obj_files/4507B004-B005-B006/4507B004.obj",
"obj_files/4508B/4508B.obj",
"obj_files/4524B/4524B.obj",
"obj_files/4528B/4528B.obj",
"obj_files/4529B/4529B.obj",
"obj_files/4535B/4535B.obj",
)
val files = filenames.map {
val inputStream = javaClass.classLoader!!.getResourceAsStream(it)
ObjReader.read(inputStream)
}
//val files = listOf("bugatti/bugatti.obj").map {
// val inputStream = javaClass.classLoader!!.getResourceAsStream(it)
// ObjReader.read(inputStream)
//}
repeat(500) {
var totalTime = 0L
var sum = 0f
var time = System.nanoTime()
for(obj in files) {
val indices = ObjData.getFaceVertexIndices(obj)
val vertices = ObjData.getVertices(obj)
sum += indices.get(0)
sum += indices.get(1)
sum += vertices.get(0)
sum += vertices.get(1)
val now = System.nanoTime()
totalTime += (now - time)
time = now
}
println("TOTAL TIME $totalTime")
} |
Thanks for these updates. A few high-level points about how strongly I'd consider to merge this PR:
The latter may affect the performance, but I don't have a clue how much. I'll probably try to allocate some time (maybe during the weekend) to play with all this in more detail (although I'd only run tests on Desktop - so my focus would roughly be to keep the functionality that makes it ~60% faster for the small/Android case, and try to avoid the 25% slowdown for the large/Desktop case...)
The GC is a tricky beast. And these GC pauses may very well be an artifact of the artificial test: Of course there's a lot of garbage generated when I already tried to avoid generating garbage inside the |
It's faster to create an array with data and then bulk
put()
it to a buffer. For loops were also optimized.Code was tested on an Android phone and after the optimisations, buffers are created about 2x faster.
Public function signatures were changed, hopefully this isn't a big problem.
For example in
getFaceVertexIndices()
, you shouldn't be adding one value at a time to the buffer anyway, that's slow.Note that the code isn't final, Javadoc still has to be updated.