-
-
Notifications
You must be signed in to change notification settings - Fork 264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regressions without LTO in LDC 1.3.0-beta[1|2] #2168
Comments
|
According to your benchmark page, the numbers have been obtained using LDC 1.1, so have you compared 1.1 to the 1.3 betas, skipping 1.2? I'm asking since we switched to LLVM 4.0 (on non-Windows) starting with 1.2 and the regressions are most likely due to that, although we have only heard from general performance improvements (even > 10%) with LLVM 4.0 until now. |
|
It's possible that I skipped 1.2, don't remember off-hand. I'll test with 1.2 and update this report as soon as I can. It may be a couple days, I have some schedule conflicts at the moment. I normally test every LDC release, though without updating the benchmarks page. And I've been using 1.2 since it came out, so it's likely I tested it also. However, I don't have any record of it. |
|
I've rerun the six benchmarks I use, there is definitely a performance regression in three tests between LDC 1.2 and LDC 1.3-beta2 on OSX (xcode 8.3.3). The main deltas:
The other three tests are not materially changed. The regex version of the tsv-filter test has had a small performance decrease over several releases, this is likely due to changes in std.regex during these releases. While running the tests, I discovered there is a delta when using the I'll see what I can do to narrow this further. The tsv-filter (numeric) and tsv-summarize tests have numeric processing in common, may be a clue there. The csv2tsv test does not do numeric processing. I'll still have schedule conflicts for several days, so may be a bit before I have more. Below is a table of performance numbers over several LDC releases. Each metric was run at least 5 times to ensure consistency. The "flto" versions are built with the
LLVM and Phobos versions for each LDC release:
|
|
@jondegenhardt just a quick note: those numbers are kindof cool to report in a blog post somewhere :) Did you also try with |
|
Can you add the LLVM version to that table? Thanks. |
|
@JohanEngelen Thanks! Thought the numbers might be of general interest. I did not try |
|
I would have liked this to correlate with #2161 of course, but sadly it doesn't. It may be worth trying to recompile with LDC 1.3 and 2.072.2 druntime/Phobos (hopefully compatible) and see whether we have to blame the libs, the compiler or both. ;) |
|
[Btw, I don't know how fair that'd be for your benchmark comparison, but compiling with |
|
One specific possibility is that converting from The "csv2tsv" benchmark does not do any converting, or any numeric operations for that matter. It would be something different. |
|
I tried the easy thing - Run DMD executables for 2.072.2 and 2.073.2 and see if there are similar regressions. There weren't. The "tsv-filter (numeric)" and "tsv-summarize" tests are actually faster. Still doesn't rule out Phobos changes as the cause.
|
|
Tested it on some audio processing (30 measures, arch x86_64) Here it's a less than 1% regression so hard to tell. |
|
A 30% regression should be somewhat easy to localize by just comparing time profiles for the two programs. |
|
@klickverbot Do you have suggestions for a profiling tool to use for this purpose? I tried looking at function call counts via the LDC |
|
CodeXL is pretty great. |
|
On OS X, you can also simply use Xcode's Instruments for a rudimentary sampling profiler. |
|
Update: I've taken a few more looks at this, but so far have only eliminated things, haven't pinpointed specific causes. Main bottleneck is that I've got other engagements taking precedent and very limited time to investigate. That won't change for a few days. I did try XCode's profiler, but there wasn't enough detail to identify specifics. I also tried simpler versions of some of the functionality that seems troublesome, but these attempts were not revealing. I'll continue investigating as I have time. |
|
@jondegenhardt It's also interesting to know what you have eliminated! :) |
|
@JohanEngelen Main thing eliminated was char[] to double conversion via std.conv.to. Or least eliminated for simple uses. Background - For the pair of tests "tsv-filter (numeric)" and tsv-summarize, the obvious thing is they are slower on operations involving numbers (doubles). Beyond the published benchmarks, every operation they support that requires converting the text in the input files to a number is slower. However, for operations on strings, like string equality, there is no performance degradation. There's not a lot going on in numeric operations that does not occur in string operations, except for converting to numbers and using numeric operations like less-than or plus. So I figured that constructing simpler code involving large numbers of char[] to double conversions and simple numeric operations should see similar degradations, but I wasn't able to show this. These operations could be still be the cause, but perhaps with a more complex trigger. I've been imagining that perhaps in-lining that took place in LDC 1.2 versions is no longer taking place, so perhaps that's still the problem. Using a better profiling tool might show this, but the OS X profiler doesn't know seem to know function names produced by LDC code. Looking at the assembly might do it, I haven't gotten to this step yet. The csv2tsv does not convert strings to numbers or do comparisons. However, it does many things the other tools do not. It could easily be that it is slower due to something different than what is affecting tsv-filter and tsv-summarize. |
|
@jondegenhardt Can you try again with #2180 ? Thanks! |
|
@jondegenhardt: The commit in question is in master now. As for mangled names in profiles, what I meant is to just compare the profiles structurally (as in, just look which bits of the call stack take more time than before). If your program is sufficiently long-running so that the statistical noise is low, this should give you a pretty good starting point, pointing us to the area where things got slower (or to the fact that everything got slower uniformly). |
|
Building with latest (6c97a02, which should have #2180 ), tests are mostly unchanged. The csv2tsv test without the I ran the three tests in question, with no changes to compiler command line:
I'll run the full test suite and report any other findings. It will take a couple hours. |
|
@jondegenhardt Can you try with LLVM 3.9 + LDC master? (I can't tell from the thread here, but it looks you've been using LLVM4.0 so far with 1.3, right?) |
|
Yes, I used LLVM 4.0. I'll also try with LLVM 3.9. |
|
(Thank you very much for all the work on tracking this down and your persistence, Jon!) |
|
You're welcome. I very much appreciate that these finding are being taken seriously in release evaluation. The tsv toolkit is by itself a pretty minor piece of software, I do hope that these findings apply more widely that just the toolkit. |
|
Here are the results for current master with LLVM 4.0. I'll try LLVM 3.9 next. Each was metric run 3 times to ensure consistency. Update: Now with the LLVM 3.9 times.
|
|
I've updated the table above to include the LLVM 3.9 times. LLVM 3.9 shows only minor differences from LLVM 4.0. Most tests are unchanged, tsv-summarize is a little faster with LLVM 4.0 and csv2tsv is a little faster with LLVM 3.9. |
|
Thanks for the interesting numbers. I'd be way more interested in a comparison with 2.072 Phobos (or LDC 1.2 with 2.073 Phobos). If you managed to build LDC yourself, swapping out the druntime/Phobos sources should be simple. Of course the other option is to provide us with an easy way of reproducing the issue. I already built tsv-filter, but didn't want to download that huge dataset, so I took a smaller one (Alaska only), but with that the filter you use didn't yield any records at all... |
|
That hack shouldn't be necessary anymore, I'd test it enabled (vanilla). |
|
Well, it didn't work first try. Suggestions? |
|
This is related to #2077. I think the clang version needs to match up with the LLVM libs used to build LDC. |
|
@jondegenhardt You need to use the newer LLVM version of clang or LDC for linking. In your case, I guess we use clang for linking, but LDC is built with a newer LLVM version. You can specify which LTO plugin to use for linking (such that it uses the newer LTO plugin version): try adding |
|
Should the linker flag have been set automatically? I'm specifying the LLVM installation on the |
Nope, it was removed here. When also building the C parts with clang and LTO support, the used clang ( [The other alternative would be building LLVM incl. clang source and using that clang binary + LTO plugin.] |
Yeah, um, I'm not going to modify my xcode installation, even temporarily. I rely on it too much for other work. If there's a cmake variable I can set I'll do that, or if there's a mod I can make to one of the CMakeLists.txt files, I'll do that. The CMakeLists.txt files are just involved enough It's not immediately clear what change to make. |
|
Alright. ;) Simply restoring the previous block cmake/Modules/HandleLTOPGOBuildOptions.cmake:36-45 should do the trick. |
|
Wow! The results are remarkable. My system isn't quiet enough right now to do a reliable benchmark, but I ran the tests by hand to get a preview. Improvements in most programs, in some cases quite dramatic. I'll generate a full benchmark later, but here's the quick preview:
Executable sizes are smaller, by 2-4x. My immediate reaction is to say that if there's a reliable way to bundle this it'd be certainly worth while. |
|
@jondegenhardt The build time probably went up quite a lot too? ;) Can you try with |
|
Thx Jon, excellent results! :) |
|
Results of the full suite below (best time of seven runs). Last row is with Phobos compiled -flto=full, and the app code as well. It's a clear win. Best times for all benchmarks, significant in several cases, with no signs of degradation. I ran a number of variants so you can see how they compare. Sorry @JohanEngelen, I didn't think to do an Overall very impressive results!
|
|
Great stuff. Forum post? ;-) |
I agree, but perhaps you or the LDC team should write it :) . Myself, I find these results eye-opening and I wonder if others in the community would as well. Of course, the tsv utilities may be more amenable to this type of optimization than other apps (they typically repeat the same operations large numbers of times in tight loops). But still, this is really promising. If it's not easy to bundle Phobos LTO nicely as released feature, then perhaps a useful step would be to make it easy for others in the D community to try it and benchmark their own apps. It'd be worth getting some additional data points. |
|
Follow-up to the previous benchmark results - Executable sizes of the tsv utility tools for a couple different builds. Last row in each table is with Phobos built with
Update: Added several |
|
The large size for non-LTO builds (and when linking in the static runtime libs) is most likely also caused by the template culling mechanism. (One of?) the Phobos object(s) containing the template instantiation definition will be dragged in, and with it all its dependencies, so that you end up dragging in large parts of Phobos. I'm currently leaning towards focusing on LTO for optimal release builds instead of trying to tweak the template culling. |
Makes sense. Performance gains via template culling sounds challenging to do well, while the LTO path looks like it holds quite a bit more potential. There might be short term performance opportunities missed, but should be worth it. |
|
Regarding Compiler line used: |
|
@jondegenhardt ThinLTO is pretty new in LLVM, so there may be some bugs lurking. We could test with LLVM trunk and see if there is the same linker failure. If so, dustmite, etc.. work :( |
|
Here is a benchmark run comparing For LDC 1.2, 'thin' and 'full' had nearly identical results. For LDC 1.3, 'thin' was better than 'full' in several benchmarks. It appears to have done a better job avoiding the performance regressions that surfaced with LDC 1.3, though not all of them. Executable sizes were not materially different (updated in the table a few messages back).
Note: These runs were a touch slower than the previous benchmark, likely due to a somewhat higher level of other activities on the machine. This is why I always run all metrics needed to do a relative comparison. Run-to-run times were quite consistent so relative comparisons should be valid. |
|
Compile times for the different LTO build options. These are for the full tsv utilities library (8 executable builds). Perhaps a surprise, but the
|
|
Thanks for more data :-) |
I'm seeing performance regressions in several of my standard benchmarks (https://github.com/eBay/tsv-utils-dlang/blob/master/docs/Performance.md). Specifically the
tsv-filter,tsv-summarize, andcsv2tsvtests. Performance is off 20-30% from previous LDC releases.So far I've only tested OSX (Xcode 8.3.3).
I won't have time to diagnose further for several days, but I'll see if I can narrow this further sometime during the week of June 19. I don't think it's likely to be due to changes in Phobos, but I can check that and if it is specific to OSX. If a fix is created for issue #2161 (performance degradation with boundscheck=off) I can check that as well. Testing with this option removed did show improvement on one benchmark, but overall benchmarks were worse.
The text was updated successfully, but these errors were encountered: