Compiler performance: General observations #1380

dnadlinger · 2019-10-28T12:19:34Z

This is a breakout issue from #1370: Kernel compilation is slow, and we should do something about it.

To follow up on the earlier discussion, this is an overview of the compilation process on a typical piece of code in my experiment (ARTIQ commit 611bcc4):

(This was generated using py-spy record -r 100 -f speedscope -o artiq_run.speedscope -- artiq_run ~/…/run_gate_sequence.py. The raw data can be explored on https://speedscope.app; I've also generated an SVG flamegraph.)

In reference to the earlier discussion, note that > 70 % of the time is spent in the ARTIQ compiler proper. In other words, llvmlite certainly isn't our biggest problem. Just caching emitted object file fragments also wouldn't help much; we'd need to make sure we can do so without type-checking the function bodies.

Speaking of which, all the Python functions with the highest execution time are currently related to type inference/merging:

I'm not sure how this looks for other languages with HM-style type inference, but this certainly seems a bit high, and suggests optimisations here (e.g. interning well-known types) could be worth it.

Also, the total time spent in C calls, i.e. including all the heavy lifting done in LLVM itself, is only about 5% of the overall compilation cost. Although the comparison isn't entirely fair, compilers for languages like D and Rust (C++ too, probably – haven't looked at Clang in a while) typically spend the majority of time in LLVM, or at least a much more sizeable fraction. In a way, this is good news: If LLVM was our bottleneck, improving on compile times would likely be a much more elaborate undertaking.

In any case, it doesn't look like for us there is currently is a single pathological case that, if fixed, would greatly speed up compilation. Consequently, it does seem like enabling re-use of compilation results is the way to go even in the short term, not just in terms of long-term scalability. Transparently integrating this with the dynamic mess that is Python doesn't seem particularly straightforward, though. Ideas/opinions?

The text was updated successfully, but these errors were encountered:

dnadlinger · 2019-10-28T12:20:41Z

(Per-function py-spy recordings, instead of per-line: artiq_run.functions.speedscope.zip
artiq_run.functions.svg.zip.)

dnadlinger · 2019-12-30T00:05:26Z

See #1415 for another compile time improvement.

On another performance-related note, on one of my experiment test cases (others were similar) I saw a 24% reduction in total compilation time by switching from Conda's Python 3.5 to 3.8, and a custom PGO+LTO build gave a few more percent for a total reduction of 30% compared to Python 3.5. (Haven't checked any intermediate versions.)

sbourdeauducq · 2019-12-30T01:42:04Z

a custom PGO+LTO build gave a few more percent for a total reduction of 30% compared to Python 3.5.

Can you describe how you built that?

dnadlinger · 2019-12-30T05:39:49Z

Can you describe how you built that?

Just checked out the 3.8.1 source tree and ./configure --prefix=/opt/python-3.8-pgo~lto --enable-optimizations --with-lto && make && make install.

This uses some random test cases for establishing the PGO counters. I've tried manually running it on the target test case as well, but that made at most a percent difference, so isn't worth the effort. (This isn't very surprising, as I wouldn't expect the compiler to be bottlenecked by a particular hot path in the VM.)

sbourdeauducq · 2020-01-04T05:39:56Z

NixOS/nixpkgs#43442 (comment)
If this doesn't get accepted we can use a nixpkgs overlay to replace Python in ARTIQ installations, but it would cause annoying package rebuilds as I hinted.

dnadlinger · 2020-01-04T16:44:02Z

Do you have any links describing how to do that? I'll have to roll out the changes in a somewhat permanent way in at least a few of our setups. (The Nix Python setup is – probably quite reasonably so – a bit complex; how do I change the default version to 3.8 from an overlay? I can certainly figure out how to do that in Nix, but if it's trivial for you to do…)

sbourdeauducq · 2020-01-05T03:14:06Z

Enabling optimizations can be done like this:

diff --git a/artiq-fast/default.nix b/artiq-fast/default.nix
index 638baa0..b24d743 100644
--- a/artiq-fast/default.nix
+++ b/artiq-fast/default.nix
@@ -1,4 +1,8 @@
-{ pkgs ? import <nixpkgs> {}}:
+{ pkgs ? import <nixpkgs> {
+  overlays = [ (self: super: {
+    python3 = super.python3.overrideAttrs(oa: { configureFlags = oa.configureFlags ++ ["--enable-optimizations" "--with-lto"]; }); 
+  }) ];
+}}:
 with pkgs;
 let
   pythonDeps = import ./pkgs/python-deps.nix { inherit (pkgs) stdenv fetchFromGitHub python3Packages; };

For Python 3.8, many packages in release-19.09 and master of nixpkgs are not compatible, so the simplest way seems to be to use the staging branch, which comes with 3.8.

sbourdeauducq · 2020-01-05T10:24:03Z

Turns out, there's also the libapparmor Python 3.8 issue on staging, so for now there is no simple solution if you do need 3.8 (either fix libapparmor, or use a mix of 3.7 and 3.8).

sbourdeauducq · 2020-02-08T06:53:42Z

In any case, it doesn't look like for us there is currently is a single pathological case that, if fixed, would greatly speed up compilation.

Maybe we should rewrite the compiler in an efficient and compiled language (probably Rust), using this and other lessons learned from the previous two iterations in Python.

drewrisinger · 2020-04-02T16:41:29Z

FYI, looks like these python speedup improvements might hit nixpkgs/staging based on a new, well-tested & documented PR that hit nixpkgs today. NixOS/nixpkgs#84072

pca006132 · 2021-07-02T06:32:32Z

Is it possible to provide some benchmark code for us to put in the unit tests? I need some example for profiling and testing. (otherwise I don't know where to look for)

pca006132 · 2021-07-02T09:25:36Z

FYI, looks like these python speedup improvements might hit nixpkgs/staging based on a new, well-tested & documented PR that hit nixpkgs today. NixOS/nixpkgs#84072

However it seems that they don't use optimizations for now:

https://github.com/NixOS/nixpkgs/blob/e9148dc1c30e02aae80cc52f68ceb37b772066f3/pkgs/development/interpreters/python/cpython/default.nix#L41

, enableOptimizations ? false

https://github.com/NixOS/nixpkgs/blob/e9148dc1c30e02aae80cc52f68ceb37b772066f3/pkgs/development/interpreters/python/cpython/default.nix#L66-L67

assert lib.assertMsg (reproducibleBuild -> (!enableOptimizations))
  "Deterministic builds are not achieved when optimizations are enabled.";

It seems that they prefer deterministic builds instead of 25% speedup...

drewrisinger · 2021-07-02T14:19:08Z

@pca006132 Yes, these optimizations were reverted in NixOS/nixpkgs#107965, which I was minimally involved in. You are correct that they seemed to prefer reproducible builds. Luckily, with Nix you can still enable those optimizations yourself and get the 25% speedup if you really care, just requires a little bit of Nix knowledge & a decent amount of time to rebuild all the dependent packages.

For reference, the Nix to produce it would look something like (untested) (ref: https://nixos.wiki/wiki/Overlays):

shell.nix:

{  }:
let
  optimizePythonOverlay = self: super: { python3 = super.python3.override { enableOptimizations = true; }; };
  nixpkgsWithOptimizedPython = import <nixpkgs-21.05> { overlays = [ optimizePythonOverlay ]; };
in
nixpkgsWithOptimizedPython.mkShell {
  buildInputs = [ (nixpkgsWithOptimizedPython.python3.withPackages(ps: [ ps.numpy ])) ];
}

sbourdeauducq · 2021-12-13T12:11:46Z

NAC3 compilation time breakdown is different and seems to be dominated by LLVM.

dnadlinger added area:compiler area:speed labels Oct 28, 2019

This was referenced Oct 28, 2019

potentially reduce kernel compile time by replacing InstCombine with InstSimplify #720

Closed

Use the llvm linker #733

Closed

sbourdeauducq added the NAC3 label Dec 2, 2020

pca006132 mentioned this issue Jul 2, 2021

compiler: speedup list processing #1709

Merged

6 tasks

pca006132 mentioned this issue Jul 6, 2021

Compiler: Improve type checking performance #1713

Merged

5 tasks

sbourdeauducq mentioned this issue Dec 13, 2021

reduce compile time by replacing llvmlite with bindings to LLVM-C #723

Closed

sbourdeauducq closed this as completed Dec 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compiler performance: General observations #1380

Compiler performance: General observations #1380

dnadlinger commented Oct 28, 2019

dnadlinger commented Oct 28, 2019

dnadlinger commented Dec 30, 2019

sbourdeauducq commented Dec 30, 2019

dnadlinger commented Dec 30, 2019

sbourdeauducq commented Jan 4, 2020

dnadlinger commented Jan 4, 2020 •

edited

sbourdeauducq commented Jan 5, 2020

sbourdeauducq commented Jan 5, 2020

sbourdeauducq commented Feb 8, 2020

drewrisinger commented Apr 2, 2020

pca006132 commented Jul 2, 2021

pca006132 commented Jul 2, 2021

drewrisinger commented Jul 2, 2021

sbourdeauducq commented Dec 13, 2021

Compiler performance: General observations #1380

Compiler performance: General observations #1380

Comments

dnadlinger commented Oct 28, 2019

dnadlinger commented Oct 28, 2019

dnadlinger commented Dec 30, 2019

sbourdeauducq commented Dec 30, 2019

dnadlinger commented Dec 30, 2019

sbourdeauducq commented Jan 4, 2020

dnadlinger commented Jan 4, 2020 • edited

sbourdeauducq commented Jan 5, 2020

sbourdeauducq commented Jan 5, 2020

sbourdeauducq commented Feb 8, 2020

drewrisinger commented Apr 2, 2020

pca006132 commented Jul 2, 2021

pca006132 commented Jul 2, 2021

drewrisinger commented Jul 2, 2021

sbourdeauducq commented Dec 13, 2021

dnadlinger commented Jan 4, 2020 •

edited