Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compiler performance: General observations #1380

Closed
dnadlinger opened this issue Oct 28, 2019 · 14 comments
Closed

Compiler performance: General observations #1380

dnadlinger opened this issue Oct 28, 2019 · 14 comments

Comments

@dnadlinger
Copy link
Collaborator

This is a breakout issue from #1370: Kernel compilation is slow, and we should do something about it.

To follow up on the earlier discussion, this is an overview of the compilation process on a typical piece of code in my experiment (ARTIQ commit 611bcc4):

(This was generated using py-spy record -r 100 -f speedscope -o artiq_run.speedscope -- artiq_run ~/…/run_gate_sequence.py. The raw data can be explored on https://speedscope.app; I've also generated an SVG flamegraph.)

In reference to the earlier discussion, note that > 70 % of the time is spent in the ARTIQ compiler proper. In other words, llvmlite certainly isn't our biggest problem. Just caching emitted object file fragments also wouldn't help much; we'd need to make sure we can do so without type-checking the function bodies.

Speaking of which, all the Python functions with the highest execution time are currently related to type inference/merging:

I'm not sure how this looks for other languages with HM-style type inference, but this certainly seems a bit high, and suggests optimisations here (e.g. interning well-known types) could be worth it.

Also, the total time spent in C calls, i.e. including all the heavy lifting done in LLVM itself, is only about 5% of the overall compilation cost. Although the comparison isn't entirely fair, compilers for languages like D and Rust (C++ too, probably – haven't looked at Clang in a while) typically spend the majority of time in LLVM, or at least a much more sizeable fraction. In a way, this is good news: If LLVM was our bottleneck, improving on compile times would likely be a much more elaborate undertaking.


In any case, it doesn't look like for us there is currently is a single pathological case that, if fixed, would greatly speed up compilation. Consequently, it does seem like enabling re-use of compilation results is the way to go even in the short term, not just in terms of long-term scalability. Transparently integrating this with the dynamic mess that is Python doesn't seem particularly straightforward, though. Ideas/opinions?

@dnadlinger
Copy link
Collaborator Author

(Per-function py-spy recordings, instead of per-line: artiq_run.functions.speedscope.zip
artiq_run.functions.svg.zip.)

@dnadlinger
Copy link
Collaborator Author

See #1415 for another compile time improvement.

On another performance-related note, on one of my experiment test cases (others were similar) I saw a 24% reduction in total compilation time by switching from Conda's Python 3.5 to 3.8, and a custom PGO+LTO build gave a few more percent for a total reduction of 30% compared to Python 3.5. (Haven't checked any intermediate versions.)

@sbourdeauducq
Copy link
Member

a custom PGO+LTO build gave a few more percent for a total reduction of 30% compared to Python 3.5.

Can you describe how you built that?

@dnadlinger
Copy link
Collaborator Author

Can you describe how you built that?

Just checked out the 3.8.1 source tree and ./configure --prefix=/opt/python-3.8-pgo~lto --enable-optimizations --with-lto && make && make install.

This uses some random test cases for establishing the PGO counters. I've tried manually running it on the target test case as well, but that made at most a percent difference, so isn't worth the effort. (This isn't very surprising, as I wouldn't expect the compiler to be bottlenecked by a particular hot path in the VM.)

@sbourdeauducq
Copy link
Member

NixOS/nixpkgs#43442 (comment)
If this doesn't get accepted we can use a nixpkgs overlay to replace Python in ARTIQ installations, but it would cause annoying package rebuilds as I hinted.

@dnadlinger
Copy link
Collaborator Author

dnadlinger commented Jan 4, 2020

Do you have any links describing how to do that? I'll have to roll out the changes in a somewhat permanent way in at least a few of our setups. (The Nix Python setup is – probably quite reasonably so – a bit complex; how do I change the default version to 3.8 from an overlay? I can certainly figure out how to do that in Nix, but if it's trivial for you to do…)

@sbourdeauducq
Copy link
Member

Enabling optimizations can be done like this:

diff --git a/artiq-fast/default.nix b/artiq-fast/default.nix
index 638baa0..b24d743 100644
--- a/artiq-fast/default.nix
+++ b/artiq-fast/default.nix
@@ -1,4 +1,8 @@
-{ pkgs ? import <nixpkgs> {}}:
+{ pkgs ? import <nixpkgs> {
+  overlays = [ (self: super: {
+    python3 = super.python3.overrideAttrs(oa: { configureFlags = oa.configureFlags ++ ["--enable-optimizations" "--with-lto"]; }); 
+  }) ];
+}}:
 with pkgs;
 let
   pythonDeps = import ./pkgs/python-deps.nix { inherit (pkgs) stdenv fetchFromGitHub python3Packages; };

For Python 3.8, many packages in release-19.09 and master of nixpkgs are not compatible, so the simplest way seems to be to use the staging branch, which comes with 3.8.

@sbourdeauducq
Copy link
Member

Turns out, there's also the libapparmor Python 3.8 issue on staging, so for now there is no simple solution if you do need 3.8 (either fix libapparmor, or use a mix of 3.7 and 3.8).

@sbourdeauducq
Copy link
Member

In any case, it doesn't look like for us there is currently is a single pathological case that, if fixed, would greatly speed up compilation.

Maybe we should rewrite the compiler in an efficient and compiled language (probably Rust), using this and other lessons learned from the previous two iterations in Python.

@drewrisinger
Copy link
Contributor

FYI, looks like these python speedup improvements might hit nixpkgs/staging based on a new, well-tested & documented PR that hit nixpkgs today. NixOS/nixpkgs#84072

@pca006132
Copy link
Contributor

Is it possible to provide some benchmark code for us to put in the unit tests? I need some example for profiling and testing. (otherwise I don't know where to look for)

@pca006132
Copy link
Contributor

FYI, looks like these python speedup improvements might hit nixpkgs/staging based on a new, well-tested & documented PR that hit nixpkgs today. NixOS/nixpkgs#84072

However it seems that they don't use optimizations for now:

https://github.com/NixOS/nixpkgs/blob/e9148dc1c30e02aae80cc52f68ceb37b772066f3/pkgs/development/interpreters/python/cpython/default.nix#L41

, enableOptimizations ? false

https://github.com/NixOS/nixpkgs/blob/e9148dc1c30e02aae80cc52f68ceb37b772066f3/pkgs/development/interpreters/python/cpython/default.nix#L66-L67

assert lib.assertMsg (reproducibleBuild -> (!enableOptimizations))
  "Deterministic builds are not achieved when optimizations are enabled.";

It seems that they prefer deterministic builds instead of 25% speedup...

@drewrisinger
Copy link
Contributor

@pca006132 Yes, these optimizations were reverted in NixOS/nixpkgs#107965, which I was minimally involved in. You are correct that they seemed to prefer reproducible builds. Luckily, with Nix you can still enable those optimizations yourself and get the 25% speedup if you really care, just requires a little bit of Nix knowledge & a decent amount of time to rebuild all the dependent packages.

For reference, the Nix to produce it would look something like (untested) (ref: https://nixos.wiki/wiki/Overlays):

shell.nix:

{  }:
let
  optimizePythonOverlay = self: super: { python3 = super.python3.override { enableOptimizations = true; }; };
  nixpkgsWithOptimizedPython = import <nixpkgs-21.05> { overlays = [ optimizePythonOverlay ]; };
in
nixpkgsWithOptimizedPython.mkShell {
  buildInputs = [ (nixpkgsWithOptimizedPython.python3.withPackages(ps: [ ps.numpy ])) ];
}

@sbourdeauducq
Copy link
Member

NAC3 compilation time breakdown is different and seems to be dominated by LLVM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants