Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profiler fixes #2097

Closed
wants to merge 4 commits into from
Closed

Profiler fixes #2097

wants to merge 4 commits into from

Conversation

Kha
Copy link
Member

@Kha Kha commented Feb 7, 2023

The "interpretation" metric is now more accurate by subtracting run times of invoked native code. As a side effect, an interpreted closure is now reported as a separate task (regardless of whether the caller is the interpreter or native code), but I think the cumulative metric is more important here.

@Kha
Copy link
Member Author

Kha commented Feb 7, 2023

!bench

@leanprover-bot
Copy link
Collaborator

Here are the benchmark results for commit 9fa36fc.
There were significant changes against commit 4b974fd:

  Benchmark                  Metric         Change
  ===========================================================
- tests/bench/ interpreted   instructions     7.6%  (922.0 σ)
- tests/bench/ interpreted   task-clock       6.9%   (14.6 σ)
- tests/bench/ interpreted   wall-clock       8.0%   (25.6 σ)
- workspaceSymbols           branches         9.6% (1360.8 σ)
- workspaceSymbols           instructions    10.2% (2190.1 σ)
- workspaceSymbols           task-clock      15.3%   (36.0 σ)
- workspaceSymbols           wall-clock      15.3%   (36.0 σ)

@Kha
Copy link
Member Author

Kha commented Feb 7, 2023

I guess the profiling mutex is not quite insignificant!

@gebner
Copy link
Member

gebner commented Feb 7, 2023

IIUC this adds a KVMap-lookup to every native function call (even Array.size?) and every pap, even if the profiler is turned off. That feels like a very high cost for typical interpreted code.

I don't think we're going to get representative results from benchmarking core (as virtually everything is compiled). IIRC you said that mathlib didn't have a high percentage of interpreted runtime either; could you benchmark mathlib with this change (leanprover-community/mathlib4#2113 fixes the panic btw)?

@Kha
Copy link
Member Author

Kha commented Feb 9, 2023

!bench

@leanprover-bot
Copy link
Collaborator

Here are the benchmark results for commit 6431b81.
The entire run failed.
Found no significant differences.

@Kha
Copy link
Member Author

Kha commented Feb 9, 2023

!bench

@leanprover-bot
Copy link
Collaborator

Here are the benchmark results for commit 42e4d74.
There were significant changes against commit 4b974fd:

  Benchmark                  Metric         Change
  ===========================================================
- tests/bench/ interpreted   instructions     7.8% (3573.2 σ)
- tests/bench/ interpreted   task-clock       9.3%   (10.9 σ)
- tests/bench/ interpreted   wall-clock      10.3%   (31.3 σ)
- workspaceSymbols           branches         4.8%  (777.8 σ)
- workspaceSymbols           instructions     5.2% (1391.1 σ)
- workspaceSymbols           task-clock      12.6%   (45.2 σ)
- workspaceSymbols           wall-clock      12.6%   (45.1 σ)

@Kha Kha force-pushed the fix-prof branch 2 times, most recently from 5bf377e to 50b30e3 Compare February 9, 2023 15:08
@Kha
Copy link
Member Author

Kha commented Feb 9, 2023

!bench

@leanprover-bot
Copy link
Collaborator

Here are the benchmark results for commit 50b30e3.
The entire run failed.
Found no significant differences.

@Kha
Copy link
Member Author

Kha commented Feb 9, 2023

!bench

@leanprover-bot
Copy link
Collaborator

Here are the benchmark results for commit 50b30e3.
The entire run failed.
Found no significant differences.

@Kha
Copy link
Member Author

Kha commented Feb 9, 2023

!bench

@leanprover-bot
Copy link
Collaborator

Here are the benchmark results for commit 50b30e3.
The entire run failed.
Found no significant differences.

@Kha
Copy link
Member Author

Kha commented Feb 9, 2023

Ohh, it's caching the failure...

@Kha
Copy link
Member Author

Kha commented Feb 9, 2023

!bench

@leanprover-bot
Copy link
Collaborator

Here are the benchmark results for commit 84d0252.
There were significant changes against commit 4b974fd:

  Benchmark                  Metric         Change
  ============================================================
- tests/bench/ interpreted   instructions     4.7% (16077.6 σ)
- tests/bench/ interpreted   wall-clock       5.1%    (21.7 σ)
- workspaceSymbols           branches         4.8%  (1130.8 σ)
- workspaceSymbols           instructions     5.2%  (1706.4 σ)
- workspaceSymbols           task-clock      10.5%    (36.3 σ)
- workspaceSymbols           wall-clock      10.5%    (36.2 σ)

@Kha
Copy link
Member Author

Kha commented Feb 10, 2023

Interesting, so even if the class doesn't do much at all when the profiler is disabled, there is a very measurable slowdown. So it seems we would either have to accept this overhead or live with wildly inaccurate interpretation metrics.

I'll do a benchmark run of mathlib4 after the 2003 changes are ready.

@Kha
Copy link
Member Author

Kha commented Mar 23, 2023

!bench

@leanprover-bot
Copy link
Collaborator

Here are the benchmark results for commit 775ef14.
There were significant changes against commit 158d58f:

  Benchmark                  Metric          Change
  =============================================================
+ stdlib                     type checking    -1.1%   (-20.2 σ)
- tests/bench/ interpreted   instructions      4.9% (15962.7 σ)
- tests/bench/ interpreted   wall-clock        4.3%    (21.0 σ)
- workspaceSymbols           branches          4.7%   (345.1 σ)
- workspaceSymbols           instructions      5.2%   (559.4 σ)
- workspaceSymbols           task-clock       10.4%    (30.7 σ)
- workspaceSymbols           wall-clock       10.4%    (30.7 σ)

Kha added a commit to Kha/mathlib4 that referenced this pull request Mar 23, 2023
Kha added a commit to Kha/mathlib4 that referenced this pull request Mar 23, 2023
@Kha
Copy link
Member Author

Kha commented Mar 23, 2023

!bench

@Kha
Copy link
Member Author

Kha commented Mar 23, 2023

IIUC this adds a KVMap-lookup to every native function call (even Array.size?) and every pap, even if the profiler is turned off.

I missed this the first time, we only pay the cost when entering the interpreter (though calling an interpreted closure from the interpreter via ap does count as entering it, yes; in theory, we could cache the value of profiler in the closure).

I'm tending towards merging this as it's better to make the interpreter a little slower than to have no idea how much time it actually takes. I can do another mathlib4 run with the optimization I just pushed though.

@leanprover-bot
Copy link
Collaborator

Here are the benchmark results for commit d4bc7fc.
There were significant changes against commit 158d58f:

  Benchmark                  Metric          Change
  ============================================================
+ stdlib                     type checking    -1.2%  (-71.4 σ)
- tests/bench/ interpreted   instructions      4.3% (7660.0 σ)
- workspaceSymbols           branches          4.6%  (732.0 σ)
- workspaceSymbols           instructions      5.0% (1335.2 σ)
- workspaceSymbols           task-clock        9.4%   (27.2 σ)
- workspaceSymbols           wall-clock        9.4%   (27.1 σ)

@gebner
Copy link
Member

gebner commented Mar 24, 2023

I'm tending towards merging this as it's better to make the interpreter a little slower than to have no idea how much time it actually takes. I can do another mathlib4 run with the optimization I just pushed though.

Yes, please do that. The last version of this PR was a pretty significant regression, and I think we should only merge this if the impact is much smaller. IIUC leanprover-community/mathlib4#3048 (comment) correctly, it was a ~10% slowdown across the board. And tactic execution was even 50% slower.

@Kha
Copy link
Member Author

Kha commented Mar 24, 2023

I'm officially declaring this PR cursed. I noticed today that on my laptop, I don't get the ~10% slowdown from above for workspaceSymbols but instead 450%. I can only surmise that clock_gettime overhead very much depends on the hardware. CLOCK_MONOTONIC_COARSE (~4ms precision?) could help with that but is, as usual, Linux-only.

@Kha Kha closed this Mar 24, 2023
@Kha
Copy link
Member Author

Kha commented Mar 24, 2023

IIUC leanprover-community/mathlib4#3048 (comment) correctly, it was a ~10% slowdown across the board

2-3% on average judging from instructions/task-clock, without --profile it should definitely be smaller. Judging the individual file regressions as a set can be misleading since these might be mostly from small files, which indeed are probably dominated by interpretation in import. Sorting the commit comparison by total value shows that regression for most big files is <1%, though interestingly the slowest file, Mathlib.GroupTheory.MonoidLocalization, leads with +25%. I wouldn't be surprised if it spent most time at a single location.

And tactic execution was even 50% slower

Only superficially, it's time previously incorrectly detected as time spent in the interpreter. Same with import and linting, presumably, which are all categories with many interpreted extensions.

@Kha
Copy link
Member Author

Kha commented Mar 25, 2023

interestingly the slowest file, Mathlib.GroupTheory.MonoidLocalization, leads with +25%

(it's probably [to_additive])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants