Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Refactor workers into state machines #63

Merged
merged 12 commits into from
Dec 23, 2019
Merged

[WIP] Refactor workers into state machines #63

merged 12 commits into from
Dec 23, 2019

Conversation

mratsim
Copy link
Owner

@mratsim mratsim commented Dec 22, 2019

Fix #10, should help #3
And also should make following the transitions and events/triggers more explicit.
This would be helpful to implement nested barriers (#41] on sane foundations
and improve maintenance of Weave in the future.

Commits in this PR will progressively bubble up all the states into 1 state machine or pushdown automaton or a set of hierarchical state machines but only 1 level deep.

@mratsim
Copy link
Owner Author

mratsim commented Dec 23, 2019

There doesn't seem to be much of a perf impact in many workloads but I did notice one on nqueens 15.
It seems like for some reason the FSM increases the computation required in nqueens algorithm.
Not sure why

2019-12-23_18-39

@mratsim
Copy link
Owner Author

mratsim commented Dec 23, 2019

The await rewrite regain perf of nqueens 15, probably due to nextTask not being fully inlined after the handleThieves rewrite.

However there is an impact on overhead measured with fibonacci, about on both eager (5%) and lazy (10%) with a constant extra overhead of about 20ms on my 36 cores machine.

@mratsim
Copy link
Owner Author

mratsim commented Dec 23, 2019

Performance is increased on nested parallelism:

With the PR, 3 middle runs out of 20 of transpose benchmark

Inverting the transpose order may favor one transposition heavily for non-tiled strategies
--------------------------------------------------------------------------
Scheduler:                                    Weave  (eager flowvars)
Benchmark:                                    Transpose - TiledNested
Threads:                                      36
# of rounds:                                  1000
# of operations:                              1600000
# of bytes:                                   6400000
Arithmetic Intensity:                         0.25
--------------------------------------------------------------------------
Transposition:                                400x4000 --> 4000x400
Time(ms):                                     62.414
Max RSS (KB):                                 19408
Runtime RSS (KB):                             10300
# of page faults:                             12761
Perf (GMEMOPs/s ~ GigaMemory Operations/s)    25.635
--------------------------------------------------------------------------
Transposition:                                4000x400 --> 400x4000
Time(ms):                                     46.67
Max RSS (KB):                                 19700
Runtime RSS (KB):                             292
# of page faults:                             84
Perf (GMEMOPs/s ~ GigaMemory Operations/s)    34.283
Inverting the transpose order may favor one transposition heavily for non-tiled strategies
--------------------------------------------------------------------------
Scheduler:                                    Weave  (eager flowvars)
Benchmark:                                    Transpose - TiledNested
Threads:                                      36
# of rounds:                                  1000
# of operations:                              1600000
# of bytes:                                   6400000
Arithmetic Intensity:                         0.25
--------------------------------------------------------------------------
Transposition:                                400x4000 --> 4000x400
Time(ms):                                     61.591
Max RSS (KB):                                 18820
Runtime RSS (KB):                             9608
# of page faults:                             9089
Perf (GMEMOPs/s ~ GigaMemory Operations/s)    25.978
--------------------------------------------------------------------------
Transposition:                                4000x400 --> 400x4000
Time(ms):                                     46.202
Max RSS (KB):                                 18948
Runtime RSS (KB):                             128
# of page faults:                             84
Perf (GMEMOPs/s ~ GigaMemory Operations/s)    34.631
Inverting the transpose order may favor one transposition heavily for non-tiled strategies
--------------------------------------------------------------------------
Scheduler:                                    Weave  (eager flowvars)
Benchmark:                                    Transpose - TiledNested
Threads:                                      36
# of rounds:                                  1000
# of operations:                              1600000
# of bytes:                                   6400000
Arithmetic Intensity:                         0.25
--------------------------------------------------------------------------
Transposition:                                400x4000 --> 4000x400
Time(ms):                                     60.174
Max RSS (KB):                                 18524
Runtime RSS (KB):                             9428
# of page faults:                             8571
Perf (GMEMOPs/s ~ GigaMemory Operations/s)    26.59
--------------------------------------------------------------------------
Transposition:                                4000x400 --> 400x4000
Time(ms):                                     46.2
Max RSS (KB):                                 18716
Runtime RSS (KB):                             192
# of page faults:                             90
Perf (GMEMOPs/s ~ GigaMemory Operations/s)    34.632

Without

Inverting the transpose order may favor one transposition heavily for non-tiled strategies
--------------------------------------------------------------------------
Scheduler:                                    Weave  (eager flowvars)
Benchmark:                                    Transpose - TiledNested
Threads:                                      36
# of rounds:                                  1000
# of operations:                              1600000
# of bytes:                                   6400000
Arithmetic Intensity:                         0.25
--------------------------------------------------------------------------
Transposition:                                400x4000 --> 4000x400
Time(ms):                                     67.95699999999999
Max RSS (KB):                                 18820
Runtime RSS (KB):                             9688
# of page faults:                             8898
Perf (GMEMOPs/s ~ GigaMemory Operations/s)    23.544
--------------------------------------------------------------------------
Transposition:                                4000x400 --> 400x4000
Time(ms):                                     54.512
Max RSS (KB):                                 19112
Runtime RSS (KB):                             292
# of page faults:                             84
Perf (GMEMOPs/s ~ GigaMemory Operations/s)    29.351
Inverting the transpose order may favor one transposition heavily for non-tiled strategies
--------------------------------------------------------------------------
Scheduler:                                    Weave  (eager flowvars)
Benchmark:                                    Transpose - TiledNested
Threads:                                      36
# of rounds:                                  1000
# of operations:                              1600000
# of bytes:                                   6400000
Arithmetic Intensity:                         0.25
--------------------------------------------------------------------------
Transposition:                                400x4000 --> 4000x400
Time(ms):                                     67.565
Max RSS (KB):                                 19104
Runtime RSS (KB):                             9944
# of page faults:                             9786
Perf (GMEMOPs/s ~ GigaMemory Operations/s)    23.681
--------------------------------------------------------------------------
Transposition:                                4000x400 --> 400x4000
Time(ms):                                     54.634
Max RSS (KB):                                 19364
Runtime RSS (KB):                             260
# of page faults:                             90
Perf (GMEMOPs/s ~ GigaMemory Operations/s)    29.286
Inverting the transpose order may favor one transposition heavily for non-tiled strategies
--------------------------------------------------------------------------
Scheduler:                                    Weave  (eager flowvars)
Benchmark:                                    Transpose - TiledNested
Threads:                                      36
# of rounds:                                  1000
# of operations:                              1600000
# of bytes:                                   6400000
Arithmetic Intensity:                         0.25
--------------------------------------------------------------------------
Transposition:                                400x4000 --> 4000x400
Time(ms):                                     69.173
Max RSS (KB):                                 18800
Runtime RSS (KB):                             9548
# of page faults:                             11166
Perf (GMEMOPs/s ~ GigaMemory Operations/s)    23.13
--------------------------------------------------------------------------
Transposition:                                4000x400 --> 400x4000
Time(ms):                                     54.671
Max RSS (KB):                                 19140
Runtime RSS (KB):                             340
# of page faults:                             90
Perf (GMEMOPs/s ~ GigaMemory Operations/s)    29.266

@mratsim mratsim merged commit bd87a4f into master Dec 23, 2019
@mratsim mratsim deleted the state-machine branch December 27, 2019 21:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rewrite the state transitions as a Finite State Machine
1 participant