[WIP] Refactor workers into state machines #63

mratsim · 2019-12-22T23:56:26Z

Fix #10, should help #3
And also should make following the transitions and events/triggers more explicit.
This would be helpful to implement nested barriers (#41] on sane foundations
and improve maintenance of Weave in the future.

Commits in this PR will progressively bubble up all the states into 1 state machine or pushdown automaton or a set of hierarchical state machines but only 1 level deep.

mratsim · 2019-12-23T17:46:51Z

There doesn't seem to be much of a perf impact in many workloads but I did notice one on nqueens 15.
It seems like for some reason the FSM increases the computation required in nqueens algorithm.
Not sure why

mratsim · 2019-12-23T23:13:21Z

The await rewrite regain perf of nqueens 15, probably due to nextTask not being fully inlined after the handleThieves rewrite.

However there is an impact on overhead measured with fibonacci, about on both eager (5%) and lazy (10%) with a constant extra overhead of about 20ms on my 36 cores machine.

mratsim · 2019-12-23T23:24:47Z

Performance is increased on nested parallelism:

With the PR, 3 middle runs out of 20 of transpose benchmark

Inverting the transpose order may favor one transposition heavily for non-tiled strategies
--------------------------------------------------------------------------
Scheduler:                                    Weave  (eager flowvars)
Benchmark:                                    Transpose - TiledNested
Threads:                                      36
# of rounds:                                  1000
# of operations:                              1600000
# of bytes:                                   6400000
Arithmetic Intensity:                         0.25
--------------------------------------------------------------------------
Transposition:                                400x4000 --> 4000x400
Time(ms):                                     62.414
Max RSS (KB):                                 19408
Runtime RSS (KB):                             10300
# of page faults:                             12761
Perf (GMEMOPs/s ~ GigaMemory Operations/s)    25.635
--------------------------------------------------------------------------
Transposition:                                4000x400 --> 400x4000
Time(ms):                                     46.67
Max RSS (KB):                                 19700
Runtime RSS (KB):                             292
# of page faults:                             84
Perf (GMEMOPs/s ~ GigaMemory Operations/s)    34.283
Inverting the transpose order may favor one transposition heavily for non-tiled strategies
--------------------------------------------------------------------------
Scheduler:                                    Weave  (eager flowvars)
Benchmark:                                    Transpose - TiledNested
Threads:                                      36
# of rounds:                                  1000
# of operations:                              1600000
# of bytes:                                   6400000
Arithmetic Intensity:                         0.25
--------------------------------------------------------------------------
Transposition:                                400x4000 --> 4000x400
Time(ms):                                     61.591
Max RSS (KB):                                 18820
Runtime RSS (KB):                             9608
# of page faults:                             9089
Perf (GMEMOPs/s ~ GigaMemory Operations/s)    25.978
--------------------------------------------------------------------------
Transposition:                                4000x400 --> 400x4000
Time(ms):                                     46.202
Max RSS (KB):                                 18948
Runtime RSS (KB):                             128
# of page faults:                             84
Perf (GMEMOPs/s ~ GigaMemory Operations/s)    34.631
Inverting the transpose order may favor one transposition heavily for non-tiled strategies
--------------------------------------------------------------------------
Scheduler:                                    Weave  (eager flowvars)
Benchmark:                                    Transpose - TiledNested
Threads:                                      36
# of rounds:                                  1000
# of operations:                              1600000
# of bytes:                                   6400000
Arithmetic Intensity:                         0.25
--------------------------------------------------------------------------
Transposition:                                400x4000 --> 4000x400
Time(ms):                                     60.174
Max RSS (KB):                                 18524
Runtime RSS (KB):                             9428
# of page faults:                             8571
Perf (GMEMOPs/s ~ GigaMemory Operations/s)    26.59
--------------------------------------------------------------------------
Transposition:                                4000x400 --> 400x4000
Time(ms):                                     46.2
Max RSS (KB):                                 18716
Runtime RSS (KB):                             192
# of page faults:                             90
Perf (GMEMOPs/s ~ GigaMemory Operations/s)    34.632

Without

Inverting the transpose order may favor one transposition heavily for non-tiled strategies
--------------------------------------------------------------------------
Scheduler:                                    Weave  (eager flowvars)
Benchmark:                                    Transpose - TiledNested
Threads:                                      36
# of rounds:                                  1000
# of operations:                              1600000
# of bytes:                                   6400000
Arithmetic Intensity:                         0.25
--------------------------------------------------------------------------
Transposition:                                400x4000 --> 4000x400
Time(ms):                                     67.95699999999999
Max RSS (KB):                                 18820
Runtime RSS (KB):                             9688
# of page faults:                             8898
Perf (GMEMOPs/s ~ GigaMemory Operations/s)    23.544
--------------------------------------------------------------------------
Transposition:                                4000x400 --> 400x4000
Time(ms):                                     54.512
Max RSS (KB):                                 19112
Runtime RSS (KB):                             292
# of page faults:                             84
Perf (GMEMOPs/s ~ GigaMemory Operations/s)    29.351
Inverting the transpose order may favor one transposition heavily for non-tiled strategies
--------------------------------------------------------------------------
Scheduler:                                    Weave  (eager flowvars)
Benchmark:                                    Transpose - TiledNested
Threads:                                      36
# of rounds:                                  1000
# of operations:                              1600000
# of bytes:                                   6400000
Arithmetic Intensity:                         0.25
--------------------------------------------------------------------------
Transposition:                                400x4000 --> 4000x400
Time(ms):                                     67.565
Max RSS (KB):                                 19104
Runtime RSS (KB):                             9944
# of page faults:                             9786
Perf (GMEMOPs/s ~ GigaMemory Operations/s)    23.681
--------------------------------------------------------------------------
Transposition:                                4000x400 --> 400x4000
Time(ms):                                     54.634
Max RSS (KB):                                 19364
Runtime RSS (KB):                             260
# of page faults:                             90
Perf (GMEMOPs/s ~ GigaMemory Operations/s)    29.286
Inverting the transpose order may favor one transposition heavily for non-tiled strategies
--------------------------------------------------------------------------
Scheduler:                                    Weave  (eager flowvars)
Benchmark:                                    Transpose - TiledNested
Threads:                                      36
# of rounds:                                  1000
# of operations:                              1600000
# of bytes:                                   6400000
Arithmetic Intensity:                         0.25
--------------------------------------------------------------------------
Transposition:                                400x4000 --> 4000x400
Time(ms):                                     69.173
Max RSS (KB):                                 18800
Runtime RSS (KB):                             9548
# of page faults:                             11166
Perf (GMEMOPs/s ~ GigaMemory Operations/s)    23.13
--------------------------------------------------------------------------
Transposition:                                4000x400 --> 400x4000
Time(ms):                                     54.671
Max RSS (KB):                                 19140
Runtime RSS (KB):                             340
# of page faults:                             90
Perf (GMEMOPs/s ~ GigaMemory Operations/s)    29.266

mratsim force-pushed the state-machine branch from 07c2d5f to bd3ffa9 Compare December 23, 2019 00:55

mratsim added 4 commits December 23, 2019 14:15

Reimplement decline as a finite state automaton

10592c7

Use the new decline FSM

c0fe660

Add synthesis setup to CI

d1d6e9e

only ascertain dropped request on lastreq when it's > 1

239aded

mratsim force-pushed the state-machine branch from bd3ffa9 to 239aded Compare December 23, 2019 13:15

mratsim added 5 commits December 23, 2019 15:51

empty transition on a single line

1864734

rework receiving tasks as a finite state automaton

533bca5

Make apparent that recv(task) also initiates steal requests

e4c6bfc

Make apparent that dispatchTask participates in worker state transitions

daaf7fd

Extract thieves handling FSM from nextTask

a0a6b1e

mratsim added 3 commits December 23, 2019 20:02

work FSM: profile in onEntry/onExit

d13440b

Switch the sync() global barrier to a finite state automaton

1546026

Rewrite await as a finite automaton

c6138fc

mratsim merged commit bd87a4f into master Dec 23, 2019

This was referenced Dec 23, 2019

Rewrite the state transitions as a Finite State Machine #10

Closed

[WIP] Nestable barriers #41

Closed

mratsim deleted the state-machine branch December 27, 2019 21:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Refactor workers into state machines #63

[WIP] Refactor workers into state machines #63

mratsim commented Dec 22, 2019

mratsim commented Dec 23, 2019

mratsim commented Dec 23, 2019

mratsim commented Dec 23, 2019 •

edited

Loading

[WIP] Refactor workers into state machines #63

[WIP] Refactor workers into state machines #63

Conversation

mratsim commented Dec 22, 2019

mratsim commented Dec 23, 2019

mratsim commented Dec 23, 2019

mratsim commented Dec 23, 2019 • edited Loading

mratsim commented Dec 23, 2019 •

edited

Loading