Convert averager to libp2p backend #323

borzunov · 2021-07-15T22:56:25Z

Current status: Finished.

What I have tested:

Training (1 monitor + 2 trainers) works, peers average successfully.

TODO:

Provide benchmark results (default and for large tensors)
Add experiment_prefix to handler name for averager RPCs to enable using several different averagers simultaneously
~~Follow-up PR: rename PeerID -> Endpoint, use bytes for PeerIDs in protobufs (instead of string)~~ (moved to [REFACTOR] updates to DHT internals #276 )

…n ServicerBase

…ackend

borzunov · 2021-07-23T17:29:30Z

We have found that #317 is the reason of periodic test freezes in this branch and master.

3/30 test runs freeze for the current MPFuture implementation: report.

0/30 test runs freeze for the reverted MPFuture implementation (to the version with torch shared memory): report

Now, we are thinking about ways to fix that.

mryab

Thanks for this monumental contribution! Before we merge, though, I'd also like to see two things:

Passing tests (maybe Resolve deadlock in MPFuture #337 and Reduce complexity of several DHT tests #334 will be useful in that regard)
Performance benchmarks comparing this one with the master branch

tests/test_training.py

tests/test_averaging.py

hivemind/dht/__init__.py

mryab · 2021-07-24T10:49:40Z

hivemind/optim/collaborative.py

@@ -354,7 +354,7 @@ def report_training_progress(self):
            with self.lock_local_progress:
                current_time = get_dht_time()
                local_state_info = TrainingState(
-                    endpoint=self.averager.endpoint,
+                    peer_id=self.averager.endpoint.to_base58(),


I'm slightly against casting to base58 in multiple places all over the code; would appreciate if it was possible to come up with a way to reduce this casting :)

Or maybe you can just call __str__ everywhere, since to_base58 is an implementation detail

We can use str(endpoint), however we would still need to call PeerID.from_base58(value) to deserialize. Therefore, I'd suggest to keep these operations symmetric.

Also, it is actually more natural to use bytes for representing PeerIDs in protobufs (and change endpoint.to_base58()/PeerID.from_base58(value) to endpoint.to_bytes()/PeerID(value)). However, DHT already uses str for PeerIDs in protobufs, so I'd like to make the code consistent in this PR (but I don't mind changing everything to bytes in a separate PR).

Sure, let's change it in a follow-up

hivemind/averaging/allreduce.py

hivemind/p2p/p2p_daemon.py

mryab · 2021-07-24T10:54:37Z

hivemind/p2p/servicer.py

+                if len(spec.args) < 3:
+                    raise ValueError(
+                        f"{method_name} is expected to at least three positional arguments "
+                        f"(self: TServicer, request: TInputProtobuf, context: hivemind.p2p.P2PContext)"


Thankfully, TServicer and TInputProtobuf are no more :)

I've removed mentioning of TServicer from this comment. However, I'd still suggest to use the T prefix for TypeVars to distinguish them from the usual types, so I am keeping TInputProtobuf for now :)

This reverts commit b1a43a5.

borzunov · 2021-07-26T19:53:37Z

hivemind/utils/asyncio.py

@@ -59,6 +59,16 @@ async def aenumerate(aiterable: AsyncIterable[T]) -> AsyncIterable[Tuple[int, T]
        index += 1


+async def asingle(aiter: AsyncIterable[T]) -> T:


The name is inspired by Single() from LINQ (.NET functional programming functions).

…centralizedAverager instances

borzunov · 2021-07-27T01:42:31Z

Benchmark Results

Setup

num_peers = 16
target_group_size = 16
request_timeout = 1
hid_size = 8192
num_layers = 1
averaging_expiration = 300

Branch `master` (`39afa97`)

Part size: 2 ** 20 bytes
Averaging step time: mean 23.3 sec (std 1.2 sec, based on 3 runs)

Branch `averager-libp2p` (`fc8d296`)

The plot shows the mean averaging time ± std (based on 3 runs).

Part size (optimal): 2 ** 19 bytes
Averaging step time: mean 25.8 sec (std 0.8 sec, based on 3 runs)

hivemind/averaging/allreduce.py

hivemind/averaging/averager.py

hivemind/averaging/matchmaking.py

hivemind/p2p/servicer.py

hivemind/utils/asyncio.py

This PR follows #323 and does the remaining mass refactors: 1. Rename `Endpoint` to `PeerID` in averager (+ related variable names) 2. Rename the `P2P.id` field to `P2P.peer_id` (because the local peer ID is stored in the `.peer_id` fields in all other classes) 3. Serialize `PeerID`s as `bytes` instead of Base58 string 4. Remove `JoinRequest.peer_id` and `AveragingData.peer_id` fields (they duplicate `context.remote_id`) 5. Remove the `DecentralizedAveraging` gRPC interface (not used anymore)

borzunov added 5 commits July 15, 2021 18:46

Implement DHT.p2p property

e907eb2

Start converting averager to libp2p backend

ea3b56d

Support inheritance and arbitrary parameter names for rpc_* methods i…

2795176

…n ServicerBase

Make test_load_state_from_peers work

85785b9

Support calling Servicer.get_stub without having servicer instances

c89b598

borzunov force-pushed the averager-libp2p branch 2 times, most recently from 8590bd9 to 8109d93 Compare July 15, 2021 22:59

Convert AllReduceRunner, Matchmaking, and GroupKeyManager to libp2p b…

a8fcb0a

…ackend

borzunov force-pushed the averager-libp2p branch from 8109d93 to a8fcb0a Compare July 15, 2021 23:38

Fix test_allreduce.py

83c5d30

borzunov force-pushed the averager-libp2p branch from 83606c1 to 83c5d30 Compare July 15, 2021 23:56

borzunov added 5 commits July 16, 2021 03:39

Fix test_allreduce_once

0384737

Fix test_averaging.py

4f5acb5

Move launch_dht_instances() to test_utils.py

7eb91a3

Continue fix test_averaging.py

36282f8

Speed up DHT swarm creation

2ae476f

borzunov force-pushed the averager-libp2p branch from 56cb777 to 2ae476f Compare July 16, 2021 02:02

borzunov added 3 commits July 16, 2021 05:07

Remove endpoint parameter of GroupKeyManager

2e51140

Merge remote-tracking branch 'origin/master' into averager-libp2p

12e8039

Fix RPC in ServicerBase derivatives for test_training_averager

20f19b1

borzunov force-pushed the averager-libp2p branch 2 times, most recently from cb09676 to 955c058 Compare July 16, 2021 18:37

Rename _get_stub to _get_peer_stub

f615693

borzunov force-pushed the averager-libp2p branch from 6c41027 to f615693 Compare July 16, 2021 19:30

borzunov added 6 commits July 16, 2021 22:35

Fix benchmark_averaging.py

1256e09

Fix bugs with misusing str and PeerID

02f1d47

Make diff smaller

9557452

Try removing timeout

df26bfc

Remove excess import

b90cef4

Remove listen_on argument

0042076

borzunov added 3 commits July 22, 2021 15:39

Merge remote-tracking branch 'origin/master' into averager-libp2p

0d683fd

Remove excess import

9314701

Unskip test_allreduce_grid()

cb22719

mryab requested changes Jul 24, 2021

View reviewed changes

borzunov added 7 commits July 26, 2021 19:29

Merge remote-tracking branch 'origin/master' into averager-libp2p

ca8b563

Fix some of @mryab's comments

79bf112

Implement asingle()

ec4cc63

Revert "call_binary_stream_handler: Retry on ControlError"

11cc1f0

This reverts commit b1a43a5.

Make some DHT methods static

2e8a6a3

Fix some of @mryab's comments

b374164

Blackify

c3a1747

borzunov commented Jul 26, 2021

View reviewed changes

borzunov added 2 commits July 27, 2021 01:40

Remove circular references between AllReduceRunner/Matchmaking and De…

fc8d296

…centralizedAverager instances

Avoid duplicating defaults

cdd66ad

Set DEFAULT_PART_SIZE_BYTES = 2 ** 19

3f39a62

borzunov requested a review from mryab July 27, 2021 01:44

mryab approved these changes Jul 27, 2021

View reviewed changes

borzunov and others added 4 commits July 28, 2021 05:58

Fix @mryab's comments

85fa631

Rename unused function/method args to _arg

40d60f0

Merge branch 'master' into averager-libp2p

1b2d13d

Merge branch 'master' into averager-libp2p

b9973fb

justheuristic merged commit 3f691fc into master Jul 28, 2021

justheuristic deleted the averager-libp2p branch July 28, 2021 17:56

borzunov mentioned this pull request Jul 28, 2021

Refactor naming and serialization for PeerIDs #339

Merged

This was referenced Jul 30, 2021

Set default DHT num_workers = 4 #342

Merged

Convert averager rpc_aggregate_part to P2P #188

Closed

borzunov mentioned this pull request May 23, 2022

Convert hivemind.server to libp2p backend #470

Merged

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert averager to libp2p backend #323

Convert averager to libp2p backend #323

borzunov commented Jul 15, 2021 •

edited by justheuristic

borzunov commented Jul 23, 2021 •

edited

mryab left a comment

mryab Jul 24, 2021

borzunov Jul 26, 2021 •

edited

mryab Jul 27, 2021

mryab Jul 24, 2021

borzunov Jul 26, 2021

borzunov Jul 26, 2021

borzunov commented Jul 27, 2021 •

edited

		@@ -59,6 +59,16 @@ async def aenumerate(aiterable: AsyncIterable[T]) -> AsyncIterable[Tuple[int, T]
		index += 1


		async def asingle(aiter: AsyncIterable[T]) -> T:

Convert averager to libp2p backend #323

Convert averager to libp2p backend #323

Conversation

borzunov commented Jul 15, 2021 • edited by justheuristic

borzunov commented Jul 23, 2021 • edited

mryab left a comment

Choose a reason for hiding this comment

mryab Jul 24, 2021

Choose a reason for hiding this comment

borzunov Jul 26, 2021 • edited

Choose a reason for hiding this comment

mryab Jul 27, 2021

Choose a reason for hiding this comment

mryab Jul 24, 2021

Choose a reason for hiding this comment

borzunov Jul 26, 2021

Choose a reason for hiding this comment

borzunov Jul 26, 2021

Choose a reason for hiding this comment

borzunov commented Jul 27, 2021 • edited

Benchmark Results

Setup

Branch master (39afa97)

Branch averager-libp2p (fc8d296)

borzunov commented Jul 15, 2021 •

edited by justheuristic

borzunov commented Jul 23, 2021 •

edited

borzunov Jul 26, 2021 •

edited

borzunov commented Jul 27, 2021 •

edited

Branch `master` (`39afa97`)

Branch `averager-libp2p` (`fc8d296`)