Add new hardware and software metrics #11062

koute · 2022-03-18T14:18:02Z

This PR adds new hardware/software telemetry to Substrate.

The following extra information about the system is gathered (Linux-only):

CPU name
CPU core count
RAM size
Linux kernel version
Linux distribution
Is the node running on a VM?

The following benchmarks are ran on startup (all OSes):

CPU speed (hashrate of BLAKE2b-256 in MB/s)
Memory speed (how many MB/s it can memcpy)
Sequential disk write speed
Random disk write speed

The benchmarks are ran on every startup and in total should take less than ~1s. They are deliberately kept very simple as to not become a maintenance burden.

I've also changed how the node reports its version; previously it appended the current CPU ISA + the OS + the environment to the version and sent it as one field (e.g. 0.9.17-75dd6c7d0-x86_64-linux-gnu) while now that field contains only the version (e.g. 0.9.17-75dd6c7d0) and the rest of the information is transmitted as their own fields.

Fixes (partially) #8944

Polkadot PR (should me mergeable independently now, so not marking as companion): paritytech/polkadot#5206

Cumulus PR (should me mergeable independently now, so not marking as companion): paritytech/cumulus#1113

substrate-telemetry PR: paritytech/substrate-telemetry#464

cc @emostov @jsdw

Questions you might have

Why is the system information Linux-only?

The majority of nodes are running on Linux, so it makes sense to start there. Besides, gathering this information on Linux is pretty trivial.

Why not use an external crate to get the system information?

Apparently (or so I've heard) we used such crates in the past and we had problems with them; since gathering this is pretty simple anyway I see no point it adding new, potentially janky dependencies.

Are these benchmarks reliable?

In general from what I can see - yes, however the numbers they produce obviously do vary from run-to-run and can change depending on whether something else also ran in the background on that machine. But on average across the whole network this shouldn't matter. We could periodically rerun them in the future, but for now I propose that we should just add them as-is and see whether any potential noise will actually be a problem or not.

Is this going to be useful?

I think it will. Although it's hard to tell without actually, you know, adding those metrics in and seeing the results.

How will those be displayed?

These metrics are printed out in the console (screenshot from our benchmarking machine):

And also in substrate-telemetry (work in progress; incomplete; also, I had only one node connected since I'm only testing it locally, so this is not completely representative of how it will actually look):

As you can see there's a new tab/category in the upper right corner on which you can click, and which will bring you to this screen where you'll see a bunch of tables with aggregate statistics for a given chain, showing the most common values for each category. (Somewhat inspired by the Steam Hardware Survey.) This will display all of the information gathered here along with the relative benchmark results as compared to our bechmarking machine. (So we'll see what fraction of the network is running faster/slower hardware.)

wigy-opensource-developer

I understand you did not want to depend on an external crate to gather these data. But are you afraid if you extracted your benchmark code into a separate crate, other people would start asking new features into it unrelated to what Substrate needs?

wigy-opensource-developer · 2022-03-18T14:45:22Z

client/service/src/builder.rs

+	info!("💻 Operating system: {}", TARGET_OS);
+	info!("💻 CPU architecture: {}", TARGET_ARCH);
+	if !TARGET_ENV.is_empty() {
+		info!("💻 Target environment: {}", TARGET_ENV);
+	}


Just looking at these log lines, I would get the impression that the properties of the running system are listed, not those of the build target. I know it is really an edge-case, but foreign ELF formats can be loaded and run emulated on a different system. Are we okay with ignoring those fringe usages?

Hmm... well, that is a good point; I don't think we have to care about this in general though since those should mostly be really fringe cases, and detecting this will most likely not be easy. (That said, if anyone has any counterpoints here or any good ideas how to handle this in a reasonable way I'm all ears.)

I guess the most likely cases here would be either someone running the Linux binary on a BSD system, or someone running an amd64 binary on an M1 Mac (but we don't provide binaries for macOS, so they'd have to compile it themselves, and if they're compiling it themselves then why not compile a native aarch64 binary in the first place and run that?).

koute · 2022-03-18T15:16:41Z

I understand you did not want to depend on an external crate to gather these data. But are you afraid if you extracted your benchmark code into a separate crate, other people would start asking new features into it unrelated to what Substrate needs?

Well, I guess we could chuck it into a separate crate, but I'm not entirely convinced that it'd be worth it. And, yes, once you get any actual external users they do tend to start asking for new features. (: A major point of this implementation is that it's small, simple and narrow in scope. There's a gazillion of other things a general-purpose sysinfo and/or benchmarking crate would have to support besides what we support here. (Just compare our ~100 lines of code which gather all the Linux sysinfo we need with the sysinfo crate which is 10k lines of code.)

client/service/src/sysinfo.rs

davxy · 2022-03-18T15:32:04Z

Excluding the grouping tests in their submodule nitpick, LGTM and it is a very cool and useful feature to get some insights about the network nodes

client/service/Cargo.toml

ggwpez · 2022-03-21T15:14:13Z

We have a lot of tests that just start a --dev node for testing, also in Polkadot.
See bin/node/cli/test in Substrate. Maybe you can add a suppression flag to them as they would otherwise run that over and over.

koute · 2022-03-25T08:33:25Z

We have a lot of tests that just start a --dev node for testing, also in Polkadot. See bin/node/cli/test in Substrate. Maybe you can add a suppression flag to them as they would otherwise run that over and over.

Considering those benchmarks take less that 1s it shouldn't be too big of a deal in practice, but good point. I've added an extra flag and suppressed those in those tests. (I'll do polkadot in a separate PR later since this is not critical and if I don't have to put up a companion I'd rather not.)

koute · 2022-03-25T09:58:44Z

Looks like a companion is necessary now anyway since I've added the new CLI argument.

client/telemetry/src/lib.rs

client/service/src/builder.rs

bkchr

Mainly some nitpicks, otherwise it looks good. However, I also don't checked every benchmark into each detail.

client/sysinfo/src/sysinfo.rs

bkchr · 2022-04-04T20:06:46Z

client/sysinfo/src/sysinfo.rs

+		}
+	}
+
+	positions.shuffle(&mut rng());


Don't you want to use some fixed seed here to always have this reproducible?

It probably doesn't make such a big difference.

Unless I'm missing something I am using a fixed seed? (:

fn rng() -> rand_pcg::Pcg64 { rand_pcg::Pcg64::new(0xcafef00dd15ea5e5, 0xa02bdbf7bb3c0a7ac28fa16a64abf96) }

Ohh fuck :D I did not realize that rng was defined locally. I did not say anything :P

client/sysinfo/src/sysinfo.rs

client/sysinfo/src/lib.rs

client/telemetry/src/lib.rs

…sconnected

arkpar · 2022-04-05T15:41:14Z

Minor nit: for the CLI we follow the GNU convention of using --no-whatever rather than--disable-whatever

koute · 2022-04-05T15:44:33Z

I've randomly recompiled and retested this code and.... the hwbench telemetry stopped working. The message was not being sent for some reason, even though I swear it used to work.

It took a while to debug, but it turns out the connection notification stream in telemetry was totally broken - it was using send which is async and returns a future which needs to be awaited on, but the code was deliberately ignoring it with a let _ =. I have no idea why it worked before. Anyway, I've fixed it to use the synchronous try_send instead, and made it so that it won't leak disconnected notifiers while I'm at it. (Alas, retain_mut's still unstable, so it's a little ugly.)

This should be good to go I think; I'll let it marinate until tomorrow and if there won't be any more comments I'll merge it in.

koute · 2022-04-05T15:45:53Z

Minor nit: for the CLI we follow the GNU convention of using --no-whatever rather than--disable-whatever

We have the --disable-log-color argument, so not entirely it seems... (:

But I don't really care either way; I can change it.

bkchr · 2022-04-05T18:36:01Z

it was using send which is async and returns a future which needs to be awaited on, but the code was deliberately ignoring it with a let _ =. I have no idea why it worked before. Anyway, I've fixed it to use the synchronous try_send instead, and made it so that it won't leak disconnected notifiers while I'm at it. (Alas, retain_mut's still unstable, so it's a little ugly.)

Okay that is really bad and your solution is also not really great. I already see people complaining about the warning appearing very often. To the point why it worked before, send is internally first taking the same path as try_send and if that fails it returns the future to try again. The correct solution here would be to collect these futures and to poll them until they are finished.

bkchr · 2022-04-05T18:36:33Z

(If you convert your warn into a debug you can merge as is and open an issue to implement this properly)

koute · 2022-04-06T06:59:45Z

To the point why it worked before, send is internally first taking the same path as try_send and if that fails it returns the future to try again. The correct solution here would be to collect these futures and to poll them until they are finished.

Wait, is it? From what I can see send doesn't actually do anything besides creating a new struct. ([1], [2], [3]) I was convinced it used to work, but this definitely should have not worked. Maybe I'm just going crazy. (Or I'm looking at the wrong thing?)

and your solution is also not really great. I already see people complaining about the warning appearing very often. [..] The correct solution here would be to collect these futures and to poll them until they are finished.

Unless I'm missing something here AFAIK it shouldn't appear at all in normal circumstances. The try_send should always succeed as long as there's space in the channel (that is: as long as the receiver is not stuck; if I'm reading this code right) and this is triggered only when we succeeded (re)connecting to the telemetry. So for this warning to trigger this needs to happen:

We get disconnected from the telemetry.
We get reconnected to the telemetry.
The connection notifier receiver must have not yet processed a notification from before the (1) happened.

We're probably probably not going to be disconnected from the telemetry very often (and that prints a warning on its own anyway), and on top of that for this warning to trigger the receiver either isn't handling the notifications at all, or is handling them really slow (slower than the time it takes to get disconnected from the telemetry and reconnected again), so I'd argue the warn! here is warranted since if it's printed out there's probably a bug somewhere in the code (because it hasn't yet processed the notification from before when we got disconnected).

Also, here's a quick test program to make sure this works as I think it works:

use futures::StreamExt;

#[tokio::main]
async fn main() {
    let (mut tx, mut rx) = futures::channel::mpsc::channel(0);
    tokio::task::spawn_blocking(move || {
        loop {
            tx.try_send(()).unwrap();
            std::thread::sleep(std::time::Duration::from_millis(1));
        }
    });

    loop {
        rx.next().await.unwrap();
        println!("Got message");
    }
}

This prints out:

Got message
Got message
Got message
Got message
...

So try_send always succeeds as long as the receiver is processing the messages fast enough, so I think in practice this warning should never actually be triggered. (And if it does I'd like to know, since either something's seriously broken, or this whole thing works entirely different than how I think it works.)

bkchr · 2022-04-06T07:54:18Z

Wait, is it? From what I can see send doesn't actually do anything besides creating a new struct. ([1], [2], [3]) I was convinced it used to work, but this definitely should have not worked. Maybe I'm just going crazy. (Or I'm looking at the wrong thing?)

No you are right. I didn't checked the code yesterday and was under the impression that the model was right in my brain. Sorry! No after thinking again about it this doesn't make any sense, because futures need to be polled to do something...

So try_send always succeeds as long as the receiver is processing the messages fast enough, so I think in practice this warning should never actually be triggered. (And if it does I'd like to know, since either something's seriously broken, or this whole thing works entirely different than how I think it works.)

The point here is you don't know what is on the other side and why it is maybe not processing messages fast enough. I would argue that if you for example reconnected very fast and the other side still doesn't has processed the first reconnect, there is no harm in just dropping any further reconnect message, because they will process the waiting message at some point. We can also not consume too much memory, because there is only room for one element in the channel.

TDLR: Either remove the log completely or turn it into a debug please.

koute · 2022-04-06T10:28:38Z

Okay, I've changed it to a debug log and also renamed the command-line argument as @arkpar requested. Should be good now; I'll also update the companions.

koute · 2022-04-11T09:44:08Z

bot merge

paritytech-processbot · 2022-04-11T09:44:16Z

Error: pr-custom-review is not passing for paritytech/polkadot#5206

koute · 2022-04-11T09:48:18Z

Merged in manually (with the merge button on GH) since the bot was still somehow watching the companion PR (which shouldn't have to be a companion now and should be mergable independently, so no point in making it harder than it needs to be).

* Add new hardware and software metrics * Move sysinfo tests into `mod tests` * Correct a typo in a comment * Remove unnecessary `nix` dependency * Fix the version tests * Add a `--disable-hardware-benchmarks` CLI argument * Disable hardware benchmarks in the integration tests * Remove unused import * Fix benchmarks compilation * Move code to a new `sc-sysinfo` crate * Correct `impl_version` comment * Move `--disable-hardware-benchmarks` to the chain-specific bin crate * Move printing out of hardware bench results to `sc-sysinfo` * Move hardware benchmarks to a separate messages; trigger them manually * Rename some of the fields in the `HwBench` struct * Revert changes to the telemetry crate; manually send hwbench messages * Move sysinfo logs into the sysinfo crate * Move the `TARGET_OS_*` constants into the sysinfo crate * Minor cleanups * Move the `HwBench` struct to the sysinfo crate * Derive `Clone` for `HwBench` * Fix broken telemetry connection notification stream * Prevent the telemetry connection notifiers from leaking if they're disconnected * Turn the telemetry notification failure log into a debug log * Rename `--disable-hardware-benchmarks` to `--no-hardware-benchmarks`

Add new hardware and software metrics

dccce70

koute requested a review from a team March 18, 2022 14:18

wigy-opensource-developer reviewed Mar 18, 2022

View reviewed changes

davxy reviewed Mar 18, 2022

View reviewed changes

client/service/src/sysinfo.rs Outdated Show resolved Hide resolved

davxy reviewed Mar 18, 2022

View reviewed changes

client/service/Cargo.toml Outdated Show resolved Hide resolved

koute added 6 commits March 25, 2022 17:26

Move sysinfo tests into mod tests

99633f9

Correct a typo in a comment

55c7358

Remove unnecessary nix dependency

909df21

Fix the version tests

706e4d3

Add a --disable-hardware-benchmarks CLI argument

c1e409f

Disable hardware benchmarks in the integration tests

a05159a

koute added 2 commits March 25, 2022 17:34

Remove unused import

fff4573

Fix benchmarks compilation

3d15f0e

wigy-opensource-developer approved these changes Mar 25, 2022

View reviewed changes

Merge remote-tracking branch 'origin/master' into master_hwswinfo

a8c71c1

koute mentioned this pull request Mar 25, 2022

Add hardware benchmark telemetry (Companion for Substrate#11062) paritytech/polkadot#5206

Merged

koute mentioned this pull request Mar 25, 2022

Companion for Substrate#11062 paritytech/cumulus#1113

Merged

koute added A0-please_review Pull request needs code review. and removed A0-please_review Pull request needs code review. labels Mar 25, 2022

davxy self-requested a review March 25, 2022 14:19

koute mentioned this pull request Mar 31, 2022

Add per-chain aggregate software/hardware telemetry paritytech/substrate-telemetry#464

Merged

bkchr reviewed Apr 3, 2022

View reviewed changes

client/telemetry/src/lib.rs Outdated Show resolved Hide resolved

client/service/src/builder.rs Outdated Show resolved Hide resolved

koute added 3 commits April 4, 2022 18:53

Revert changes to the telemetry crate; manually send hwbench messages

457b4da

Move sysinfo logs into the sysinfo crate

bb26cfa

Move the TARGET_OS_* constants into the sysinfo crate

77f9c5f

bkchr approved these changes Apr 4, 2022

View reviewed changes

koute added 5 commits April 5, 2022 18:05

Minor cleanups

73b230a

Move the HwBench struct to the sysinfo crate

c3214e5

Derive Clone for HwBench

b6d78a0

Fix broken telemetry connection notification stream

19f43ab

Prevent the telemetry connection notifiers from leaking if they're di…

15e5f9e

…sconnected

koute added 2 commits April 6, 2022 19:22

Turn the telemetry notification failure log into a debug log

e5f210f

Rename --disable-hardware-benchmarks to --no-hardware-benchmarks

0cafbf8

koute merged commit 5597a93 into paritytech:master Apr 11, 2022

ggwpez mentioned this pull request Apr 26, 2022

Add --no-hardware-benchmarks to node tests paritytech/cumulus#1219

Open

Dengjianping mentioned this pull request Jul 7, 2022

Bump deps to v0.9.22 Manta-Network/Manta#571

Merged

10 tasks

koute mentioned this pull request Sep 29, 2022

Member Request polkadot-fellows/seeding#27

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new hardware and software metrics #11062

Add new hardware and software metrics #11062

koute commented Mar 18, 2022 •

edited

Loading

wigy-opensource-developer left a comment

wigy-opensource-developer Mar 18, 2022

koute Mar 18, 2022

koute commented Mar 18, 2022

davxy commented Mar 18, 2022 •

edited

Loading

ggwpez commented Mar 21, 2022

koute commented Mar 25, 2022

koute commented Mar 25, 2022

bkchr left a comment

bkchr Apr 4, 2022

bkchr Apr 4, 2022

koute Apr 5, 2022

bkchr Apr 5, 2022

arkpar commented Apr 5, 2022

koute commented Apr 5, 2022 •

edited

Loading

koute commented Apr 5, 2022

bkchr commented Apr 5, 2022

bkchr commented Apr 5, 2022

koute commented Apr 6, 2022

bkchr commented Apr 6, 2022

koute commented Apr 6, 2022

koute commented Apr 11, 2022

paritytech-processbot bot commented Apr 11, 2022

koute commented Apr 11, 2022 •

edited

Loading

Add new hardware and software metrics #11062

Add new hardware and software metrics #11062

Conversation

koute commented Mar 18, 2022 • edited Loading

Questions you might have

Why is the system information Linux-only?

Why not use an external crate to get the system information?

Are these benchmarks reliable?

Is this going to be useful?

How will those be displayed?

wigy-opensource-developer left a comment

Choose a reason for hiding this comment

wigy-opensource-developer Mar 18, 2022

Choose a reason for hiding this comment

koute Mar 18, 2022

Choose a reason for hiding this comment

koute commented Mar 18, 2022

davxy commented Mar 18, 2022 • edited Loading

ggwpez commented Mar 21, 2022

koute commented Mar 25, 2022

koute commented Mar 25, 2022

bkchr left a comment

Choose a reason for hiding this comment

bkchr Apr 4, 2022

Choose a reason for hiding this comment

bkchr Apr 4, 2022

Choose a reason for hiding this comment

koute Apr 5, 2022

Choose a reason for hiding this comment

bkchr Apr 5, 2022

Choose a reason for hiding this comment

arkpar commented Apr 5, 2022

koute commented Apr 5, 2022 • edited Loading

koute commented Apr 5, 2022

bkchr commented Apr 5, 2022

bkchr commented Apr 5, 2022

koute commented Apr 6, 2022

bkchr commented Apr 6, 2022

koute commented Apr 6, 2022

koute commented Apr 11, 2022

paritytech-processbot bot commented Apr 11, 2022

koute commented Apr 11, 2022 • edited Loading

koute commented Mar 18, 2022 •

edited

Loading

davxy commented Mar 18, 2022 •

edited

Loading

koute commented Apr 5, 2022 •

edited

Loading

koute commented Apr 11, 2022 •

edited

Loading