From 193b41dce0df11e00fa23a6f28ff7d478b25a077 Mon Sep 17 00:00:00 2001 From: Kosyrev Serge Date: Tue, 2 Jun 2020 14:16:00 +0300 Subject: [PATCH] CAD-967 bench: further documentation updates --- bench/README.md | 486 +++++++++++++++++++++++++++++++++++------------ bench/README.org | 387 ++++++++++++++++++++++++++++--------- 2 files changed, 665 insertions(+), 208 deletions(-) diff --git a/bench/README.md b/bench/README.md index ce0ec5ba..40cc9d66 100644 --- a/bench/README.md +++ b/bench/README.md @@ -5,7 +5,7 @@ 1. *Bird's eye overview* 2. *Quick action -- running a benchmark without much ado* 3. *Conceptual overview* -4. *Benchmark operation* +4. *Benchmarking environment operation notes* 5. *Parametrisation – source pins and and benchmarking profiles* 6. *Deployment state handling* 7. *The benchmarking process* @@ -17,61 +17,79 @@ The benchmarking infrastructure consists of the following components: - The `cardano-benchmarking` repository, which contains: + - log analysis scripts, in the `scripts/` subdirectory - the `cardano-tx-generator`, that generates Tx load on arbitrary nodes, - a NixOS service definition for the generator, in `nix/nixos/tx-generator-service.nix` + - The `cardano-ops` repository, with: + - the general cluster deployment infrastructure, - a specific, benchmarking-oriented deployment type, defined in `globals-bench-txgen-simple.nix`, - the `bench` script, that automates benchmark runs. - - A number of cluster deployments, on the development deployer, that - have `cardano-ops` checkouts & benchmarking configuration set up in - each of them, in the form of `benchmarking-cluster-params.json` - files. - - -1. Clusters + - Resources on the development deployer, available under the `dev` + user: - Benchmarking clusters are available on the development deployer, and - currently include: + - A number of cluster deployments, that have `cardano-ops` + checkouts & benchmarking configuration set up in each of them, + in the form of `benchmarking-cluster-params.json` files. These + benchmark clusters are controlled by the `bench` tool from the + benchmarking screen session. + + The clusters currently include: + + - `bench-dist-0` + - `bench-dist-1` + - `bench-dist-2` + - `bench-dist-3` - - remote-dev-cluster-3 - - remote-dev-cluster-6 - - remote-dev-cluster-9 - - remote-dev-cluster-12 + - Benchmark result storages: + + - `bench-results` + - `bench-results-bad`, for results that failed validation, + - `bench-results-old`, for legacy data. + + -2. The `bench` command +1. The `bench` tool - Benchmarking clusters are controlled using the `bench` command, - which is equivalent to an alias of `./scripts/bench.sh` in the - current development checkout. + Benchmarking clusters are controlled using the `bench` tool, which + is equivalent to running `./bench/bench.sh` in the current + development checkout. - The command automates a significant part of the benchmarking - workflow: + The tool automates a significant part of the benchmarking workflow: - definition of multiple benchmarking profiles - - generating genesis + - genesis handling - deploying the cluster - running the benchmark profiles - - collecting & doing basic analysis on the logs + - collecting & doing analysis (incl. validation) on the logs - packaging the profile run data - The command has a fairly helpful `--help` & `--help-full` flags. + The tool includes fairly helpful `--help` & `--help-full` + facilities. ### Quick action – running a benchmark without much ado 1. Access the cluster deployment environment: - - - log into the development deployer - - `cd $CLUSTER` - - update the `cardano-ops` checkout to the desired version - - enter the `nix-shell` + - log into the development deployer under the `dev` user + - in case you don't have this set up, add the following to + your + +Host staging User staging Hostname 18.196.206.34 + + - `screen -x bench` + - use `C-b p` / `C-b n` to select the `screen` window that contains + the cluster you want to operate on + - update the `cardano-ops` checkout to the desired version -2. Ensure the cluster params file (`benchmarking-cluster-params.json`) + + +1. Ensure the cluster params file (`benchmarking-cluster-params.json`) is up-to-date with the current `cardano-ops` checkout: ``` example @@ -81,48 +99,58 @@ The benchmarking infrastructure consists of the following components: This step is only necessary when `cardano-ops` is updated in a major way, which is rarely the case. -3. The easiest way to run the full suite of benchmark profiles is to - issue: +2. The easiest way to run a basic benchmark for the default profile is + to issue: ``` example - bench all + bench profile default ``` - That's it\! - -4. Otherwise, if willing to benchmark just a single profile, we might - need to redeploy the cluster: + That's it\! The default profile + (`dist3-50000tx-100b-100tps-1io-2000kb`, provided that the cluster + is currently initialised to the size of 3 nodes) runs roughly 15 + minutes, and will deliver packaged results in `~dev/bench-results`, + in a file such as: ``` example - bench deploy + 2020-05-27-12.02.dist3-50000tx-100b-100tps-1io-128kb.tar.xz ``` - This is, generally speaking, only necessary in the following - situations: - - - after genesis update (which is performed by `bench genesis`), - and - - after `cardano-node` pin changes. + The timestamp in the package name corresponds to the initiation of + the test run. -5. At this point you are prepared to run a benchmark – either for a - single profile: +3. Otherwise, if we would like to make a specific choice on the + profiles to run, we'll have to explore the available profiles and + express that choice. + + The first exploration option lists all available profiles: ``` example - bench bench-profile [PROFILE] + bench list-profiles ``` - The profile name is optional, and defaults to `default`, which - refers to a profile specified in the cluster params file (see the - `.meta.default_profile` JSON key). + Second, we can use a `jq` query to subset the profiles, based on + their properties. Properties possessed by profiles have can be + observed via the `show` command: - Normally, the default profile is the first one listed. - -6. Every benchmark run provides an archive at the root of the cluster's - deployment checkout, f.e.: + ``` example + bench show-profile default + ``` + + These properties can, then be used for subsetting, as follows: + + ``` example + bench query-profiles '.generator.inputs_per_tx | among([1, 2])' + ``` + + Once we're satisfied with the query, we can use it as follows: ``` example - 2020-05-19-22.21.dist6-50000tx-100b-1i-1o-100tps.tar.xz + bench profiles-jq '.generator.inputs_per_tx | among([1, 2])' ``` + + Please, refer to for further + information on the `jq` query language. ### Conceptual overview @@ -163,11 +191,12 @@ we enumerate here shortly, but will also revisit in depth later: and updates its summaries in the *deployment state files*. 9. The **benchmarking process**, which is defined by the *deployment - state*, and so, indirectly, by the *source pins* and the chosen - *benchmarking profile*. + state*, which is in turn affected by the *source pins*, *topology* + and the chosen *benchmarking profile*. - It consists of several phases: **cleanup**, **initialisation**, - **registration**, **generation** and **termination**. + It consists of several phases: **profile activation**, **cleanup**, + **initialisation**, **registration**, **generation** and + **termination**. 10. **Benchmarking run** is a closely related concept that denotes a particular, parametrised instance of the *benchmarking process*, @@ -175,7 +204,10 @@ we enumerate here shortly, but will also revisit in depth later: Each *benchmarking run* is assigned a unique **tag**, that coincides with the name of a subdirectory under `./runs` in the deployment - checkout. + checkout, and also defines the name of the run report package. + + The *tag* is formed as concatenation of the run's timestamp and + profile name. 11. The **benchmarking batch** is a set of **benchmarking runs** for all *benchmarking profiles* defined by the *benchmarking parameters* of @@ -185,21 +217,22 @@ we enumerate here shortly, but will also revisit in depth later: collects and processes the post-benchmarking cluster state, and ultimately provides the **benchmark results**. - It consists of: **log fetching**, **analysis** and **packaging**. + It consists of: **log fetching**, **analysis**, **validation** and + **packaging**. -13. **Benchmark results**, consist of the *logs* and results of their +13. **Benchmark results**, consist of run *logs* and results of their *analysis*. -### Benchmark operation +### Benchmarking environment operation notes **WARNING 1**: it is strongly discouraged to edit the `cardano-ops` -deployment checkout, as this severely impedes collaboration. +deployment checkouts directly, as this severely impedes collaboration. -It is, instead, advised to add on the developer's machine, a remote for -the `cardano-ops` deployment checkout, and push to that. Note, that even -the branch currently checked out on the deployer can be pushed to – the -checkout will be magically updated, provided there were no local -changes. +It is, instead, advised to add on the developer's machine, a `git` +remote for the `cardano-ops` deployment checkout, and use `git` to push +to that. Note, that even the branch currently checked out on the +deployer can be pushed to – the checkout will be magically updated, +provided there were no local changes. **WARNING 2**: it is strongly discouraged to operate the cluster outside of the permanent screen session on the `dev` deployer – this raises the @@ -211,14 +244,31 @@ It's easy to join the screen session: screen -x bench ``` -### Parametrisation – source pins and and benchmarking profiles +In any case, please be mindful of the potential disruption to the +ongoing benchmarks. + +### Parametrisation – source pins, topologies and and benchmarking profiles -TODO +Benchmark runs are, ultimately, parametrised by a combination of +explicit variables (we're omitting such implicit factors as network +congestion and cloud service load), that are captured and exposed by the +benchmarking infrastructure via concordant control mechanisms: + +1. versions of deployed software + - exposed via source pins +2. cluster topology (incl. size) + - currently not parametrisable in a satisfactorily flexible way, + so only cluster size can be picked easily +3. blockchain and transaction generation parameters + - exposed via benchmarking profiles + + 1. Source pins **Source pins** specify versions of software components deployed on - the benchmarking cluster. + the benchmarking cluster. These pins are honored by all profile + runs, and their values are captured in the profile run metadata. Following pins are relevant in the benchmarking context: @@ -235,7 +285,49 @@ TODO niv -s SOURCES-JSON-FILE update REPO-NAME --branch BRANCH-OR-TAG ``` -2. Profiles and the benchmarking cluster parameters file +2. Topology and size + + As mentioned previously, only cluster size can be changed + conveniently. + + There are four pre-defined topologies, each associated with a + particular cluster size: `3, =6`, `9` or `12` nodes, and that's what + forms the basis for the parametrisation. + + These topology files reside in the `topologies` subdirectory of + `cardano-ops`, and are called `bench-txgen-simple-N.nix`, where N is + the intended cluster size. + + Changes beyond mere size require direct, manual intervention into + one of those topology files. + + Once the desired topology is prepared, switching the cluster to that + topology takes two steps: + + 1. ``` example + bench recreate-cluster N + ``` + + ..where `N` is the new cluster size. + + This step will fails at the very end, due to a known + `cardano-db-sync` service definition issue, and then: + + 2. ``` example + bench deploy + ``` + + ..which ought to succeed, with the following final messages: + + ``` example + explorer................................> activation finished successfully + bench-dist-0> deployment finished successfully + ``` + + This completes preparations for running of benchmark profiles of the + new size. + +3. Profiles and the benchmarking cluster parameters file Each benchmarking cluster obtains its profile definitions and other metadata from a local file called @@ -252,82 +344,235 @@ TODO (except for its `meta` component, which carries things like node names and genesis configuration). + NOTE: From time to time, the JSON schema intended by the `bench` + tool for this file changes, and so the file has to be reinitialised + to restore correspondence: + + ``` example + bench reinit-params + ``` + Benchmarking profiles serve as named sets of parameters for benchmarking runs, and can be listed with: ``` example - bench list-profiles # ..or just 'bench ps' + bench list-profiles ``` - As mentioned in the *Quick action* section, we can run benchmarks - per-profile: + Note, that besides the main benchmarking profiles, this also lists a + number of quicker-running auxiliary profiles, such as `short`, + `small` and `smoke`. + + The content of any particular profile can be inspected as follows: ``` example - bench bench-profile dist3-50000tx-100b-1i-1o-100tps # defaults to 'default' + bench show-profile dist3-50000tx-100b-100tps-16io-2000kb ``` - ..or for all defined profiles: + This structure can be used as a basis for selecting profiles, as + follows: ``` example - bench bench-all + bench query-profiles '.generator.inputs_per_tx | among([1, 2])' ``` - 1. Changing the set of available profiles - - It's not advised to edit the cluster parameters file directly – - because doing so would force us to update this file manually, - whenever the `bench` script changes – we should, instead, change - the definition of its generator. - - Note that this is still currently a bit ad-hoc, but will - improve, once the declarative definition for the profile specs - is implemented. - -### Deployment state + ..or even: + + ``` example + bench query-profiles 'matrix_blks_by_ios([32000, 64000], [1, 4, 16])' + ``` + + Once we have a satisfactory query, we can run the profiles it + selects: + + ``` example + bench profiles-jq 'matrix_blks_by_ios([32000, 64000], [1, 4, 16])' + ``` + + For details on the `jq` query language, please see + -1. State handling +4. Note on critical blockchain parameters - There is an ongoing effort to handle deployment state transparently, - on a minimal, as-needed basis – as implied by the *benchmarking - process*. + Some of the benchmarked protocols critically tie their genesis + parameters to the cluster size. - We'll only cover this shortly, therefore: + In case of PBFT, the PBFT signature threshold critically must not be + less than `1 / N`, where, `N` is the producer node count. + +5. Changing the set of available profiles - 1. genesis can be generated for a particular profile by: - - ``` example - bench genesis [PROFILE=default] - ``` + It's not advised to edit the cluster parameters file directly – + because doing so would force us to update this file manually, + whenever the `bench` script changes – we should, instead, change the + definition of its generator. - 2. deployment can be initiated by: - - ``` example - bench deploy [PROFILE=default] - ``` + Note that this is still currently a bit ad-hoc, but will improve, + once the declarative definition for the profile specs is implemented + (it's well underway). + +### Deployment state + +The cluster deployment state is handled more-or-less transparently, +with, for example, genesis being regenerated and deployed across cluster +on minimal, as needed basis. + +Whenever a need arises, deployment can be done as easily as: + +``` example +bench deploy [PROFILE=default] +``` + +..which prepares the cluster to execution of a particular benchmark +profile. ### The benchmarking process -TODO +Following phases constitute a benchmark run: -1. Cleanup +1. Profile activation: + + - genesis age check and potential regeneration + - deployment, either just on the explorer node, or across the + cluster, depending on circumstances -2. Initialisation +2. Cleanup of the cluster state: + + - service shutdown across the cluster, including journald + - purging of service logs, including all journald logs + - purging of node databases + - cleanup of the `cardano-db-sync` database + - restart of all services, incl. nodes and the `db=sync` -3. Registration +3. Initialisation + + - an additional genesis check, based on the its actually deployed + value + - initial delay, to allow nodes to connect to each other, and to + generally establish business -4. Generation +4. Registration + + This is when the benchmark run gets assigned a unique run id, or + "tag", and an output folder in a '`run`' subdirectory named after + the run id. + + The run id (or tag) consists of the humand-readable timestamp, in + the form of `YYYY-MM-DD-HH.MM.PROFILENAME`. -5. Termination +5. Generation + + - The `cardano-tx-generator` service is started, + + - A non-empty block is then expected to appear within the + following `200` seconds – and if it doesn't, the benchmark run + is aborted and its results are marked and further processed as + broken. See the **"Broken run processing"** section for the + details. + + - After a non-empty block appears, the condition changes – the + benchmark run is considered in progress, until either: + + - no blocks arrive within reasonable time, in which case, + again, the benchmark run is aborted and its results are + marked and further processed as broken. Again, please see + the **"Broken run processing"** section for the details. + - a sequence of empty blocks arrives, according to a + profile-defined length. This is considered a success, and + leads to the following phase. + +6. Benchmark termination is simply about stopping of all key + log-producing services. ### The capture process -TODO +The capture process deals with artifact collection and analysis +(including validation). 1. Log fetching + + Logs are collected from all services/nodes material to benchmarking: + + - producer and observer nodes + - transaction generator + - `cardano-db-sync` + + The `cardano-db-sync` database has the SQL extraction queries + performed on it. (For the queries, please look for the + `scripts/NN-*.sql` files in the `cardano-benchmarking` repository). + + All that is collected over SSH and stored in the current benchmark + run directory on the deployer. + + The fetch phase is separately available via the `bench fetch` + subcommand, and by default fetches logs from the last run. 2. Analysis - -3. Packaging + + A number of log analyses are performed on the collected logs, some + of them coming from the `cardano-benchmarking` repository (in the + `analyses` and `scripts` directories), and some defined locally, in + the `bench` tool (in `bench/lib-analyses.sh`). + + The analyses include, but are not limited to: + + - log file inventory, incl. time spans between first/last + messages, + - per-transaction issuance/block inclusion time + - transaction submission statistics: + - announcement, sending, acknowledgement, loss and + unavailability + - submission thread trace extraction + - per-message analyses, such as `MsgBlock`, `AddedBlockToQueue`, + `AddedToCurrentChain`, `TraceForgeInvalidBlock` and + `TraceMempoolRejectedTx` + - message type summary, that lists all encountered message types, + along with occurence counts + - analyses derived from the above, such as: + - Tx rejection/invalid UTxO/missing input counters + + The analysis phase is separately available via the `bench analyse` + subcommand, and by default analyses logs from the last run. + +3. Validation + + Validation depends on the previous analyses to detect anomalies and + provides a form of automated pass/fail classification of + benchmarking runs based on the sanity checks defined in + `bench/lib-sanity.sh`, such as: + + - log file start/stop spread being within a specified threshold, + incl. from benchmark run start time + - blocks being made at all + - trailing block-free gap until cluster termination being within a + specified threshold + - having any transactions in blocks at all + - having the submission process announced and send the exact same + number of transactions as requested by benchmark profile + - having the count of transactions seen in blocks to be within a + specified threshold from number of transactions sent + - having the chain density within the specified threshold + + The sanity check phase is separately available via the `bench + sanity-check` subcommand, and by default checks sanity of logs from + the last run. + +4. Packaging + + All the logs and the analysis results are packaged in either the + "good" or the "bad" run result directory, in a file with name + derived as concatenation of the profile run start timestamp and the + profile name, optionally suffixed with `.broken` marker in case of + broken runs. + + The result directories are situated on the `dev` deployer: + + - `bench-results` + - `bench-results-bad` + + The package phase is separately available via the `bench package` + subcommand, and by default packages the run directory from the last + run. ### Benchmark results @@ -336,10 +581,10 @@ Each successful benchmark run produces the following results: 1. A run output directory, such as: ``` example - ./runs/1589819135.27a0a9dc.refinery-manager.pristine.node-66f0e6d4.tx50000.l100.i1.o1.tps100 + ./runs/2020-05-27-12.02.dist3-50000tx-100b-100tps-1io-128kblk ``` - This directory (also called "tag", internally), contains: + This benchmark run directory, contains: 1. `meta.json` – the run's metadata, a key piece in its processing, @@ -357,10 +602,11 @@ Each successful benchmark run produces the following results: generator. This also includes an extraction from the `cardano-db-sync` database. - 6. `analysis/*` – some light extraction based on the available - logs. + 6. `analysis.json` – a collection of small-output-volume analyses. + + 7. `analysis/*` – data extraction based on the available logs. - 7. `tools/*` – the tools used to perform the above extraction, + 8. `tools/*` – the tools used to perform some of the analyses, fetched from the `cardano-benchmarking` repo. 2. An archive in the deployment checkout, that contains the exact @@ -368,5 +614,5 @@ Each successful benchmark run produces the following results: user-friendly name: ``` example - ./YYYY-MM-DD.$PROFILE_NAME.tar.xz + ./YYYY-MM-DD-HH.MM.$PROFILE_NAME.tar.xz ``` diff --git a/bench/README.org b/bench/README.org index 7259dc92..7d942079 100644 --- a/bench/README.org +++ b/bench/README.org @@ -4,7 +4,7 @@ 1. [[*Bird's eye overview][Bird's eye overview]] 2. [[Quick action -- running a benchmark without much ado]] 3. [[*Conceptual overview][Conceptual overview]] - 4. [[Benchmark operation]] + 4. [[Benchmarking environment operation notes][Benchmarking environment operation notes]] 5. [[*Parametrisation -- source pins and and benchmarking profiles][Parametrisation -- source pins and and benchmarking profiles]] 6. [[Deployment state handling]] 7. [[The benchmarking process]] @@ -26,43 +26,53 @@ =globals-bench-txgen-simple.nix=, - the =bench= script, that automates benchmark runs. - - A number of cluster deployments, on the development deployer, - that have =cardano-ops= checkouts & benchmarking configuration set up - in each of them, in the form of =benchmarking-cluster-params.json= files. + - Resources on the development deployer, available under the =dev= user: -***** Clusters + - A number of cluster deployments, that have =cardano-ops= checkouts & + benchmarking configuration set up in each of them, in the form of + =benchmarking-cluster-params.json= files. These benchmark clusters + are controlled by the =bench= tool from the benchmarking screen session. - Benchmarking clusters are available on the development deployer, and currently - include: + The clusters currently include: + - =bench-dist-0= + - =bench-dist-1= + - =bench-dist-2= + - =bench-dist-3= - - remote-dev-cluster-3 - - remote-dev-cluster-6 - - remote-dev-cluster-9 - - remote-dev-cluster-12 + - Benchmark result storages: + - =bench-results= + - =bench-results-bad=, for results that failed validation, + - =bench-results-old=, for legacy data. -***** The =bench= command +***** The =bench= tool - Benchmarking clusters are controlled using the =bench= command, which is - equivalent to an alias of =./scripts/bench.sh= in the current development checkout. + Benchmarking clusters are controlled using the =bench= tool, which is + equivalent to running =./bench/bench.sh= in the current development + checkout. - The command automates a significant part of the benchmarking workflow: + The tool automates a significant part of the benchmarking workflow: - definition of multiple benchmarking profiles - - generating genesis + - genesis handling - deploying the cluster - running the benchmark profiles - - collecting & doing basic analysis on the logs + - collecting & doing analysis (incl. validation) on the logs - packaging the profile run data - The command has a fairly helpful =--help= & =--help-full= flags. + The tool includes fairly helpful =--help= & =--help-full= facilities. *** Quick action -- running a benchmark without much ado 1. Access the cluster deployment environment: - - log into the development deployer - - =cd $CLUSTER= + - log into the development deployer under the =dev= user + - in case you don't have this set up, add the following to your +Host staging +User staging +Hostname 18.196.206.34 + - =screen -x bench= + - use =C-b p= / =C-b n= to select the =screen= window that contains the + cluster you want to operate on - update the =cardano-ops= checkout to the desired version - - enter the =nix-shell= 2. Ensure the cluster params file (=benchmarking-cluster-params.json=) is up-to-date with the current =cardano-ops= checkout: @@ -72,37 +82,44 @@ This step is only necessary when =cardano-ops= is updated in a major way, which is rarely the case. - 3. The easiest way to run the full suite of benchmark profiles is to issue: + 3. The easiest way to run a basic benchmark for the default profile is + to issue: - : bench all + : bench profile default - That's it! + That's it! The default profile (=dist3-50000tx-100b-100tps-1io-2000kb=, + provided that the cluster is currently initialised to the size of 3 nodes) + runs roughly 15 minutes, and will deliver packaged results in + =~dev/bench-results=, in a file such as: - 4. Otherwise, if willing to benchmark just a single profile, we might - need to redeploy the cluster: + : 2020-05-27-12.02.dist3-50000tx-100b-100tps-1io-128kb.tar.xz - : bench deploy + The timestamp in the package name corresponds to the initiation of the test + run. - This is, generally speaking, only necessary in the following situations: + 4. Otherwise, if we would like to make a specific choice on the profiles to + run, we'll have to explore the available profiles and express that choice. - - after genesis update (which is performed by =bench genesis=), and - - after =cardano-node= pin changes. + The first exploration option lists all available profiles: - 5. At this point you are prepared to run a benchmark -- either for a single - profile: + : bench list-profiles - : bench bench-profile [PROFILE] + Second, we can use a =jq= query to subset the profiles, based on their + properties. Properties possessed by profiles have can be observed + via the =show= command: - The profile name is optional, and defaults to =default=, which - refers to a profile specified in the cluster params file - (see the =.meta.default_profile= JSON key). + : bench show-profile default - Normally, the default profile is the first one listed. + These properties can, then be used for subsetting, as follows: - 6. Every benchmark run provides an archive at the root of the - cluster's deployment checkout, f.e.: + : bench query-profiles '.generator.inputs_per_tx | among([1, 2])' - : 2020-05-19-22.21.dist6-50000tx-100b-1i-1o-100tps.tar.xz + Once we're satisfied with the query, we can use it as follows: + + : bench profiles-jq '.generator.inputs_per_tx | among([1, 2])' + + Please, refer to https://stedolan.github.io/jq/manual/ for further + information on the =jq= query language. *** Conceptual overview @@ -140,17 +157,21 @@ updates its summaries in the /deployment state files/. 9. The *benchmarking process*, which is defined by the /deployment state/, - and so, indirectly, by the /source pins/ and the chosen /benchmarking profile/. + which is in turn affected by the /source pins/, /topology/ and the chosen + /benchmarking profile/. - It consists of several phases: *cleanup*, *initialisation*, - *registration*, *generation* and *termination*. + It consists of several phases: *profile activation*, *cleanup*, + *initialisation*, *registration*, *generation* and *termination*. 10. *Benchmarking run* is a closely related concept that denotes a particular, parametrised instance of the /benchmarking process/, that was executed at a certain time. Each /benchmarking run/ is assigned a unique *tag*, that coincides - with the name of a subdirectory under =./runs= in the deployment checkout. + with the name of a subdirectory under =./runs= in the deployment checkout, + and also defines the name of the run report package. + + The /tag/ is formed as concatenation of the run's timestamp and profile name. 11. The *benchmarking batch* is a set of *benchmarking runs* for all /benchmarking profiles/ defined by the /benchmarking parameters/ of the @@ -160,20 +181,20 @@ and processes the post-benchmarking cluster state, and ultimately provides the *benchmark results*. - It consists of: *log fetching*, *analysis* and *packaging*. + It consists of: *log fetching*, *analysis*, *validation* and *packaging*. - 13. *Benchmark results*, consist of the /logs/ and results of their + 13. *Benchmark results*, consist of run /logs/ and results of their /analysis/. -*** Benchmark operation +*** Benchmarking environment operation notes *WARNING 1*: it is strongly discouraged to edit the =cardano-ops= deployment - checkout, as this severely impedes collaboration. + checkouts directly, as this severely impedes collaboration. - It is, instead, advised to add on the developer's machine, a remote for the - =cardano-ops= deployment checkout, and push to that. Note, that even the - branch currently checked out on the deployer can be pushed to -- the checkout - will be magically updated, provided there were no local changes. + It is, instead, advised to add on the developer's machine, a =git= remote for + the =cardano-ops= deployment checkout, and use =git= to push to that. Note, + that even the branch currently checked out on the deployer can be pushed to -- + the checkout will be magically updated, provided there were no local changes. *WARNING 2*: it is strongly discouraged to operate the cluster outside of the permanent screen session on the =dev= deployer -- this raises the possibilty @@ -183,14 +204,31 @@ : screen -x bench -*** Parametrisation -- source pins and and benchmarking profiles + In any case, please be mindful of the potential disruption to the ongoing + benchmarks. + +*** Parametrisation -- source pins, topologies and and benchmarking profiles + + Benchmark runs are, ultimately, parametrised by a combination of explicit + variables (we're omitting such implicit factors as network congestion and + cloud service load), that are captured and exposed by the benchmarking + infrastructure via concordant control mechanisms: + + 1. versions of deployed software + - exposed via source pins - TODO + 2. cluster topology (incl. size) + - currently not parametrisable in a satisfactorily flexible way, so only + cluster size can be picked easily + + 3. blockchain and transaction generation parameters + - exposed via benchmarking profiles ***** Source pins *Source pins* specify versions of software components deployed on the - benchmarking cluster. + benchmarking cluster. These pins are honored by all profile runs, + and their values are captured in the profile run metadata. Following pins are relevant in the benchmarking context: @@ -203,6 +241,44 @@ : niv -s SOURCES-JSON-FILE update REPO-NAME --branch BRANCH-OR-TAG +***** Topology and size + + As mentioned previously, only cluster size can be changed conveniently. + + There are four pre-defined topologies, each associated with a particular + cluster size: =3, =6=, =9= or =12= nodes, and that's what forms the basis + for the parametrisation. + + These topology files reside in the =topologies= subdirectory of + =cardano-ops=, and are called =bench-txgen-simple-N.nix=, where N is the + intended cluster size. + + Changes beyond mere size require direct, manual intervention into one of + those topology files. + + Once the desired topology is prepared, switching the cluster to that + topology takes two steps: + + 1. + : bench recreate-cluster N + + ..where =N= is the new cluster size. + + This step will fails at the very end, due to a known =cardano-db-sync= + service definition issue, and then: + + 2. + + : bench deploy + + ..which ought to succeed, with the following final messages: + + : explorer................................> activation finished successfully + : bench-dist-0> deployment finished successfully + + This completes preparations for running of benchmark profiles of the new + size. + ***** Profiles and the benchmarking cluster parameters file Each benchmarking cluster obtains its profile definitions and other metadata @@ -217,80 +293,213 @@ its =meta= component, which carries things like node names and genesis configuration). + NOTE: From time to time, the JSON schema intended by the =bench= tool for this + file changes, and so the file has to be reinitialised to restore correspondence: + + : bench reinit-params + Benchmarking profiles serve as named sets of parameters for benchmarking runs, and can be listed with: - : bench list-profiles # ..or just 'bench ps' + : bench list-profiles - As mentioned in the /Quick action/ section, we can run benchmarks per-profile: + Note, that besides the main benchmarking profiles, this also lists a number + of quicker-running auxiliary profiles, such as =short=, =small= and + =smoke=. - : bench bench-profile dist3-50000tx-100b-1i-1o-100tps # defaults to 'default' + The content of any particular profile can be inspected as follows: - ..or for all defined profiles: + : bench show-profile dist3-50000tx-100b-100tps-16io-2000kb - : bench bench-all + This structure can be used as a basis for selecting profiles, as follows: -******* Changing the set of available profiles + : bench query-profiles '.generator.inputs_per_tx | among([1, 2])' - It's not advised to edit the cluster parameters file directly -- - because doing so would force us to update this file manually, whenever - the =bench= script changes -- we should, instead, change the - definition of its generator. + ..or even: - Note that this is still currently a bit ad-hoc, but will improve, - once the declarative definition for the profile specs is implemented. + : bench query-profiles 'matrix_blks_by_ios([32000, 64000], [1, 4, 16])' -*** Deployment state + Once we have a satisfactory query, we can run the profiles it selects: + : bench profiles-jq 'matrix_blks_by_ios([32000, 64000], [1, 4, 16])' -***** State handling + For details on the =jq= query language, please see https://stedolan.github.io/jq/manual/ - There is an ongoing effort to handle deployment state transparently, - on a minimal, as-needed basis -- as implied by the /benchmarking process/. +***** Note on critical blockchain parameters - We'll only cover this shortly, therefore: + Some of the benchmarked protocols critically tie their genesis parameters + to the cluster size. - 1. genesis can be generated for a particular profile by: + In case of PBFT, the PBFT signature threshold critically must not be less + than =1 / N=, where, =N= is the producer node count. - : bench genesis [PROFILE=default] +***** Changing the set of available profiles - 2. deployment can be initiated by: + It's not advised to edit the cluster parameters file directly -- + because doing so would force us to update this file manually, whenever + the =bench= script changes -- we should, instead, change the + definition of its generator. - : bench deploy [PROFILE=default] + Note that this is still currently a bit ad-hoc, but will improve, + once the declarative definition for the profile specs is implemented + (it's well underway). + +*** Deployment state + + The cluster deployment state is handled more-or-less transparently, with, for + example, genesis being regenerated and deployed across cluster on minimal, as + needed basis. + + Whenever a need arises, deployment can be done as easily as: + + : bench deploy [PROFILE=default] + + ..which prepares the cluster to execution of a particular benchmark profile. *** The benchmarking process - TODO + Following phases constitute a benchmark run: + + 1. Profile activation: + - genesis age check and potential regeneration + - deployment, either just on the explorer node, or across the cluster, + depending on circumstances + + 2. Cleanup of the cluster state: + - service shutdown across the cluster, including journald + - purging of service logs, including all journald logs + - purging of node databases + - cleanup of the =cardano-db-sync= database + - restart of all services, incl. nodes and the =db=sync= + + 3. Initialisation + - an additional genesis check, based on the its actually deployed value + - initial delay, to allow nodes to connect to each other, and to generally + establish business + + 4. Registration + + This is when the benchmark run gets assigned a unique run id, or "tag", + and an output folder in a '=run=' subdirectory named after the run id. + + The run id (or tag) consists of the humand-readable timestamp, in the form + of =YYYY-MM-DD-HH.MM.PROFILENAME=. -***** Cleanup + 5. Generation -***** Initialisation + - The =cardano-tx-generator= service is started, -***** Registration + - A non-empty block is then expected to appear within the following =200= + seconds -- and if it doesn't, the benchmark run is aborted and its + results are marked and further processed as broken. See the *"Broken run + processing"* section for the details. -***** Generation + - After a non-empty block appears, the condition changes -- the benchmark + run is considered in progress, until either: + - no blocks arrive within reasonable time, in which case, again, the + benchmark run is aborted and its results are marked and further + processed as broken. Again, please see the *"Broken run processing"* + section for the details. + - a sequence of empty blocks arrives, according to a profile-defined + length. This is considered a success, and leads to the following + phase. -***** Termination + 6. Benchmark termination is simply about stopping of all key log-producing + services. *** The capture process - TODO + The capture process deals with artifact collection and analysis + (including validation). ***** Log fetching + Logs are collected from all services/nodes material to benchmarking: + + - producer and observer nodes + - transaction generator + - =cardano-db-sync= + + The =cardano-db-sync= database has the SQL extraction queries performed on + it. (For the queries, please look for the =scripts/NN-*.sql= files in the + =cardano-benchmarking= repository). + + All that is collected over SSH and stored in the current benchmark run + directory on the deployer. + + The fetch phase is separately available via the =bench fetch= subcommand, and by + default fetches logs from the last run. + ***** Analysis + A number of log analyses are performed on the collected logs, some of them + coming from the =cardano-benchmarking= repository (in the =analyses= and + =scripts= directories), and some defined locally, in the =bench= tool (in + =bench/lib-analyses.sh=). + + The analyses include, but are not limited to: + + - log file inventory, incl. time spans between first/last messages, + - per-transaction issuance/block inclusion time + - transaction submission statistics: + - announcement, sending, acknowledgement, loss and unavailability + - submission thread trace extraction + - per-message analyses, such as =MsgBlock=, =AddedBlockToQueue=, + =AddedToCurrentChain=, =TraceForgeInvalidBlock= and + =TraceMempoolRejectedTx= + - message type summary, that lists all encountered message types, along with + occurence counts + - analyses derived from the above, such as: + - Tx rejection/invalid UTxO/missing input counters + + The analysis phase is separately available via the =bench analyse= subcommand, and by + default analyses logs from the last run. + +***** Validation + + Validation depends on the previous analyses to detect anomalies and provides + a form of automated pass/fail classification of benchmarking runs based on + the sanity checks defined in =bench/lib-sanity.sh=, such as: + + - log file start/stop spread being within a specified threshold, incl. from + benchmark run start time + - blocks being made at all + - trailing block-free gap until cluster termination being within a specified + threshold + - having any transactions in blocks at all + - having the submission process announced and send the exact same number of + transactions as requested by benchmark profile + - having the count of transactions seen in blocks to be within a specified + threshold from number of transactions sent + - having the chain density within the specified threshold + + The sanity check phase is separately available via the =bench sanity-check= + subcommand, and by default checks sanity of logs from the last run. + ***** Packaging + All the logs and the analysis results are packaged in either + the "good" or the "bad" run result directory, in a file with name derived as + concatenation of the profile run start timestamp and the profile name, + optionally suffixed with =.broken= marker in case of broken runs. + + The result directories are situated on the =dev= deployer: + + - =bench-results= + - =bench-results-bad= + + The package phase is separately available via the =bench package= + subcommand, and by default packages the run directory from the last run. + *** Benchmark results Each successful benchmark run produces the following results: 1. A run output directory, such as: - : ./runs/1589819135.27a0a9dc.refinery-manager.pristine.node-66f0e6d4.tx50000.l100.i1.o1.tps100 + : ./runs/2020-05-27-12.02.dist3-50000tx-100b-100tps-1io-128kblk - This directory (also called "tag", internally), contains: + This benchmark run directory, contains: 1. =meta.json= -- the run's metadata, a key piece in its processing, @@ -307,12 +516,14 @@ for all the nodes (including explorer) and the Tx generator. This also includes an extraction from the =cardano-db-sync= database. - 6. =analysis/*= -- some light extraction based on the available logs. + 6. =analysis.json= -- a collection of small-output-volume analyses. + + 7. =analysis/*= -- data extraction based on the available logs. - 7. =tools/*= -- the tools used to perform the above extraction, fetched + 8. =tools/*= -- the tools used to perform some of the analyses, fetched from the =cardano-benchmarking= repo. 2. An archive in the deployment checkout, that contains the exact /content/ of that directory, but placed in a directory with a user-friendly name: - : ./YYYY-MM-DD.$PROFILE_NAME.tar.xz + : ./YYYY-MM-DD-HH.MM.$PROFILE_NAME.tar.xz