Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define initial metrics to export to Prometheus as MVP #5

Closed
lohanspies opened this issue Jul 13, 2020 · 3 comments
Closed

Define initial metrics to export to Prometheus as MVP #5

lohanspies opened this issue Jul 13, 2020 · 3 comments

Comments

@lohanspies
Copy link

Prometheus MVP Metrics

  • The Sovrin Network name being monitored
    Should be able to get this from the pool being connected to

  • Node alias name

  • Detect when a node is inaccessible and produce standard output for that situation.

Should generate a timeout when trying to pull validator_info from inaccessible nodes.

  • Detect any nodes that are accessible but that are "unreachable" to some or all of the other Indy nodes.

    • That indicates that the internal port to the node is not accessible, even though the public port is accessible.
"Reachable_nodes": [
            [
              "Node1",
              0
            ],
            [
              "Node3",
              null
            ],
            [
              "Node4",
              null
            ]
          ],
          "Unreachable_nodes": [
            [
              "Node2",
              null
            ]
          ],
          "Reachable_nodes_count": 3,
          "Unreachable_nodes_count": 1,
  • The number of transaction per Indy ledger, especially the domain ledger.
"transaction-count": {
              "ledger": 21,
              "pool": 4,
              "config": 0,
              "audit": 1042
            },
  • The average read and write times for the node.
"throughput": {
              "0": 0.0017547843
            },
            "master throughput": 0.0017547843,
            "total requests": 16,
            "avg backup throughput": null,
            "master throughput ratio": null,
            "average-per-second": {
              "read-transactions": 0.0338584473,
              "write-transactions": 0.0001539895
            },
  • The average throughput time for the node.
"throughput": {
              "0": 0.0017547843
            },
            "master throughput": 0.0017547843,
            "total requests": 16,
            "avg backup throughput": null,
            "master throughput ratio": null,
            "average-per-second": {
              "read-transactions": 0.0338584473,
              "write-transactions": 0.0001539895
            },
  • The uptime of the node (time is last restart).
    "transaction-count": {
              "ledger": 21,
              "pool": 4,
              "config": 0,
              "audit": 1042
            },
            "uptime": 103903
          },
  • The time since last freshness check (should be less than 5 minutes).
          "Freshness_status": {
            "1": {
              "Last_updated_time": "2020-07-06 23:55:07+00:00",
              "Has_write_consensus": true
            },
            "0": {
              "Last_updated_time": "2020-07-06 23:57:33+00:00",
              "Has_write_consensus": true
            },
            "2": {
              "Last_updated_time": "2020-07-06 23:57:33+00:00",
              "Has_write_consensus": true
            }
          }
  • Node IP address information
"Node_info": {
          "Name": "Node4",
          "Mode": "participating",
          "Client_port": 9708,
          "Client_ip": "0.0.0.0",
          "Client_protocol": "tcp",
          "Node_port": 9707,
          "Node_ip": "0.0.0.0",
  • Total nodes in pool information
"Pool_info": {
          "Read_only": false,
          "Total_nodes_count": 4,
@kiview
Copy link

kiview commented Jul 13, 2020

Since Prometheus will only gather numeric metrics, there are some things to consider when modeling the metrics.

The Sovrin Network name being monitored

Should be a label attached to all metrics.

Node alias name

Should also be a label, for each node we get (with the node being the top-level objects in the duct we are currently fetching).

Detect when a node is inaccessible and produce standard output for that situation.

This would happen outside of the exporter, either in Prometheus through Altermanager, or in Grafana.

The number of transaction per Indy ledger, especially the domain ledger.

Should work as Gauge transactions_total with a label per ledger.

The average read and write times for the node.

Here I wonder how the values are measured. Ideally, we could just record the total requests in a Gauge and let Prometheus infer the other metrics. Else having histograms for throughput might be fine, we just have to be careful with regards to statistically wrong double aggregations.

The uptime of the node (time is last restart).

Clearly a gauge with a label per node.

The time since last freshness check (should be less than 5 minutes).

Diff against time of the and record as Gauge?

Node IP address information

This could be a label, same as the node name.

Total nodes in pool information

Gauge with pool name as label.

@kiview
Copy link

kiview commented Jul 13, 2020

One question regarding freshness status:

When I have a test network with 4 nodes, I get 3 freshness values, as you have posted above:

          "Freshness_status": {
            "1": {
              "Last_updated_time": "2020-07-06 23:55:07+00:00",
              "Has_write_consensus": true
            },
            "0": {
              "Last_updated_time": "2020-07-06 23:57:33+00:00",
              "Has_write_consensus": true
            },
            "2": {
              "Last_updated_time": "2020-07-06 23:57:33+00:00",
              "Has_write_consensus": true
            }
          }

What does these numbers as keys (0,1,2) represent and how should we interpret them?

@kiview kiview mentioned this issue Jul 13, 2020
11 tasks
WadeBarnes pushed a commit to WadeBarnes/indy-node-monitor that referenced this issue Jun 27, 2021
@WadeBarnes
Copy link
Member

These metrics should be available on the auto-provisioned dashboards supplied with the monitoring stack. If anything else is needed or anything is missing a separate issue can be opened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants