Influxdb3 monitor metrics #6422

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

jstirnaman merged 10 commits into 6403-influxdb3-perf-tuning from influxdb3-monitor-metrics

Oct 2, 2025

Contributor

jstirnaman commented Sep 26, 2025

Add reference doc for /metrics output.
Add monitoring guide:
- Core and general metrics
- Enterprise cluster and node-specific metrics
- Using metrics and relabeling using Prometheus or Telegraf.

Part of #6420

jstirnaman added 3 commits

September 25, 2025 11:59


          chore(qol): Instruction to use /version/ in shared links

f4a93cb


          Merge pull request #6418 from influxdata/jts/qol-shared-links

97c76d5

chore(qol): Instruction to use /version/ in shared links


          feat(influxdb3): Core and Ent. metrics reference and monitoring.- Add…

113ac16

… reference doc for /metrics output.- Add monitoring guide: - Core and general metrics - Enterprise cluster and node-specific metrics - Using metrics and relabeling using Prometheus or Telegraf.

jstirnaman added the InfluxDB 3 Core and Enterprise label

jstirnaman requested review from sanderson, hiltontj and peterbarnett03

September 26, 2025 21:16

sanderson requested changes

View reviewed changes

Collaborator

sanderson left a comment

Great info. Lots of questions 😄

content/shared/influxdb3-admin/monitor-metrics.md

+              ```
+              {{% /show-in %}}
+              Replace {{% code-placeholder-key %}}`AUTH_TOKEN`{{% /code-placeholder-key %}} with your {{< product-name >}} {{% token-link %}} that has read access to the `/metrics` endpoint.

Collaborator

sanderson Sep 29, 2025

What tokens can read the /metrics endpoint? I assume it's just admin tokens since it's in both Core and Enterprise. I think we should call this out.

Contributor

hiltontj Sep 30, 2025

In Enterprise you can create a non-admin fine grained token with system:metrics:read permission and it will grant access to that endpoint.

content/shared/influxdb3-admin/monitor-metrics.md

Comment on lines +42 to +52

+              {{% show-in "enterprise" %}}
+              ### Aggregate metrics across cluster
+              ```bash
+              # Get metrics from all nodes in cluster
+              for node in ingester-01 query-01 compactor-01; do
+                echo "=== Node: $node ==="
+                curl -s http://$node:8181/metrics | grep 'http_requests_total.*status="ok"'
+              done
+              ```
+              {{% /show-in %}}

Collaborator

sanderson Sep 29, 2025

So these metrics are specific to each node. Does the prometheus schema include the node ID or is there additional processing a user would have to do to know the source node?

content/shared/influxdb3-admin/monitor-metrics.md

Comment on lines +88 to +89

		Different metrics are more relevant depending on node [mode configuration](/influxdb3/version/admin/clustering/#configure-node-modes):

Collaborator

sanderson Sep 29, 2025

Do irrelevant metrics still get reported? Do all nodes report the same metric, no matter what mode they're running in?

content/shared/influxdb3-admin/monitor-metrics.md

Comment on lines +313 to +322

+              ```promql
+              # 95th percentile query latency by query node
+              histogram_quantile(0.95,
+                sum(rate(influxdb_iox_query_log_execute_duration_seconds_bucket[5m])) by (instance, le)
+              )
+              # Average inter-node coordination time
+              avg(rate(influxdb_iox_query_log_ingester_latency_to_full_data_seconds_sum[5m]) /
+                  rate(influxdb_iox_query_log_ingester_latency_to_full_data_seconds_count[5m])) by (instance)
+              ```

Collaborator

sanderson Sep 29, 2025

Just a thought, but why not suggest to users to use Telegraf to collect these metrics and store them in another InfluxDB instance rather than Prometheus? I think we can provide PromQL queries, but they should be secondary to InfluxDB queries.

Setting up a sidecar monitoring instance is basically standard practice with v1 and v2 production deployments. It think it should be with v3 as well.

The Telegraf config would look something like:

[[inputs.prometheus]]
  urls = [
    "http://ingester-1.com/metrics",
    "http://querier-1.com/metrics",
    "http://compactor-1.com/metrics"
  ]
  metric_version = 2
  http_headers = {"Authorization" = "Bearer ${READ_AUTH_TOKEN}"}

[[outputs.influxdb_v2]]
  urls = ["http://influxdb3-monitor.com"]
  token = "${WRITE_AUTH_TOKEN}"
  organization = ""
  bucket = "DATABASE_NAME"

Collaborator

sanderson Sep 29, 2025

I actually see that you cover this later under "Node Labeling", but I still think this should be the first suggestion.

content/shared/influxdb3-admin/monitor-metrics.md

Comment on lines +422 to +446

+              Create role-specific dashboards with the following suggested metrics for each dashboard:
+              #### Cluster Overview Dashboard
+              - Node status and availability
+              - Request rates across all nodes
+              - Error rates by node and operation type
+              - Resource utilization summary
+              #### Ingest Performance Dashboard
+              - Write throughput by ingest node
+              - Snapshot creation rates
+              - Memory usage and pressure
+              - WAL-to-Parquet conversion metrics
+              #### Query Performance Dashboard
+              - Query latency percentiles by query node
+              - Cache hit rates and efficiency
+              - Inter-node coordination times
+              - Memory usage during query execution
+              #### Operations Dashboard
+              - Compaction progress and performance
+              - Object store operation success rates
+              - Processing engine trigger rates
+              - System health indicators

Collaborator

sanderson Sep 29, 2025

This section doesn't seem all that helpful unless we're going to actually provide a dashboard for them, or, at a minimum, the queries for each. But I know that depends on where they're storing the metrics.

content/shared/influxdb3-admin/monitor-metrics.md Outdated

+                # Add node name from URL
+                [inputs.prometheus.tags]
+                  node_name = "$1"

Collaborator

sanderson Sep 29, 2025

I'm surprised we don't actually include the node-id as a label in the metrics.

content/shared/influxdb3-admin/monitor-metrics.md Outdated Show resolved Hide resolved

jstirnaman mentioned this pull request

influxdb3: Include node ID and modes as metrics labels influxdata/influxdb#26873

Open

jstirnaman added 7 commits

October 1, 2025 16:14


          test: Use custom influxdb3 image - Docker Hub doesn't have latest ARM64

e8ccfb2


          test(influxdb3): Add metrics output and Prometheus tests for docs val…

3182f8b

…idation


          docs(influxdb3): Add system:metrics:read token examples and clarify a…

8f62e5b

…uthentication

- Add system:metrics:read token creation examples to Enterprise token docs
- Document both CLI and HTTP API approaches for creating metrics tokens
- Clarify that Enterprise supports both admin and fine-grained tokens for /metrics
- Add node identification explanation to monitoring documentation

Addresses PR #6422 review comments about authentication and token permissions.
Test validation: All examples validated with InfluxDB 3.5.0 Enterprise.


          docs(influxdb3): Clarify metrics reporting across node modes

a4485fc

Addresses @sanderson's comment: "Do irrelevant metrics still get reported?
Do all nodes report the same metric, no matter what mode they're running in?"

## Changes

### Documentation Updates

**content/shared/influxdb3-reference/metrics.md:**
- Add "Metrics reporting across node modes" section under cluster considerations
- Explain that all nodes report the same 120 metrics regardless of mode
- Clarify differences appear in values/labels, not metric availability
- Remove mention of HTTP/gRPC metrics appearing dynamically (less relevant)

**content/shared/influxdb3-admin/monitor-metrics.md:**
- Add Note callout in "Metric categories" section
- Provide same clarifications in more prominent location
- Simplify explanation for better readability

### Testing Configuration

**compose.yaml:**
- Add specialized Enterprise nodes for testing:
  - influxdb3-enterprise-write (mode: ingest, port 8183)
  - influxdb3-enterprise-query (mode: query, port 8184)
- Fix port conflicts between specialized nodes
- Enable validation of metrics behavior across node modes

## Test Results

Validated with running Enterprise nodes in different modes:
- All nodes expose same 120 unique metrics
- Metrics not filtered by node specialization
- Metric values reflect actual node activity
- Confirmed standard Prometheus behavior

See .context/issues/pr-6422-comment-responses.md for detailed test results.


          docs(influxdb3): Prioritize Telegraf over Prometheus for metrics coll…

3c2cbac

…ection

Addresses @sanderson's comment: "Why not suggest users use Telegraf to collect
these metrics and store them in another InfluxDB instance rather than Prometheus?"

## Changes

### Enterprise Monitoring Setup

**Before:** Prometheus configuration appeared first
**After:** Telegraf configuration with "(recommended)" label appears first

**New Telegraf section includes:**
- Complete configuration with `outputs.influxdb_v3` for storing in monitoring instance
- `inputs.prometheus` for scraping cluster node metrics
- `processors.regex` for extracting node_name and node_role from URLs
- Start commands for running Telegraf as a service
- SQL query examples for analyzing collected metrics in InfluxDB

**Prometheus section:**
- Moved to "Alternative: Prometheus configuration"
- Retained for users preferring Prometheus ecosystem
- Includes separate "Add node identification with Prometheus" section

### Core Monitoring Setup

**Before:** Only Prometheus configuration shown
**After:** Telegraf appears first with "(recommended)" label

**New sections:**
- "Collect metrics with Telegraf (recommended)" with complete config
- "Alternative: Prometheus configuration" for Prometheus users
- SQL query examples for monitoring InfluxDB 3 Core metrics

## Benefits

1. **InfluxDB-native workflow**: Collect InfluxDB metrics → Store in InfluxDB → Query with SQL
2. **Consistent tooling**: Users already familiar with Telegraf for data collection
3. **SQL queries**: Natural fit for InfluxDB users vs learning PromQL
4. **Centralized monitoring**: Store metrics in separate InfluxDB instance
5. **Platform agnostic**: Telegraf runs anywhere without Prometheus infrastructure

## Documentation Coverage

- ✅ Complete Telegraf configurations for both Core and Enterprise
- ✅ Node identification through processor plugins
- ✅ SQL query examples for common monitoring scenarios
- ✅ Prometheus approach retained as alternative
- ✅ Clear "(recommended)" and "Alternative" labels throughout

Addresses PR #6422 comment 4.


          docs(influxdb3): Provide Telegraf config for Core and Enterprise cluster

7d8f3e6


          docs(influxdb3): Add product-specific TOCs for Core and Enterprise me…

ad3b0e5

…trics

- Split TOC into separate Core and Enterprise versions using show-in shortcodes
- Core TOC focuses on single-node monitoring workflows
- Enterprise TOC includes cluster-specific and node-specific monitoring sections
- Improves navigation by showing only relevant sections per product
- Fix: Remove duplicate "InfluxDB" word in metrics.md

jstirnaman commented

View reviewed changes

Contributor Author

jstirnaman left a comment

I'm surprised we don't actually include the node-id as a label in the metrics.

Raised influxdata/influxdb#26873 for this

jstirnaman merged commit f14c244 into 6403-influxdb3-perf-tuning

2 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

InfluxDB 3 Core and Enterprise