Always include all destinations #1399

petewall · 2025-04-01T22:24:28Z

Also, move destination configs to the end, they're the least interesting, typically.

Move destination configs to the end, they're the least interesting, typically. Signed-off-by: Pete Wall <pete.wall@grafana.com>

Signed-off-by: Pete Wall <pete.wall@grafana.com>

stefanandres · 2025-04-22T12:31:44Z

@petewall Is there any specific reason why you added to add all destinations to all deployments?

e.g. now our alloy-logs deployment also got prometheus and otel/traces configurations and one of our configurations is broken due to that

===== /ConfigMap monitoring-dev/monitoring-dev-alloy-logs ======
4,23d3
<     // Destination: logs_service (loki)
<     otelcol.exporter.loki "logs_service" {
<       forward_to = [loki.write.logs_service.receiver]
<     }
< 
<     loki.write "logs_service" {
<       endpoint {
<         url = "http://loki-central-gateway.loki-central.svc.cluster.local/loki/api/v1/push"
<         tls_config {
<           insecure_skip_verify = false
<         }
<         min_backoff_period = "500ms"
<         max_backoff_period = "5m"
<         max_backoff_retries = "0"
<       }
<       external_labels = {
<         "cluster" = "dev-us-east-1",
<         "k8s_cluster_name" = "dev-us-east-1",
<       }
<     }
256a237,355
>       }
>     }
>     // Destination: metrics_service (prometheus)
>     otelcol.exporter.prometheus "metrics_service" {
>       add_metric_suffixes = true
>       forward_to = [prometheus.remote_write.metrics_service.receiver]
>     }
> 
>     prometheus.remote_write "metrics_service" {
>       endpoint {
>         url = "http://mimir-central-nginx.mimir-central.svc.cluster.local/api/v1/push"
>         headers = {
>         }
>         tls_config {
>           insecure_skip_verify = false
>         }
>         send_native_histograms = false
> 
>         queue_config {
>           capacity = 10000
>           min_shards = 1
>           max_shards = 50
>           max_samples_per_send = 2000
>           batch_send_deadline = "5s"
>           min_backoff = "30ms"
>           max_backoff = "5s"
>           retry_on_http_429 = true
>           sample_age_limit = "0s"
>         }
> 
>         write_relabel_config {
>           source_labels = ["cluster"]
>           regex = ""
>           replacement = "dev-us-east-1"
>           target_label = "cluster"
>         }
>         write_relabel_config {
>           source_labels = ["k8s_cluster_name"]
>           regex = ""
>           replacement = "dev-us-east-1"
>           target_label = "k8s_cluster_name"
>         }
>       }
> 
>       wal {
>         truncate_frequency = "20m"
>         min_keepalive_time = "5m"
>         max_keepalive_time = "30m"
>       }
>     }
>     // Destination: logs_service (loki)
>     otelcol.exporter.loki "logs_service" {
>       forward_to = [loki.write.logs_service.receiver]
>     }
> 
>     loki.write "logs_service" {
>       endpoint {
>         url = "http://loki-central-gateway.loki-central.svc.cluster.local/loki/api/v1/push"
>         tls_config {
>           insecure_skip_verify = false
>         }
>         min_backoff_period = "500ms"
>         max_backoff_period = "5m"
>         max_backoff_retries = "0"
>       }
>       external_labels = {
>         "cluster" = "dev-us-east-1",
>         "k8s_cluster_name" = "dev-us-east-1",
>       }
>     }
>     // Destination: traces_service (otlp)
> 
>     otelcol.processor.attributes "traces_service" {
>       output {
>         metrics = [otelcol.processor.transform.traces_service.input]
>         logs = [otelcol.processor.transform.traces_service.input]
>         traces = [otelcol.processor.transform.traces_service.input]
>       }
>     }
> 
>     otelcol.processor.transform "traces_service" {
>       error_mode = "ignore"
> 
>       trace_statements {
>         context = "resource"
>         statements = [
>           `set(attributes["cluster"], "dev-us-east-1")`,
>           `set(attributes["k8s.cluster.name"], "dev-us-east-1")`,
>         ]
>       }
> 
>       output {
>         traces = [otelcol.processor.batch.traces_service.input]
>       }
>     }
> 
>     otelcol.processor.batch "traces_service" {
>       timeout = "2s"
>       send_batch_size = 8192
>       send_batch_max_size = 0
> 
>       output {
>         traces = [otelcol.exporter.otlphttp.traces_service.input]
>       }
>     }
>     otelcol.exporter.otlphttp "traces_service" {
>       client {
>         endpoint = "http://tempo-central-gateway.tempo-central.svc.cluster.local:80"
>         tls {
>           insecure = false
>           insecure_skip_verify = false
>         }
>       }
> 
>       retry_on_failure {
>         enabled = true
>         initial_interval = "5s"
>         max_interval = "30s"
>         max_elapsed_time = "5m"
>       client {

That leads to

Error: /etc/alloy/config.alloy:242:1: Failed to build component: building component: get segment range: segments are not sequential

241 |
242 |   prometheus.remote_write "metrics_service" {
    |  _^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
243 | |   endpoint {
244 | |     url = "https://mimir-central-internal-write.domain.com/api/v1/push"
245 | |     headers = {

246 | |       "X-Scope-OrgID" = convert.nonsensitive(remote.kubernetes.secret.metrics_service.data["tenantId"]),
247 | |     }
248 | |     basic_auth {
249 | |       username = convert.nonsensitive(remote.kubernetes.secret.metrics_service.data["user"])
250 | |       password = remote.kubernetes.secret.metrics_service.data["password"]
251 | |     }
252 | |     tls_config {
253 | |       insecure_skip_verify = false
254 | |       ca_pem = convert.nonsensitive(remote.kubernetes.secret.metrics_service.data["ca"])
255 | |       cert_pem = convert.nonsensitive(remote.kubernetes.secret.metrics_service.data["cert"])
256 | |       key_pem = remote.kubernetes.secret.metrics_service.data["key"]
257 | |     }
258 | |     send_native_histograms = false
259 | |
260 | |     queue_config {
261 | |       capacity = 35000
262 | |       min_shards = 1
263 | |       max_shards = 150
264 | |       max_samples_per_send = 10000
265 | |       batch_send_deadline = "5s"
266 | |       min_backoff = "30ms"
267 | |       max_backoff = "5s"
268 | |       retry_on_http_429 = true
269 | |       sample_age_limit = "0s"
270 | |     }
271 | |
272 | |     write_relabel_config {
273 | |       source_labels = ["cluster"]
274 | |       regex = ""
275 | |       replacement = "dev-us-east-1"
276 | |       target_label = "cluster"
277 | |     }
278 | |     write_relabel_config {
279 | |       source_labels = ["k8s_cluster_name"]
280 | |       regex = ""
281 | |       replacement = "dev-us-east-1"
282 | |       target_label = "k8s_cluster_name"
283 | |     }
284 | |   }
285 | |
286 | |   wal {
287 | |     truncate_frequency = "2h"
288 | |     min_keepalive_time = "5m"
289 | |     max_keepalive_time = "2h"
290 | |   }
291 | | }
    | |_^
292 |
interrupt received
ts=2025-04-22T08:32:02.696132819Z level=info msg="starting complete graph evaluation" controller_path=/ controller_id=pod_logs.feature trace_id=00000000000000000000000000000000
ts=2025-04-22T08:32:02.696239088Z level=info msg="finished node evaluation" controller_path=/ controller_id=pod_logs.feature trace_id=00000000000000000000000000000000 node_id=discovery.kubernetes.pods duration=73.925µs
panic: duplicate metrics collector registration attempted

From what I guess here is that Failed to build component: building component: get segment range: segments are not sequential means that the alloy-logs process couldn't process the local WALs, which is interesting because it didn't even had a prometheus WAL before, but because /var/lib/alloy is host-mounted, those WAL files even persist and aren't deleted upon pod spawn. The alloy-metrics sts seems uses /tmp/ that is cleared with every pod spawn.

I wish I could just disable the un-used config again, but now I have to fix a problem for code that is not even used.

Especially with the vision of https://github.com/grafana/k8s-monitoring-helm/blob/039a96d76c347dd165cf70777cf5217bf8b7299d/charts/k8s-monitoring/docs/Migration.md

In v2, all features are turned off by default, which leads your values file to better reflect your desired feature set.

stefanandres · 2025-04-22T12:45:35Z

A workaround for that seems to be to manually delete /var/lib/alloy/prometheus.remote_write.metrics_service/wal/* from all existing alloy-logs daemonset pods.

Edit: I don't know how, but this only fixed 80% of the daemonset's pods

Some pods error with
Error: /etc/alloy/config.alloy:242:1: Failed to build component: building component: open WAL segment 0: open /var/lib/alloy/prometheus.remote_write.metrics_service/wal/00000000: no such file or directory

now

Edit 2: This might have been a race condition with still half running alloy-logs instances. After I shut down the DS, ran the cleanup DS, re-spawned the alloy-logs DS, the error is gone.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: sandres-wal-cleaner
spec:
  selector:
    matchLabels:
      app: sandres-wal-cleaner
  template:
    metadata:
      labels:
        app: sandres-wal-cleaner
    spec:
      containers:
        - name: cleaner
          image: busybox:latest
          securityContext:
            privileged: true
          command: ["/bin/sh", "-c"]
          args:
            [
              "echo Deleting WAL files...; \
               rm -rfv /host-wal/*; \
               echo Done. Sleeping...; \
               sleep 3600"
            ]
          volumeMounts:
            - name: wal-host-mount
              mountPath: /host-wal
      volumes:
        - name: wal-host-mount
          hostPath:
            path: /var/lib/alloy/prometheus.remote_write.metrics_service/wal
            type: DirectoryOrCreate
      restartPolicy: Always
      tolerations:
        - effect: NoSchedule
          operator: Exists
        - effect: NoExecute
          key: CriticalAddonsOnly

Edit 3: Well, because of another issue with our installation we recreate the alloy-logs pods every 15min and original issue is back now. So the pods error with the initial error again.

Edit 4:
Here is a working workaround for us now:

alloy-logs:
  controller:
    initContainers:
      # See https://github.com/grafana/k8s-monitoring-helm/pull/1399#issuecomment-2821182928
      - name: delete-wal-workaround
        image: busybox:stable@sha256:e246aa22ad2cbdfbd19e2a6ca2b275e26245a21920e2b2d0666324cee3f15549
        command:
          - /bin/sh
          - -c
          - |
            rm -rfv /var/lib/alloy/prometheus.remote_write.metrics_service/wal/*
        volumeMounts:
          - mountPath: /var/lib/alloy
            name: agent-storage

Please let us disable the un-used config :)

Always include all destinations

548c3c4

Move destination configs to the end, they're the least interesting, typically. Signed-off-by: Pete Wall <pete.wall@grafana.com>

petewall self-assigned this Apr 1, 2025

petewall requested a review from rlankfo as a code owner April 1, 2025 22:24

petewall linked an issue Apr 1, 2025 that may be closed by this pull request

Place all destination components on all collectors #1398

Closed

rlankfo approved these changes Apr 1, 2025

View reviewed changes

Update placement for otlp destinations

02c58df

Signed-off-by: Pete Wall <pete.wall@grafana.com>

petewall merged commit 5343743 into main Apr 2, 2025
42 checks passed

petewall deleted the fix/place-all-destination-components-on-all-collectors branch April 2, 2025 00:03

johnjeffers mentioned this pull request May 1, 2025

Chart 2.0.26 trying to send pod logs to Prometheus destination #1491

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Always include all destinations #1399

Always include all destinations #1399

Uh oh!

petewall commented Apr 1, 2025

Uh oh!

Uh oh!

stefanandres commented Apr 22, 2025 •

edited

Loading

Uh oh!

stefanandres commented Apr 22, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Always include all destinations #1399

Always include all destinations #1399

Uh oh!

Conversation

petewall commented Apr 1, 2025

Uh oh!

Uh oh!

stefanandres commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stefanandres commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stefanandres commented Apr 22, 2025 •

edited

Loading

stefanandres commented Apr 22, 2025 •

edited

Loading