Skip to content

Conversation

@petewall
Copy link
Collaborator

@petewall petewall commented Apr 1, 2025

Also, move destination configs to the end, they're the least interesting, typically.

Move destination configs to the end, they're the least interesting, typically.

Signed-off-by: Pete Wall <pete.wall@grafana.com>
@petewall petewall self-assigned this Apr 1, 2025
@petewall petewall requested a review from rlankfo as a code owner April 1, 2025 22:24
@petewall petewall linked an issue Apr 1, 2025 that may be closed by this pull request
Signed-off-by: Pete Wall <pete.wall@grafana.com>
@petewall petewall merged commit 5343743 into main Apr 2, 2025
42 checks passed
@petewall petewall deleted the fix/place-all-destination-components-on-all-collectors branch April 2, 2025 00:03
@stefanandres
Copy link
Contributor

stefanandres commented Apr 22, 2025

@petewall Is there any specific reason why you added to add all destinations to all deployments?

e.g. now our alloy-logs deployment also got prometheus and otel/traces configurations and one of our configurations is broken due to that

===== /ConfigMap monitoring-dev/monitoring-dev-alloy-logs ======
4,23d3
<     // Destination: logs_service (loki)
<     otelcol.exporter.loki "logs_service" {
<       forward_to = [loki.write.logs_service.receiver]
<     }
< 
<     loki.write "logs_service" {
<       endpoint {
<         url = "http://loki-central-gateway.loki-central.svc.cluster.local/loki/api/v1/push"
<         tls_config {
<           insecure_skip_verify = false
<         }
<         min_backoff_period = "500ms"
<         max_backoff_period = "5m"
<         max_backoff_retries = "0"
<       }
<       external_labels = {
<         "cluster" = "dev-us-east-1",
<         "k8s_cluster_name" = "dev-us-east-1",
<       }
<     }
256a237,355
>       }
>     }
>     // Destination: metrics_service (prometheus)
>     otelcol.exporter.prometheus "metrics_service" {
>       add_metric_suffixes = true
>       forward_to = [prometheus.remote_write.metrics_service.receiver]
>     }
> 
>     prometheus.remote_write "metrics_service" {
>       endpoint {
>         url = "http://mimir-central-nginx.mimir-central.svc.cluster.local/api/v1/push"
>         headers = {
>         }
>         tls_config {
>           insecure_skip_verify = false
>         }
>         send_native_histograms = false
> 
>         queue_config {
>           capacity = 10000
>           min_shards = 1
>           max_shards = 50
>           max_samples_per_send = 2000
>           batch_send_deadline = "5s"
>           min_backoff = "30ms"
>           max_backoff = "5s"
>           retry_on_http_429 = true
>           sample_age_limit = "0s"
>         }
> 
>         write_relabel_config {
>           source_labels = ["cluster"]
>           regex = ""
>           replacement = "dev-us-east-1"
>           target_label = "cluster"
>         }
>         write_relabel_config {
>           source_labels = ["k8s_cluster_name"]
>           regex = ""
>           replacement = "dev-us-east-1"
>           target_label = "k8s_cluster_name"
>         }
>       }
> 
>       wal {
>         truncate_frequency = "20m"
>         min_keepalive_time = "5m"
>         max_keepalive_time = "30m"
>       }
>     }
>     // Destination: logs_service (loki)
>     otelcol.exporter.loki "logs_service" {
>       forward_to = [loki.write.logs_service.receiver]
>     }
> 
>     loki.write "logs_service" {
>       endpoint {
>         url = "http://loki-central-gateway.loki-central.svc.cluster.local/loki/api/v1/push"
>         tls_config {
>           insecure_skip_verify = false
>         }
>         min_backoff_period = "500ms"
>         max_backoff_period = "5m"
>         max_backoff_retries = "0"
>       }
>       external_labels = {
>         "cluster" = "dev-us-east-1",
>         "k8s_cluster_name" = "dev-us-east-1",
>       }
>     }
>     // Destination: traces_service (otlp)
> 
>     otelcol.processor.attributes "traces_service" {
>       output {
>         metrics = [otelcol.processor.transform.traces_service.input]
>         logs = [otelcol.processor.transform.traces_service.input]
>         traces = [otelcol.processor.transform.traces_service.input]
>       }
>     }
> 
>     otelcol.processor.transform "traces_service" {
>       error_mode = "ignore"
> 
>       trace_statements {
>         context = "resource"
>         statements = [
>           `set(attributes["cluster"], "dev-us-east-1")`,
>           `set(attributes["k8s.cluster.name"], "dev-us-east-1")`,
>         ]
>       }
> 
>       output {
>         traces = [otelcol.processor.batch.traces_service.input]
>       }
>     }
> 
>     otelcol.processor.batch "traces_service" {
>       timeout = "2s"
>       send_batch_size = 8192
>       send_batch_max_size = 0
> 
>       output {
>         traces = [otelcol.exporter.otlphttp.traces_service.input]
>       }
>     }
>     otelcol.exporter.otlphttp "traces_service" {
>       client {
>         endpoint = "http://tempo-central-gateway.tempo-central.svc.cluster.local:80"
>         tls {
>           insecure = false
>           insecure_skip_verify = false
>         }
>       }
> 
>       retry_on_failure {
>         enabled = true
>         initial_interval = "5s"
>         max_interval = "30s"
>         max_elapsed_time = "5m"
>       client {

That leads to

Error: /etc/alloy/config.alloy:242:1: Failed to build component: building component: get segment range: segments are not sequential

241 |
242 |   prometheus.remote_write "metrics_service" {
    |  _^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
243 | |   endpoint {
244 | |     url = "https://mimir-central-internal-write.domain.com/api/v1/push"
245 | |     headers = {

246 | |       "X-Scope-OrgID" = convert.nonsensitive(remote.kubernetes.secret.metrics_service.data["tenantId"]),
247 | |     }
248 | |     basic_auth {
249 | |       username = convert.nonsensitive(remote.kubernetes.secret.metrics_service.data["user"])
250 | |       password = remote.kubernetes.secret.metrics_service.data["password"]
251 | |     }
252 | |     tls_config {
253 | |       insecure_skip_verify = false
254 | |       ca_pem = convert.nonsensitive(remote.kubernetes.secret.metrics_service.data["ca"])
255 | |       cert_pem = convert.nonsensitive(remote.kubernetes.secret.metrics_service.data["cert"])
256 | |       key_pem = remote.kubernetes.secret.metrics_service.data["key"]
257 | |     }
258 | |     send_native_histograms = false
259 | |
260 | |     queue_config {
261 | |       capacity = 35000
262 | |       min_shards = 1
263 | |       max_shards = 150
264 | |       max_samples_per_send = 10000
265 | |       batch_send_deadline = "5s"
266 | |       min_backoff = "30ms"
267 | |       max_backoff = "5s"
268 | |       retry_on_http_429 = true
269 | |       sample_age_limit = "0s"
270 | |     }
271 | |
272 | |     write_relabel_config {
273 | |       source_labels = ["cluster"]
274 | |       regex = ""
275 | |       replacement = "dev-us-east-1"
276 | |       target_label = "cluster"
277 | |     }
278 | |     write_relabel_config {
279 | |       source_labels = ["k8s_cluster_name"]
280 | |       regex = ""
281 | |       replacement = "dev-us-east-1"
282 | |       target_label = "k8s_cluster_name"
283 | |     }
284 | |   }
285 | |
286 | |   wal {
287 | |     truncate_frequency = "2h"
288 | |     min_keepalive_time = "5m"
289 | |     max_keepalive_time = "2h"
290 | |   }
291 | | }
    | |_^
292 |
interrupt received
ts=2025-04-22T08:32:02.696132819Z level=info msg="starting complete graph evaluation" controller_path=/ controller_id=pod_logs.feature trace_id=00000000000000000000000000000000
ts=2025-04-22T08:32:02.696239088Z level=info msg="finished node evaluation" controller_path=/ controller_id=pod_logs.feature trace_id=00000000000000000000000000000000 node_id=discovery.kubernetes.pods duration=73.925µs
panic: duplicate metrics collector registration attempted

From what I guess here is that Failed to build component: building component: get segment range: segments are not sequential means that the alloy-logs process couldn't process the local WALs, which is interesting because it didn't even had a prometheus WAL before, but because /var/lib/alloy is host-mounted, those WAL files even persist and aren't deleted upon pod spawn. The alloy-metrics sts seems uses /tmp/ that is cleared with every pod spawn.

I wish I could just disable the un-used config again, but now I have to fix a problem for code that is not even used.

Especially with the vision of https://github.com/grafana/k8s-monitoring-helm/blob/039a96d76c347dd165cf70777cf5217bf8b7299d/charts/k8s-monitoring/docs/Migration.md

In v2, all features are turned off by default, which leads your values file to better reflect your desired feature set.

@stefanandres
Copy link
Contributor

stefanandres commented Apr 22, 2025

A workaround for that seems to be to manually delete /var/lib/alloy/prometheus.remote_write.metrics_service/wal/* from all existing alloy-logs daemonset pods.

Edit: I don't know how, but this only fixed 80% of the daemonset's pods

Some pods error with
Error: /etc/alloy/config.alloy:242:1: Failed to build component: building component: open WAL segment 0: open /var/lib/alloy/prometheus.remote_write.metrics_service/wal/00000000: no such file or directory

now

Edit 2: This might have been a race condition with still half running alloy-logs instances. After I shut down the DS, ran the cleanup DS, re-spawned the alloy-logs DS, the error is gone.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: sandres-wal-cleaner
spec:
  selector:
    matchLabels:
      app: sandres-wal-cleaner
  template:
    metadata:
      labels:
        app: sandres-wal-cleaner
    spec:
      containers:
        - name: cleaner
          image: busybox:latest
          securityContext:
            privileged: true
          command: ["/bin/sh", "-c"]
          args:
            [
              "echo Deleting WAL files...; \
               rm -rfv /host-wal/*; \
               echo Done. Sleeping...; \
               sleep 3600"
            ]
          volumeMounts:
            - name: wal-host-mount
              mountPath: /host-wal
      volumes:
        - name: wal-host-mount
          hostPath:
            path: /var/lib/alloy/prometheus.remote_write.metrics_service/wal
            type: DirectoryOrCreate
      restartPolicy: Always
      tolerations:
        - effect: NoSchedule
          operator: Exists
        - effect: NoExecute
          key: CriticalAddonsOnly

Edit 3: Well, because of another issue with our installation we recreate the alloy-logs pods every 15min and original issue is back now. So the pods error with the initial error again.

Edit 4:
Here is a working workaround for us now:

alloy-logs:
  controller:
    initContainers:
      # See https://github.com/grafana/k8s-monitoring-helm/pull/1399#issuecomment-2821182928
      - name: delete-wal-workaround
        image: busybox:stable@sha256:e246aa22ad2cbdfbd19e2a6ca2b275e26245a21920e2b2d0666324cee3f15549
        command:
          - /bin/sh
          - -c
          - |
            rm -rfv /var/lib/alloy/prometheus.remote_write.metrics_service/wal/*
        volumeMounts:
          - mountPath: /var/lib/alloy
            name: agent-storage

Please let us disable the un-used config :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Place all destination components on all collectors

3 participants