loki.source.docker error: could not set up a wait request to the Docker client #4403

geekifier · 2023-07-11T23:18:16Z

What's wrong?

I have a very basic Docker log scraping config with the agent running in flow mode. The logs seem to be transmitted normally, however I get constant error/warning messages from the grafana-agent complaining about the inability to read container logs. The error messages are produced and repeated for all targeted containers on a regular basis.

Steps to reproduce

Simply run AGENT_MODE=flow grafana-agent --server.http.listen-addr=0.0.0.0:12345 run /etc/grafana-agent.river with the provided config on the affected host.

Agent is running as user grafana-agent who is a member of the following groups: grafana-agent adm systemd-journal docker

System information

Linux 4.4.0-92-generic x86_64

Software version

Grafana Agent v0.34.3

Configuration

logging {
  level  = "warn"
  format = "logfmt"
}

prometheus.scrape "default" {
	targets    = [{"__address__" = "localhost:12345"}]
	forward_to = [prometheus.remote_write.prom.receiver]
}

prometheus.exporter.unix {
	include_exporter_metrics = true
	disable_collectors       = ["mdadm"]
}

prometheus.remote_write "prom" { }

discovery.docker "linux" {
	host = "unix:///var/run/docker.sock"
	filter {
		name = "status"
		values = ["running"]
	}
}

loki.source.docker "default" {
	host    = "unix:///var/run/docker.sock"
	targets = discovery.relabel.docker.output
	forward_to = [loki.process.final.receiver]
}

discovery.relabel "docker" {
	targets = discovery.docker.linux.targets

	rule {
		source_labels = ["__meta_docker_container_name"]

		target_label = "container_name"
	}

	rule {
		source_labels = ["__meta_docker_container_id"]
		target_label  = "container_id"
	}
}

loki.relabel "journal" {
	forward_to = []

	rule {
		source_labels = ["__journal__systemd_unit"]
		target_label  = "unit"
	}

	rule {
		source_labels = ["__journal_priority_keyword"]
		target_label  = "level"
	}

	rule {
		source_labels = ["__journal__transport"]
		target_label  = "transport"
	}

	rule {
		source_labels = ["__journal__boot_id"]
		target_label  = "boot_id"
	}
}

loki.source.journal "read" {
	forward_to    = [loki.process.final.receiver]
	relabel_rules = loki.relabel.journal.rules
}

loki.process "final" {
	forward_to = [loki.write.grafana_cloud.receiver]

	stage.drop {
		older_than = "1h"
	}
}

loki.write "grafana_cloud" {
	external_labels = {instance = constants.hostname}

	endpoint {
		url = "https://USER:SECRET@logs-prod3.grafana.net/loki/api/v1/push"
	}
}

Logs

# the following logs repeat many times, once for each container; then repeat every few minutes

2023-07-11 19:00:04	ts=2023-07-11T23:00:04.735108377Z component=loki.source.docker.default level=warn target=docker/4f8edb61c3d75c606ddf0159ba4892aee758a53a0a58e9b4faec13fa63457538 msg="could not transfer logs" written=0 container=4f8edb61c3d75c606ddf0159ba4892aee758a53a0a58e9b4faec13fa63457538 err="context canceled"

2023-07-11 19:00:04	ts=2023-07-11T23:00:04.735116325Z component=loki.source.docker.default level=error target=e2a20d798dbb394b802179ed42fd1001ab4a5d0ed66c69649e65df3a65be0221 msg="could not set up a wait request to the Docker client" error=

The text was updated successfully, but these errors were encountered:

geekifier · 2023-07-11T23:23:53Z

Since the original problem occurred on an older system, I have also tested the same config on another machine running Docker with same results. The other machine is running Linux 6.1.36-v8+ arm64 and Docker 24.0.4.

I also checked the fs.inotify kernel limits on my systems. I was not reaching them, and even after setting them ridiculously high, the errors persist.

MatthieuBailleul · 2023-07-24T08:20:45Z

I'm having exactly the same problem as you.
In addition, I see an increasing memory consumption, it seems to be the docker-driver responsible. I have to restart the docker engine

beningodfrey4 · 2023-08-05T10:54:11Z

Can confirm. Additionally I see a CPU spike in the agent and loki containers every minute. Goes away when loki.source.docker is removed from the agent flow config.

Interestingly, the default refresh interval for the prometheus docker SD object is 1 minute as well.

agent/component/discovery/docker/docker.go

Line 57 in 52d855e

RefreshInterval: time.Minute,

mai-rnd · 2023-11-02T09:16:39Z

discovery.docker may returns several targets with the same container_id (for example if several ports are exposed). In this case loki.source.docker is trying to get logs several times from the same container. I've try to apply filter to make sure that loki.source.docker has only one instance of each container in the targets and errors could not set up a wait request to the Docker client gone.

geekifier · 2023-11-02T14:15:00Z

I've try to apply filter to make sure that loki.source.docker has only one instance of each container in the targets and errors could not set up a wait request to the Docker client gone.

@mai-rnd Can you share an example of a working filter setup?

mai-rnd · 2023-11-02T14:33:50Z

I've try to apply filter to make sure that loki.source.docker has only one instance of each container in the targets and errors could not set up a wait request to the Docker client gone.

@mai-rnd Can you share an example of a working filter setup?

I have filtered by port, for example:

discovery.docker "scrapable" {
    host = "unix:///var/run/docker.sock"
}
discovery.relabel "scrapable" {
    targets = discovery.docker.scrapable.targets
    rule {
        source_labels = ["__meta_docker_port_private" ]
        regex         = "8080"
        action        = "keep"
    }
....

But it's not universal solution. Still try to find some "univeral" way.
Also checked how it's works in Promtail with docker_sd_configs component - looks like Promtail filters targets internally and we have only uniq container_id in the targets list.

tpaschalis · 2023-11-10T09:29:07Z

Hey there, thanks for the extra datapoint @mai-rnd, will keep it in mind!

looks like Promtail filters targets internally and we have only uniq container_id in the targets list

Are you referring to this piece of code?

Indeed, the discovery.docker component is meant to also be used for metrics collection, so it does require the multiple ports to be on different targets, even if it doesn't make sense for collecting logs. It's not immediateIy clear, but I think we might find a way to improve this aspect of logs collection.

pddg · 2023-11-18T10:00:05Z

The documentation gives an example of using loki.source.docker in combination with discovery.docker. In that setup I frequently get errors similar to this issue.
https://grafana.com/docs/agent/latest/flow/reference/components/loki.source.docker/

It seems that the workaround can be even more complicated because in my environment a single container belonged to multiple networks.
This did not occur when I collect logs in static mode. It is now a blocker for migration to flow mode 😢

If a solution is not available soon, it might be a good idea to mention the drawbacks of this approach in the documentation.

mai-rnd · 2023-11-21T07:55:19Z

@tpaschalis yes, you're right, discovery.docker not only for logs, but for metrics also. But we have to find workaround for logs, for that I have to add one more label to the containers, which is expose some ports - logging_filter: "7050". Here 7050 is just example, 7050 is one one the exposed ports by container. Later it's can be used as filter to point prometheus.scrape from which port need to get metrics, but currently it's not the topic.
So, the final config is like:

// ## GET CONTAINERS LIST FROM DOCKER DAEMON AND SET DEFAULT LABELS ##
discovery.docker "all_containers" {
  host = "unix:///var/run/docker.sock"
  refresh_interval = "10s"
}

discovery.relabel "extract_containers_labels" {
  targets = discovery.docker.all_containers.targets
.....
  // get port from __address__
  rule {
    source_labels = [ "__address__" ]
    regex         = "(.*:)(\\d+?)"
    target_label  = "__tmp_port_from_address"
    replacement   = "$2"
    action        = "replace"
  }
}

// ## FILTER UNIQUE CONTAINERS ##
discovery.relabel "filter_containers_without_ports" {
  targets = discovery.relabel.extract_containers_labels.output
  // drop target if port in __address__ is THE SAME as in __meta_docker_port_private
  // at the output we will have containers without published ports.
  rule {
    source_labels = [ "__tmp_port_from_address" ]
    target_label  = "__meta_docker_port_private"
    action        = "dropequal"
  }
}

discovery.relabel "filter_containers_with_ports" {
  targets = discovery.relabel.extract_containers_labels.output
  //keep target if port in __address__ is THE SAME as in __meta_docker_port_private
  //other targets will be dropped
  rule {
    source_labels = [ "__tmp_port_from_address" ]
    target_label  = "__meta_docker_port_private"
    action        = "keepequal"
  }
  //keep only targets where label "logging-filter" match __meta_docker_port_private
  rule {
    source_labels = [ "__meta_docker_port_private" ]
    target_label  = "__meta_docker_container_label_logging_filter"
    action        = "keepequal"
  }
}

// ## ENABLE LOGS COLLECTION FOR CONTAINERS ##
// We don't need logs from all containers, so we have "logging" or "promtail" labels to filter containers.
discovery.relabel "filter_containers_for_logs" {
  targets = concat(discovery.relabel.filter_containers_with_ports.output, discovery.relabel.filter_containers_without_ports.output)
  rule {
    source_labels = [ "__meta_docker_container_label_logging", "__meta_docker_container_label_promtail" ]
    separator     = ""
    regex         = "\\w+"
    action        = "keep"
  }
}

loki.source.docker "infra" {
  host       = "unix:///var/run/docker.sock"
  targets    = discovery.relabel.filter_containers_for_logs.output
  forward_to = [ loki.process.preprocess_containers_logs.receiver ]
}

loki.process "preprocess_containers_logs" {
  stage.docker {}
{% include 'loki_process_common_stages.j2' %} //Just some process code.

  forward_to = [ loki.write.infra_loki_reciever.receiver ] //loki receiver
}

MichaelSasser · 2023-11-23T14:41:31Z

It is quite unusable like this at this time. I was testing flow mode on a non-critical development environment VM for a few days. It's a relatively small VM with just five containers. They are all in two different networks, one of them is in three. One container shares four ports with the host, one two another one zero and the rest one. Additionally, I have IPv6 is enabled in the Docker daemon config and the ports are shared here too. Every minute, the agent and the docker daemon are maxing out the CPU for around 10–15 seconds. That time is creeping up slowly, though, if I don't restart the agent from time to time.
When I dropped some of the duplicated by filtering out one network and all IPv6 addresses, the time those two are maxing out the CPU went down quite a bit. But there are still some duplicates left. But filtering them out manually isn't feasible. At least not on the other hosts I was planning to switch because there are a few hundred short-lived containers created and running dynamically on some of the other hosts per day.

I initially had the WAL enabled for Loki, but it was filling up the 20 GB remaining on the virtual hard drive in minutes. Logwise, there were 250 MB of logs in the journal and shy of 30 MB in all container log files together. I don't remember exactly, but I think the journal wasn't even in the config at that time. If I recall it correctly, I still had it handled using the static config in the "old" agent container. I did a step-by step transition from the static config to flow mode, where I commented out (end restarted) something in the static config before I deployed and restarted the container with the agent in flow mode (I haven't implemented the part about reloading the config in ansible yet).

cAdvisor Metrics for the agent

(The container with the static config grafana-agent was only active for a couple of hours until I finished transitioning to flow mode. Their functionality never overlapped, and the graph only shows the container grafana-agent-flow, which is the agent running in flow mode)

geekifier · 2023-12-04T22:04:01Z

One way to resolve this would be to give us a way to simply take the exported targets, which is a list of maps, and then be able to create a new list of maps excluding all of the duplicate items.

For example, we could iterate over the list of maps based on the __meta_docker_container_id and keep only the first instance of each __meta_docker_container_id.

Currently the standard library is quite limited in terms of data manipulation functions, but maybe this could be implemented in another way. Allow us to set one of the returned fields as a "filter" where only the first instance is returned and others discarded.

fede843 · 2023-12-13T12:32:59Z

I had the same issue. In my case was related to dual ipv4 and ipv6 addresses. Checked the positions file and got 2 entries for each container, with the only difference on:

__meta_docker_port_public_ip="0.0.0.0"
vs 
__meta_docker_port_public_ip="::"

Did a quick test disabling ipv6 off the system, rebooted, and no issues after that.

mhoyer · 2023-12-15T17:43:05Z

I faced that issue too for different reasons (which mostly are covered above):

multiple networks attached
dual network stack with ipv4/ipv6
more than one port exposed

My hacky solution for my smaller home lab set up:

discovery.docker "default" {
  host = "unix:///var/run/docker.sock"
}

discovery.relabel "docker" {
  targets = discovery.docker.default.targets

+  // Step 1. - generate a unique de-duplication key from common __meta labels
  rule {
    source_labels = ["__meta_docker_container_name", "__meta_docker_container_network_mode", "__meta_docker_port_public", "__meta_docker_port_public_ip"]
    target_label = "__meta_deduplication_key"
  }
}

+// Step 2. - keep only those targets where the de-dup key matches a custom label attached to the container
discovery.relabel "container_filter" {
  targets = discovery.relabel.docker.output

  rule {
    action = "keepequal"
    source_labels = ["__meta_deduplication_key"]
    target_label = "__meta_docker_container_label_deduplication_key_filter"
  }
}

This first discovery.relabel "docker" rule generates additional labels like such:

{
	__address__                                                           = "172.22.0.2:9001",
+	__meta_deduplication_key                                              = "/minio;minio_minio;9001;0.0.0.0",
	__meta_docker_container_name                                          = "/minio",
	__meta_docker_container_network_mode                                  = "minio_minio",
	...
	__meta_docker_network_name                                            = "minio_minio",
	__meta_docker_port_private                                            = "9001",
	__meta_docker_port_public                                             = "9001",
	__meta_docker_port_public_ip                                          = "0.0.0.0"
}

Considering such a docker-compose.yaml with two network, two ports, but also the magic label:

version: "2"

networks:
  minio:

services:
  minio:
    image: quay.io/minio/minio:RELEASE.2023-12-07T04-16-00Z
+    labels:
+      deduplication-key-filter: "grafana-agent;grafana-ag>
    entrypoint: ["sh", "-euc", "minio server /data --console-address ':9001'"]
    networks:
      minio:
      default:
    ports:
      - "9000:9000"
      - "9001:9001"

... we see such __meta label accordingly:

{
	__address__                                                           = "172.22.0.2:9001",
	__meta_deduplication_key                                              = "/minio;minio_minio;9001;0.0.0.0",
+	__meta_docker_container_label_deduplication_key_filter                = "/minio;minio_minio;9001;0.0.0.0",
	__meta_docker_container_name                                          = "/minio",
	__meta_docker_container_network_mode                                  = "minio_minio",
	...
	__meta_docker_network_name                                            = "minio_minio",
	__meta_docker_port_private                                            = "9001",
	__meta_docker_port_public                                             = "9001",
	__meta_docker_port_public_ip                                          = "0.0.0.0"
}

... which is matching with the __meta_deduplication_key. Hence, the second rule discovery.relabel "container_filter" is going to keep this container instance (while there are other three other instances of the same container exists in the initial discovery).

A pretty explicit approach as I now have to label all containers accordingly with a proper deduplication-key-filter label to pass the filter. But it prevents the errors popping up for me.

So or so. This is issue really should see some solution. E.g. some kind of group-by functionality to aggregate by unique container id or so.

Vallefor · 2023-12-26T10:52:18Z

This bug is absolutely insane. I have containers without exposed ports, with only one network, and only an IPv4 address. The bug still appears, and the Grafana Agent Flow does not push logs from some containers.

Solved it via removing the Docker integration completely (I need only logs).

River example:

loki.write "local" {
  endpoint {
    url = "...."
  }
}

loki.process "drop_ip" {
  forward_to = [loki.write.local.receiver]
  stage.label_drop {
    values = [ "ip_addr","filename",]
  }
}

loki.process "geoip" {    
  forward_to = [loki.relabel.level.receiver]
  stage.geoip {
    source  = "ip_addr"
    db      = "/geoip/GeoLite2-City.mmdb"
    db_type = "city"
  }
  stage.labels {
    values = {
      geoip_city_name          = "",
      geoip_country_name       = "",
    }
  }
}
loki.process "docker_json" {
  forward_to = [loki.process.ip_and_drop.receiver]
  stage.json {
    expressions = {message = "log", timestamp = "time", tag = "attrs.tag" }
  }
  stage.regex {
    expression = ".*\\/(?P<container>.*)"
    source = "tag"
  }
  stage.output {
    source = "message"
  }
  stage.labels {
    values = {
      tag  = "",
      container = "",
    }
  }
}
loki.process "ip_and_drop" {
  forward_to = [loki.process.geoip.receiver]
  stage.regex {
    expression = "^(?P<ip_addr>[0-9]+\\.[0-9]+\\.[0-9]+\\.[0-9]+) "
  }
  stage.labels {
    values = {
      ip_addr = "",
    }
  }
  
  stage.drop {
    older_than          = "1h"
    drop_counter_reason = "too old"
  }
}


local.file_match "docker" {
  path_targets = [{"__path__" = "/var/lib/docker/containers/**/*.log", "instance" = "my_instance", "job" = "docker/my_instance"}]
}
loki.source.file "files" {
  targets    = local.file_match.docker.targets
  forward_to = [loki.process.docker_json.receiver]
}

Also you need to add option to docker logging driver:
tag: "my_instance/{{.Name}}"

Btw without Docker integration - logs appears much faster (almost isntantelly). With integration I have delay about 8-12 sec (even if I change interval to 1sec).

tpaschalis · 2024-01-10T09:22:22Z

Hey folks, happy new year! 🎉

We've merged #6055 which allows loki.source.docker to automatically deduplicate targets with matching container IDs which should alleviate the problem based on our tests.

The fix is out on main, and will be available on the Agent's v0.40 release in about 6 weeks. In the meantime, you can try using one of the latest images in our dockerhub repo, or build your own from main.

Let us know how this works for you, and thank you for your patience!

MichaelSasser · 2024-01-12T10:32:37Z

LGTM (I picked the latest commit on main 6965ae2 for testing)

geekifier added the bug Something isn't working label Jul 11, 2023

grafanabot added this to Grafana Agent (Public) Jul 11, 2023

github-project-automation bot moved this to Todo in Grafana Agent (Public) Jul 11, 2023

rfratto added the type/signals label Jul 21, 2023

mattdurham added this to the v0.38.0 milestone Jul 25, 2023

rfratto removed the type/signals label Nov 2, 2023

rfratto removed this from the v0.38.0 milestone Nov 2, 2023

MichaelSasser mentioned this issue Nov 21, 2023

loki.source.docker: positions removed on transient docker error #5797

Closed

tpaschalis self-assigned this Jan 5, 2024

tpaschalis mentioned this issue Jan 5, 2024

loki.source.docker: deduplicate matching container IDs #6055

Merged

3 tasks

tpaschalis closed this as completed in #6055 Jan 10, 2024

github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 21, 2024

github-actions bot locked as resolved and limited conversation to collaborators Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loki.source.docker error: could not set up a wait request to the Docker client #4403

loki.source.docker error: could not set up a wait request to the Docker client #4403

geekifier commented Jul 11, 2023 •

edited

Loading

geekifier commented Jul 11, 2023 •

edited

Loading

MatthieuBailleul commented Jul 24, 2023 •

edited

Loading

beningodfrey4 commented Aug 5, 2023 •

edited

Loading

mai-rnd commented Nov 2, 2023

geekifier commented Nov 2, 2023

mai-rnd commented Nov 2, 2023 •

edited

Loading

tpaschalis commented Nov 10, 2023

pddg commented Nov 18, 2023

mai-rnd commented Nov 21, 2023

MichaelSasser commented Nov 23, 2023

geekifier commented Dec 4, 2023

fede843 commented Dec 13, 2023

mhoyer commented Dec 15, 2023

Vallefor commented Dec 26, 2023 •

edited

Loading

tpaschalis commented Jan 10, 2024

MichaelSasser commented Jan 12, 2024

loki.source.docker error: could not set up a wait request to the Docker client #4403

loki.source.docker error: could not set up a wait request to the Docker client #4403

Comments

geekifier commented Jul 11, 2023 • edited Loading

What's wrong?

Steps to reproduce

System information

Software version

Configuration

Logs

geekifier commented Jul 11, 2023 • edited Loading

MatthieuBailleul commented Jul 24, 2023 • edited Loading

beningodfrey4 commented Aug 5, 2023 • edited Loading

mai-rnd commented Nov 2, 2023

geekifier commented Nov 2, 2023

mai-rnd commented Nov 2, 2023 • edited Loading

tpaschalis commented Nov 10, 2023

pddg commented Nov 18, 2023

mai-rnd commented Nov 21, 2023

MichaelSasser commented Nov 23, 2023

cAdvisor Metrics for the agent

geekifier commented Dec 4, 2023

fede843 commented Dec 13, 2023

mhoyer commented Dec 15, 2023

Vallefor commented Dec 26, 2023 • edited Loading

tpaschalis commented Jan 10, 2024

MichaelSasser commented Jan 12, 2024

geekifier commented Jul 11, 2023 •

edited

Loading

geekifier commented Jul 11, 2023 •

edited

Loading

MatthieuBailleul commented Jul 24, 2023 •

edited

Loading

beningodfrey4 commented Aug 5, 2023 •

edited

Loading

mai-rnd commented Nov 2, 2023 •

edited

Loading

Vallefor commented Dec 26, 2023 •

edited

Loading