Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loki.source.docker error: could not set up a wait request to the Docker client #4403

Closed
geekifier opened this issue Jul 11, 2023 · 16 comments · Fixed by #6055
Closed

loki.source.docker error: could not set up a wait request to the Docker client #4403

geekifier opened this issue Jul 11, 2023 · 16 comments · Fixed by #6055
Assignees
Labels
bug Something isn't working frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed.

Comments

@geekifier
Copy link

geekifier commented Jul 11, 2023

What's wrong?

I have a very basic Docker log scraping config with the agent running in flow mode. The logs seem to be transmitted normally, however I get constant error/warning messages from the grafana-agent complaining about the inability to read container logs. The error messages are produced and repeated for all targeted containers on a regular basis.

Steps to reproduce

Simply run AGENT_MODE=flow grafana-agent --server.http.listen-addr=0.0.0.0:12345 run /etc/grafana-agent.river with the provided config on the affected host.

Agent is running as user grafana-agent who is a member of the following groups: grafana-agent adm systemd-journal docker

System information

Linux 4.4.0-92-generic x86_64

Software version

Grafana Agent v0.34.3

Configuration

logging {
  level  = "warn"
  format = "logfmt"
}

prometheus.scrape "default" {
	targets    = [{"__address__" = "localhost:12345"}]
	forward_to = [prometheus.remote_write.prom.receiver]
}

prometheus.exporter.unix {
	include_exporter_metrics = true
	disable_collectors       = ["mdadm"]
}

prometheus.remote_write "prom" { }

discovery.docker "linux" {
	host = "unix:///var/run/docker.sock"
	filter {
		name = "status"
		values = ["running"]
	}
}

loki.source.docker "default" {
	host    = "unix:///var/run/docker.sock"
	targets = discovery.relabel.docker.output
	forward_to = [loki.process.final.receiver]
}

discovery.relabel "docker" {
	targets = discovery.docker.linux.targets

	rule {
		source_labels = ["__meta_docker_container_name"]

		target_label = "container_name"
	}

	rule {
		source_labels = ["__meta_docker_container_id"]
		target_label  = "container_id"
	}
}

loki.relabel "journal" {
	forward_to = []

	rule {
		source_labels = ["__journal__systemd_unit"]
		target_label  = "unit"
	}

	rule {
		source_labels = ["__journal_priority_keyword"]
		target_label  = "level"
	}

	rule {
		source_labels = ["__journal__transport"]
		target_label  = "transport"
	}

	rule {
		source_labels = ["__journal__boot_id"]
		target_label  = "boot_id"
	}
}

loki.source.journal "read" {
	forward_to    = [loki.process.final.receiver]
	relabel_rules = loki.relabel.journal.rules
}

loki.process "final" {
	forward_to = [loki.write.grafana_cloud.receiver]

	stage.drop {
		older_than = "1h"
	}
}

loki.write "grafana_cloud" {
	external_labels = {instance = constants.hostname}

	endpoint {
		url = "https://USER:SECRET@logs-prod3.grafana.net/loki/api/v1/push"
	}
}

Logs

# the following logs repeat many times, once for each container; then repeat every few minutes

2023-07-11 19:00:04	ts=2023-07-11T23:00:04.735108377Z component=loki.source.docker.default level=warn target=docker/4f8edb61c3d75c606ddf0159ba4892aee758a53a0a58e9b4faec13fa63457538 msg="could not transfer logs" written=0 container=4f8edb61c3d75c606ddf0159ba4892aee758a53a0a58e9b4faec13fa63457538 err="context canceled"

2023-07-11 19:00:04	ts=2023-07-11T23:00:04.735116325Z component=loki.source.docker.default level=error target=e2a20d798dbb394b802179ed42fd1001ab4a5d0ed66c69649e65df3a65be0221 msg="could not set up a wait request to the Docker client" error=
@geekifier geekifier added the bug Something isn't working label Jul 11, 2023
@geekifier
Copy link
Author

geekifier commented Jul 11, 2023

Since the original problem occurred on an older system, I have also tested the same config on another machine running Docker with same results. The other machine is running Linux 6.1.36-v8+ arm64 and Docker 24.0.4.

I also checked the fs.inotify kernel limits on my systems. I was not reaching them, and even after setting them ridiculously high, the errors persist.

@MatthieuBailleul
Copy link

MatthieuBailleul commented Jul 24, 2023

I'm having exactly the same problem as you.
In addition, I see an increasing memory consumption, it seems to be the docker-driver responsible. I have to restart the docker engine

@mattdurham mattdurham added this to the v0.38.0 milestone Jul 25, 2023
@beningodfrey4
Copy link

beningodfrey4 commented Aug 5, 2023

Can confirm. Additionally I see a CPU spike in the agent and loki containers every minute. Goes away when loki.source.docker is removed from the agent flow config.

Interestingly, the default refresh interval for the prometheus docker SD object is 1 minute as well.

RefreshInterval: time.Minute,

@mai-rnd
Copy link

mai-rnd commented Nov 2, 2023

discovery.docker may returns several targets with the same container_id (for example if several ports are exposed). In this case loki.source.docker is trying to get logs several times from the same container. I've try to apply filter to make sure that loki.source.docker has only one instance of each container in the targets and errors could not set up a wait request to the Docker client gone.

@rfratto rfratto removed this from the v0.38.0 milestone Nov 2, 2023
@geekifier
Copy link
Author

I've try to apply filter to make sure that loki.source.docker has only one instance of each container in the targets and errors could not set up a wait request to the Docker client gone.

@mai-rnd Can you share an example of a working filter setup?

@mai-rnd
Copy link

mai-rnd commented Nov 2, 2023

I've try to apply filter to make sure that loki.source.docker has only one instance of each container in the targets and errors could not set up a wait request to the Docker client gone.

@mai-rnd Can you share an example of a working filter setup?

I have filtered by port, for example:

discovery.docker "scrapable" {
    host = "unix:///var/run/docker.sock"
}
discovery.relabel "scrapable" {
    targets = discovery.docker.scrapable.targets
    rule {
        source_labels = ["__meta_docker_port_private" ]
        regex         = "8080"
        action        = "keep"
    }
....

But it's not universal solution. Still try to find some "univeral" way.
Also checked how it's works in Promtail with docker_sd_configs component - looks like Promtail filters targets internally and we have only uniq container_id in the targets list.

@tpaschalis
Copy link
Member

Hey there, thanks for the extra datapoint @mai-rnd, will keep it in mind!

looks like Promtail filters targets internally and we have only uniq container_id in the targets list

Are you referring to this piece of code?

Indeed, the discovery.docker component is meant to also be used for metrics collection, so it does require the multiple ports to be on different targets, even if it doesn't make sense for collecting logs. It's not immediateIy clear, but I think we might find a way to improve this aspect of logs collection.

@pddg
Copy link

pddg commented Nov 18, 2023

The documentation gives an example of using loki.source.docker in combination with discovery.docker. In that setup I frequently get errors similar to this issue.
https://grafana.com/docs/agent/latest/flow/reference/components/loki.source.docker/

It seems that the workaround can be even more complicated because in my environment a single container belonged to multiple networks.
This did not occur when I collect logs in static mode. It is now a blocker for migration to flow mode 😢

If a solution is not available soon, it might be a good idea to mention the drawbacks of this approach in the documentation.

@mai-rnd
Copy link

mai-rnd commented Nov 21, 2023

@tpaschalis yes, you're right, discovery.docker not only for logs, but for metrics also. But we have to find workaround for logs, for that I have to add one more label to the containers, which is expose some ports - logging_filter: "7050". Here 7050 is just example, 7050 is one one the exposed ports by container. Later it's can be used as filter to point prometheus.scrape from which port need to get metrics, but currently it's not the topic.
So, the final config is like:

// ## GET CONTAINERS LIST FROM DOCKER DAEMON AND SET DEFAULT LABELS ##
discovery.docker "all_containers" {
  host = "unix:///var/run/docker.sock"
  refresh_interval = "10s"
}

discovery.relabel "extract_containers_labels" {
  targets = discovery.docker.all_containers.targets
.....
  // get port from __address__
  rule {
    source_labels = [ "__address__" ]
    regex         = "(.*:)(\\d+?)"
    target_label  = "__tmp_port_from_address"
    replacement   = "$2"
    action        = "replace"
  }
}

// ## FILTER UNIQUE CONTAINERS ##
discovery.relabel "filter_containers_without_ports" {
  targets = discovery.relabel.extract_containers_labels.output
  // drop target if port in __address__ is THE SAME as in __meta_docker_port_private
  // at the output we will have containers without published ports.
  rule {
    source_labels = [ "__tmp_port_from_address" ]
    target_label  = "__meta_docker_port_private"
    action        = "dropequal"
  }
}

discovery.relabel "filter_containers_with_ports" {
  targets = discovery.relabel.extract_containers_labels.output
  //keep target if port in __address__ is THE SAME as in __meta_docker_port_private
  //other targets will be dropped
  rule {
    source_labels = [ "__tmp_port_from_address" ]
    target_label  = "__meta_docker_port_private"
    action        = "keepequal"
  }
  //keep only targets where label "logging-filter" match __meta_docker_port_private
  rule {
    source_labels = [ "__meta_docker_port_private" ]
    target_label  = "__meta_docker_container_label_logging_filter"
    action        = "keepequal"
  }
}

// ## ENABLE LOGS COLLECTION FOR CONTAINERS ##
// We don't need logs from all containers, so we have "logging" or "promtail" labels to filter containers.
discovery.relabel "filter_containers_for_logs" {
  targets = concat(discovery.relabel.filter_containers_with_ports.output, discovery.relabel.filter_containers_without_ports.output)
  rule {
    source_labels = [ "__meta_docker_container_label_logging", "__meta_docker_container_label_promtail" ]
    separator     = ""
    regex         = "\\w+"
    action        = "keep"
  }
}

loki.source.docker "infra" {
  host       = "unix:///var/run/docker.sock"
  targets    = discovery.relabel.filter_containers_for_logs.output
  forward_to = [ loki.process.preprocess_containers_logs.receiver ]
}

loki.process "preprocess_containers_logs" {
  stage.docker {}
{% include 'loki_process_common_stages.j2' %} //Just some process code.

  forward_to = [ loki.write.infra_loki_reciever.receiver ] //loki receiver
}

@MichaelSasser
Copy link

It is quite unusable like this at this time. I was testing flow mode on a non-critical development environment VM for a few days. It's a relatively small VM with just five containers. They are all in two different networks, one of them is in three. One container shares four ports with the host, one two another one zero and the rest one. Additionally, I have IPv6 is enabled in the Docker daemon config and the ports are shared here too. Every minute, the agent and the docker daemon are maxing out the CPU for around 10–15 seconds. That time is creeping up slowly, though, if I don't restart the agent from time to time.
When I dropped some of the duplicated by filtering out one network and all IPv6 addresses, the time those two are maxing out the CPU went down quite a bit. But there are still some duplicates left. But filtering them out manually isn't feasible. At least not on the other hosts I was planning to switch because there are a few hundred short-lived containers created and running dynamically on some of the other hosts per day.

I initially had the WAL enabled for Loki, but it was filling up the 20 GB remaining on the virtual hard drive in minutes. Logwise, there were 250 MB of logs in the journal and shy of 30 MB in all container log files together. I don't remember exactly, but I think the journal wasn't even in the config at that time. If I recall it correctly, I still had it handled using the static config in the "old" agent container. I did a step-by step transition from the static config to flow mode, where I commented out (end restarted) something in the static config before I deployed and restarted the container with the agent in flow mode (I haven't implemented the part about reloading the config in ansible yet).

cAdvisor Metrics for the agent

NVIDIA_Share_sAida3QHe0
(The container with the static config grafana-agent was only active for a couple of hours until I finished transitioning to flow mode. Their functionality never overlapped, and the graph only shows the container grafana-agent-flow, which is the agent running in flow mode)

@geekifier
Copy link
Author

One way to resolve this would be to give us a way to simply take the exported targets, which is a list of maps, and then be able to create a new list of maps excluding all of the duplicate items.

For example, we could iterate over the list of maps based on the __meta_docker_container_id and keep only the first instance of each __meta_docker_container_id.

Currently the standard library is quite limited in terms of data manipulation functions, but maybe this could be implemented in another way. Allow us to set one of the returned fields as a "filter" where only the first instance is returned and others discarded.

@fede843
Copy link

fede843 commented Dec 13, 2023

I had the same issue. In my case was related to dual ipv4 and ipv6 addresses. Checked the positions file and got 2 entries for each container, with the only difference on:

__meta_docker_port_public_ip="0.0.0.0"
vs 
__meta_docker_port_public_ip="::"

Did a quick test disabling ipv6 off the system, rebooted, and no issues after that.

@mhoyer
Copy link

mhoyer commented Dec 15, 2023

I faced that issue too for different reasons (which mostly are covered above):

  • multiple networks attached
  • dual network stack with ipv4/ipv6
  • more than one port exposed

My hacky solution for my smaller home lab set up:

discovery.docker "default" {
  host = "unix:///var/run/docker.sock"
}

discovery.relabel "docker" {
  targets = discovery.docker.default.targets

+  // Step 1. - generate a unique de-duplication key from common __meta labels
  rule {
    source_labels = ["__meta_docker_container_name", "__meta_docker_container_network_mode", "__meta_docker_port_public", "__meta_docker_port_public_ip"]
    target_label = "__meta_deduplication_key"
  }
}

+// Step 2. - keep only those targets where the de-dup key matches a custom label attached to the container
discovery.relabel "container_filter" {
  targets = discovery.relabel.docker.output

  rule {
    action = "keepequal"
    source_labels = ["__meta_deduplication_key"]
    target_label = "__meta_docker_container_label_deduplication_key_filter"
  }
}

This first discovery.relabel "docker" rule generates additional labels like such:

{
	__address__                                                           = "172.22.0.2:9001",
+	__meta_deduplication_key                                              = "/minio;minio_minio;9001;0.0.0.0",
	__meta_docker_container_name                                          = "/minio",
	__meta_docker_container_network_mode                                  = "minio_minio",
	...
	__meta_docker_network_name                                            = "minio_minio",
	__meta_docker_port_private                                            = "9001",
	__meta_docker_port_public                                             = "9001",
	__meta_docker_port_public_ip                                          = "0.0.0.0"
}

Considering such a docker-compose.yaml with two network, two ports, but also the magic label:

version: "2"

networks:
  minio:

services:
  minio:
    image: quay.io/minio/minio:RELEASE.2023-12-07T04-16-00Z
+    labels:
+      deduplication-key-filter: "grafana-agent;grafana-ag>
    entrypoint: ["sh", "-euc", "minio server /data --console-address ':9001'"]
    networks:
      minio:
      default:
    ports:
      - "9000:9000"
      - "9001:9001"

... we see such __meta label accordingly:

{
	__address__                                                           = "172.22.0.2:9001",
	__meta_deduplication_key                                              = "/minio;minio_minio;9001;0.0.0.0",
+	__meta_docker_container_label_deduplication_key_filter                = "/minio;minio_minio;9001;0.0.0.0",
	__meta_docker_container_name                                          = "/minio",
	__meta_docker_container_network_mode                                  = "minio_minio",
	...
	__meta_docker_network_name                                            = "minio_minio",
	__meta_docker_port_private                                            = "9001",
	__meta_docker_port_public                                             = "9001",
	__meta_docker_port_public_ip                                          = "0.0.0.0"
}

... which is matching with the __meta_deduplication_key. Hence, the second rule discovery.relabel "container_filter" is going to keep this container instance (while there are other three other instances of the same container exists in the initial discovery).

A pretty explicit approach as I now have to label all containers accordingly with a proper deduplication-key-filter label to pass the filter. But it prevents the errors popping up for me.

So or so. This is issue really should see some solution. E.g. some kind of group-by functionality to aggregate by unique container id or so.

@Vallefor
Copy link

Vallefor commented Dec 26, 2023

This bug is absolutely insane. I have containers without exposed ports, with only one network, and only an IPv4 address. The bug still appears, and the Grafana Agent Flow does not push logs from some containers.

Solved it via removing the Docker integration completely (I need only logs).

River example:

loki.write "local" {
  endpoint {
    url = "...."
  }
}

loki.process "drop_ip" {
  forward_to = [loki.write.local.receiver]
  stage.label_drop {
    values = [ "ip_addr","filename",]
  }
}

loki.process "geoip" {    
  forward_to = [loki.relabel.level.receiver]
  stage.geoip {
    source  = "ip_addr"
    db      = "/geoip/GeoLite2-City.mmdb"
    db_type = "city"
  }
  stage.labels {
    values = {
      geoip_city_name          = "",
      geoip_country_name       = "",
    }
  }
}
loki.process "docker_json" {
  forward_to = [loki.process.ip_and_drop.receiver]
  stage.json {
    expressions = {message = "log", timestamp = "time", tag = "attrs.tag" }
  }
  stage.regex {
    expression = ".*\\/(?P<container>.*)"
    source = "tag"
  }
  stage.output {
    source = "message"
  }
  stage.labels {
    values = {
      tag  = "",
      container = "",
    }
  }
}
loki.process "ip_and_drop" {
  forward_to = [loki.process.geoip.receiver]
  stage.regex {
    expression = "^(?P<ip_addr>[0-9]+\\.[0-9]+\\.[0-9]+\\.[0-9]+) "
  }
  stage.labels {
    values = {
      ip_addr = "",
    }
  }
  
  stage.drop {
    older_than          = "1h"
    drop_counter_reason = "too old"
  }
}


local.file_match "docker" {
  path_targets = [{"__path__" = "/var/lib/docker/containers/**/*.log", "instance" = "my_instance", "job" = "docker/my_instance"}]
}
loki.source.file "files" {
  targets    = local.file_match.docker.targets
  forward_to = [loki.process.docker_json.receiver]
}

Also you need to add option to docker logging driver:
tag: "my_instance/{{.Name}}"

Btw without Docker integration - logs appears much faster (almost isntantelly). With integration I have delay about 8-12 sec (even if I change interval to 1sec).

@tpaschalis
Copy link
Member

Hey folks, happy new year! 🎉

We've merged #6055 which allows loki.source.docker to automatically deduplicate targets with matching container IDs which should alleviate the problem based on our tests.

The fix is out on main, and will be available on the Agent's v0.40 release in about 6 weeks. In the meantime, you can try using one of the latest images in our dockerhub repo, or build your own from main.

Let us know how this works for you, and thank you for your patience!

@MichaelSasser
Copy link

LGTM (I picked the latest commit on main 6965ae2 for testing)

@github-actions github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 21, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed.
Projects
No open projects
Development

Successfully merging a pull request may close this issue.