Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loki.source.docker: could not set up a wait request to the Docker client is still displayed when the container belongs to multiple networks. #225

Closed
pddg opened this issue Mar 3, 2024 · 10 comments · Fixed by #1017
Labels
bug Something isn't working frozen-due-to-age

Comments

@pddg
Copy link

pddg commented Mar 3, 2024

What's wrong?

grafana/agent#6055 will fix grafana/agent#4403 . However, same error has occurred when the docker container belongs to two or more networks.
This error appears to be observed more frequently when the number of networks that the container belongs to is increased or the frequency of discovery is increased.

This phenomenon consumes CPU and error logs continue to appear.

Steps to reproduce

docker and grafana-agent-flow is required.
I tested this on Ubuntu 22.04 LTS.

ubuntu@testnode:~$ grafana-agent-flow -v
agent, version v0.40.1 (branch: HEAD, revision: f222e2e08)
  build user:       root@c2af75d2b110
  build date:       2024-02-29T19:50:44Z
  go version:       go1.22.0
  platform:         linux/amd64
  tags:             netgo,builtinassets,promtail_journal_enabled
ubuntu@testnode:~$ sudo docker version
Client: Docker Engine - Community
 Version:           25.0.3
 API version:       1.44
 Go version:        go1.21.6
 Git commit:        4debf41
 Built:             Tue Feb  6 21:13:09 2024
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          25.0.3
  API version:      1.44 (minimum version 1.24)
  Go version:       go1.21.6
  Git commit:       f417435
  Built:            Tue Feb  6 21:13:09 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.28
  GitCommit:        ae07eda36dd25f8a1b98dfbf587313b99c0190bb
 runc:
  Version:          1.1.12
  GitCommit:        v1.1.12-0-g51d5e94
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
ubuntu@testnode:~$

Create docker container and connect it to user defined network.

sudo docker network create testnet1
sudo docker run -d --name testnginx --network testnet1 nginx:latest

# Then, create another network and connect the container.
sudo docker network create testnet2
sudo docker network connect testnet2 testnginx

# Ensure that the container belongs to two networks
sudo docker inspect testnginx | jq '.[] | .NetworkSettings.Networks | keys[]'

Wait until the agent discovers the container.

System information

Linux 5.15.0-92 x86_64

Software version

Grafana Agent Flow v0.40.1

Configuration

logging {
	level = "warn"
}

loki.echo "stdout" {}

discovery.docker "linux" {
    host = "unix:///var/run/docker.sock"
}

loki.source.docker "docker" {
    host = "unix:///var/run/docker.sock"
    targets = discovery.docker.linux.targets
    forward_to = [loki.echo.stdout.receiver]
}

Logs

Mar 03 03:58:10 testnode grafana-agent-flow[17376]: ts=2024-03-03T03:58:10.254744904Z level=error msg="could not set up a wait request to the Docker client" target=5d048aeb40e70ce8356f6192ea676b9fbc0d9a339b685e657f96f54c922f8634 component=loki.source.docker.docker error=""
Mar 03 03:58:10 testnode grafana-agent-flow[17376]: ts=2024-03-03T03:58:10.254856313Z level=warn msg="could not transfer logs" component=loki.source.docker.docker target=docker/5d048aeb40e70ce8356f6192ea676b9fbc0d9a339b685e657f96f54c922f8634 written=0 container=5d048aeb40e70ce8356f6192ea676b9fbc0d9a339b685e657f96f54c922f8634 err="context canceled"
@pddg pddg added the bug Something isn't working label Mar 3, 2024
@fede843
Copy link

fede843 commented Mar 5, 2024

I was testing the same, with agent 0.40.1 (docker) as well. I was expecting to migrate off Promtail after the fixing of grafana/agent#4403 but still found some issues. I am getting this same error continuously in the logs too.

@fede843
Copy link

fede843 commented Mar 6, 2024

FYI, in my case I am using just one user defined network.

@pirm-in
Copy link

pirm-in commented Mar 20, 2024

Experiencing exactly the same. Grafana Agent Flow v0.40.2 (running in docker container, mounted /var/run/docker.sock)

Continuous logs like

monitoring_grafana_agent_loki.xxx   | ts=2024-03-20T10:26:36.895262631Z level=error msg="could not set up a wait request to the Docker client" target=82781a533d3213e6be44a7f44badabe83fb2e50dd42971d2d121cd62fd673761 component=loki.source.docker.default error=""
monitoring_grafana_agent_loki.xxx   | ts=2024-03-20T10:26:36.895733428Z level=warn msg="could not transfer logs" component=loki.source.docker.default target=docker/82781a533d3213e6be44a7f44badabe83fb2e50dd42971d2d121cd62fd673761 written=0 container=82781a533d3213e6be44a7f44badabe83fb2e50dd42971d2d121cd62fd673761 err="context canceled"

... just for different targets/containers.

Most logs do arrive in Loki (Grafana Cloud), but i am not sure what the error logs mean effectively.

@rfratto
Copy link
Member

rfratto commented Apr 11, 2024

Hi there 👋

On April 9, 2024, Grafana Labs announced Grafana Alloy, the spirital successor to Grafana Agent and the final form of Grafana Agent flow mode. As a result, Grafana Agent has been deprecated and will only be receiving bug and security fixes until its end-of-life around November 1, 2025.

To make things easier for maintainers, we're in the process of migrating all issues tagged variant/flow to the Grafana Alloy repository to have a single home for tracking issues. This issue is likely something we'll want to address in both Grafana Alloy and Grafana Agent, so just because it's being moved doesn't mean we won't address the issue in Grafana Agent :)

@rfratto rfratto transferred this issue from grafana/agent Apr 11, 2024
@rhaidiz
Copy link

rhaidiz commented Jun 7, 2024

I've experience the same issue with Grafana Alloy v1.1.1 (branch: HEAD, revision: 2687a2d), but only when I start a container that runs caddy 🤔 .

@fede843
Copy link

fede843 commented Jun 10, 2024

This is still blocking us to migrate out Promtail. Still broken in Alloy 1.1.1.

grafana-alloy  | {"ts":"2024-06-10T12:14:04.358608295Z","level":"warn","msg":"could not transfer logs","component_path":"/","component_id":"loki.source.docker.docker_sd_promtail","target":"docker/30d76c92cf298fd0bf2886d5cf9dec6267b71e6c99f478559a1c19114e023485","written":0,"container":"30d76c92cf298fd0bf2886d5cf9dec6267b71e6c99f478559a1c19114e023485","err":"context canceled"}
grafana-alloy  | {"ts":"2024-06-10T12:14:04.358724766Z","level":"info","msg":"removing entry from positions file","component_path":"/","component_id":"loki.source.docker.docker_sd_promtail","path":"cursor-30d76c92cf298fd0bf2886d5cf9dec6267b71e6c99f478559a1c19114e023485","labels":"{__address__=\"172.20.0.3:8080\",  __meta_docker_container_label_org_opencontainers_image_version=\"20230927-df1772f\", __meta_docker_container_label_org_springframework_boot_version=\"2.5.3\", __meta_docker_container_name=\"/eureka\", __meta_docker_container_network_mode=\"5c83f81ec790ee4054b008fd690cb3994acc85921937280f0469f64737af2b31\", __meta_docker_network_id=\"5c83f81ec790ee4054b008fd690cb3994acc85921937280f0469f64737af2b31\", __meta_docker_network_ingress=\"false\", __meta_docker_network_internal=\"false\", __meta_docker_network_ip=\"172.20.0.3\", __meta_docker_network_name=\"mainnet\", __meta_docker_network_scope=\"local\", __meta_docker_port_private=\"8080\", __meta_docker_port_public=\"28081\", __meta_docker_port_public_ip=\"0.0.0.0\"}"}
grafana-alloy  | {"ts":"2024-06-10T12:14:19.359315343Z","level":"error","msg":"error inspecting Docker container","target":"30d76c92cf298fd0bf2886d5cf9dec6267b71e6c99f478559a1c19114e023485","component_path":"/","component_id":"loki.source.docker.docker_sd_promtail","id":"30d76c92cf298fd0bf2886d5cf9dec6267b71e6c99f478559a1c19114e023485","error":"Get \"http://%2Fvar%2Frun%2Fdocker.sock/v1.44/containers/30d76c92cf298fd0bf2886d5cf9dec6267b71e6c99f478559a1c19114e023485/json\": context canceled"}
grafana-alloy  | {"ts":"2024-06-10T12:14:19.359372343Z","level":"warn","msg":"could not transfer logs","component_path":"/","component_id":"loki.source.docker.docker_sd_promtail","target":"docker/30d76c92cf298fd0bf2886d5cf9dec6267b71e6c99f478559a1c19114e023485","written":0,"container":"30d76c92cf298fd0bf2886d5cf9dec6267b71e6c99f478559a1c19114e023485","err":"context canceled"}
grafana-alloy  | {"ts":"2024-06-10T12:14:19.359499955Z","level":"info","msg":"removing entry from positions file","component_path":"/","component_id":"loki.source.docker.docker_sd_promtail","path":"cursor-30d76c92cf298fd0bf2886d5cf9dec6267b71e6c99f478559a1c19114e023485","labels":"{__address__=\"172.20.0.3:8761\", __meta_docker_container_label_org_opencontainers_image_version=\"20230927-df1772f\", __meta_docker_container_label_org_springframework_boot_version=\"2.5.3\", __meta_docker_container_name=\"/eureka\", __meta_docker_container_network_mode=\"5c83f81ec790ee4054b008fd690cb3994acc85921937280f0469f64737af2b31\", __meta_docker_network_id=\"5c83f81ec790ee4054b008fd690cb3994acc85921937280f0469f64737af2b31\", __meta_docker_network_ingress=\"false\", __meta_docker_network_internal=\"false\", __meta_docker_network_ip=\"172.20.0.3\", __meta_docker_network_name=\"mainnet\", __meta_docker_network_scope=\"local\", __meta_docker_port_private=\"8761\", __meta_docker_port_public=\"8761\", __meta_docker_port_public_ip=\"0.0.0.0\"}"}

As I said before, it is a simple container, using one network, exposing 2 ports. The difference here is just the ports.

@fede843
Copy link

fede843 commented Jun 10, 2024

More to it. As soon I comment out one of the 2 exposed ports in the app compose file, the error is gone. One mapped port beheaves fine, 2 exposed mapped ports don't.

Also, when having the 2 ports mapped I have checked the positions.yml file. And it is jumping all around. It contains one entry for the container, that is fine, but the labels are being randomly updated, sometimes with one port sometimes with the other.

I am happy to further assist if needed.

@wildum
Copy link
Contributor

wildum commented Jun 11, 2024

Hey, thanks for reporting the bug and investigating! I had a look at it:
The discovery.docker component will create one target for every network on the same container. This means that the loki.source.docker receives multiple targets that have the same containerID in random order. This fix introduced a filter in the loki.source.docker component to avoid collecting logs multiple times from the same container. But because the order is not guaranteed, if both targetA and targetB have the same containerID, whichever target is first in the list will be selected.
As a consequence, if we have a worker collecting logs on targetA and an update comes with targetB instead, then the worker will be canceled and a new one for targetB will start. This creates error logs with workers being canceled.
This fix sorts the targets according to their label fingerprints before filtering on the containerID. This way, if multiple targets have the same containerID, it will always be the same one that will be selected for log collection.
I will open a PR

@wildum
Copy link
Contributor

wildum commented Jun 12, 2024

The fix has been merged to main, it will be available in the next Alloy release (v1.2.0)!

@fede843
Copy link

fede843 commented Jun 12, 2024

That's great. Thanks!

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 13, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working frozen-due-to-age
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants