Timeout with inputs.snmp polling squid-cache #9286

BenCastricum · 2021-05-20T14:38:26Z

Relevant telegraf.conf:

[[inputs.snmp]]
  interval = "30s"
  agents = [ "udp://192.168.3.1:161" ]
  version = 2
  community = "public"
  name = "squid"

  [[inputs.snmp.field]]
    name = "cacheSoftware"
    oid = "1.3.6.1.4.1.3495.1.2.2.0"
    is_tag = true

  [[inputs.snmp.field]]
    name = "cacheVersionId"
    oid = "1.3.6.1.4.1.3495.1.2.3.0"
    is_tag = true

  [[inputs.snmp.field]]
    name = "cacheProtoClientHttpRequests"
    oid = "1.3.6.1.4.1.3495.1.3.2.1.1.0"

System info:

Running Ubuntu in telegraf:latest docker image:

I have no name!@TelegrafHost:/$ telegraf --version
Telegraf 1.18.2 (git: HEAD a614372)

Docker

version : "3"

services:
  telegraf:
    image: telegraf
    hostname: TelegrafHost
    user: 201:201
    command: --config /etc/telegraf/telegraf.conf --config-directory /etc/telegraf/telegraf.d
    ports:
      - 6514:6514/udp
      - 162:6162/udp
    volumes:
      - /home/docker/etc/telegraf:/etc/telegraf:ro
      - /proc:/host/proc:ro
      - /home/docker/.snmp/mibs:/usr/share/snmp/mibs:ro
    networks:
       influxdatanet:

Steps to reproduce:

configure telegraf docker image to pull snmp statistics from squid-cache

Expected behavior:

From the CLI inside the docker container snmpget works as expected.

I have no name!@TelegrafHost:/$ snmpget -v2c -c public 192.168.3.1 1.3.6.1.4.1.3495.1.2.2.0
iso.3.6.1.4.1.3495.1.2.2.0 = STRING: "squid"

Actual behavior:

docker logs show timeout errors on this specific agent

2021-05-20T14:28:10Z E! [inputs.snmp] Error in plugin: agent udp://192.168.3.1:161: performing get on field cacheSoftware: request timeout (after 3 retries)

(lot's of these)

Additional info:

Other input.snmp agents work, just this specific one time's out.

a tcpdump gives this when telegraf polls:

16:31:35.005000 IP 172.18.0.5.35773 > 192.168.3.1.161: GetRequest(31) .1.3.6.1.4.1.3495.1.2.2.0
16:31:35.005274 IP 172.18.0.1.161 > 172.18.0.5.35773: GetResponse(40) .1.3.6.1.4.1.3495.1.2.2.0="squid"
16:31:35.005313 IP 172.18.0.5 > 172.18.0.1: ICMP 172.18.0.5 udp port 35773 unreachable, length 95

but an snmpget from the CLI does not give the ICMP error. (telegraf stops listening at the port before the answer is received?)

test run output:

I have no name!@TelegrafHost:/$ telegraf --test --config /etc/telegraf/telegraf.conf --config-directory /etc/telegraf/telegraf.d
<...>
2021-05-20T14:09:28Z E! [inputs.snmp] Error in plugin: agent udp://192.168.3.1:161: performing get on field cacheSoftware: request timeout (after 3 retries)
I have no name!@TelegrafHost:/$

The text was updated successfully, but these errors were encountered:

BenCastricum · 2021-05-20T18:33:46Z

Using the same telegraf config, but outside the docker container seems to work as expected. So it seems there is some relation with docker.

Hipska · 2021-05-20T19:23:37Z

In that case, it seems more relevant to create an issue here: https://github.com/influxdata/influxdata-docker

But looking at your stack trace, do you also have the snmp_trap plugin configured? As it seems that is the one sending this panic. And what version of telegraf does the container have?

BenCastricum · 2021-05-20T20:00:00Z

inside the container:

I have no name!@TelegrafHost:/$ telegraf --version
Telegraf 1.18.2 (git: HEAD a614372)

The stack trace is indeed from the snmp_trap plugin unable to bind to the socket during the --test run which makes sens because it is already in use by the main process.

I would agree with you it's a docker only thingy if an snmpget inside the container had the same issue. but snmpget works fine, telegraf does not. Weird.

Hipska · 2021-05-21T05:24:14Z

Could you update the output with the results of a test without snmp_trap plugin configured?

BenCastricum · 2021-05-21T06:57:07Z

WIthout it, no panic occurs. I updated the report to just the timeout error. Should I file a new bug report for the SNMP trap plugin panic? It's not really bothering me.

Hipska · 2021-05-21T07:20:44Z

That would be nice. The panic should not occur if it is because the port is already used for listening, an error message would be expected in that case.

About your current issue; While I still think it is a docker related problem (since you specified it works perfectly when not running in docker) and it should be filed at the correct repository, I saw this that might have a clue:

So you are doing snmp request towards 192.168.3.1, but the response comes from a totally different IP. That might be the reason why it is not working? Do you have an idea where this comes from? And does that also happen when running telegraf on the host instead of in a container?

BenCastricum · 2021-05-21T09:05:41Z

You have a good point there. The reply isn't properly NATed. Telegraf does the right thing here and rejects the packet while snmpget does not seem to care that the reply is from a different IP address. This page more or less confirms that.

the docker container has IP 172.18.0.5/16, it's gateway to the outside world is 172.18.0.1.

This is definitely a docker issue. Sorry for wasting your time.

Hipska · 2021-05-21T09:12:54Z

You're welcome.

I'm closing this issue now, but please keep us in the loop so we also know what the solution/fix for this issue is/was.

telegraf-tiger · 2021-05-21T09:51:02Z

Hello! I recommend posting this question in our Community Slack or Community Page, we have a lot of talented community members there who could help answer your question more quickly.
Heads up, this issue will be automatically closed after 7 days of inactivity. Thank you!

BenCastricum · 2021-05-21T09:55:21Z

It seems that it isn't a docker thing either. The Squid-Cache daemon was replying to the packet with an IP based on the routing table. So it choose the docker_gwbridge interface (172.18.0.1) because of the source IP (172.18.0.5).

I fixed it by explicitly by configuring the SNMP IP address in squid,conf with the snmp_incoming_address option, this IP wil then also used for snmp_outgoing_address.

The underlying question is if it is required to reply SNMP packets with using the IP address it receives the request on as source IP addres for the reply. Squid-Cache and net-snmp don't think so, Telegraf does and enforces it

This seems a larger discussion.
Go SNMP seems to have changed his behaviour recently. Does this mean Telegraf can be expected to adapt too?

Hipska · 2021-05-21T10:04:19Z

Indeed, I was also thinking if telegraf was really doing the right thing. But indeed, if gosnmp updates it's behaviour on this, it will also be reflected in telegraf when we update the module.

I could not find any traces of gosnmp actually being changed on this topic, could you point me the exact PR where this is changed?

BenCastricum · 2021-05-21T11:08:52Z

I think PR#277 is the one.

Hipska · 2021-05-21T13:50:15Z

@reimda I think we still need to implement gosnmp's UseUnconnectedUDPSocket option to accommodate this..

flyinghuman · 2021-09-28T11:19:24Z

same here. snmpwalk and snmpget from console works but within telegraf a timeout happens even if the hosts are in same subnet and no NAT is there. Seems to be related with snmp version 1 - version 2c works like expected. but the device i must monitor only has version 1 snmp. so what can i do? if i do a tcpdump i see that telegraf sends the request and also gets correct response. Telegraf also listens on the correct ports - so it must be a software related error inside telegraf :( i could also upload the pcap file if this would be helpful. Maybe this is related to #8271
No Docker, no different answer-address. I don'k know where i can look now. Thanks, fly.
snmp.zip

flyinghuman · 2021-09-30T07:21:55Z

maybe this is also related: gosnmp/gosnmp#47 (comment)

Hipska · 2021-10-14T09:14:49Z

@reimda WDYT?

Hipska · 2021-12-08T09:39:16Z

@reimda? Maybe @MyaLongmire can also be of help here?

MyaLongmire · 2021-12-22T17:11:44Z

This pr takes @Hipska idea of implementing UseUnconnectedUDPSocket. Can someone please test it to see if it fixes their issue?

telegraf-tiger · 2022-03-01T18:09:48Z

Hello! I am closing this issue due to inactivity. I hope you were able to resolve your problem, if not please try posting this question in our Community Slack or Community Page. Thank you!

OliverTUBAF · 2022-03-02T13:34:29Z

Thank you all for addressing this problem, it works now, I'm able to query data:

[root@server01 telegraf-1.22.0]# tail -f /var/log/telegraf/telegraf.log &
[root@server01 telegraf-1.22.0]# ./usr/bin/telegraf --config /etc/telegraf/telegraf.conf --config-directory /etc/telegraf/telegraf.d/ --test
2022-03-02T12:19:45Z I! Starting Telegraf 1.22.0-ccddf3bd
2022-03-02T12:19:45Z I! Loaded inputs: snmp
2022-03-02T12:19:45Z I! Loaded aggregators: 
2022-03-02T12:19:45Z I! Loaded processors: 
2022-03-02T12:19:45Z W! Outputs are not used in testing mode!
2022-03-02T12:19:45Z I! Tags enabled: host=server.cluster
2022-03-02T12:19:45Z D! [agent] Initializing plugins
2022-03-02T12:20:24Z D! [agent] Starting service inputs
> hw_storage,agent_host=192.168.200.1,host=server.cluster hostname="PanFS" 1646223624000000000
> hw_director,agent_host=192.168.200.1,host=server.cluster director="3e960005300030011" 1646223625000000000
> hw_director,agent_host=192.168.200.1,host=server.cluster director="3e960002300030011" 1646223625000000000
2022-03-02T12:20:25Z D! [agent] Stopping service inputs
2022-03-02T12:20:25Z D! [agent] Input channel closed
> hw_director,agent_host=192.168.200.1,host=server.cluster CPUUsage=2i,DiskUtil=1i,Ops=3i,ResponseTime=0i 1646223625000000000
2022-03-02T12:20:25Z D! [agent] Stopped Successfully
> hw_director,agent_host=192.168.200.1,host=server.cluster CPUUsage=0i,DiskUtil=1i,Ops=3i,ResponseTime=0i 1646223625000000000
> hw_director,agent_host=192.168.200.1,host=server.cluster CPUUsage=1i,DiskUtil=1i,Ops=27i,ResponseTime=1i 1646223625000000000
> hw_director,agent_host=192.168.200.1,host=server.cluster director="3c76000d300030001" 1646223625000000000



[root@server01 telegraf-1.22.0]# grep -v "^#" /etc/telegraf/telegraf.d/inputs.mpanfs.conf 
[[inputs.snmp]]
   agents = [ "udp://192.168.200.1:161" ]
   ## Timeout for each SNMP query.
   timeout = "5s"
   ## Number of retries to attempt within timeout.
   retries = 2
   ## SNMP version, values can be 1, 2, or 3
   version = 1

   ## SNMP community string.
   community = "public"

  name = "hw_storage"
  [[inputs.snmp.field]]
    name = "hostname"
    oid = "SNMPv2-MIB::sysName.0"
  
  [[inputs.snmp.table]]
    ## measurement name
    name = "hw_director"
  [[inputs.snmp.table.field]]
    name = "director"
    oid = "PANASAS-SYSTEM-MIB-V1::panSystemRepsetEntryNodeHwSN"
  [[inputs.snmp.table.field]]
    name = "DiskUtil"
    oid = "PANASAS-PERFORMANCE-MIB-V1::panPerfDirectorDiskUtil"
  [[inputs.snmp.table.field]]
    name = "ResponseTime"
    oid = "PANASAS-PERFORMANCE-MIB-V1::panPerfDirectorResponseTime"
  [[inputs.snmp.table.field]]
    name = "CPUUsage"
    oid = "PANASAS-PERFORMANCE-MIB-V1::panPerfDirectorCpuUtil"
  [[inputs.snmp.table.field]]
    name = "Ops"
    oid = "PANASAS-PERFORMANCE-MIB-V1::panPerfDirectorOps"
    ```

MyaLongmire · 2022-03-02T16:58:14Z

@OliverTUBAF glad to hear it's working. Did you test a new version or the pr linked above?

OliverTUBAF · 2022-03-03T06:42:00Z

With the latest official release (1.21.4) it did not work, then I tried the pr build (1.22.0) and there it worked.

MyaLongmire · 2022-03-03T09:41:41Z

@OliverTUBAF Thank you for the clarification! This pr is getting cleaned up and will hopefully be in the next release :)

BenCastricum added the bug unexpected problem or unintended behavior label May 20, 2021

telegraf-tiger bot added the area/snmp label May 20, 2021

Hipska added the area/docker label May 20, 2021

BenCastricum changed the title ~~Timeout / panic runtime error with inputs.snmp polling squid-cache~~ Timeout with inputs.snmp polling squid-cache May 21, 2021

Hipska closed this as completed May 21, 2021

Hipska added support Telegraf questions, may be directed to community site or slack and removed bug unexpected problem or unintended behavior labels May 21, 2021

Hipska reopened this May 21, 2021

Hipska added feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin and removed support Telegraf questions, may be directed to community site or slack area/docker labels May 21, 2021

Hipska assigned reimda Oct 14, 2021

Hipska assigned MyaLongmire Dec 22, 2021

MyaLongmire mentioned this issue Dec 22, 2021

fix(inputs.snmp): snmp UseUnconnectedUDPSocket when using udp #10331

Merged

3 tasks

MyaLongmire added the waiting for response waiting for response from contributor label Jan 12, 2022

telegraf-tiger bot removed the waiting for response waiting for response from contributor label Feb 28, 2022

telegraf-tiger bot closed this as completed Feb 28, 2022

sspaink reopened this Feb 28, 2022

influxdata deleted a comment from telegraf-tiger bot Feb 28, 2022

Hipska added the waiting for response waiting for response from contributor label Feb 28, 2022

telegraf-tiger bot closed this as completed Mar 1, 2022

telegraf-tiger bot removed the waiting for response waiting for response from contributor label Mar 1, 2022

Hipska reopened this Mar 1, 2022

Hipska added the waiting for response waiting for response from contributor label Mar 1, 2022

telegraf-tiger bot removed the waiting for response waiting for response from contributor label Mar 2, 2022

MyaLongmire closed this as completed in #10331 Jul 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeout with inputs.snmp polling squid-cache #9286

Timeout with inputs.snmp polling squid-cache #9286

BenCastricum commented May 20, 2021 •

edited by Hipska

BenCastricum commented May 20, 2021

Hipska commented May 20, 2021

BenCastricum commented May 20, 2021

Hipska commented May 21, 2021

BenCastricum commented May 21, 2021

Hipska commented May 21, 2021

BenCastricum commented May 21, 2021

Hipska commented May 21, 2021

telegraf-tiger bot commented May 21, 2021

BenCastricum commented May 21, 2021

Hipska commented May 21, 2021

BenCastricum commented May 21, 2021

Hipska commented May 21, 2021

flyinghuman commented Sep 28, 2021 •

edited

flyinghuman commented Sep 30, 2021

Hipska commented Oct 14, 2021

Hipska commented Dec 8, 2021

MyaLongmire commented Dec 22, 2021

telegraf-tiger bot commented Mar 1, 2022

OliverTUBAF commented Mar 2, 2022 •

edited

MyaLongmire commented Mar 2, 2022

OliverTUBAF commented Mar 3, 2022

MyaLongmire commented Mar 3, 2022

Timeout with inputs.snmp polling squid-cache #9286

Timeout with inputs.snmp polling squid-cache #9286

Comments

BenCastricum commented May 20, 2021 • edited by Hipska

Relevant telegraf.conf:

System info:

Docker

Steps to reproduce:

Expected behavior:

Actual behavior:

Additional info:

BenCastricum commented May 20, 2021

Hipska commented May 20, 2021

BenCastricum commented May 20, 2021

Hipska commented May 21, 2021

BenCastricum commented May 21, 2021

Hipska commented May 21, 2021

BenCastricum commented May 21, 2021

Hipska commented May 21, 2021

telegraf-tiger bot commented May 21, 2021

BenCastricum commented May 21, 2021

Hipska commented May 21, 2021

BenCastricum commented May 21, 2021

Hipska commented May 21, 2021

flyinghuman commented Sep 28, 2021 • edited

flyinghuman commented Sep 30, 2021

Hipska commented Oct 14, 2021

Hipska commented Dec 8, 2021

MyaLongmire commented Dec 22, 2021

telegraf-tiger bot commented Mar 1, 2022

OliverTUBAF commented Mar 2, 2022 • edited

MyaLongmire commented Mar 2, 2022

OliverTUBAF commented Mar 3, 2022

MyaLongmire commented Mar 3, 2022

BenCastricum commented May 20, 2021 •

edited by Hipska

flyinghuman commented Sep 28, 2021 •

edited

OliverTUBAF commented Mar 2, 2022 •

edited