Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout with inputs.snmp polling squid-cache #9286

Closed
BenCastricum opened this issue May 20, 2021 · 23 comments · Fixed by #10331
Closed

Timeout with inputs.snmp polling squid-cache #9286

BenCastricum opened this issue May 20, 2021 · 23 comments · Fixed by #10331
Assignees
Labels
area/snmp feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin

Comments

@BenCastricum
Copy link

BenCastricum commented May 20, 2021

Relevant telegraf.conf:

[[inputs.snmp]]
  interval = "30s"
  agents = [ "udp://192.168.3.1:161" ]
  version = 2
  community = "public"
  name = "squid"

  [[inputs.snmp.field]]
    name = "cacheSoftware"
    oid = "1.3.6.1.4.1.3495.1.2.2.0"
    is_tag = true

  [[inputs.snmp.field]]
    name = "cacheVersionId"
    oid = "1.3.6.1.4.1.3495.1.2.3.0"
    is_tag = true

  [[inputs.snmp.field]]
    name = "cacheProtoClientHttpRequests"
    oid = "1.3.6.1.4.1.3495.1.3.2.1.1.0"

System info:

Running Ubuntu in telegraf:latest docker image:

I have no name!@TelegrafHost:/$ telegraf --version
Telegraf 1.18.2 (git: HEAD a614372)

Docker

version : "3"

services:
  telegraf:
    image: telegraf
    hostname: TelegrafHost
    user: 201:201
    command: --config /etc/telegraf/telegraf.conf --config-directory /etc/telegraf/telegraf.d
    ports:
      - 6514:6514/udp
      - 162:6162/udp
    volumes:
      - /home/docker/etc/telegraf:/etc/telegraf:ro
      - /proc:/host/proc:ro
      - /home/docker/.snmp/mibs:/usr/share/snmp/mibs:ro
    networks:
       influxdatanet:

Steps to reproduce:

  1. configure telegraf docker image to pull snmp statistics from squid-cache

Expected behavior:

From the CLI inside the docker container snmpget works as expected.

I have no name!@TelegrafHost:/$ snmpget -v2c -c public 192.168.3.1 1.3.6.1.4.1.3495.1.2.2.0
iso.3.6.1.4.1.3495.1.2.2.0 = STRING: "squid"

Actual behavior:

docker logs show timeout errors on this specific agent

2021-05-20T14:28:10Z E! [inputs.snmp] Error in plugin: agent udp://192.168.3.1:161: performing get on field cacheSoftware: request timeout (after 3 retries)

(lot's of these)

Additional info:

Other input.snmp agents work, just this specific one time's out.

a tcpdump gives this when telegraf polls:

16:31:35.005000 IP 172.18.0.5.35773 > 192.168.3.1.161: GetRequest(31) .1.3.6.1.4.1.3495.1.2.2.0
16:31:35.005274 IP 172.18.0.1.161 > 172.18.0.5.35773: GetResponse(40) .1.3.6.1.4.1.3495.1.2.2.0="squid"
16:31:35.005313 IP 172.18.0.5 > 172.18.0.1: ICMP 172.18.0.5 udp port 35773 unreachable, length 95

but an snmpget from the CLI does not give the ICMP error. (telegraf stops listening at the port before the answer is received?)

test run output:

I have no name!@TelegrafHost:/$ telegraf --test --config /etc/telegraf/telegraf.conf --config-directory /etc/telegraf/telegraf.d
<...>
2021-05-20T14:09:28Z E! [inputs.snmp] Error in plugin: agent udp://192.168.3.1:161: performing get on field cacheSoftware: request timeout (after 3 retries)
I have no name!@TelegrafHost:/$

@BenCastricum BenCastricum added the bug unexpected problem or unintended behavior label May 20, 2021
@BenCastricum
Copy link
Author

Using the same telegraf config, but outside the docker container seems to work as expected. So it seems there is some relation with docker.

@Hipska
Copy link
Contributor

Hipska commented May 20, 2021

In that case, it seems more relevant to create an issue here: https://github.com/influxdata/influxdata-docker

But looking at your stack trace, do you also have the snmp_trap plugin configured? As it seems that is the one sending this panic. And what version of telegraf does the container have?

@BenCastricum
Copy link
Author

inside the container:

I have no name!@TelegrafHost:/$ telegraf --version
Telegraf 1.18.2 (git: HEAD a614372)

The stack trace is indeed from the snmp_trap plugin unable to bind to the socket during the --test run which makes sens because it is already in use by the main process.

I would agree with you it's a docker only thingy if an snmpget inside the container had the same issue. but snmpget works fine, telegraf does not. Weird.

@Hipska
Copy link
Contributor

Hipska commented May 21, 2021

Could you update the output with the results of a test without snmp_trap plugin configured?

@BenCastricum BenCastricum changed the title Timeout / panic runtime error with inputs.snmp polling squid-cache Timeout with inputs.snmp polling squid-cache May 21, 2021
@BenCastricum
Copy link
Author

WIthout it, no panic occurs. I updated the report to just the timeout error. Should I file a new bug report for the SNMP trap plugin panic? It's not really bothering me.

@Hipska
Copy link
Contributor

Hipska commented May 21, 2021

That would be nice. The panic should not occur if it is because the port is already used for listening, an error message would be expected in that case.

About your current issue; While I still think it is a docker related problem (since you specified it works perfectly when not running in docker) and it should be filed at the correct repository, I saw this that might have a clue:

So you are doing snmp request towards 192.168.3.1, but the response comes from a totally different IP. That might be the reason why it is not working? Do you have an idea where this comes from? And does that also happen when running telegraf on the host instead of in a container?

@BenCastricum
Copy link
Author

You have a good point there. The reply isn't properly NATed. Telegraf does the right thing here and rejects the packet while snmpget does not seem to care that the reply is from a different IP address. This page more or less confirms that.

the docker container has IP 172.18.0.5/16, it's gateway to the outside world is 172.18.0.1.

This is definitely a docker issue. Sorry for wasting your time.

@Hipska
Copy link
Contributor

Hipska commented May 21, 2021

You're welcome.

I'm closing this issue now, but please keep us in the loop so we also know what the solution/fix for this issue is/was.

@Hipska Hipska closed this as completed May 21, 2021
@Hipska Hipska added support Telegraf questions, may be directed to community site or slack and removed bug unexpected problem or unintended behavior labels May 21, 2021
@telegraf-tiger
Copy link
Contributor

Hello! I recommend posting this question in our Community Slack or Community Page, we have a lot of talented community members there who could help answer your question more quickly.
Heads up, this issue will be automatically closed after 7 days of inactivity. Thank you!

@BenCastricum
Copy link
Author

It seems that it isn't a docker thing either. The Squid-Cache daemon was replying to the packet with an IP based on the routing table. So it choose the docker_gwbridge interface (172.18.0.1) because of the source IP (172.18.0.5).

I fixed it by explicitly by configuring the SNMP IP address in squid,conf with the snmp_incoming_address option, this IP wil then also used for snmp_outgoing_address.

The underlying question is if it is required to reply SNMP packets with using the IP address it receives the request on as source IP addres for the reply. Squid-Cache and net-snmp don't think so, Telegraf does and enforces it

This seems a larger discussion.
Go SNMP seems to have changed his behaviour recently. Does this mean Telegraf can be expected to adapt too?

@Hipska
Copy link
Contributor

Hipska commented May 21, 2021

Indeed, I was also thinking if telegraf was really doing the right thing. But indeed, if gosnmp updates it's behaviour on this, it will also be reflected in telegraf when we update the module.

I could not find any traces of gosnmp actually being changed on this topic, could you point me the exact PR where this is changed?

@BenCastricum
Copy link
Author

I think PR#277 is the one.

@Hipska
Copy link
Contributor

Hipska commented May 21, 2021

@reimda I think we still need to implement gosnmp's UseUnconnectedUDPSocket option to accommodate this..

@Hipska Hipska reopened this May 21, 2021
@Hipska Hipska added feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin and removed support Telegraf questions, may be directed to community site or slack area/docker labels May 21, 2021
@flyinghuman
Copy link

flyinghuman commented Sep 28, 2021

same here. snmpwalk and snmpget from console works but within telegraf a timeout happens even if the hosts are in same subnet and no NAT is there. Seems to be related with snmp version 1 - version 2c works like expected. but the device i must monitor only has version 1 snmp. so what can i do? if i do a tcpdump i see that telegraf sends the request and also gets correct response. Telegraf also listens on the correct ports - so it must be a software related error inside telegraf :( i could also upload the pcap file if this would be helpful. Maybe this is related to #8271
No Docker, no different answer-address. I don'k know where i can look now. Thanks, fly.
snmp.zip

@flyinghuman
Copy link

maybe this is also related: gosnmp/gosnmp#47 (comment)

@Hipska
Copy link
Contributor

Hipska commented Oct 14, 2021

@reimda WDYT?

@Hipska
Copy link
Contributor

Hipska commented Dec 8, 2021

@reimda? Maybe @MyaLongmire can also be of help here?

@MyaLongmire
Copy link
Contributor

This pr takes @Hipska idea of implementing UseUnconnectedUDPSocket. Can someone please test it to see if it fixes their issue?

@MyaLongmire MyaLongmire added the waiting for response waiting for response from contributor label Jan 12, 2022
@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Feb 28, 2022
@sspaink sspaink reopened this Feb 28, 2022
@influxdata influxdata deleted a comment from telegraf-tiger bot Feb 28, 2022
@Hipska Hipska added the waiting for response waiting for response from contributor label Feb 28, 2022
@telegraf-tiger
Copy link
Contributor

telegraf-tiger bot commented Mar 1, 2022

Hello! I am closing this issue due to inactivity. I hope you were able to resolve your problem, if not please try posting this question in our Community Slack or Community Page. Thank you!

@telegraf-tiger telegraf-tiger bot closed this as completed Mar 1, 2022
@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Mar 1, 2022
@Hipska Hipska reopened this Mar 1, 2022
@Hipska Hipska added the waiting for response waiting for response from contributor label Mar 1, 2022
@OliverTUBAF
Copy link

OliverTUBAF commented Mar 2, 2022

Thank you all for addressing this problem, it works now, I'm able to query data:

[root@server01 telegraf-1.22.0]# tail -f /var/log/telegraf/telegraf.log &
[root@server01 telegraf-1.22.0]# ./usr/bin/telegraf --config /etc/telegraf/telegraf.conf --config-directory /etc/telegraf/telegraf.d/ --test
2022-03-02T12:19:45Z I! Starting Telegraf 1.22.0-ccddf3bd
2022-03-02T12:19:45Z I! Loaded inputs: snmp
2022-03-02T12:19:45Z I! Loaded aggregators: 
2022-03-02T12:19:45Z I! Loaded processors: 
2022-03-02T12:19:45Z W! Outputs are not used in testing mode!
2022-03-02T12:19:45Z I! Tags enabled: host=server.cluster
2022-03-02T12:19:45Z D! [agent] Initializing plugins
2022-03-02T12:20:24Z D! [agent] Starting service inputs
> hw_storage,agent_host=192.168.200.1,host=server.cluster hostname="PanFS" 1646223624000000000
> hw_director,agent_host=192.168.200.1,host=server.cluster director="3e960005300030011" 1646223625000000000
> hw_director,agent_host=192.168.200.1,host=server.cluster director="3e960002300030011" 1646223625000000000
2022-03-02T12:20:25Z D! [agent] Stopping service inputs
2022-03-02T12:20:25Z D! [agent] Input channel closed
> hw_director,agent_host=192.168.200.1,host=server.cluster CPUUsage=2i,DiskUtil=1i,Ops=3i,ResponseTime=0i 1646223625000000000
2022-03-02T12:20:25Z D! [agent] Stopped Successfully
> hw_director,agent_host=192.168.200.1,host=server.cluster CPUUsage=0i,DiskUtil=1i,Ops=3i,ResponseTime=0i 1646223625000000000
> hw_director,agent_host=192.168.200.1,host=server.cluster CPUUsage=1i,DiskUtil=1i,Ops=27i,ResponseTime=1i 1646223625000000000
> hw_director,agent_host=192.168.200.1,host=server.cluster director="3c76000d300030001" 1646223625000000000



[root@server01 telegraf-1.22.0]# grep -v "^#" /etc/telegraf/telegraf.d/inputs.mpanfs.conf 
[[inputs.snmp]]
   agents = [ "udp://192.168.200.1:161" ]
   ## Timeout for each SNMP query.
   timeout = "5s"
   ## Number of retries to attempt within timeout.
   retries = 2
   ## SNMP version, values can be 1, 2, or 3
   version = 1

   ## SNMP community string.
   community = "public"

  name = "hw_storage"
  [[inputs.snmp.field]]
    name = "hostname"
    oid = "SNMPv2-MIB::sysName.0"
  
  [[inputs.snmp.table]]
    ## measurement name
    name = "hw_director"
  [[inputs.snmp.table.field]]
    name = "director"
    oid = "PANASAS-SYSTEM-MIB-V1::panSystemRepsetEntryNodeHwSN"
  [[inputs.snmp.table.field]]
    name = "DiskUtil"
    oid = "PANASAS-PERFORMANCE-MIB-V1::panPerfDirectorDiskUtil"
  [[inputs.snmp.table.field]]
    name = "ResponseTime"
    oid = "PANASAS-PERFORMANCE-MIB-V1::panPerfDirectorResponseTime"
  [[inputs.snmp.table.field]]
    name = "CPUUsage"
    oid = "PANASAS-PERFORMANCE-MIB-V1::panPerfDirectorCpuUtil"
  [[inputs.snmp.table.field]]
    name = "Ops"
    oid = "PANASAS-PERFORMANCE-MIB-V1::panPerfDirectorOps"
    ```

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Mar 2, 2022
@MyaLongmire
Copy link
Contributor

@OliverTUBAF glad to hear it's working. Did you test a new version or the pr linked above?

@OliverTUBAF
Copy link

With the latest official release (1.21.4) it did not work, then I tried the pr build (1.22.0) and there it worked.

@MyaLongmire
Copy link
Contributor

@OliverTUBAF Thank you for the clarification! This pr is getting cleaned up and will hopefully be in the next release :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/snmp feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants