Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random integers in SNMP Input Plugin OID output #7929

Closed
surbhim7 opened this issue Jul 31, 2020 · 25 comments · Fixed by #8917
Closed

Random integers in SNMP Input Plugin OID output #7929

surbhim7 opened this issue Jul 31, 2020 · 25 comments · Fixed by #8917
Assignees
Labels
area/snmp bug unexpected problem or unintended behavior

Comments

@surbhim7
Copy link

  • Telegraf version - 1.15.1
  • Operating System - Redhat Linux
  • Input Plugin - SNMP

I'm using the snmp input plugin to poll certain OIDs.
Every time I run telegraf, I'm getting sequential integers in outputs of OID for random hosts as can be seen below :

viptela_mem_cpu_stats,agent_host=10.14.*.*,hostname=903 `memory_free=909i,uptime=904i,cpu_util_system=905i,cpu_util_user=906i,memory_total=907i,memory_used=908i` 1596221521000000000

viptela_mem_cpu_stats,agent_host=10.12.*.*,hostname=TEST-VE1 memory_used="3502212",memory_free="1761120",uptime=21752169i,cpu_util_system="3.49",cpu_util_user="2.28",memory_total="6081060" 1596221521000000000

viptela_mem_cpu_stats,agent_host=10.13.*.*,hostname=924 uptime=925i,cpu_util_system=926i,cpu_util_user=927i,memory_total=928i,memory_used=929i,memory_free=930i 1596221521000000000
@ssoroka
Copy link
Contributor

ssoroka commented Aug 3, 2020

What am I supposed to be looking at? Can you point out specifically what you're seeing and what you're expecting to see?

@surbhim7
Copy link
Author

surbhim7 commented Aug 4, 2020

So the second line is what the output should look like with proper/expected values for each field and tag.

The first and third lines are what I'm getting the output as for some hosts. the values for OIDs are coming as 903i, 904i, 905i and so on.

@surbhim7
Copy link
Author

surbhim7 commented Aug 6, 2020

Hi @ssoroka , anything on this?

@ssoroka
Copy link
Contributor

ssoroka commented Aug 18, 2020

Hey @surbhim7. I'm going to phone a friend here. cc @reimda

@reimda
Copy link
Contributor

reimda commented Aug 18, 2020

I haven't seen this behavior before. Could you make a packet capture of telegraf getting these values from the snmp agents and attach it to the issue along with the output of telegraf and the telegraf config you're using?

@Hipska
Copy link
Contributor

Hipska commented Aug 21, 2020

Maybe also show the relevant part of your config.

@surbhim7
Copy link
Author

Sorry for the delay in replying.

I managed to figure out what the issue was.
This was happening for the agent hosts for which the snmp credentials didn't work. Telegraf gave an output with random integers for those hosts.

I guess that is not the right behaviour.

@DouglasHeriot
Copy link

I'm also seeing this. Some of these OIDs I have enabled is_tag so just ran into "runaway cardinality" and a bunch of writes being dropped where they add new tags:

2020-10-01T14:00:06Z E! [outputs.influxdb] When writing to [http://X:8086]: received error partial write: max-values-per-tag limit exceeded (100001/100000): measurement="snmp" tag="snmp_hostname" value="X" dropped=2; discarding points

Any idea where these values are coming from, and why an error isn't reported?

@Hipska
Copy link
Contributor

Hipska commented Oct 2, 2020

@DouglasHeriot That seems a totally different issue, could you create a new one for this please?

@surbhim7 As your issue seems to be resolved, can you close it?

@DouglasHeriot
Copy link

@Hipska I'm pretty sure it's the same issue - I've just run into a different consequence of it.

With this basic input config:

[[inputs.snmp]]
  agents = [
	  "10.1.1.1",
	  "10.2.2.2"
  ]
  version = 3
  sec_name = "user"
  auth_protocol = "SHA"
  auth_password = "password"
  sec_level = "authPriv"
  context_name = ""
  priv_protocol = "AES"
  priv_password = "password"

  interval = "20s"

  [[inputs.snmp.field]]
    name = "snmp_hostname"
    oid = "RFC1213-MIB::sysName.0"
    is_tag = true

  [[inputs.snmp.field]]
    name = "uptime"
	oid = "RFC1213-MIB:sysUpTime.0"

Returns this result:

$ ./telegraf --config-directory ~/telegraf.d --test
2020-10-02T10:25:55Z I! Starting Telegraf
2020-10-02T10:25:55Z I! Using config file: /etc/telegraf/telegraf.conf
> snmp,agent_host=10.1.1.1,snmp_hostname=6334808 uptime=6334809i 1601634356000000000
> snmp,agent_host=10.2.2.2,snmp_hostname=CORRECT-HOSTNAME uptime=2343511200i 1601634356000000000

Example switch 10.1.1.1 does not authenticate successfully, while 10.2.2.2 does. Note that snmp_hostname tag and uptime field are incrementing integers. Subsequent runs return the next numbers:

$ ./telegraf --config-directory ~/telegraf.d --test
2020-10-02T10:25:55Z I! Starting Telegraf
2020-10-02T10:25:55Z I! Using config file: /etc/telegraf/telegraf.conf
> snmp,agent_host=10.1.1.1,snmp_hostname=6334810 uptime=6334811i 1601634356000000000
> snmp,agent_host=10.2.2.2,snmp_hostname=CORRECT-HOSTNAME uptime=2343511200i 1601634356000000000

When testing using snmpget manually, it returns an error, and exits with code 1.

$ snmpget -v3 -l authPriv -u user -a SHA -A password -x AES -X password 10.1.1.1 RFC1213-MIB::sysName.0
snmpget: Authentication failure (incorrect password, community or key)
$ echo $?
1

This leads to "runaway cardinality" on snmp fields with is_tag=true as each SNMP request leads to another series being created.

The expected outcome is an error should be logged saying that authentication failed, instead of rubbish data being published.

@Hipska
Copy link
Contributor

Hipska commented Oct 2, 2020

Correct, that is indeed the same issue, with the same root cause. I was confused because you initially only mentioned a InfluxDB error/problem.

The plugin should indeed give warning/error if the agent does not return useful data.

@DouglasHeriot
Copy link

I'm having a little look into this - it appears it may be an issue in the upstream gosnmp project?

Where snmpConnection Get is called on line 416, the Variables array contains these mystery numbers, and no error is returned.
https://github.com/influxdata/telegraf/blob/master/plugins/inputs/snmp/snmp.go#L416
Variables:[{Name:.1.3.6.1.6.3.15.1.1.5.0 Type:Counter32 Value:6335216 Logger:0xc0001b00a0}]

DouglasHeriot added a commit to hillsong/telegraf that referenced this issue Oct 2, 2020
…a#7929)

gosnmp does not return any error when an SNMP get request fails due to authentication.
The SNMP agent will return a "report" PDU instead of a "get-response" PDU in this case.

A check has been added to verify the packet's PDUType is not "Report"
@DouglasHeriot
Copy link

I think this only applies to SNMP V3 - with SNMP V2 if the community string is incorrect there is simply no response.

Error packet capture (note telegraf is set to retry 3 times)
image

Successful packet capture:
image

rfc3411 says "Report-PDU" is of the "Internal Class" which I guess means valid messages shouldn't use this - my testing and packet captures show valid requests receive a "Response-PDU".

gosnmp Get method returns a SnmpPacket that contains the PDUType. We can check if it's a report or response.

I have created a draft pull request #8215 with a fix for this that logs reports as an error that may be due to authentication. I have not yet added any tests for this.

However, I'm not an expert at the SNMP protocol - I'm not sure if it makes more sense for this to be handled within the gosnmp library. I'm not sure if gosnmp/gosnmp#172 is relevant to this or not - I think it shows you can also check the received report message for authentication flags?

As shown in my comment above, the snmpget tool detects this case and returns an error "snmpget: Authentication failure (incorrect password, community or key)". We could look into its source to see what condition it uses to determine this is the case.

@dpajin
Copy link
Contributor

dpajin commented Nov 24, 2020

Hi @DouglasHeriot,

I have devices which does not work with Telegraf SNMP plugin, as the requests fail due to following error: "Incoming packet is not authentic, discarding".

The device answers on the first initial packet with the SNMP Report PDU and that message triggers the error and Telegraf stops at that point. As you mentioned, tools like snmpget and snmpwalk work correctly.

I have tested your pull request #8215, but I don't see any change in the behavior:

2020-11-24T15:36:01Z E! [inputs.snmp] Error in plugin: agent 172.31.9.214: performing get on field hostname: Incoming packet is not authentic, discarding
2020-11-24T15:36:01Z E! [inputs.snmp] Error in plugin: agent 172.31.9.214: gathering table snmp_inventory: performing bulk walk for field hwDescription: Incoming packet is not authentic, discarding

Any ideas how this issue can be fixed?

@dpajin
Copy link
Contributor

dpajin commented Nov 24, 2020

Hi @DouglasHeriot,

if you interested in the problem described in my previous comment, please take a look at: #3788 (comment)

@DouglasHeriot
Copy link

@dpajin thanks for the info! In my case I just fixed the authentication config on our switches to resolve the error. However I'm still interested in seeing this fixed as my Influx gets full of random integers when new switches are added that are not correctly configured.

Will you make a pull request to update gosnmp in this project?

@dpajin
Copy link
Contributor

dpajin commented Nov 26, 2020

@DouglasHeriot, yes I will try to make a pull request to update gosnmp. Additionally, I can confirm that I also hit the bug with the random numbers at some point. Unfortunately, I don't have any packet capture so far.

@Hipska Hipska added area/snmp bug unexpected problem or unintended behavior labels Dec 17, 2020
@sjwang90 sjwang90 linked a pull request Dec 17, 2020 that will close this issue
3 tasks
@dpajin
Copy link
Contributor

dpajin commented Dec 17, 2020

@DouglasHeriot, I made a pull request for gosnmp update: #8588

@Hipska
Copy link
Contributor

Hipska commented Feb 10, 2021

Since gosnmp has been updated, does this issue still occur?

@henriknoerr
Copy link

henriknoerr commented Feb 15, 2021

Yes I can confirm it happens:

snmpwalk -v 3 -u user-l AuthPriv -x aes -X privpass -a sha -A authpass 10.10.10.10
snmpwalk: Unknown user name

telegraf --test --config asav.conf
2021-02-15T08:31:53Z I! Starting Telegraf 1.17.2

snmp,agent_host=10.10.10.10,host=telegrafhost,hostname=378659 AnyConnectUsers=378660i,1613377914000000000

[[inputs.snmp.field]]
name = "hostname"
oid = "RFC1213-MIB::sysName.0"
is_tag = true

[[inputs.snmp.field]]
name = "AnyConnect"
oid = "CISCO-REMOTE-ACCESS-MONITOR-MIB::crasNumSessions.0"

@Hipska Hipska modified the milestone: 2.0.0 Feb 15, 2021
@Hipska
Copy link
Contributor

Hipska commented Feb 15, 2021

Okay, I see that some problems are solved with new gosnmp version, but there is still no check on the response.

Where snmpConnection Get is called on line 416, the Variables array contains these mystery numbers, and no error is returned.
https://github.com/influxdata/telegraf/blob/master/plugins/inputs/snmp/snmp.go#L416
Variables:[{Name:.1.3.6.1.6.3.15.1.1.5.0 Type:Counter32 Value:6335216 Logger:0xc0001b00a0}]

The mystery numbers are some of 1.3.6.1.6.3.15.1.1 They are counters of how many auth problem the system has observed. I will have a look on your original PR #8215.

@Hipska Hipska self-assigned this Feb 15, 2021
@hackery
Copy link
Contributor

hackery commented Feb 16, 2021

The mystery numbers are some of 1.3.6.1.6.3.15.1.1 They are counters of how many auth problem the system has observed. I will have a look on your original PR #8215.

I don't think there's a mystery, I explained in #7746 what was going on here (SNMP target returns a Report rather than a GetResponse but the code doesn't discriminate between them, and uses the error count as a result value)

@Hipska
Copy link
Contributor

Hipska commented Feb 16, 2021

Indeed, that's what I'm saying. See also the linked PR #8215 which will fix this issue.

@Hipska
Copy link
Contributor

Hipska commented Apr 30, 2021

Hi all, please check out telegraf 1.18.2 which has a final fix for this. Feedback is welcome.

@henriknoerr
Copy link

Works as expected as I gather - no more false entries sent to output, influxdb and understandable errors in telegraf.log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/snmp bug unexpected problem or unintended behavior
Projects
None yet
8 participants