Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telegraf writing metrics twice, but one is wrong? #2041

Closed
rosspink opened this issue Nov 14, 2016 · 6 comments
Closed

Telegraf writing metrics twice, but one is wrong? #2041

rosspink opened this issue Nov 14, 2016 · 6 comments
Labels
bug unexpected problem or unintended behavior

Comments

@rosspink
Copy link

rosspink commented Nov 14, 2016

Bug report

Relevant telegraf.conf:

[[inputs.snmp]]
  agents = [ "Edi.Core1" ]
  version = 2
  community = "redacted"

  [[inputs.snmp.table]]
    name = "interface"
    oid = "IF-MIB::ifXTable"

  [[inputs.snmp.table.field]]
    name = "ifName"
    oid = "IF-MIB::ifName"
    is_tag = true

System info:

[Include Telegraf version, operating system name, and other relevant details]
Telegraf v1.1.0 (git: release-1.1.0 8ecfe13)
InfluxDB shell version: 1.0.2
InfluxDB Relay: latest

Steps to reproduce:

I have setup a system as described here:

  • one HAProxy server, configured in round-robin
  • two InfluxDB servers, each running Relay; Relay copies any writes sent to one server, to the other

Everything works as I expect, except that Telegraf appears to be writing a particular metric twice, once incorrectly.

It only happens in certain circumstances. Generally it affects certain switches with a particularly high counter value. In my example, we're interested in ifHCInOctets, for host Edi.Core1, interface name (ifName) Vl351

Expected behavior:

Telegraf writes a metric once, correctly. It should only write the following:

ifHCInOctets=340772969979493i

Actual behavior:

Telegraf writes a metric twice, once correctly and once with a false value:

ifHCInOctets=340772969979493i
ifHCInOctets=637409419i

Additional info:

These log files reveal the multiple writes to one metric, across two separate writes, since you can see ifHCInOctets is duplicated:

interface,agent_host=Edi.Core1,host=dc5-influxha01,ifName=Vl351 ifConnectorPresent=2i,ifCounterDiscontinuityTime=3736i,ifHCInOctets=637409419i,ifHCInUcastPkts=8143038i,ifHCOutOctets=140906590205473i,ifHCOutUcastPkts=333187996005i,ifHighSpeed=1000i,ifLinkUpDownTrapEnable=1i,ifName="Vl351",ifPromiscuousMode=2i 1479123813000000000
interface,agent_host=Edi.Core1,host=dc5-influxha01,ifName=Vl351 ifAlias="Link to DC3CORE1",ifConnectorPresent=2i,ifCounterDiscontinuityTime=3734i,ifHCInBroadcastPkts=31807157i,ifHCInMulticastPkts=0i,ifHCInOctets=340772969979493i,ifHCInUcastPkts=656403980609i,ifHCOutBroadcastPkts=0i,ifHCOutMulticastPkts=0i,ifHCOutOctets=271670987401320i,ifHCOutUcastPkts=411805219439i,ifHighSpeed=1000i,ifInBroadcastPkts=31807157i,ifInMulticastPkts=0i,ifLinkUpDownTrapEnable=1i,ifName="Vl351",ifOutBroadcastPkts=0i,ifOutMulticastPkts=0i,ifPromiscuousMode=2i 1479123813000000000

If I poll continuously the relevant OID, from the same one machine as is running Telegraf, for however long, there is never one instance where the OID counter value deviates:

IF-MIB::ifHCInOctets.104 = Counter64: 340772969979493
IF-MIB::ifHCInOctets.104 = Counter64: 340772969979493
IF-MIB::ifHCInOctets.104 = Counter64: 340772969979493
IF-MIB::ifHCInOctets.104 = Counter64: 340772969979493
IF-MIB::ifHCInOctets.104 = Counter64: 340772969979493
IF-MIB::ifHCInOctets.104 = Counter64: 340772969979493
IF-MIB::ifHCInOctets.104 = Counter64: 340772969979493
IF-MIB::ifHCInOctets.104 = Counter64: 340772969979493
IF-MIB::ifHCInOctets.104 = Counter64: 340772969979493
IF-MIB::ifHCInOctets.104 = Counter64: 340772969979493
IF-MIB::ifHCInOctets.104 = Counter64: 340772969979493

Let me know if you need any more information to help investigate this.

@sparrc
Copy link
Contributor

sparrc commented Nov 14, 2016

cc @phemmer

@rosspink
Copy link
Author

I've just tried removing Relay altogether i.e. writing straight to the backends and the issue persists regardless.

@sparrc sparrc added the bug unexpected problem or unintended behavior label Nov 14, 2016
@phemmer
Copy link
Contributor

phemmer commented Nov 14, 2016

These log files reveal the multiple writes

How are you generating these log files? Are you running telegraf -test, querying the data from influxdb, using [[outputs.file]], something else?

Can you do an snmpwalk of the IF-MIB::ifXTable?
The outputs of the 2 lines are so radically different, I can't think of anything in the SNMP plugin which could result in such behavior.
My initial suspicion is one of 2 things:

  1. There is more to your config than what is shown.
  2. The device you are polling is returning 2 different rows, and the row's index is more than just the ifName field.

@rosspink
Copy link
Author

rosspink commented Nov 14, 2016

Can you do an snmpwalk of the IF-MIB::ifXTable

Output is here

An abridged output of just the one interface I was using as an example, is as follows:

[root@dc5-influxha01 ~]# cat snmpwalk.lst | grep '\.104 ='
IF-MIB::ifName.104 = STRING: Vl351
IF-MIB::ifInMulticastPkts.104 = Counter32: 0
IF-MIB::ifInBroadcastPkts.104 = Counter32: 31807157
IF-MIB::ifOutMulticastPkts.104 = Counter32: 0
IF-MIB::ifOutBroadcastPkts.104 = Counter32: 0
IF-MIB::ifHCInOctets.104 = Counter64: 340772969979493
IF-MIB::ifHCInUcastPkts.104 = Counter64: 656403980609
IF-MIB::ifHCInMulticastPkts.104 = Counter64: 0
IF-MIB::ifHCInBroadcastPkts.104 = Counter64: 31807157
IF-MIB::ifHCOutOctets.104 = Counter64: 271670987401320
IF-MIB::ifHCOutUcastPkts.104 = Counter64: 411805219439
IF-MIB::ifHCOutMulticastPkts.104 = Counter64: 0
IF-MIB::ifHCOutBroadcastPkts.104 = Counter64: 0
IF-MIB::ifLinkUpDownTrapEnable.104 = INTEGER: enabled(1)
IF-MIB::ifHighSpeed.104 = Gauge32: 1000
IF-MIB::ifPromiscuousMode.104 = INTEGER: false(2)
IF-MIB::ifConnectorPresent.104 = INTEGER: false(2)
IF-MIB::ifAlias.104 = STRING: REDACTED
IF-MIB::ifCounterDiscontinuityTime.104 = Timeticks: (3734) 0:00:37.34
  1. There is more to your config than what is shown.

Not in the config for that one device, no; that's its entire configuration. The intention is to build a resilient, load balanced platform. I have dismantled this for the purposes of fault finding and the setup is now pretty basic (Telegraf > InfluxDB)

  1. The device you are polling is returning 2 different rows, and the row's index is more than just the ifName field.

It looks like this may be the case. Having taken the snmpwalk output, I can see it seems to be getting the incorrect value from the next object/branch (not sure on terminology) in the tree:

[root@dc5-influxha01 ~]# grep 637409419 snmpwalk.lst
IF-MIB::ifHCInOctets.105 = Counter64: 637409419

And when I isolate its output in snmpwalk:

[root@dc5-influxha01 ~]# cat snmpwalk.lst | grep '\.105 ='
IF-MIB::ifName.105 = STRING: Vl351
IF-MIB::ifHCInOctets.105 = Counter64: 637409419
IF-MIB::ifHCInUcastPkts.105 = Counter64: 8143038
IF-MIB::ifHCOutOctets.105 = Counter64: 140906590205473
IF-MIB::ifHCOutUcastPkts.105 = Counter64: 333187996005
IF-MIB::ifLinkUpDownTrapEnable.105 = INTEGER: enabled(1)
IF-MIB::ifHighSpeed.105 = Gauge32: 1000
IF-MIB::ifPromiscuousMode.105 = INTEGER: false(2)
IF-MIB::ifConnectorPresent.105 = INTEGER: false(2)
IF-MIB::ifAlias.105 = STRING:
IF-MIB::ifCounterDiscontinuityTime.105 = Timeticks: (3736) 0:00:37.36

ifName is identical.

Is this plugin indexing on the value of the ifName field as opposed to the OID/ID (again, terminology)?

@phemmer
Copy link
Contributor

phemmer commented Nov 14, 2016

The plugin internally indexes on the OID, so that is why both rows are present, and not merged together. However when reporting, unless you grab enough fields to ensure uniqueness, and use is_tag on them, the rows will get merged together when inserted into InfluxDB.

There is another feature request (#1948) to allow adding the OID index as a tag, which would also help differentiating two rows when all tags are identical. But for the time being, the only way to differentiate them would be to either add ifAlias as a tag (since it seems to be different between the two), or to fix the device so that you don't have 2 interfaces with the same ifName value.

@rosspink
Copy link
Author

I get you. Whilst reading this, I noticed the following from the SNMP RFC:

If several entries in the ifTable together represent a single interface as named by the device, then each will have the same value of ifName.

The Cisco device I was looking at is doing just this. It was presenting multiple devices with the same ifName.

In my specific use case, one solution to this is to add IF-MIB::ifType as a tag. The multiple devices which share the same name are indeed related, but of varying interface types. I don't see any way of distinguishing the ifName values on the device itself, as it's effectively composed of more than one device behind the scenes and as we can see from SNMP.

Thank you both for your efforts on this, I'll close this off.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

3 participants