Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNMP plugin hits open file limit errors #2104

Closed
kostasb opened this issue Nov 30, 2016 · 6 comments
Closed

SNMP plugin hits open file limit errors #2104

kostasb opened this issue Nov 30, 2016 · 6 comments
Labels
bug unexpected problem or unintended behavior
Milestone

Comments

@kostasb
Copy link

kostasb commented Nov 30, 2016

Issue found in Telegraf v1.1.1

Config file loads approximately 390 inputs.snmp plugins - works well with Telegraf v1.0.1.

Telegraf tries to snmptranslate all OID's defined in [[inputs.snmp.table.field]] under [[inputs.snmp.table]].

Multiple concurrent snmptranslate processes found to be running at the same time on the system (over 200 at a given point). The snmptranslate commands fail to translate the OID's, return code is 0 though.

The system eventually runs out of open file descriptors.

Sample output:

Nov 29 19:10:02 telegraf telegraf[16111]: 2016/11/29 19:10:02 I! Starting Telegraf (version 1.1.1)
Nov 29 19:10:02 telegraf telegraf[16111]: 2016/11/29 19:10:02 I! Loaded outputs: influxdb
Nov 29 19:10:02 telegraf telegraf[16111]: 2016/11/29 19:10:02 I! Loaded inputs: inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs
Nov 29 19:10:02 telegraf telegraf[16111]: .snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp in
Nov 29 19:10:02 telegraf telegraf[16111]: puts.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp inputs.snmp
Nov 29 19:10:02 telegraf telegraf[16111]: 2016/11/29 19:10:02 I! Tags enabled: host=telegraf
Nov 29 19:10:02 telegraf telegraf[16111]: 2016/11/29 19:10:02 I! Agent Config: Interval:5m0s, Quiet:false, Hostname:"telegraf", Flush Interval:10s
Nov 29 19:10:02 telegraf telegraf[16111]: 2016/11/29 19:10:02 I! Tags enabled: host=telegraf
Nov 29 19:10:02 telegraf telegraf[16111]: 2016/11/29 19:10:02 I! Agent Config: Interval:5m0s, Quiet:false, Hostname:"telegraf", Flush Interval:10s
Nov 29 19:15:00 telegraf telegraf[16111]: 2016/11/29 19:15:00 E! ERROR in input [inputs.snmp]: translating .1.3.6.1.4.1.6141.2.60.35.1.21.7.1.1.50: pipe2: too many open files
Nov 29 19:15:00 telegraf telegraf[16111]: 2016/11/29 19:15:00 D! Input [inputs.snmp] gathered metrics, (5m0s interval) in 564.946µs

@sparrc
Copy link
Contributor

sparrc commented Nov 30, 2016

I believe this issue has come to light because the user is using many (thousands) of snmp.table.fields in their config file:

        [[inputs.snmp.table.field]]
        name = "ifHCOutUcastPkts"
        oid = ".1.3.6.1.2.1.31.1.1.1.11"

        [[inputs.snmp.table.field]]
        name = "ifHCInBroadcastPkts"
        oid = ".1.3.6.1.2.1.31.1.1.1.9"

in 1.0, these fields were not getting translated, but that was changed in 1.1 with #1836

@phemmer do you have any ideas on how to solve this? A couple questions I have:

  1. Do we need to run snmptranslate when we are already given a name?
  2. Should we be retrying snmptranslate when we get an error trying to run it? AFAICT we are currently retrying on every call to Gather even if it's already failed previously.

FWIW, I think that fixing #1665 would have prevented this from being hit in the first place, as the user wouldn't have needed to specify thousands of separate fields just to parallelize their agent collections.

@sparrc sparrc added the bug unexpected problem or unintended behavior label Nov 30, 2016
@phemmer
Copy link
Contributor

phemmer commented Nov 30, 2016

I believe this issue has come to light because the user is using many (thousands) of snmp.table.fields in their config file:

Actually it's because of the number of [[inputs.snmp]] sections in the config. The each instance performs the snmptranslate calls serially. But telegraf launches all the instances in parallel.

Do we need to run snmptranslate when we are already given a name?

The snmptranslate also looks up any conversions, not just the name.

Should we be retrying snmptranslate when we get an error trying to run it?

I would vote no. It just complicates the code for a rather uncommon edge case. One that would become near impossible with the change I mention below.

AFAICT we are currently retrying on every call to Gather even if it's already failed previously.

snmptranslate is only called when the plugin first starts, and never again.

FWIW, I think that fixing #1665 would have prevented this from being hit in the first place

#1665 might have prevented (the object-instance-per-agent thing) this yes as the config probably wouldn't have had so many [[inputs.snmp]] sections. But such a change brings new problems of its own such as out of memory, and file descriptor exhaustion due to number of connections. Though it should be the same amount of resources consumed as creating X number of [[inputs.snmp]] sections. (Edit: nevermind, it probably wouldn't)

The proper way to fix this I think is to share the snmptranslate data globally.
snmptranslate foo is going to have the same result for every instance of the plugin, so there's no reason for every instance to call it separately when they could share the information.
I'll start working on that. It should be a clean, and relatively simple change.

@sparrc
Copy link
Contributor

sparrc commented Nov 30, 2016

snmptranslate is only called when the plugin first starts, and never again.

I don't think this is happening in failure scenarios, because s.initialized = true only gets set if there are no errors returned from "Tables.init" and "Fields.init":
https://github.com/influxdata/telegraf/blob/master/plugins/inputs/snmp/snmp.go#L141-L160

@phemmer
Copy link
Contributor

phemmer commented Nov 30, 2016

Ah, good call. Its init is called every gather because there is no explicit plugin initialization call. And since telegraf kicks them all off in parallel, the issue would recur over and over.
It would be nice if we had an init call for stuff like this and #1438

The global snmptranslate data I think is enough of a fix for this. To where we wouldn't need to otherwise change the way the plugin initializes itself.

@toddboom
Copy link
Contributor

toddboom commented Dec 1, 2016

@phemmer It sounds like you're on top of this, but let me know if you need a hand or eyes on a PR. We've got a few people who have asked about this, so we can probably get a 1.1.2 release lined up whenever this is fixed. Thanks for digging in!

@phemmer
Copy link
Contributor

phemmer commented Dec 2, 2016

PR is up: #2115

@sparrc sparrc modified the milestones: 1.2.0, 1.1.2 Dec 5, 2016
@sparrc sparrc closed this as completed in b58926d Dec 12, 2016
sparrc pushed a commit that referenced this issue Dec 12, 2016
Prevents the same data from being looked up multiple times. Also prevents multiple simultaneous lookups.

closes #2115
closes #2104
njwhite pushed a commit to njwhite/telegraf that referenced this issue Jan 31, 2017
Prevents the same data from being looked up multiple times. Also prevents multiple simultaneous lookups.

closes influxdata#2115
closes influxdata#2104
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

4 participants