Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ethtool): Gather statistics from namespaces #11895

Merged
merged 2 commits into from
Nov 15, 2022

Conversation

zeffron
Copy link
Contributor

@zeffron zeffron commented Sep 27, 2022

Required for all PRs

resolves #11754

Added network namespace support to the ethtool plugin. By default, it continues
to operate as it always had, gathering metrics from the initial namespace only.
To gather metrics from additional namespaces, CAP_SYS_ADMIN must be added to
the telegraf process, and at least one of the new namespace filters must be
configured.

Unit tests for the new namespace support also require running with
CAP_SYS_ADMIN. They will be skipped if CAP_SYS_ADMIN is misisng. The
original unit tests have been updated to do minimal namespace support testing
and do not require CAP_SYS_ADMIN.

@telegraf-tiger telegraf-tiger bot added the feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin label Sep 27, 2022
@zeffron zeffron force-pushed the ethtool_namespaces branch 4 times, most recently from bb4b613 to d64989e Compare September 27, 2022 20:57
@zeffron
Copy link
Contributor Author

zeffron commented Sep 27, 2022

I think the macOS test failures are unrelated to anything I've done. It seems to be failures in the google_cloud_storage and graylog input plugins, which I have not touched.

Update: not sure what changed, but PR checks are all passing now.

Copy link
Contributor

@powersj powersj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, some questions and comments below!

Makefile Outdated Show resolved Hide resolved
plugins/inputs/ethtool/README.md Show resolved Hide resolved
Comment on lines 276 to 275
if err := netns.Set(initialNamespace); err != nil {
c.Log.Errorf("Could not return to initial namespace: %s", err)
return nil, err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern with this is if an existing user has some additional network namespaces created, they will suddenly start getting errors from Telegraf around operation not permitted, right?

As an example, I created one namespace and ran again with no elevated permissions and error out, whereas before I was collecting metrics without issue:

2022-09-28T14:39:04Z W! [inputs.ethtool] Could not switch to namespace vnet0: operation not permitted
2022-09-28T14:39:04Z E! [inputs.ethtool] Could not return to initial namespace: operation not permitted
2022-09-28T14:39:04Z E! [inputs.ethtool] Error in plugin: operation not permitted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. I did make some changes that I didn't put through my local testing to satisfy the CI linter. I wonder if that changed things. The intention is that unless at least one of the new namespace filters is configured, behavior should be unchanged from prior to this change.
Let me investigate...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found and fixed the bug. It should be working as intended now (extra warnings in the logs, but functions as before otherwise).

@zeffron zeffron force-pushed the ethtool_namespaces branch 6 times, most recently from 9861d16 to a1db0ff Compare September 28, 2022 19:08
Comment on lines 237 to 73
namespaces, err := os.ReadDir("/var/run/netns")
if err != nil {
c.Log.Warnf("Could not find namespace directory: %s", err)
return allInterfaces, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the filtering of the interfaces happen before we hit this point? This way telegraf only reads this directory and attempts to change namespaces, if we have a list of namespaces to actually review.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's one option.

It would move even more of the logic out of the abstraction layer and into the implementation layer, which also makes the unit tests less useful because they're testing less of the runtime logic. Which is why I didn't go with that approach initially.

But it would be overall more performant to filter the interfaces and namespaces sooner, plus, in the namespace case, would reduce annoying log warning spam.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further testing revealed that this approach has very noticeable performance impact on my workload. Going to see what can be done to improve performance.

@powersj powersj added the waiting for response waiting for response from contributor label Oct 3, 2022
@zeffron zeffron marked this pull request as draft October 3, 2022 20:29
@zeffron zeffron marked this pull request as ready for review October 7, 2022 19:51
@zeffron
Copy link
Contributor Author

zeffron commented Oct 7, 2022

Okay, I think this is ready again. More major refactor, but it now creates an OS thread per-namespace. This saves significant CPU performance at the cost of some extra memory. It also makes it easier to have cleaner logs in the case where default (non-namespaced) behavior is desired.

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Oct 7, 2022
Copy link
Contributor

@powersj powersj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates! I have a couple minor comments.

I do want to chat with others on the team about whether this should be a possible separate plugin. I don't want the additional namespace tooling and library to get in the way in what is a rather simple plugin. But this might be fine as is.

plugins/inputs/ethtool/README.md Outdated Show resolved Hide resolved
plugins/inputs/ethtool/ethtool_linux.go Outdated Show resolved Hide resolved
plugins/inputs/ethtool/namespace_linux.go Outdated Show resolved Hide resolved
@zeffron
Copy link
Contributor Author

zeffron commented Oct 12, 2022

I do want to chat with others on the team about whether this should be a possible separate plugin. I don't want the additional namespace tooling and library to get in the way in what is a rather simple plugin. But this might be fine as is.

I think ideally, this should become a more global configuration option with an easy way for plugins to take advantage of the namespace logic to do their thing. Many more plugins than just ethtool have the same limitation/issue. This is just (currently) the only networking plugin used in my project, so it's the only one where we need namespace support.

plugins/inputs/ethtool/ethtool_test.go Outdated Show resolved Hide resolved
plugins/inputs/ethtool/ethtool_linux.go Show resolved Hide resolved
To support monitoring an entire host with one telegraf process, or any
number of other uses of namespaced network interfaces, the ethtool plugin
now supports gathering statistics from interfaces in additional
namespaces.

This functionality must be enabled by adding (at least) one of the new
namespace filters to the plugin configuration, and requires adding
CAP_SYS_ADMIN to the telegraf process.

Resolves influxdata#11754
@telegraf-tiger
Copy link
Contributor

@zeffron zeffron requested review from powersj and removed request for reimda November 10, 2022 21:30
@powersj powersj added the ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review. label Nov 14, 2022
Copy link
Contributor

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks for your work @zeffron!

@srebhan srebhan merged commit c7a1d9e into influxdata:master Nov 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[[input.ethtool]] Support interfaces in the non-initial network namespace
4 participants