-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Host Metrics process scraper #1047
Host Metrics process scraper #1047
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1047 +/- ##
==========================================
+ Coverage 86.34% 86.59% +0.24%
==========================================
Files 198 204 +6
Lines 14159 14453 +294
==========================================
+ Hits 12226 12515 +289
- Misses 1477 1481 +4
- Partials 456 457 +1
Continue to review full report at Codecov.
|
90788b0
to
09f2789
Compare
@@ -42,7 +42,7 @@ require ( | |||
github.com/prometheus/prometheus v1.8.2-0.20190924101040-52e0504f83ea | |||
github.com/rs/cors v1.6.0 | |||
github.com/securego/gosec v0.0.0-20200316084457-7da9f46445fd | |||
github.com/shirou/gopsutil v2.20.4+incompatible | |||
github.com/shirou/gopsutil v0.0.0-20200517204708-c89193f22d93 // c89193f22d9359848988f32aee972122bb2abdc2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gopsutil
needed to be updated to include this relatively recent PR that improves the performance of reading process information on Windows: shirou/gopsutil#862
receiver/hostmetricsreceiver/internal/scraper/processscraper/process.go
Outdated
Show resolved
Hide resolved
receiver/hostmetricsreceiver/internal/scraper/processscraper/process_constants.go
Outdated
Show resolved
Hide resolved
receiver/hostmetricsreceiver/internal/scraper/processscraper/process_scraper.go
Outdated
Show resolved
Hide resolved
metrics := pdata.NewMetricSlice() | ||
|
||
processes, err := s.getProcesses() | ||
if len(processes) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's more idiomatic to check for if err != nil
first, and result should be empty when err is not nil.
if err != nil {
return nil, err
} else if len(processes) == 0 {
return metrics, nil
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've followed that approach where possible, but it's not so straightforward in a case like this where partial failure is possible (failed to get some processes, succeeded in getting others), and we want to propagate any partial errors up to be reported on in aggregate.
For now I've added a comment to the function that returns partial results to document this behaviour, but I'm happy to explore other options if this is strongly unidiomatic. I guess another option would be to store partial errors in a different type (i.e. not error
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably returning a "result" struct rather than tuple would help to make it more idiomatic.
i.e. in that case error is part of the function result rather than an "exception"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright - I might leave that for a separate PR though as this style is used throughout the existing scrapers at the moment (in particular, in the result of the ScrapeMetrics
function)
receiver/hostmetricsreceiver/internal/scraper/processscraper/process_scraper.go
Outdated
Show resolved
Hide resolved
receiver/hostmetricsreceiver/internal/scraper/processscraper/process_scraper.go
Outdated
Show resolved
Hide resolved
receiver/hostmetricsreceiver/internal/scraper/processscraper/process_scraper.go
Outdated
Show resolved
Hide resolved
receiver/hostmetricsreceiver/internal/scraper/processscraper/process_scraper.go
Outdated
Show resolved
Hide resolved
482bff3
to
b7a19b0
Compare
8692330
to
38d16e8
Compare
receiver/hostmetricsreceiver/internal/scraper/processscraper/process_constants.go
Outdated
Show resolved
Hide resolved
receiver/hostmetricsreceiver/internal/scraper/processscraper/process_scraper.go
Outdated
Show resolved
Hide resolved
3918fa2
to
06c5fa1
Compare
@@ -12,6 +12,7 @@ receivers: | |||
disk: | |||
filesystem: | |||
network: | |||
virtualmemory: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is process... Why virtualmem? Forgot to add process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to add virtualmemory
before (thus adding it now 🤦♂️).
I intentionally did not add process
to the default/example config, as process metrics are more expensive to compute and may often not be needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many processes are running on the machine you're benchmarking with? Any idea what the max we may see in production? There are optimizations we could do but I think we need some realistic workloads running to validate against.
process: | ||
include: | ||
names: ["test1", "test2"] | ||
match_type: "regexp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing to do for this PR but I've made a proposal #1081 for improving filtering. Would make these configs a little cleaner I think.
var metricDiskBytesDescriptor = createMetricDiskBytesDescriptor() | ||
|
||
func createMetricDiskBytesDescriptor() pdata.MetricDescriptor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not worth just changing here given it's been done like this throughout and that this will be generated someday soonish but could just do a single variable:
var metricDiskBytesDescriptor = func() pdata.MetricDescriptor {
...}()
receiver/hostmetricsreceiver/internal/scraper/processscraper/process_metadata.go
Outdated
Show resolved
Hide resolved
receiver/hostmetricsreceiver/internal/scraper/processscraper/process_scraper.go
Outdated
Show resolved
Hide resolved
receiver/hostmetricsreceiver/internal/scraper/processscraper/process_scraper.go
Outdated
Show resolved
Hide resolved
receiver/hostmetricsreceiver/internal/scraper/processscraper/process_scraper.go
Outdated
Show resolved
Hide resolved
b1afc97
to
16aa955
Compare
The benchmark above was from running on my local machine with ~300 processes. I'll get back to you with some estimates on what workloads we expect |
16aa955
to
a8cacff
Compare
9ae571f
to
c114078
Compare
c114078
to
3738da8
Compare
I've now updated this to create a separate resource (and associated metrics) for each process. This did end up requiring some fairly significant changes: the existing Scraper API only supported returning a MetricSlice, so I created a separate API to use for the Process Scraper that returns a ResourceMetricsSlice. Apologies for this PR getting a bit large - I've separated the refactoring around how resources are handled to a separate commit, so for those who've already taken a look, you can just review the final commit. |
@@ -118,6 +123,18 @@ func (f *Factory) CustomUnmarshaler() component.CustomUnmarshaler { | |||
} | |||
} | |||
|
|||
func (f *Factory) getScraperFactory(key string) (internal.BaseFactory, bool) { | |||
if factory, ok := f.scraperFactories[key]; ok { | |||
return factory, true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic here means that we must never have duplicate keys in scraperFactories
and resourceScraperFactories
. Do you have a validation that would throw an error if such conflict occurs?
} | ||
|
||
// Scraper gathers metrics from the host machine. | ||
type Scraper interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider giving a more specific name to this interface, otherwise ResourceScraper
sounds like a subcategory of Scraper
, but it's an independent entity AFAICT.
e.g. HostScraper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I agree this name is definitely a bit confusing. I'm not sure what to call it though, as I'd prefer to leave "Host" out of the name since it's not an explicit requirement of that interface that metrics are scraped from the host, nor is it a requirement of the other interface that metrics are not scraped from the host - would something like MetricsSliceScraper
& ResourceMetricsSliceScraper
be better?
I'll create a separate PR to do this refactor as it will impact all the other scrapers.
type ResourceScraper interface { | ||
BaseScraper | ||
|
||
// ScrapeMetrics returns relevant scraped metrics per resource. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No suggestion, just trying to understand the problem better, since this seems to be the key change here.
AFAICT the main difference between Scraper
and ResourceScraper
is that Scraper
returns metrics for a single fixed resource (host), while ResourceScraper
dynamically gathers a list of multiple processes, generates a resource per process and returns metrics for each of them.
So in that sense, Scraper
is just a special case of ResourceScraper
with exactly one (implicit) resource.
Follow-up question: why don't you need to create a separate resource per CPU in cpuscraper
or per disk in diskscraper
? What makes the processes unique?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- For processes, the cardinality can be very high (1000s) which may cause issues when storing the metric data. If grouped by resource, the data is sharded against diff root resources (iiuc).
- We may want to create lots of metrics against the same process (at the moment its just three, but this will presumably be extended), and its easier to group metrics against the same resource than looking up similar label values.
While they wouldn't be impacted by those issues to the same degree, having a resource per cpu, disk, or network interface could make sense.
@bogdandrutu can probably explain the reasoning in more detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that makes perfect sense.
3738da8
to
c805ff0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a couple of code style nits / checks
receiver/hostmetricsreceiver/internal/scraper/processscraper/process_scraper.go
Outdated
Show resolved
Hide resolved
receiver/hostmetricsreceiver/internal/scraper/processscraper/process_scraper.go
Outdated
Show resolved
Hide resolved
caebde5
to
107f573
Compare
…ather than labels
107f573
to
267247e
Compare
* Initial commit of host metrics process scraper that scrapes cpu, memory & disk metrics * Refactored the host metrics process scraper getProcessHandles func to return an interface to avoid having to convert the type of each process struct returned by gopsutil * Changed process/cpu/usage metric to double type * Refactor process metrics to store process information as a resource rather than labels
* Add support for filtering label sets * Restore test * Fix Value() bug * Pass kv.KeyValue * Apply suggestions from code review Thank you @MrAlias. Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com> Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com>
Link to tracking Issue:
#847
Description:
Added process scraper to the hostmetricsreceiver which currently scrapes cpu, memory & disk usage stats using gopsutil. Also includes the ability to include/exclude processes by name using the existing
filterset
config.Example Metrics:
^ in most cases "command_line" is not empty, but its missing for a few system processes on Windows
Benchmarks: