Host Metrics process scraper #1047

james-bebbington · 2020-05-29T02:33:40Z

Link to tracking Issue:
#847

Description:
Added process scraper to the hostmetricsreceiver which currently scrapes cpu, memory & disk usage stats using gopsutil. Also includes the ability to include/exclude processes by name using the existing filterset config.

Example Metrics:

^ in most cases "command_line" is not empty, but its missing for a few system processes on Windows

Benchmarks:

codecov · 2020-05-29T02:41:43Z

Codecov Report

Merging #1047 into master will increase coverage by 0.24%.
The diff coverage is 95.18%.

@@            Coverage Diff             @@
##           master    #1047      +/-   ##
==========================================
+ Coverage   86.34%   86.59%   +0.24%     
==========================================
  Files         198      204       +6     
  Lines       14159    14453     +294     
==========================================
+ Hits        12226    12515     +289     
- Misses       1477     1481       +4     
- Partials      456      457       +1

Impacted Files	Coverage Δ
...ceiver/hostmetricsreceiver/hostmetrics_receiver.go	`85.24% <86.11%> (+11.71%)`	⬆️
...eceiver/internal/scraper/processscraper/process.go	`88.88% <88.88%> (ø)`
...internal/scraper/processscraper/process_scraper.go	`98.47% <98.47%> (ø)`
receiver/hostmetricsreceiver/factory.go	`87.09% <100.00%> (+13.51%)`	⬆️
...ceiver/internal/scraper/cpuscraper/cpu_metadata.go	`100.00% <100.00%> (ø)`
...eceiver/internal/scraper/cpuscraper/cpu_scraper.go	`88.23% <100.00%> (ø)`
...r/internal/scraper/cpuscraper/cpu_scraper_linux.go	`100.00% <100.00%> (ø)`
...eceiver/internal/scraper/processscraper/factory.go	`100.00% <100.00%> (ø)`
...nternal/scraper/processscraper/process_metadata.go	`100.00% <100.00%> (ø)`
...al/scraper/processscraper/process_scraper_linux.go	`100.00% <100.00%> (ø)`
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update caadbbc...267247e. Read the comment docs.

receiver/hostmetricsreceiver/internal/scraper/processscraper/process.go

james-bebbington · 2020-05-29T04:06:25Z

go.mod

@@ -42,7 +42,7 @@ require (
 	github.com/prometheus/prometheus v1.8.2-0.20190924101040-52e0504f83ea
 	github.com/rs/cors v1.6.0
 	github.com/securego/gosec v0.0.0-20200316084457-7da9f46445fd
-	github.com/shirou/gopsutil v2.20.4+incompatible
+	github.com/shirou/gopsutil v0.0.0-20200517204708-c89193f22d93 // c89193f22d9359848988f32aee972122bb2abdc2


gopsutil needed to be updated to include this relatively recent PR that improves the performance of reading process information on Windows: shirou/gopsutil#862

receiver/hostmetricsreceiver/internal/scraper/processscraper/process.go

receiver/hostmetricsreceiver/internal/scraper/processscraper/process_constants.go

receiver/hostmetricsreceiver/internal/scraper/processscraper/process_scraper.go

nilebox · 2020-05-29T04:17:09Z

receiver/hostmetricsreceiver/internal/scraper/processscraper/process_scraper.go

+	metrics := pdata.NewMetricSlice()
+
+	processes, err := s.getProcesses()
+	if len(processes) == 0 {


It's more idiomatic to check for if err != nil first, and result should be empty when err is not nil.

if err != nil { return nil, err } else if len(processes) == 0 { return metrics, nil }

I've followed that approach where possible, but it's not so straightforward in a case like this where partial failure is possible (failed to get some processes, succeeded in getting others), and we want to propagate any partial errors up to be reported on in aggregate.

For now I've added a comment to the function that returns partial results to document this behaviour, but I'm happy to explore other options if this is strongly unidiomatic. I guess another option would be to store partial errors in a different type (i.e. not error)?

Probably returning a "result" struct rather than tuple would help to make it more idiomatic.
i.e. in that case error is part of the function result rather than an "exception"

Alright - I might leave that for a separate PR though as this style is used throughout the existing scrapers at the moment (in particular, in the result of the ScrapeMetrics function)

receiver/hostmetricsreceiver/internal/scraper/processscraper/process_scraper.go

receiver/hostmetricsreceiver/internal/scraper/processscraper/config.go

receiver/hostmetricsreceiver/internal/scraper/processscraper/process_constants.go

receiver/hostmetricsreceiver/internal/scraper/processscraper/process_scraper.go

bogdandrutu · 2020-06-04T13:15:23Z

receiver/hostmetricsreceiver/example_config.yaml

@@ -12,6 +12,7 @@ receivers:
      disk:
      filesystem:
      network:
+      virtualmemory:


This is process... Why virtualmem? Forgot to add process?

I forgot to add virtualmemory before (thus adding it now 🤦‍♂️).

I intentionally did not add process to the default/example config, as process metrics are more expensive to compute and may often not be needed

receiver/hostmetricsreceiver/internal/testutils.go

receiver/hostmetricsreceiver/internal/scraper/processscraper/process.go

jrcamp

How many processes are running on the machine you're benchmarking with? Any idea what the max we may see in production? There are optimizations we could do but I think we need some realistic workloads running to validate against.

jrcamp · 2020-06-04T15:38:05Z

receiver/hostmetricsreceiver/testdata/config.yaml

+      process:
+        include:
+          names: ["test1", "test2"]
+          match_type: "regexp"


Nothing to do for this PR but I've made a proposal #1081 for improving filtering. Would make these configs a little cleaner I think.

jrcamp · 2020-06-04T21:23:07Z

receiver/hostmetricsreceiver/internal/scraper/processscraper/process_metadata.go

+var metricDiskBytesDescriptor = createMetricDiskBytesDescriptor()
+
+func createMetricDiskBytesDescriptor() pdata.MetricDescriptor {


Not worth just changing here given it's been done like this throughout and that this will be generated someday soonish but could just do a single variable:

var metricDiskBytesDescriptor = func() pdata.MetricDescriptor { ...}()

receiver/hostmetricsreceiver/internal/scraper/processscraper/process_metadata.go

receiver/hostmetricsreceiver/internal/scraper/processscraper/process_scraper.go

james-bebbington · 2020-06-05T08:05:16Z

How many processes are running on the machine you're benchmarking with? Any idea what the max we may see in production? There are optimizations we could do but I think we need some realistic workloads running to validate against.

The benchmark above was from running on my local machine with ~300 processes. I'll get back to you with some estimates on what workloads we expect

james-bebbington · 2020-06-22T04:06:04Z

I've now updated this to create a separate resource (and associated metrics) for each process. This did end up requiring some fairly significant changes: the existing Scraper API only supported returning a MetricSlice, so I created a separate API to use for the Process Scraper that returns a ResourceMetricsSlice.

Apologies for this PR getting a bit large - I've separated the refactoring around how resources are handled to a separate commit, so for those who've already taken a look, you can just review the final commit.

FYI @jrcamp @nilebox @bogdandrutu

nilebox · 2020-06-22T04:28:07Z

receiver/hostmetricsreceiver/factory.go

@@ -118,6 +123,18 @@ func (f *Factory) CustomUnmarshaler() component.CustomUnmarshaler {
 	}
 }

+func (f *Factory) getScraperFactory(key string) (internal.BaseFactory, bool) {
+	if factory, ok := f.scraperFactories[key]; ok {
+		return factory, true


The logic here means that we must never have duplicate keys in scraperFactories and resourceScraperFactories. Do you have a validation that would throw an error if such conflict occurs?

receiver/hostmetricsreceiver/hostmetrics_receiver.go

receiver/hostmetricsreceiver/internal/scraper.go

nilebox · 2020-06-22T04:54:52Z

receiver/hostmetricsreceiver/internal/scraper.go

+}
+
+// Scraper gathers metrics from the host machine.
+type Scraper interface {


Consider giving a more specific name to this interface, otherwise ResourceScraper sounds like a subcategory of Scraper, but it's an independent entity AFAICT.
e.g. HostScraper

Yea I agree this name is definitely a bit confusing. I'm not sure what to call it though, as I'd prefer to leave "Host" out of the name since it's not an explicit requirement of that interface that metrics are scraped from the host, nor is it a requirement of the other interface that metrics are not scraped from the host - would something like MetricsSliceScraper & ResourceMetricsSliceScraper be better?

I'll create a separate PR to do this refactor as it will impact all the other scrapers.

nilebox · 2020-06-22T05:24:34Z

receiver/hostmetricsreceiver/internal/scraper.go

+type ResourceScraper interface {
+	BaseScraper
+
+	// ScrapeMetrics returns relevant scraped metrics per resource.


No suggestion, just trying to understand the problem better, since this seems to be the key change here.

AFAICT the main difference between Scraper and ResourceScraper is that Scraper returns metrics for a single fixed resource (host), while ResourceScraper dynamically gathers a list of multiple processes, generates a resource per process and returns metrics for each of them.

So in that sense, Scraper is just a special case of ResourceScraper with exactly one (implicit) resource.

Follow-up question: why don't you need to create a separate resource per CPU in cpuscraper or per disk in diskscraper? What makes the processes unique?

For processes, the cardinality can be very high (1000s) which may cause issues when storing the metric data. If grouped by resource, the data is sharded against diff root resources (iiuc).

We may want to create lots of metrics against the same process (at the moment its just three, but this will presumably be extended), and its easier to group metrics against the same resource than looking up similar label values.

While they wouldn't be impacted by those issues to the same degree, having a resource per cpu, disk, or network interface could make sense.

@bogdandrutu can probably explain the reasoning in more detail.

Thanks, that makes perfect sense.

nilebox

LGTM, just a couple of code style nits / checks

receiver/hostmetricsreceiver/hostmetrics_receiver.go

receiver/hostmetricsreceiver/internal/scraper/processscraper/process_scraper.go

…ather than labels

* Initial commit of host metrics process scraper that scrapes cpu, memory & disk metrics * Refactored the host metrics process scraper getProcessHandles func to return an interface to avoid having to convert the type of each process struct returned by gopsutil * Changed process/cpu/usage metric to double type * Refactor process metrics to store process information as a resource rather than labels

@MrAlias

* Add support for filtering label sets * Restore test * Fix Value() bug * Pass kv.KeyValue * Apply suggestions from code review Thank you @MrAlias. Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com> Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com>

james-bebbington requested review from bogdandrutu, dmitryax, flands, owais, pjanotti, rghetia, songy23 and tigrannajaryan as code owners May 29, 2020 02:33

james-bebbington force-pushed the hostmetrics-process branch 2 times, most recently from 90788b0 to 09f2789 Compare May 29, 2020 03:35

james-bebbington commented May 29, 2020

View reviewed changes

receiver/hostmetricsreceiver/internal/scraper/processscraper/process.go Show resolved Hide resolved

james-bebbington commented May 29, 2020

View reviewed changes

nilebox reviewed May 29, 2020

View reviewed changes

james-bebbington force-pushed the hostmetrics-process branch 3 times, most recently from 482bff3 to b7a19b0 Compare May 29, 2020 05:40

tigrannajaryan assigned jrcamp May 29, 2020

tigrannajaryan added this to In progress in Collector via automation May 29, 2020

james-bebbington force-pushed the hostmetrics-process branch 2 times, most recently from 8692330 to 38d16e8 Compare May 31, 2020 23:27

bogdandrutu reviewed Jun 3, 2020

View reviewed changes

james-bebbington force-pushed the hostmetrics-process branch 3 times, most recently from 3918fa2 to 06c5fa1 Compare June 4, 2020 03:59

bogdandrutu reviewed Jun 4, 2020

View reviewed changes

jrcamp reviewed Jun 4, 2020

View reviewed changes

james-bebbington force-pushed the hostmetrics-process branch from b1afc97 to 16aa955 Compare June 5, 2020 07:55

james-bebbington force-pushed the hostmetrics-process branch from 16aa955 to a8cacff Compare June 5, 2020 10:57

james-bebbington force-pushed the hostmetrics-process branch 8 times, most recently from 9ae571f to c114078 Compare June 22, 2020 03:57

james-bebbington changed the title ~~[WIP] Host Metrics process scraper~~ Host Metrics process scraper Jun 22, 2020

james-bebbington marked this pull request as ready for review June 22, 2020 04:00

james-bebbington force-pushed the hostmetrics-process branch from c114078 to 3738da8 Compare June 22, 2020 04:05

nilebox reviewed Jun 22, 2020

View reviewed changes

james-bebbington force-pushed the hostmetrics-process branch from 3738da8 to c805ff0 Compare June 22, 2020 07:01

nilebox approved these changes Jun 22, 2020

View reviewed changes

james-bebbington force-pushed the hostmetrics-process branch 2 times, most recently from caebde5 to 107f573 Compare June 25, 2020 01:24

Refactor process metrics to store process information as a resource r…

267247e

…ather than labels

james-bebbington force-pushed the hostmetrics-process branch from 107f573 to 267247e Compare June 25, 2020 02:59

james-bebbington requested review from jrcamp and bogdandrutu June 25, 2020 03:56

jrcamp approved these changes Jun 26, 2020

View reviewed changes

bogdandrutu approved these changes Jun 30, 2020

View reviewed changes

Collector automation moved this from In progress to Reviewer approved Jun 30, 2020

bogdandrutu merged commit af5bdb1 into open-telemetry:master Jun 30, 2020

Collector automation moved this from Reviewer approved to Done Jun 30, 2020

pavolloffay mentioned this pull request Jun 30, 2020

PR 1047 broke CI #1237

Closed

james-bebbington mentioned this pull request Jun 30, 2020

Fix build #1238

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Host Metrics process scraper #1047

Host Metrics process scraper #1047

james-bebbington commented May 29, 2020 •

edited

codecov bot commented May 29, 2020 •

edited

james-bebbington May 29, 2020

nilebox May 29, 2020

james-bebbington May 29, 2020 •

edited

nilebox May 29, 2020

james-bebbington May 29, 2020

bogdandrutu Jun 4, 2020

james-bebbington Jun 5, 2020 •

edited

jrcamp left a comment

jrcamp Jun 4, 2020

jrcamp Jun 4, 2020

james-bebbington commented Jun 5, 2020

james-bebbington commented Jun 22, 2020

nilebox Jun 22, 2020

nilebox Jun 22, 2020

james-bebbington Jun 22, 2020 •

edited

nilebox Jun 22, 2020 •

edited

james-bebbington Jun 22, 2020

nilebox Jun 22, 2020

nilebox left a comment

		var metricDiskBytesDescriptor = createMetricDiskBytesDescriptor()

		func createMetricDiskBytesDescriptor() pdata.MetricDescriptor {

Host Metrics process scraper #1047

Host Metrics process scraper #1047

Conversation

james-bebbington commented May 29, 2020 • edited

codecov bot commented May 29, 2020 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

james-bebbington May 29, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

james-bebbington Jun 5, 2020 • edited

Choose a reason for hiding this comment

jrcamp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

james-bebbington commented Jun 5, 2020

james-bebbington commented Jun 22, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

james-bebbington Jun 22, 2020 • edited

Choose a reason for hiding this comment

nilebox Jun 22, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nilebox left a comment

Choose a reason for hiding this comment

james-bebbington commented May 29, 2020 •

edited

codecov bot commented May 29, 2020 •

edited

james-bebbington May 29, 2020 •

edited

james-bebbington Jun 5, 2020 •

edited

james-bebbington Jun 22, 2020 •

edited

nilebox Jun 22, 2020 •

edited