Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Host Metrics process scraper #1047

Merged

Conversation

james-bebbington
Copy link
Member

@james-bebbington james-bebbington commented May 29, 2020

Link to tracking Issue:
#847

Description:
Added process scraper to the hostmetricsreceiver which currently scrapes cpu, memory & disk usage stats using gopsutil. Also includes the ability to include/exclude processes by name using the existing filterset config.

Example Metrics:
image
image
image
image

^ in most cases "command_line" is not empty, but its missing for a few system processes on Windows

Benchmarks:
image

@codecov
Copy link

codecov bot commented May 29, 2020

Codecov Report

Merging #1047 into master will increase coverage by 0.24%.
The diff coverage is 95.18%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1047      +/-   ##
==========================================
+ Coverage   86.34%   86.59%   +0.24%     
==========================================
  Files         198      204       +6     
  Lines       14159    14453     +294     
==========================================
+ Hits        12226    12515     +289     
- Misses       1477     1481       +4     
- Partials      456      457       +1     
Impacted Files Coverage Δ
...ceiver/hostmetricsreceiver/hostmetrics_receiver.go 85.24% <86.11%> (+11.71%) ⬆️
...eceiver/internal/scraper/processscraper/process.go 88.88% <88.88%> (ø)
...internal/scraper/processscraper/process_scraper.go 98.47% <98.47%> (ø)
receiver/hostmetricsreceiver/factory.go 87.09% <100.00%> (+13.51%) ⬆️
...ceiver/internal/scraper/cpuscraper/cpu_metadata.go 100.00% <100.00%> (ø)
...eceiver/internal/scraper/cpuscraper/cpu_scraper.go 88.23% <100.00%> (ø)
...r/internal/scraper/cpuscraper/cpu_scraper_linux.go 100.00% <100.00%> (ø)
...eceiver/internal/scraper/processscraper/factory.go 100.00% <100.00%> (ø)
...nternal/scraper/processscraper/process_metadata.go 100.00% <100.00%> (ø)
...al/scraper/processscraper/process_scraper_linux.go 100.00% <100.00%> (ø)
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update caadbbc...267247e. Read the comment docs.

@james-bebbington james-bebbington force-pushed the hostmetrics-process branch 2 times, most recently from 90788b0 to 09f2789 Compare May 29, 2020 03:35
@@ -42,7 +42,7 @@ require (
github.com/prometheus/prometheus v1.8.2-0.20190924101040-52e0504f83ea
github.com/rs/cors v1.6.0
github.com/securego/gosec v0.0.0-20200316084457-7da9f46445fd
github.com/shirou/gopsutil v2.20.4+incompatible
github.com/shirou/gopsutil v0.0.0-20200517204708-c89193f22d93 // c89193f22d9359848988f32aee972122bb2abdc2
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gopsutil needed to be updated to include this relatively recent PR that improves the performance of reading process information on Windows: shirou/gopsutil#862

metrics := pdata.NewMetricSlice()

processes, err := s.getProcesses()
if len(processes) == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more idiomatic to check for if err != nil first, and result should be empty when err is not nil.

if err != nil {
  return nil, err
} else if len(processes) == 0 {
  return metrics, nil
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've followed that approach where possible, but it's not so straightforward in a case like this where partial failure is possible (failed to get some processes, succeeded in getting others), and we want to propagate any partial errors up to be reported on in aggregate.

For now I've added a comment to the function that returns partial results to document this behaviour, but I'm happy to explore other options if this is strongly unidiomatic. I guess another option would be to store partial errors in a different type (i.e. not error)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably returning a "result" struct rather than tuple would help to make it more idiomatic.
i.e. in that case error is part of the function result rather than an "exception"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright - I might leave that for a separate PR though as this style is used throughout the existing scrapers at the moment (in particular, in the result of the ScrapeMetrics function)

@james-bebbington james-bebbington force-pushed the hostmetrics-process branch 3 times, most recently from 482bff3 to b7a19b0 Compare May 29, 2020 05:40
@tigrannajaryan tigrannajaryan added this to In progress in Collector via automation May 29, 2020
@james-bebbington james-bebbington force-pushed the hostmetrics-process branch 2 times, most recently from 8692330 to 38d16e8 Compare May 31, 2020 23:27
@james-bebbington james-bebbington force-pushed the hostmetrics-process branch 3 times, most recently from 3918fa2 to 06c5fa1 Compare June 4, 2020 03:59
@@ -12,6 +12,7 @@ receivers:
disk:
filesystem:
network:
virtualmemory:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is process... Why virtualmem? Forgot to add process?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to add virtualmemory before (thus adding it now 🤦‍♂️).

I intentionally did not add process to the default/example config, as process metrics are more expensive to compute and may often not be needed

receiver/hostmetricsreceiver/internal/testutils.go Outdated Show resolved Hide resolved
Copy link
Contributor

@jrcamp jrcamp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many processes are running on the machine you're benchmarking with? Any idea what the max we may see in production? There are optimizations we could do but I think we need some realistic workloads running to validate against.

Comment on lines +15 to +18
process:
include:
names: ["test1", "test2"]
match_type: "regexp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing to do for this PR but I've made a proposal #1081 for improving filtering. Would make these configs a little cleaner I think.

Comment on lines +72 to +75
var metricDiskBytesDescriptor = createMetricDiskBytesDescriptor()

func createMetricDiskBytesDescriptor() pdata.MetricDescriptor {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not worth just changing here given it's been done like this throughout and that this will be generated someday soonish but could just do a single variable:

var metricDiskBytesDescriptor = func() pdata.MetricDescriptor {
...}()

@james-bebbington
Copy link
Member Author

How many processes are running on the machine you're benchmarking with? Any idea what the max we may see in production? There are optimizations we could do but I think we need some realistic workloads running to validate against.

The benchmark above was from running on my local machine with ~300 processes. I'll get back to you with some estimates on what workloads we expect

@james-bebbington james-bebbington force-pushed the hostmetrics-process branch 8 times, most recently from 9ae571f to c114078 Compare June 22, 2020 03:57
@james-bebbington james-bebbington changed the title [WIP] Host Metrics process scraper Host Metrics process scraper Jun 22, 2020
@james-bebbington james-bebbington marked this pull request as ready for review June 22, 2020 04:00
@james-bebbington
Copy link
Member Author

I've now updated this to create a separate resource (and associated metrics) for each process. This did end up requiring some fairly significant changes: the existing Scraper API only supported returning a MetricSlice, so I created a separate API to use for the Process Scraper that returns a ResourceMetricsSlice.

Apologies for this PR getting a bit large - I've separated the refactoring around how resources are handled to a separate commit, so for those who've already taken a look, you can just review the final commit.

FYI @jrcamp @nilebox @bogdandrutu

@@ -118,6 +123,18 @@ func (f *Factory) CustomUnmarshaler() component.CustomUnmarshaler {
}
}

func (f *Factory) getScraperFactory(key string) (internal.BaseFactory, bool) {
if factory, ok := f.scraperFactories[key]; ok {
return factory, true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic here means that we must never have duplicate keys in scraperFactories and resourceScraperFactories. Do you have a validation that would throw an error if such conflict occurs?

receiver/hostmetricsreceiver/hostmetrics_receiver.go Outdated Show resolved Hide resolved
receiver/hostmetricsreceiver/hostmetrics_receiver.go Outdated Show resolved Hide resolved
receiver/hostmetricsreceiver/hostmetrics_receiver.go Outdated Show resolved Hide resolved
receiver/hostmetricsreceiver/hostmetrics_receiver.go Outdated Show resolved Hide resolved
receiver/hostmetricsreceiver/hostmetrics_receiver.go Outdated Show resolved Hide resolved
receiver/hostmetricsreceiver/internal/scraper.go Outdated Show resolved Hide resolved
}

// Scraper gathers metrics from the host machine.
type Scraper interface {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider giving a more specific name to this interface, otherwise ResourceScraper sounds like a subcategory of Scraper, but it's an independent entity AFAICT.
e.g. HostScraper

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I agree this name is definitely a bit confusing. I'm not sure what to call it though, as I'd prefer to leave "Host" out of the name since it's not an explicit requirement of that interface that metrics are scraped from the host, nor is it a requirement of the other interface that metrics are not scraped from the host - would something like MetricsSliceScraper & ResourceMetricsSliceScraper be better?

I'll create a separate PR to do this refactor as it will impact all the other scrapers.

type ResourceScraper interface {
BaseScraper

// ScrapeMetrics returns relevant scraped metrics per resource.
Copy link
Member

@nilebox nilebox Jun 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No suggestion, just trying to understand the problem better, since this seems to be the key change here.

AFAICT the main difference between Scraper and ResourceScraper is that Scraper returns metrics for a single fixed resource (host), while ResourceScraper dynamically gathers a list of multiple processes, generates a resource per process and returns metrics for each of them.

So in that sense, Scraper is just a special case of ResourceScraper with exactly one (implicit) resource.

Follow-up question: why don't you need to create a separate resource per CPU in cpuscraper or per disk in diskscraper? What makes the processes unique?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. For processes, the cardinality can be very high (1000s) which may cause issues when storing the metric data. If grouped by resource, the data is sharded against diff root resources (iiuc).
  2. We may want to create lots of metrics against the same process (at the moment its just three, but this will presumably be extended), and its easier to group metrics against the same resource than looking up similar label values.

While they wouldn't be impacted by those issues to the same degree, having a resource per cpu, disk, or network interface could make sense.

@bogdandrutu can probably explain the reasoning in more detail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that makes perfect sense.

Copy link
Member

@nilebox nilebox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a couple of code style nits / checks

@james-bebbington james-bebbington force-pushed the hostmetrics-process branch 2 times, most recently from caebde5 to 107f573 Compare June 25, 2020 01:24
Collector automation moved this from In progress to Reviewer approved Jun 30, 2020
@bogdandrutu bogdandrutu merged commit af5bdb1 into open-telemetry:master Jun 30, 2020
Collector automation moved this from Reviewer approved to Done Jun 30, 2020
@pavolloffay pavolloffay mentioned this pull request Jun 30, 2020
@james-bebbington james-bebbington mentioned this pull request Jun 30, 2020
wyTrivail pushed a commit to mxiamxia/opentelemetry-collector that referenced this pull request Jul 13, 2020
* Initial commit of host metrics process scraper that scrapes cpu, memory & disk metrics

* Refactored the host metrics process scraper getProcessHandles func to return an interface to avoid having to convert the type of each process struct returned by gopsutil

* Changed process/cpu/usage metric to double type

* Refactor process metrics to store process information as a resource rather than labels
MovieStoreGuy pushed a commit to atlassian-forks/opentelemetry-collector that referenced this pull request Nov 11, 2021
* Add support for filtering label sets

* Restore test

* Fix Value() bug

* Pass kv.KeyValue

* Apply suggestions from code review

Thank you @MrAlias.

Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com>

Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Collector
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

4 participants