-
Notifications
You must be signed in to change notification settings - Fork 489
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the docs of pyroscope.scrape #5847
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few minor comments, otherwise LGTM
#### `scrape_interval` argument | ||
|
||
If `scrape_interval` is short: | ||
* Advantages: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One advantage is also being able to catch events such as OOM when they happen quickly. So maybe something like "Ability to capture rapid changes in profiles".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see how this is true if delta = false
, because then profiles would work similarly to Prometheus metrics.
However, if delta = true
, wouldn't a lower scrape frequency lead to less catched events?
E.g. if scrape_frequency = "15s"
, then each observation will last 14 seconds. For every 60 seconds we will lose 4 seconds of observations - this is the time between scrapes. On the other hand, for scrape_frequency = "60s"
we will only lose 1 second worth of samples for every minute.
It'd be great if someone from the Pyroscope team confirms this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is when a bug leads to e.g. a rapid growth in memory and within 20 seconds the instance OOMs. This is much harder to catch with 1m scrape interval, as you'd need to be lucky to finish the profile just a few seconds before OOM event happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that when analysing OOMs, it's better to have a short scrape frequency. However, in case the application is not crashing and duration = true
, I think that longer scrapes will actually result in more samples and will better detect rapid changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you refer to in duration = false|true
? I don't see such setting on pyroscope.scrape.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I meant delta = true
. I'll edit my comment above.
`job_name` | `string` | The job name to override the job label with. | component name | no | ||
`params` | `map(list(string))` | A set of query parameters with which the target is scraped. | | no | ||
`scrape_interval` | `duration` | How frequently to scrape the targets of this scrape config. | `"15s"` | no | ||
`scrape_timeout` | `duration` | The timeout for scraping targets of this config. | `"18s"` | no |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's best to add it right here: "It must be larger than scrape_interval
".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not too sure about this one. We usually document constraints under the table... I'll let the comment sit for now in case someone else wants to opine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @cyriltovena curious your thoughts here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is a bug to require scrape_timeout longer that scrape_interval.
It may make sense for cpu, but does not makes sense for other (not delta profiles) requests which should not last long
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Shouldn't the docs therefore explain this - that CPU profiles may need different timeout & interval than the remaining profile types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I will just state "must be larger than scrape_interval
" in the table as @thampiotr suggested. We usually state such things under the table, but I think it improves readability in this case and it's worth making an exception.
Shouldn't the docs therefore explain this - that CPU profiles may need different timeout & interval than the remaining profile types?
I think for now we can just explain how the component behaves, and we can improve with such advice later on. I think what @korniltsev states applies not just for CPU, but any "delta" profile.
Btw I think the reason for timing out delta profiles before the next query is started is that there can't be more than one active pprof for a certain type at a time. However, even if we have a very slow non-delta pprof query, I suppose we still won't be able to send a new query until the old one has finished. This is just a guess though.
@@ -142,7 +230,7 @@ an `oauth2` block. | |||
The `profiling_config` block configures the profiling settings when scraping | |||
targets. | |||
|
|||
The block contains the following attributes: | |||
The following arguments are supported: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is path_prefix
even used? I can't find code which uses it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be a bug yes. Worth confirming it doesn't work and open an issue.
`scrape_timeout` | `duration` | The timeout for scraping targets of this config. | `"18s"` | no | ||
`scheme` | `string` | The URL scheme with which to fetch metrics from targets. | `"http"` | no | ||
`bearer_token` | `secret` | Bearer token to authenticate with. | | no | ||
`bearer_token_file` | `string` | File containing a bearer token to authenticate with. | | no |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Such _file
arguments should be deprecated. Normally we recommend using components like local.file
, remote.vault
, remote.s3
for such use cases.
@@ -322,6 +404,19 @@ If the agent is _not_ running in clustered mode, this block is a no-op. | |||
|
|||
[using clustering]: {{< relref "../../concepts/clustering.md" >}} | |||
|
|||
## Common configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clayton-cornell - this is another use case for a "Common configuration" section. Yesterday we discussed it in the context of documenting shared blocks. Here I use it to document shared arguments.
@rfratto - you mentioned in a previous PR that you're concerned with too much repetition, which makes users think there's more to learn than there really is. I agree, and I hope this section here helps with that.
|
||
### `delta` argument | ||
|
||
When the `delta` argument is `false`, the [pprof][] HTTP query will be instantaneous. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does a delta
of false
actually mean, e.g. in the context of CPU profiles? Does it lead to visible differences in the Pyroscope UI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Delta" to me means that the difference between the first sample and the last sample.
Is that the case here? I was under the impression that all samples for the duration of seconds=14 will be taken into account.
|
||
For example, the `job_name` of `pyroscope.scrape "local" { ... }` will be `"pyroscope.scrape/local"`. | ||
|
||
#### `targets` argument |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -288,7 +370,7 @@ profile.custom "PROFILE_TYPE" { | |||
Multiple `profile.custom` blocks can be specified. Labels assigned to | |||
`profile.custom` blocks must be unique across the component. | |||
|
|||
The `profile.custom` block accepts the following arguments: | |||
The following arguments are supported: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be nice to have an example with profile.custom
in the Examples section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clayton-cornell - The changes in the PR make this page more in line with our Writing documentation for Flow components guide, but I have taken a few liberties which are noted in comments.
@knylander-grafana - there is a Go (pull mode) page on the Pyroscope website which contains lots of duplicate information. Should we remove the duplications?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We created that page in response to feedback that this page was too verbose and hard to get started quickly. We will definitely keep the other page for now and I'll just manage the duplicate work of making sure changes made here make it over there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could look into converting Go (pull mode) and eBPF into new task-based docs under Agent's Getting Started section called "Profiling Go Applications" and "Profiling with eBPF".
07edd31
to
71a0a29
Compare
`targets`. The scraped performance profiles are forwarded to the list of receivers passed in | ||
`forward_to`. | ||
`pyroscope.scrape` collects [pprof] performance profiles for a given set of HTTP `targets`. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the picture shows the interactions between different components, I'd prefer not to place it in the reference docs for pyroscope.scrape
. It'd be great to have a new "Profiling Go Applications" task-based doc in the Getting Started section. That doc can walk people through installing the Agent for profiling and we could put the picture there.
I think having such an architecture picture would be great for most task-based docs, and it would be even better if they are all consistent.
cc @clayton-cornell
This PR has not had any activity in the past 30 days, so the |
2b066e1
to
7970f78
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@ptodev Does this make this one ready for merge? There's a lot of unresolved comments throughout the doc. |
@clayton-cornell I am happy to merge it. It is a good improvement over what we have atm, and most of the discussions are closed now. The only remaining thing I'm curious about is how delta vs-non-delta profiles are displayed in the UI, but we could ask about that some other time. Would you mind if I merge it please? I'm not sure if you want to review further. |
Co-authored-by: Piotr <17101802+thampiotr@users.noreply.github.com>
Co-authored-by: Ryan Perry <Rperry2174@gmail.com>
2b90bd4
to
89b7cba
Compare
89b7cba
to
d935d1b
Compare
@ptodev Lets get this one published. We can open a new PR when needed ;-) |
* Clarify some aspects of pyroscope.scrape --------- Co-authored-by: Piotr <17101802+thampiotr@users.noreply.github.com> Co-authored-by: Ryan Perry <Rperry2174@gmail.com>
Adding extra information which me and @thampiotr recently learned while tweaking Agents running
pyroscope.scrape
.The PR also cleans up the docs in general.
I am not sure who should review this PR from the Docs squad. IIUC @clayton-cornell is normally helping with Agent docs whereas @knylander-grafana usually helps with Pyroscope docs :)