"failed to fetch any source profiles" #36

bboreham · 2019-11-28T18:09:14Z

Sometimes I click on a blob in the timeline and it opens up a page saying "failed to fetch any source profiles".

I have seen three different log messages that all seem to be associated with (different variants of?) this issue:

2019/11/28 18:03:48 : decompressing profile: gzip: invalid header
2019/11/28 18:04:46 : decompressing profile: unexpected EOF
2019/11/28 18:07:49 : parsing profile: unrecognized profile format

I guess it's always possible that the scrape fails, but it would be better if the UI didn't show a blob at that time.

The text was updated successfully, but these errors were encountered:

brancz · 2019-12-03T13:04:17Z

Yes I've noticed this as well, I haven't fully figured out why this happens, as we do parse profiles before appending them. Maybe we need to do run further validations on the profile rather than just parsing it. Maybe an empty profile is a valid one, I'm not sure.

brancz · 2019-12-04T09:05:04Z

I just double checked, and honestly I don't understand how those samples get persisted, as parsing the profile should encounter the exact same issue judging by: https://github.com/google/pprof/blob/27840fff0d09770c422884093a210ac5ce453ea6/profile/profile.go#L167-L177

josharian · 2020-11-14T00:24:14Z

We hit this as well. I think the stored data is getting corrupted: I've seen a particular profile go from readable (because it rendered nicely) to "failed to fetch any source profiles". A large proportion of our samples right now are failing...

brancz · 2020-11-14T11:50:26Z

Thanks for reporting! I believe since the last storage rebase this started happening for data older than 8h, potentially even 2h.

I’m gonna try to build some data integrity tooling to test this over time.

inge4pres · 2020-11-18T16:54:03Z

Hey I'm using the latest version too and facing this same issue for data just collected: how to debug, is there an option to add verbose logging?
Can you please also point me to a document/part of code that handles the scraping, I am not sure if I need to configure the targest with the HTTP host-port only or if I need to add the /debug URI too...
Thanks for this beautiful tool 👍🏼

Additional context: I see this log line when issuing the HTTP request to visualize a single trace in pprof-ui

2020/11/18 16:54:30 : parsing profile: unrecognized profile format

inge4pres · 2020-11-18T17:11:09Z

One more hint on this: when switching to version master-2020-11-04-ce50636 this log line appears

2020/11/18 17:09:35 : decompressing profile: gzip: invalid header

But removing the old tsdb storage and recreating from scratch, that version works on the first data point of all inputs (heap, goroutines, etc...) but not on all subsequent data points 🤔
Might be that if a profile collection times out its format is somehow stored corrupted in the the time series?

brancz · 2020-11-26T14:29:00Z

My apologies for taking so long to get back to this. Re-reading your #36 (comment) @inge4pres , are you saying you are trying to view trace type profiles? If so I think this is our mistake, as we don't actually support viewing trace profiles, as we never ended up finishing #56 / #61 .

I think we should probably disable scraping trace profiles by default until we have this support implemented, to avoid confusing situations like this (if this is the case).

Regarding your last comment, do you have a sequence of steps to run to reproduce this? I can't by just running a conprof instance from scratch, however, if I run conprof, collect some data, shut it down, and then start it back up, I can indeed reproduce some failed to fetch any source profiles errors. I'll investigate why this is happening.

inge4pres · 2020-11-26T16:15:38Z

Hi @brancz the steps you describe are exactly the one needed to reproduce, for some reasons it's like the tsdb files becomes unreadable?

brancz · 2020-11-26T17:07:13Z

I have two hunches right now, either the chunks mmap'ed to disk are somehow corrupted, or when WAL replay happens they get corrupted in some way. The later would be better for us as that would mean the storage isn't corrupted, just the way we load them. Once I have a better idea I'll report back here! :)

Thank you so much for reporting!

brancz · 2020-11-27T14:04:59Z

Quick update: It appears I have found part of the problem, there was an unsafe loading of data in the WAL. So the good news is that there is no corruption of the data on disk. Bad news is that for some reason new appends still yield this error, but when restarted those appends can be viewed without issues. I'm continuing to investigate.

I'll open a PR as soon as I fix the remaining issue.

inge4pres · 2020-11-27T16:08:30Z

If you can give a pointer to a piece of code and/or a test to debug, I'd be happy to fork and see if I can help 😄

brancz · 2020-11-27T16:10:00Z

I just opened a PR to fix most problems that I found so far in the tsdb (conprof/db#2), but the later issue is still present though. To debug I clone both repos in the same directory and then use a replace directive in the conprof repo to use the local tsdb.

brancz · 2020-11-27T18:19:32Z

I just wrote an extensive test that seems to indicate that the database is functioning just fine. I'll continue investigating further up the stack.

brancz · 2020-11-30T17:56:45Z

I tried a number of things, and definitely found a couple of small problems, but none of those ended up fixing this symptom. For what it's worth I was finally able to write a test to reproduce this.

brancz · 2020-11-30T18:47:18Z

I think I found the last "failed to fetch" errors with: #112.

There is still a remaining problem though, which is after a restart, previous series don't seem to be continued for some reason. At least now we're in a state where all data is viewable and queryable though (with more tests to prevent this from happening again in the future)! 🎉

(unfortunately we're being rate limited heavily in our CI environment by docker, so it may take a couple of hours until images are available; I'll look into moving to github actions to prevent this)

edit: looks like at least the amd64 image managed to push

brancz · 2020-12-01T15:16:40Z

I finally managed to find that last bug, and all new e2e tests are passing with this patch: #113

Thank you so much everyone for bearing with me!

brancz · 2020-12-01T15:30:25Z

With #112 and #113 merged I think we can close this. Thank you everyone for reporting, and please open new issues if you find anything else or if you think this isn't resolved with the latest versions!

inge4pres · 2020-12-01T15:42:25Z

Thanks! Will test asap and send some feedback

pkg/storage: Create stripeSeries to improve BenchmarkHeadQuerier_Select

josharian mentioned this issue Nov 14, 2020

race detector failures #100

Closed

brancz closed this as completed Dec 1, 2020

brancz added a commit that referenced this issue Oct 5, 2021

Merge pull request #36 from parca-dev/storage-stripeSeries

31d7b05

pkg/storage: Create stripeSeries to improve BenchmarkHeadQuerier_Select

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"failed to fetch any source profiles" #36

"failed to fetch any source profiles" #36

bboreham commented Nov 28, 2019

brancz commented Dec 3, 2019

brancz commented Dec 4, 2019

josharian commented Nov 14, 2020

brancz commented Nov 14, 2020 •

edited

Loading

inge4pres commented Nov 18, 2020 •

edited

Loading

inge4pres commented Nov 18, 2020 •

edited

Loading

brancz commented Nov 26, 2020

inge4pres commented Nov 26, 2020

brancz commented Nov 26, 2020

brancz commented Nov 27, 2020 •

edited

Loading

inge4pres commented Nov 27, 2020

brancz commented Nov 27, 2020 •

edited

Loading

brancz commented Nov 27, 2020 •

edited

Loading

brancz commented Nov 30, 2020

brancz commented Nov 30, 2020 •

edited

Loading

brancz commented Dec 1, 2020

brancz commented Dec 1, 2020

inge4pres commented Dec 1, 2020

"failed to fetch any source profiles" #36

"failed to fetch any source profiles" #36

Comments

bboreham commented Nov 28, 2019

brancz commented Dec 3, 2019

brancz commented Dec 4, 2019

josharian commented Nov 14, 2020

brancz commented Nov 14, 2020 • edited Loading

inge4pres commented Nov 18, 2020 • edited Loading

inge4pres commented Nov 18, 2020 • edited Loading

brancz commented Nov 26, 2020

inge4pres commented Nov 26, 2020

brancz commented Nov 26, 2020

brancz commented Nov 27, 2020 • edited Loading

inge4pres commented Nov 27, 2020

brancz commented Nov 27, 2020 • edited Loading

brancz commented Nov 27, 2020 • edited Loading

brancz commented Nov 30, 2020

brancz commented Nov 30, 2020 • edited Loading

brancz commented Dec 1, 2020

brancz commented Dec 1, 2020

inge4pres commented Dec 1, 2020

brancz commented Nov 14, 2020 •

edited

Loading

inge4pres commented Nov 18, 2020 •

edited

Loading

inge4pres commented Nov 18, 2020 •

edited

Loading

brancz commented Nov 27, 2020 •

edited

Loading

brancz commented Nov 27, 2020 •

edited

Loading

brancz commented Nov 27, 2020 •

edited

Loading

brancz commented Nov 30, 2020 •

edited

Loading