Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"failed to fetch any source profiles" #36

Closed
bboreham opened this issue Nov 28, 2019 · 18 comments
Closed

"failed to fetch any source profiles" #36

bboreham opened this issue Nov 28, 2019 · 18 comments

Comments

@bboreham
Copy link

Sometimes I click on a blob in the timeline and it opens up a page saying "failed to fetch any source profiles".

I have seen three different log messages that all seem to be associated with (different variants of?) this issue:

2019/11/28 18:03:48 : decompressing profile: gzip: invalid header
2019/11/28 18:04:46 : decompressing profile: unexpected EOF
2019/11/28 18:07:49 : parsing profile: unrecognized profile format

I guess it's always possible that the scrape fails, but it would be better if the UI didn't show a blob at that time.

@brancz
Copy link
Member

brancz commented Dec 3, 2019

Yes I've noticed this as well, I haven't fully figured out why this happens, as we do parse profiles before appending them. Maybe we need to do run further validations on the profile rather than just parsing it. Maybe an empty profile is a valid one, I'm not sure.

@brancz
Copy link
Member

brancz commented Dec 4, 2019

I just double checked, and honestly I don't understand how those samples get persisted, as parsing the profile should encounter the exact same issue judging by: https://github.com/google/pprof/blob/27840fff0d09770c422884093a210ac5ce453ea6/profile/profile.go#L167-L177

@josharian
Copy link

We hit this as well. I think the stored data is getting corrupted: I've seen a particular profile go from readable (because it rendered nicely) to "failed to fetch any source profiles". A large proportion of our samples right now are failing...

@brancz
Copy link
Member

brancz commented Nov 14, 2020

Thanks for reporting! I believe since the last storage rebase this started happening for data older than 8h, potentially even 2h.

I’m gonna try to build some data integrity tooling to test this over time.

@inge4pres
Copy link

inge4pres commented Nov 18, 2020

Hey I'm using the latest version too and facing this same issue for data just collected: how to debug, is there an option to add verbose logging?
Can you please also point me to a document/part of code that handles the scraping, I am not sure if I need to configure the targest with the HTTP host-port only or if I need to add the /debug URI too...
Thanks for this beautiful tool 👍🏼

Additional context: I see this log line when issuing the HTTP request to visualize a single trace in pprof-ui

2020/11/18 16:54:30 : parsing profile: unrecognized profile format

@inge4pres
Copy link

inge4pres commented Nov 18, 2020

One more hint on this: when switching to version master-2020-11-04-ce50636 this log line appears

2020/11/18 17:09:35 : decompressing profile: gzip: invalid header

But removing the old tsdb storage and recreating from scratch, that version works on the first data point of all inputs (heap, goroutines, etc...) but not on all subsequent data points 🤔
Might be that if a profile collection times out its format is somehow stored corrupted in the the time series?

@brancz
Copy link
Member

brancz commented Nov 26, 2020

My apologies for taking so long to get back to this. Re-reading your #36 (comment) @inge4pres , are you saying you are trying to view trace type profiles? If so I think this is our mistake, as we don't actually support viewing trace profiles, as we never ended up finishing #56 / #61 .

I think we should probably disable scraping trace profiles by default until we have this support implemented, to avoid confusing situations like this (if this is the case).


Regarding your last comment, do you have a sequence of steps to run to reproduce this? I can't by just running a conprof instance from scratch, however, if I run conprof, collect some data, shut it down, and then start it back up, I can indeed reproduce some failed to fetch any source profiles errors. I'll investigate why this is happening.

@inge4pres
Copy link

Hi @brancz the steps you describe are exactly the one needed to reproduce, for some reasons it's like the tsdb files becomes unreadable?

@brancz
Copy link
Member

brancz commented Nov 26, 2020

I have two hunches right now, either the chunks mmap'ed to disk are somehow corrupted, or when WAL replay happens they get corrupted in some way. The later would be better for us as that would mean the storage isn't corrupted, just the way we load them. Once I have a better idea I'll report back here! :)

Thank you so much for reporting!

@brancz
Copy link
Member

brancz commented Nov 27, 2020

Quick update: It appears I have found part of the problem, there was an unsafe loading of data in the WAL. So the good news is that there is no corruption of the data on disk. Bad news is that for some reason new appends still yield this error, but when restarted those appends can be viewed without issues. I'm continuing to investigate.

I'll open a PR as soon as I fix the remaining issue.

@inge4pres
Copy link

If you can give a pointer to a piece of code and/or a test to debug, I'd be happy to fork and see if I can help 😄

@brancz
Copy link
Member

brancz commented Nov 27, 2020

I just opened a PR to fix most problems that I found so far in the tsdb (conprof/db#2), but the later issue is still present though. To debug I clone both repos in the same directory and then use a replace directive in the conprof repo to use the local tsdb.

@brancz
Copy link
Member

brancz commented Nov 27, 2020

I just wrote an extensive test that seems to indicate that the database is functioning just fine. I'll continue investigating further up the stack.

@brancz
Copy link
Member

brancz commented Nov 30, 2020

I tried a number of things, and definitely found a couple of small problems, but none of those ended up fixing this symptom. For what it's worth I was finally able to write a test to reproduce this.

@brancz
Copy link
Member

brancz commented Nov 30, 2020

I think I found the last "failed to fetch" errors with: #112.

There is still a remaining problem though, which is after a restart, previous series don't seem to be continued for some reason. At least now we're in a state where all data is viewable and queryable though (with more tests to prevent this from happening again in the future)! 🎉

(unfortunately we're being rate limited heavily in our CI environment by docker, so it may take a couple of hours until images are available; I'll look into moving to github actions to prevent this)

edit: looks like at least the amd64 image managed to push

@brancz
Copy link
Member

brancz commented Dec 1, 2020

I finally managed to find that last bug, and all new e2e tests are passing with this patch: #113

Thank you so much everyone for bearing with me!

@brancz
Copy link
Member

brancz commented Dec 1, 2020

With #112 and #113 merged I think we can close this. Thank you everyone for reporting, and please open new issues if you find anything else or if you think this isn't resolved with the latest versions!

@brancz brancz closed this as completed Dec 1, 2020
@inge4pres
Copy link

Thanks! Will test asap and send some feedback

brancz added a commit that referenced this issue Oct 5, 2021
pkg/storage: Create stripeSeries to improve BenchmarkHeadQuerier_Select
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants