-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"failed to fetch any source profiles" #36
Comments
Yes I've noticed this as well, I haven't fully figured out why this happens, as we do parse profiles before appending them. Maybe we need to do run further validations on the profile rather than just parsing it. Maybe an empty profile is a valid one, I'm not sure. |
I just double checked, and honestly I don't understand how those samples get persisted, as parsing the profile should encounter the exact same issue judging by: https://github.com/google/pprof/blob/27840fff0d09770c422884093a210ac5ce453ea6/profile/profile.go#L167-L177 |
We hit this as well. I think the stored data is getting corrupted: I've seen a particular profile go from readable (because it rendered nicely) to |
Thanks for reporting! I believe since the last storage rebase this started happening for data older than 8h, potentially even 2h. I’m gonna try to build some data integrity tooling to test this over time. |
Hey I'm using the Additional context: I see this log line when issuing the HTTP request to visualize a single trace in pprof-ui
|
One more hint on this: when switching to version
But removing the old tsdb storage and recreating from scratch, that version works on the first data point of all inputs (heap, goroutines, etc...) but not on all subsequent data points 🤔 |
My apologies for taking so long to get back to this. Re-reading your #36 (comment) @inge4pres , are you saying you are trying to view I think we should probably disable scraping trace profiles by default until we have this support implemented, to avoid confusing situations like this (if this is the case). Regarding your last comment, do you have a sequence of steps to run to reproduce this? I can't by just running a conprof instance from scratch, however, if I run conprof, collect some data, shut it down, and then start it back up, I can indeed reproduce some |
Hi @brancz the steps you describe are exactly the one needed to reproduce, for some reasons it's like the tsdb files becomes unreadable? |
I have two hunches right now, either the chunks mmap'ed to disk are somehow corrupted, or when WAL replay happens they get corrupted in some way. The later would be better for us as that would mean the storage isn't corrupted, just the way we load them. Once I have a better idea I'll report back here! :) Thank you so much for reporting! |
Quick update: It appears I have found part of the problem, there was an unsafe loading of data in the WAL. So the good news is that there is no corruption of the data on disk. Bad news is that for some reason new appends still yield this error, but when restarted those appends can be viewed without issues. I'm continuing to investigate. I'll open a PR as soon as I fix the remaining issue. |
If you can give a pointer to a piece of code and/or a test to debug, I'd be happy to fork and see if I can help 😄 |
I just opened a PR to fix most problems that I found so far in the tsdb (conprof/db#2), but the later issue is still present though. To debug I clone both repos in the same directory and then use a replace directive in the conprof repo to use the local tsdb. |
I just wrote an extensive test that seems to indicate that the database is functioning just fine. I'll continue investigating further up the stack. |
I tried a number of things, and definitely found a couple of small problems, but none of those ended up fixing this symptom. For what it's worth I was finally able to write a test to reproduce this. |
I think I found the last "failed to fetch" errors with: #112. There is still a remaining problem though, which is after a restart, previous series don't seem to be continued for some reason. At least now we're in a state where all data is viewable and queryable though (with more tests to prevent this from happening again in the future)! 🎉 (unfortunately we're being rate limited heavily in our CI environment by docker, so it may take a couple of hours until images are available; I'll look into moving to github actions to prevent this) edit: looks like at least the amd64 image managed to push |
I finally managed to find that last bug, and all new e2e tests are passing with this patch: #113 Thank you so much everyone for bearing with me! |
Thanks! Will test asap and send some feedback |
pkg/storage: Create stripeSeries to improve BenchmarkHeadQuerier_Select
Sometimes I click on a blob in the timeline and it opens up a page saying "failed to fetch any source profiles".
I have seen three different log messages that all seem to be associated with (different variants of?) this issue:
I guess it's always possible that the scrape fails, but it would be better if the UI didn't show a blob at that time.
The text was updated successfully, but these errors were encountered: