-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/pkgsite: @master failed with "could not be found" for an hour #59464
Comments
@peterhellberg mentioned that he requested |
CC @jamalc
IIRC pkgsite does that in order to link to the source. |
But the existence of the module in the first place should follow the same logic as |
Premably, yes. I was just explaining why the meta tags were queried. I looked into this a bit. It looks like the problem is that the proxy timed out around 5am london, and this timeout is not correctly treated as a retryable error: There's a missing return: the status is set to a magic value that causes re-fetch, and then subsequently overwritten to NotFound. The reason @peterhellberg's request went through is that the pseudoversion did not exist in the version map. I'll fix. |
Amazing, thanks. |
FYI, happening again today, as of 09:13 UTC. |
Change https://go.dev/cl/482162 mentions this issue: |
Hmm, in this case it seems fetching the pseudoversion didn't work. I manually fetched: https://pkg.go.dev/cuelang.org/go@master. |
A missing early return was causing retryable fetch errors not to be retried by the frontend. I spent a little while trying to test this, and it proved too difficult and invasive. Unfortunately this may be a case where it is not worthwhile to significantly refactor in order to improve testability. Updates golang/go#59464 Change-Id: I0293a71d6be5046a4efe8b7ff490e285f912af3e Reviewed-on: https://go-review.googlesource.com/c/pkgsite/+/482162 Reviewed-by: Daniel Martí <mvdan@mvdan.cc> Reviewed-by: Hyang-Ah Hana Kim <hyangah@gmail.com> TryBot-Result: kokoro <noreply+kokoro@google.com> Run-TryBot: Robert Findley <rfindley@google.com>
This fix is deployed, so I'm going to tentatively marked this as fixed. Please ping this issue if you are unable to fetch master in the future. |
Thanks @findleyr for digging in to investigate, and for the quick fix! |
Thanks for the fix, @findleyr! I went to look at the docs again this morning, and once again I see as of 08:40 UTC: I have requested the version a couple of times and refreshed after a few minutes, but I still see the same. It's true that I no longer see the exact same "could not be found" error page from before, so something has changed for sure, but the docs still aren't loading reliably. Thinking outloud for a second. If the problem is that our vanity import URL or VCS sometimes take a while to respond or even report errors, shouldn't pkgsite continue to serve the last |
Still not loading for me thirty minutes later, so it's not a fluke :) |
Thanks for following up. Failure mode looks different: the module is being re-attempted now, but failing. I'll investigate. |
Change https://go.dev/cl/484736 mentions this issue: |
Oof, there is a lot going on here. Jotting down notes for my own recollection; Dan and Paul: don't feel obligated to follow along. The previous CL fixed only one of approximately 4 "bugs". Here's the complete picture:
So what happened is that master was successfully fetched yesterday at 1pm ET, and then at 6:30pm ET someone was browsing documentation@master, which caused master to be re-fetched asynchronously, and this fetch failed. At this point, master was in a broken state, and (1) subsequently re-fetching manually timed-out because the proxy needed to re-evaluate master (again), and then (2) couldn't be re-attempted for some amount of time due to the GCP task queue deduplication. 😮💨 To fix this properly, I think we should make proxy timeouts retryable on the queue (as done in CL 484736), and avoid updating the version map for master when a fetch fails. It is better to serve a potentially stale master version (and continue trying to fetch it in the background) than to serve a 404. |
That is some investigation :) I wonder, are we really the only ones to have run into this set of bugs? I've used Also, is there any record about our HTML with meta tags or VCS repository taking too long to respond or failing? Because if those failure rates on our end are noticeable, then that's something we would want to fix as well. |
@mvdan it could just be that the cue repo is larger than the typical modules you are browsing, and so times out more frequently. I expect that most users just browse latest, not master, so it's not that surprising to me that you are the first to have this combination of (1) frequently browsing master, (2) large modules that timeout, and (3) willingness to file an issue. |
That makes sense, thanks again for the quick and detailed responses. |
When fetching e.g. the master branch, it is common for the proxy client to timeout as it waits for the branch to be resolved. This should be a retryable error, as otherwise we end up in a state where an asynchronous refresh of master actually breaks documentation at master. There is still a problem that documentation is broken after the first failure, but this at least puts us in a recoverable state. Again, as with CL 482162 this is unfortunately not possible to test in the current setup without significant refactoring. Such refactoring may be warranted, but I do not have time to do this now. Updates golang/go#59464 Change-Id: I321cc2eaabac1d7e052d07efcaadefcac24208ac Reviewed-on: https://go-review.googlesource.com/c/pkgsite/+/484736 Reviewed-by: Jamal Carvalho <jamal@golang.org> Run-TryBot: Robert Findley <rfindley@google.com> TryBot-Result: kokoro <noreply+kokoro@google.com>
Likely related: I just tried to directly access https://proxy.golang.org/cuelang.org/go/@v/master.info, and the first time (on a cold cache?) it took a solid 15-20s to respond, which I found quite slow. Subsequent requests were instantaneous. Surely it should always return the last master version it saw, even if its cache is a few hours old? |
Yes, what you observed is the retry behavior added in https://go.dev/cl/484736. Before that CL, you would have been served a 404. The last part of this issue is to do exactly what you suggest: don't update the pointer to master if the fetch failed. That missing piece is why this issue is still open; I just haven't had time to do it (IIRC I got stuck on the testing setup). Not likely to get done this month. I have this issue cued up for next month, but it is a relatively low priority. |
What is the URL of the page with the issue?
https://pkg.go.dev/cuelang.org/go@master
What is your user agent?
Screenshot
(this screenshot was for https://pkg.go.dev/cuelang.org/go/cue/load@master, but the root of the module failed too)
What did you do?
Visit the page to view the docs at the latest master commit.
What did you expect to see?
It should work. I've used this many times in the past, and
go get cuelang.org/go@master
worked too.What did you see instead?
The error above. This was today at around 10:19 London time, or 09:19 UTC time. We tried multiple times and it kept failing until about 11:00 London time.
I've looked at the pkgsite source code, and it's interesting to see that it fetches its own meta tag via HTTP from cuelang.org, unlike
go get
which fetches https://proxy.golang.org/cuelang.org/go/@v/master.info. Why is that?Something else to note here is that our meta tag page redirects;
curl -v https://cuelang.org/go?go-get=1
does not show the meta tags, butcurl -L -v https://cuelang.org/go?go-get=1
does. This should be fine, as both cmd/go and proxy.golang.org seem to follow redirects. I would imagine and hope that pkgsite does as well, but I would also hope that it would simply talk to proxy.golang.org instead of reimplementing the "fetch meta tags" logic.The text was updated successfully, but these errors were encountered: