Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

all: Gerrit having availability issues #30690

Closed
peter-edge opened this Issue Mar 8, 2019 · 27 comments

Comments

Projects
None yet
@peter-edge
Copy link

peter-edge commented Mar 8, 2019

What version of Go are you using (go version)?

$ go version

go version go1.12 linux/amd64

Also using my browser, https://go.googlesource.com/text/+/master

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/travis/.cache/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/travis/gopath"
GOPROXY=""
GORACE=""
GOROOT="/home/travis/.gimme/versions/go1.12.linux.amd64"
GOTMPDIR=""
GOTOOLDIR="/home/travis/.gimme/versions/go1.12.linux.amd64/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build489119788=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I visited https://go.googlesource.com/text/+/master and also tried to go get golang.org/x/text and it failed:

go: golang.org/x/text@v0.3.0: unknown revision v0.3.0
go: golang.org/x/text@v0.3.1-0.20180807135948-17ff2d5776d2: git fetch -f origin refs/heads/*:refs/heads/* refs/tags/*:refs/tags/* in /home/travis/go/pkg/mod/cache/vcs/5b03666c2d7b526129bad48c5cea095aad8b83badc1daa202e7b0279e3a5d861: exit status 128:
	fatal: remote error: Internal Server Error
go: error loading module requirements

What did you expect to see?

A working call to go get.

What did you see instead?

My CI build repeatedly fail.

@gopherbot gopherbot added this to the Unreleased milestone Mar 8, 2019

@andybons andybons changed the title x/text: is down all: Gerrit having availability issues Mar 8, 2019

@andybons

This comment has been minimized.

Copy link
Member

andybons commented Mar 8, 2019

Hi there,
Thanks for the report. Gerrit is having some availability issues that they're investigating. Thanks for your patience.

@jacohend

This comment has been minimized.

Copy link

jacohend commented Mar 9, 2019

This has been happening since yesterday afternoon PST. We're also affected.

Not to beat a dead horse, but I want to emphasize that this has the same impact as npmjs.com downtime would for nearly every nodejs project in the world.

@mikioh mikioh pinned this issue Mar 9, 2019

@0intro 0intro unpinned this issue Mar 10, 2019

@Ark-kun

This comment has been minimized.

Copy link

Ark-kun commented Mar 11, 2019

Same with us. Our presubmit tests are very flaky because of that.

go: finding golang.org/x/net v0.0.0-20181106065722-10aee1819953
go: golang.org/x/tools@v0.0.0-20180828015842-6cd1fcedba52: git fetch -f origin refs/heads/*:refs/heads/* refs/tags/*:refs/tags/* in /home/travis/gopath/pkg/mod/cache/vcs/b44680b3c3708a854d4c89f55aedda0b223beb8d9e30fba969cefb5bd9c1e843: exit status 128:
	fatal: remote error: Internal Server Error
go: golang.org/x/sync@v0.0.0-20181108010431-42b317875d0f: git fetch -f origin refs/heads/*:refs/heads/* refs/tags/*:refs/tags/* in /home/travis/gopath/pkg/mod/cache/vcs/55179c5d8c4db2eaed9fae4682d4c84a1fd3612df666b372bef3bbb997c9601f: exit status 128:
	fatal: remote error: Internal Server Error
	fatal: The remote end hung up unexpectedly
go: golang.org/x/sync@v0.0.0-20180314180146-1d60e4601c6f: unknown revision 1d60e4601c6f
go: error loading module requirements
The command "go vet -all -shadow ./agent/..." exited with 1.
136.97s$ go vet -all -shadow ./cmd/...
go: finding golang.org/x/tools v0.0.0-20180828015842-6cd1fcedba52
go: finding golang.org/x/sync v0.0.0-20181108010431-42b317875d0f
go: finding golang.org/x/sync v0.0.0-20180314180146-1d60e4601c6f
go: golang.org/x/sync@v0.0.0-20181108010431-42b317875d0f: git fetch -f https://go.googlesource.com/sync refs/heads/*:refs/heads/* refs/tags/*:refs/tags/* in /home/travis/gopath/pkg/mod/cache/vcs/55179c5d8c4db2eaed9fae4682d4c84a1fd3612df666b372bef3bbb997c9601f: exit status 128:
	fatal: remote error: Internal Server Error
	fatal: The remote end hung up unexpectedly
go: golang.org/x/sync@v0.0.0-20180314180146-1d60e4601c6f: unknown revision 1d60e4601c6f
go: golang.org/x/tools@v0.0.0-20180828015842-6cd1fcedba52: git fetch -f https://go.googlesource.com/tools refs/heads/*:refs/heads/* refs/tags/*:refs/tags/* in /home/travis/gopath/pkg/mod/cache/vcs/b44680b3c3708a854d4c89f55aedda0b223beb8d9e30fba969cefb5bd9c1e843: exit status 128:
	fatal: remote error: Internal Server Error
go: error loading module requirements
The command "go vet -all -shadow ./cmd/..." exited with 1.
@peter-edge

This comment has been minimized.

Copy link
Author

peter-edge commented Mar 11, 2019

I feel like this should be relatively high priority, right?

@szuecs

This comment has been minimized.

Copy link

szuecs commented Mar 11, 2019

The same here for Go 1.11.x in our travis CI pipelines

go: golang.org/x/sys@v0.0.0-20180831094639-fa5fdf94c789: git fetch -f origin refs/heads/*:refs/heads/* refs/tags/*:refs/tags/* in /home/travis/gopath/pkg/mod/cache/vcs/76a8992ccba6d77c6bcf031ff2b6d821cf232e4ad8d1f2362404fbd0a798d846: exit status 128:
	fatal: remote error: Internal Server Error
go: error loading module requirements
GO111MODULE=on go build 
go: finding golang.org/x/sys v0.0.0-20180831094639-fa5fdf94c789
go: golang.org/x/sys@v0.0.0-20180831094639-fa5fdf94c789: git fetch -f https://go.googlesource.com/sys refs/heads/*:refs/heads/* refs/tags/*:refs/tags/* in /home/travis/gopath/pkg/mod/cache/vcs/76a8992ccba6d77c6bcf031ff2b6d821cf232e4ad8d1f2362404fbd0a798d846: exit status 128:
	fatal: remote error: Internal Server Error
@peter-edge

This comment has been minimized.

Copy link
Author

peter-edge commented Mar 11, 2019

@andybons any updates?

@szuecs szuecs referenced this issue Mar 11, 2019

Merged

Feature/redis based cluster ratelimit #981

5 of 5 tasks complete
@myitcv

This comment has been minimized.

Copy link
Member

myitcv commented Mar 11, 2019

@bcmills @jayconrod - a drive-by thought, but given that tools is now a proper module, would it make sense to have mod go-import meta tag so that published versions of x/tools (and others) can be served from a fast, low-cost CDN, rather than hitting VCS?

@davecohrs

This comment has been minimized.

Copy link

davecohrs commented Mar 11, 2019

This issue also affects go 1.11, where modules are experimental and we are not using them.

@af-engineering

This comment has been minimized.

Copy link

af-engineering commented Mar 11, 2019

We are hitting this issue as well since Friday. We have had to resort to building locally and then publishing to GCR manually, but this is not an acceptable workaround.

@dmitshur

This comment has been minimized.

Copy link
Member

dmitshur commented Mar 11, 2019

Thanks for the reports. We are looking into the issue with Gerrit availability, and will post updates here.

@dmitshur

This comment has been minimized.

Copy link
Member

dmitshur commented Mar 11, 2019

@myitcv We can't do that at this time because there aren't tagged releases of the modules yet. See this comment that describes the current strategy for adding go.mod files to subrepositories.

@myitcv

This comment has been minimized.

Copy link
Member

myitcv commented Mar 11, 2019

@dmitshur unless I'm mistaken, this approach will work for pseudo versions as well as non-pseudo versions (i.e. the tagged released you are referring to).

All that's required here would be a bot of some sort to publish the pseudo versions for each new commit.

I previous did something similar for my domain, with the module versions published to https://github.com/myitcv/pubx, and served via https://raw.githubusercontent.com

@peter-edge

This comment has been minimized.

Copy link
Author

peter-edge commented Mar 11, 2019

Gerrit does not seem reliable here, and this is affecting most of the Golang community. Perhaps an option is to return go-import and go-source tags that point to the GitHub mirrors instead of Gerrit. For example, instead of:

$ curl -sSL golang.org/x/sys?go-get=1
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<meta name="go-import" content="golang.org/x/sys git https://go.googlesource.com/sys">
<meta name="go-source" content="golang.org/x/sys https://github.com/golang/sys/ https://github.com/golang/sys/tree/master{/dir} https://github.com/golang/sys/blob/master{/dir}/{file}#L{line}">
<meta http-equiv="refresh" content="0; url=https://godoc.org/golang.org/x/sys">
</head>
<body>
Nothing to see here; <a href="https://godoc.org/golang.org/x/sys">move along</a>.
</body>
</html>

Return:

# SUGGESTED VALUE TO RETURN
$ curl -sSL golang.org/x/sys?go-get=1
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<meta name="go-import" content="golang.org/x/sys git https://github.com/golang/sys">
<meta name="go-source" content="golang.org/x/sys https://github.com/golang/sys/ https://github.com/golang/sys/tree/master{/dir} https://github.com/golang/sys/blob/master{/dir}/{file}#L{line}">
<meta http-equiv="refresh" content="0; url=https://godoc.org/golang.org/x/sys">
</head>
<body>
Nothing to see here; <a href="https://godoc.org/golang.org/x/sys">move along</a>.
</body>
</html>
@theckman

This comment has been minimized.

Copy link
Contributor

theckman commented Mar 11, 2019

@andybons Is this the place where we can go to get the latest updates on this issue? This is impacting us internally at Netflix, and others in the larger Go community. I'd like to make sure I'm linking people to the right place.

As we're moving towards more critical components for the ecosystem being hosted by Google, I think it's worth calling out that handling of issues like this are going to be what we use to make judgement calls regarding the trust we have in Google to operate future things such as the Notary.

An issue not being handled well, such as through a lack of transparent communication, is one thing that community members will remember going forward. Experiencing intermittent issues for days, while minimal communication from the team, does not feel like the level of operational maturity we will expect from hosted Module infrastructure.

How can I help us get in to a position of better communication so that we can have that trust?

@peter-edge

This comment has been minimized.

Copy link
Author

peter-edge commented Mar 11, 2019

I second @theckman's statement on trust here - as we move towards a centralized module repository, we want to be able to trust Google's infrastructure here to be reliable for critical parts of the Golang ecosystem. Gerrit being in the critical path, if unreliable, is part of this - it's my belief we need a postmortem here, but just my two cents.

@theckman

This comment has been minimized.

Copy link
Contributor

theckman commented Mar 11, 2019

@peter-edge let's aim for an Incident Review; nobody died. 😉

@peter-edge

This comment has been minimized.

Copy link
Author

peter-edge commented Mar 11, 2019

Right now, as a hotfix, literally having the go-import tag for HTTP GET golang.org/x/*?go-get=1 change to the GitHub mirrors instead of the Gerrit source would help.

@dmitshur

This comment has been minimized.

Copy link
Member

dmitshur commented Mar 11, 2019

Is this the place where we can go to get the latest updates on this issue? ... I'd like to make sure I'm linking people to the right place.

Yes, this is the issue tracking the Gerrit availability problem. It has a Soon label and we're actively working on resolving it; we'll post updates as they happen.

@theckman

This comment has been minimized.

Copy link
Contributor

theckman commented Mar 12, 2019

@dmitshur Thank you for the update on the current progress, we all really appreciate it!

Is it possible to add a separate label for "Actively Engaged" or something similar, as "Soon" doesn't really communicate that piece (at least it didn't to me). If we intentionally use "Soon" to either/or, maybe we could enhance the description to make it clear that it includes imminent activity or current activity.

I know you said that you'll post updates as they happen, but I wonder what we've been able to learn in 3 days. Are you able to share any impact estimates, impact start times, etc. to help people understand when/how it may have impacted them? Is this a transient thing, where it had stabilized and then became unstable again? Do you have an ETA for when we'd expect to see stability, or at least for when we'd expect to get an update from you around an ETA?

Like with Google Cloud outages, can we get a commitment from the team to post regular updates (even if it's just to confirm it's still being worked on, and there's no ETA)? It helps us know that people are still engaged, and avoids hours of uncomfortable silence. Regular could be every 3, 6, or whatever hours.

Sorry, got my SRE hat on a little bit here. 😄

@peter-edge

This comment has been minimized.

Copy link
Author

peter-edge commented Mar 12, 2019

Also just again to note - we can immediately and effectively mitigate this problem while Gerrit's overarching issues are investigated and fixed by switching the go-import tag - this must be quick, and this will cause this critical outage to be mitigated. Can we get this done while further work is performed with Gerrit?

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

ianlancetaylor commented Mar 12, 2019

I don't know what is happening, but I believe that the Gerrit team fixed the problem on Friday, and now, on Monday, they are encountering the same problem or a different one with similar symptoms.

@dmitshur

This comment has been minimized.

Copy link
Member

dmitshur commented Mar 12, 2019

An update from the Gerrit team is that they believe today's outage has ended at 8:50 PM EST (36 minutes ago compared to this comment's time). If you're still experiencing problems since that time, please let us know so we can continue to look into it.

@dmitshur

This comment has been minimized.

Copy link
Member

dmitshur commented Mar 12, 2019

Today's incident seems to be resolved by now, so I'm going to remove the Soon label now. I won't close the issue yet for visibility, and because there are follow-up actions from this that we'll want to consider.

(For now, I've added "outages" to the description of the label Soon. We can see if more can be done to improve the label as part of follow-up steps.)

@dmitshur dmitshur removed the Soon label Mar 12, 2019

@theckman

This comment has been minimized.

Copy link
Contributor

theckman commented Mar 12, 2019

@dmitshur @ianlancetaylor When should we anticipate seeing a review of the incident, including details like impact window timelines, estimated impacts (percentage of requests failing, for example), and what happened as well as what's being done to prevent it?

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

ianlancetaylor commented Mar 12, 2019

I think @andybons is planning to write something here after he gets the information from the Gerrit team. We are also encouraging the Gerrit team to have a public dashboard for status and outages.

@andybons

This comment has been minimized.

Copy link
Member

andybons commented Mar 13, 2019

Ian is correct. I’m working with the Gerrit team and will update this thread once we have something to share.

@andybons

This comment has been minimized.

Copy link
Member

andybons commented Mar 19, 2019

Hi all,
Thanks for your patience.

To give a broad strokes update on what happened, an errant task run by another team within Google caused one of Gerrit’s backend services to become overwhelmed, resulting in elevated error rates including 502s and “Repository not found” errors to a subset of users.

The underlying cause was due to routing not being configured to take differences in request cost into account, causing it to concentrate expensive requests in a small number of tasks.

The issue has been remediated and a thorough internal post-mortem has been written.

We have asked the Gerrit team to provide a public dashboard to better communicate these issues and will leave it to them to provide more extensive details on timeline, remediation steps, and follow-up items surrounding this particular outage.

/cc @jrn from the Gerrit team.

@andybons andybons closed this Mar 19, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.