Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build,cmd/go: frequent connection timeouts to github.com since 2022-03-04 #52545

Open
bcmills opened this issue Apr 25, 2022 · 13 comments
Open

x/build,cmd/go: frequent connection timeouts to github.com since 2022-03-04 #52545

bcmills opened this issue Apr 25, 2022 · 13 comments
Labels
Builders GoCommand NeedsInvestigation release-blocker
Milestone

Comments

@bcmills
Copy link
Member

@bcmills bcmills commented Apr 25, 2022

# cd $WORK/tmp/d2/src/github.com/myitcv; git clone -- https://github.com/myitcv/vgo_example_compat $WORK/tmp/d2/src/github.com/myitcv/vgo_example_compat
            Cloning into '$WORK/tmp/d2/src/github.com/myitcv/vgo_example_compat'...
            fatal: unable to access 'https://github.com/myitcv/vgo_example_compat/': Failed to connect to github.com port 443: Connection timed out

greplogs --dashboard -md -l -e 'github.com.* Connection timed out' --since=2021-01-01

2022-04-22T19:02:29-808d40d/linux-386-longtest
2022-04-22T15:44:57-1899472/linux-386-longtest
2022-04-22T04:42:23-1e59876/linux-386-longtest
2022-04-22T00:25:08-c9031a4/linux-amd64-longtest
2022-04-22T00:05:54-c510cd9/linux-386-longtest
2022-04-22T00:05:54-c510cd9/linux-amd64-longtest
2022-04-19T15:07:49-caa4631/linux-amd64-longtest
2022-04-15T19:02:54-df08c9a/linux-386-longtest
2022-04-14T18:00:13-dd97871/linux-386-longtest
2022-04-14T18:00:13-dd97871/linux-amd64-longtest
2022-04-13T17:48:12-517781b/linux-386-longtest
2022-04-13T01:15:22-b55a2fb/linux-386-longtest
2022-04-12T22:35:20-fc1d4c1/linux-amd64-longtest
2022-04-12T22:32:01-6183920/linux-386-longtest
2022-04-12T05:46:57-2b31abc/linux-amd64-longtest
2022-04-11T16:31:45-e4e033a/linux-386-longtest
2022-04-11T16:31:43-036b615/linux-amd64-longtest
2022-04-11T16:31:33-494b79f/linux-amd64-longtest
2022-04-08T18:30:53-3a19102/linux-386-longtest
2022-03-15T17:04:57-9b112ce/linux-amd64-longtest
2022-03-07T13:47:51-0e2f1ab/linux-386-longtest
2022-03-04T20:02:41-2b8aa2b/linux-386-longtest
[Note the 5-month gap here! 🤔]
2021-10-08T14:08:12-5b9206f/linux-386-longtest
2021-10-08T14:08:12-5b9206f/linux-amd64-longtest
2021-09-07T21:39:06-dcf3545/linux-386-longtest
2021-09-07T20:37:05-8078355/linux-386-longtest
2021-09-07T20:27:30-23f4f0d/linux-386-longtest
2021-09-07T19:39:04-d92101f/linux-386-longtest
2021-06-22T02:44:43-197a5ee/linux-amd64-longtest
2021-04-02T05:24:14-aebc0b4/linux-amd64-longtest
2021-03-18T14:43:33-e726e2a/linux-amd64-longtest
2021-03-17T17:50:50-8628bf9/linux-amd64-longtest
2021-03-17T17:13:50-0bd308f/linux-amd64-longtest
2021-03-17T16:53:00-70d54df/linux-386-longtest
2021-03-17T16:53:00-70d54df/linux-amd64-longtest
2021-03-17T16:19:21-2f3db22/linux-386-longtest
2021-03-05T18:46:36-51d8d35/linux-amd64-longtest

@golang/release: were there changes to the linux-.*-longtest image or network configuration around 2022-03-04 that might explain the increase in timeouts at that point?

@gopherbot gopherbot added the Builders label Apr 25, 2022
@gopherbot gopherbot added this to the Unreleased milestone Apr 25, 2022
@bcmills bcmills added the NeedsInvestigation label Apr 25, 2022
@heschi
Copy link
Contributor

@heschi heschi commented Apr 26, 2022

Not that I'm aware of.

@bcmills
Copy link
Member Author

@bcmills bcmills commented May 4, 2022

@bcmills
Copy link
Member Author

@bcmills bcmills commented May 5, 2022

@bcmills
Copy link
Member Author

@bcmills bcmills commented May 6, 2022

@bcmills bcmills removed this from the Unreleased milestone May 6, 2022
@bcmills bcmills added this to the Go1.19 milestone May 6, 2022
@bcmills
Copy link
Member Author

@bcmills bcmills commented May 6, 2022

This is causing a huge amount of noise in the longtest builders, and affects the most comprehensive builders for two first-class ports (linux/386 and linux/amd64).

At this failure rate, I suspect that if this issue affected end users we'd be hearing about it by now. So my hypothesis is that this is some kind of GCE networking issue that affects the builder hosts.

Marking as release-blocker for Go 1.19. We need to figure out what's happening with the network and either fix it (if it's a GCE or builder problem) or figure out a workaround (if it's a GitHub problem).

(Note that most of these failures come from the git binary itself; #46693 may explain why we're not seeing this on the Windows builders.)

@bcmills
Copy link
Member Author

@bcmills bcmills commented May 6, 2022

When these timeouts occur, the tests are consistently at 120s or longer of running time.
That suggests that raising the connection timeout would not be helpful.

@bcmills
Copy link
Member Author

@bcmills bcmills commented May 6, 2022

Here's a theory, inspired by https://askubuntu.com/questions/218728/connection-timeout-when-accessing-github.

Maybe this timeout is some kind of rate-limiting countermeasure on the GitHub side, spurred by bursts of traffic from the Go builders running more SlowBots and post-commit CLs.

Maybe the 5-month gap from October 2021 to March 2022 is an artifact of the Go development cycle: the Go release freeze started 2021-11-01 and ended 2022-02-28. The last failure before the gap was a couple of weeks before the freeze began, and the first failure after the gap was about a week after it ended. During the freeze, the CL rate is naturally much lower, which might keep the builders below whatever threshold GitHub is using for rate-limiting.

@adonovan
Copy link
Member

@adonovan adonovan commented May 6, 2022

Hi @theojulienne, do you know who might be able to cast an eye on this from the GitHub side?

@bcmills
Copy link
Member Author

@bcmills bcmills commented May 11, 2022

I filed a GitHub support ticket (#1616480), but their conclusion was that it's probably a network issue on the GCE side.

Unfortunately, the GCE network architecture is pretty opaque — traceroute doesn't even work properly (https://cloud.google.com/compute/docs/faq#networking). 😵

I think we need to escalate this on the GCE side to at least get enough routing information to identify where along the route the connection is being dropped.

@gopherbot
Copy link

@gopherbot gopherbot commented May 11, 2022

Change https://go.dev/cl/405714 mentions this issue: cmd/go: add timestamps to script test output

gopherbot pushed a commit that referenced this issue May 11, 2022
Go tests don't include timestamps by default, but we would like to
have them in order to correlate builder failures with server and
network logs.

Since many of the Go tests with external network and service
dependencies are script tests for the 'go' command, logging timestamps
here adds a lot of logging value with one simple and very low-risk
change.

For #50541.
For #52490.
For #52545.
For #52851.

Change-Id: If3fa86deb4a216ec6a1abc4e6f4ee9b05030a729
Reviewed-on: https://go-review.googlesource.com/c/go/+/405714
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
Auto-Submit: Bryan Mills <bcmills@google.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Bryan Mills <bcmills@google.com>
@dmitshur dmitshur added the GoCommand label May 13, 2022
@bcmills
Copy link
Member Author

@bcmills bcmills commented May 19, 2022

The failures starting 2022-05-11 have timestamps, so we (at least theoretically) now have something we can correlate with network and server logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Builders GoCommand NeedsInvestigation release-blocker
Projects
Development

No branches or pull requests

5 participants