Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build/env/linux-mips: we have no Linux MIPS builders #31217

Open
bradfitz opened this issue Apr 2, 2019 · 61 comments

Comments

@bradfitz
Copy link
Member

commented Apr 2, 2019

All four of our MIPS variants have lost their builders and the email address of the old owner now bounces. (changed jobs, presumably)

So, we need to find a new owner, or remove the port per policy (https://github.com/golang/go/wiki/PortingPolicy#removing-a-port)

/cc @randall77 @cherrymui @ianlancetaylor @dmitshur

@gopherbot gopherbot added this to the Unreleased milestone Apr 2, 2019

@gopherbot gopherbot added the Builders label Apr 2, 2019

@gopherbot gopherbot added the Builders label Apr 2, 2019

@gopherbot

This comment has been minimized.

Copy link

commented Apr 2, 2019

Change https://golang.org/cl/170444 mentions this issue: dashboard: update stale ownership info for now-dead MIPS builders

@cherrymui

This comment has been minimized.

Copy link
Contributor

commented Apr 2, 2019

I have a MIPS64LE machine that I used to run as a builder (old style builder, not buildlet). The machine is quite old and slow, so I stopped to run it once the new builders were up. I could probably revive that machine, if desired. (but probably no earlier than next month)

@bradfitz

This comment has been minimized.

Copy link
Member Author

commented Apr 2, 2019

Alternatively, I could run it for you if you want to ship or inter-office mail it to me.

Do you remember which board it is?

@cherrymui

This comment has been minimized.

Copy link
Contributor

commented Apr 2, 2019

It's a Loongson 2E box.

If I remember correctly it needs some awkward booting process, something like PXE with TFTP from another machine, because I screwed up something (probably part of the hard drive?) long time ago. Once it boots, it should run ok.

I also have a Loongson 2F laptop, which is probably a little better. I think it can boot by itself. But the disk space is pretty small.

@bradfitz

This comment has been minimized.

@willglynn

This comment has been minimized.

Copy link

commented Apr 2, 2019

@bradfitz Same board is available at a lower price and higher quantity elsewhere on eBay or from Mouser, a reputable electronics distributor.

@minux

This comment has been minimized.

Copy link
Member

commented Apr 2, 2019

@bradfitz

This comment has been minimized.

Copy link
Member Author

commented Apr 2, 2019

@willglynn, thanks for the links! I've now bought all the ones I found on eBay. We'll see how those work out before I buy any more from Mouser.

gopherbot pushed a commit to golang/build that referenced this issue Apr 3, 2019
dashboard: update stale ownership info for now-dead MIPS builders
Updates golang/go#31217

Change-Id: I9b8e64c24ed303166e609c13fa4b69d83ddcd0c1
Reviewed-on: https://go-review.googlesource.com/c/build/+/170444
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
@omac777

This comment has been minimized.

Copy link

commented Apr 3, 2019

Alternatively, I could run it for you if you want to ship or inter-office mail it to me.

Do you remember which board it is?
I have a six-core Loongson 3a board 8GB RAM that used to have an old fedora running on it, but I blew it away while trying to bring debian unstable up onto it using mediatemple mips binaries onto it. I didn't get much support from loongson themselves because I'm not in China. I would need a new hard-drive with the bootable loongnix installed onto it to bring it back up and running enough to run golang on it. the debian loongson experts within qemu never got back to me. It's not a typical board where you can just boot with a cd or a usb drive. I'm not willing to part with this $$$ board, but if you could help me bring the box up again, I'll give you root access via fiber to it. That would be fair.

@cetinsert

This comment has been minimized.

Copy link

commented Apr 3, 2019

Linux / MIPS is business critical for us! I can get all necessary hardware and setup a new repository for automated builds with unreleased versions of go.
Currently we use it for a fleet of IIoT devices with GOOS=linux GOARCH=mips with always the latest release of go (1.12 as of today). Hardware is all openwrt routers / router boards.

@cetinsert

This comment has been minimized.

Copy link

commented Apr 3, 2019

@bradfitz I have a doctor's appointment for the next few hours but am ready to take a role as builder for all 4 MIPS variants. Please let me know what next to do; otherwise I will review posts and documents and get in touch with a hardware / time plan. I expect getting all necessary and future hardware (for the other 3 variants (we only use GOARCH=mips in automated builds)) will take mere days because I live in Korea, relatively close to Shenzhen.

@bradfitz

This comment has been minimized.

Copy link
Member Author

commented Apr 3, 2019

@cetinsert, great to hear! How much RAM and what type of storage do those devices have? We tend to require 1GB of RAM and pretty decent storage (SD cards fail very quickly, so putting its working directory on a USB SSD disk is better). They also need to have moderately okay network, but if you're in South Korea you're probably good. :-)

https://golang.org/wiki/DashboardBuilders has more info. I can get you the necessary keys over email.

@cetinsert

This comment has been minimized.

Copy link

commented Apr 3, 2019

@bradfitz ok so we were not talking about cross-compiling for mips o__O (because that's what we do from linux-amd64, darwin-amd64 and windows-amd64)?

Let me review https://golang.org/wiki/DashboardBuilders!

@bradfitz

This comment has been minimized.

Copy link
Member Author

commented Apr 3, 2019

@cetinsert, we kinda have support for that, at least partially. The builders support cross-compiling the make.bash step but when running tests it still does on-device compiles of each test, which can still be big. We don't yet have support for cross-compiling the test binaries. Even if we did, that'd increase the network requirements a bit.

@cetinsert

This comment has been minimized.

Copy link

commented Apr 3, 2019

@bradfitz I see VMs listend here https://farmer.golang.org/builders#host-linux-mips. If VMs are ok, please send me the keys over email (mips@shin.asia)!

I have several Scaleway x86-64 (2 x amd64 cores, 2GB RAM) machines with good network conditions and can setup QEMU? and other essentials for MIPS emulation or take over existing work in the form of VM images and host them from then on.

@9nut

This comment has been minimized.

Copy link

commented Apr 3, 2019

@bradfitz I can contribute a Ci20; I'll order one today. If it's time critical, I have a mikrotik rb450g (256mb) and a vocore2 (128mb) that I can send your way; probably only useful for testing. Incidentally, the rb450g runs Plan 9 via tftp/bootp -- if/when plan9/mips32 is available.

@FlyGoat

This comment has been minimized.

Copy link

commented Apr 3, 2019

@omac777

Alternatively, I could run it for you if you want to ship or inter-office mail it to me.
Do you remember which board it is?
I have a six-core Loongson 3a board 8GB RAM that used to have an old fedora running on it, but I

I assume you have a Loongson-3B1500 since Loongson-3A only have four cores. 3B1500 is a octa-core model but two cores are disabled on some boards due to a hardware errata.

blew it away while trying to bring debian unstable up onto it using mediatemple mips binaries onto it. I
didn't get much support from loongson themselves because I'm not in China. I would need a new
hard-drive with the bootable loongnix installed onto it to bring it back up and running enough to run
golang on it. the debian loongson experts within qemu never got back to me. It's not a typical board
where you can just boot with a cd or a usb drive. I'm not willing to part with this $$$ board, but if you
could help me bring the box up again, I'll give you root access via fiber to it. That would be fair.

Could you please tell me the model of your board? It should be displayed in firmware, or printed on circuit board. Probably you just need update PMON or klBIOS to make it boot from USB stick.

And now, we have a Fedora28 Port: http://mirror.lemote.com:8000/fedora/fedora28-live/
You can try to write it to a USB stick and boot or even write it directly to your hard drive.

Anyway, though I'm not a Loongson employee, I'm familiar with Loongson devices and Loongson developers, email me if you need any help.

@bradfitz

This comment has been minimized.

Copy link
Member Author

commented Apr 3, 2019

@9nut, thanks, Skip, but I have a few Ci20s on the way already. No need to buy another. The rb450g could be interesting, though, as that's a 32-bit BE CPU it seems? Not much RAM, but it'd give us a MIPS CPU variant I don't think we have.

I don't think there are any plans for more plan9 ports (@0intro?). But really the first priority is getting Linux back.

@cetinsert

This comment has been minimized.

Copy link

commented Apr 3, 2019

@bradfitz would QEMU or another mips emulator work?
If yes, I can dedicate an x86-64 host with 2GB RAM and SSD storage.

For a true mips host, we will be reviewing hardware options shortly.
Our mips-based target devices have nowhere near 1GB of RAM.

@bradfitz

This comment has been minimized.

Copy link
Member Author

commented Apr 3, 2019

@cetinsert, QEMU is generally too slow and sometimes I hear it's too forgiving (e.g. accepting unaligned instructions that real hardware would reject). So we try to use real hardware when possible. If we do decide to go the emulation route we can run that on GCE where we have tons of x86 capacity.

@0intro

This comment has been minimized.

Copy link
Member

commented Apr 3, 2019

Currently, Plan 9 runs on MikroTik RB450G (MIPS32 BE) and
Lemote Yeeloong (MIPS64 LE). OCTEON (MIPS64 BE) and CI20
(MIPS32 LE) ports were in progress.

However, there is currently no plan to port Go on plan9/mips,
thought it could be interesting.

Anyway, I think these board would likely be a little tight
to run a (Linux) Go builder.

Ideally, you could try to get your hands on an OCTEON board,
which usually have multiple cores and multiple gigabytes of memory.

@cetinsert

This comment has been minimized.

Copy link

commented Apr 3, 2019

@bradfitz which hardware was used by the former builders for the 4 variants for Linux?

@bradfitz

This comment has been minimized.

Copy link
Member Author

commented Apr 3, 2019

@cetinsert, I don't actually know. It was run by somebody at MIPS who no longer appears to be employed there.

@mengzhuo

This comment has been minimized.

Copy link
Contributor

commented Aug 28, 2019

I have submitted CL
https://go-review.googlesource.com/c/build/+/191577

Hope it can help.

@gopherbot

This comment has been minimized.

Copy link

commented Sep 3, 2019

Change https://golang.org/cl/193017 mentions this issue: dashboard: remove nonexistent linux-mips builders

gopherbot pushed a commit to golang/build that referenced this issue Sep 6, 2019
dashboard: remove nonexistent linux-mips builders
These builders haven't existed for a while, and they consume
horizontal space on the build dashboard, pushing more interesting
columns off of the screen at moderate resolutions.

Updates golang/go#31217

Change-Id: Ib7021b699956d01d17e0ea8b213efd07b17d26f7
Reviewed-on: https://go-review.googlesource.com/c/build/+/193017
Run-TryBot: Bryan C. Mills <bcmills@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
@gopherbot

This comment has been minimized.

Copy link

commented Sep 6, 2019

Change https://golang.org/cl/191577 mentions this issue: dashboard: add linux-mipsle-mengzhuo builder

gopherbot pushed a commit to golang/build that referenced this issue Sep 11, 2019
dashboard: add linux-mipsle-mengzhuo builder
Updates golang/go#31217

Change-Id: I7bda4f3ddf7dc76bdd974726747621f69a7acfa0
Reviewed-on: https://go-review.googlesource.com/c/build/+/191577
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
Run-TryBot: Dmitri Shuralyov <dmitshur@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@dmitshur

This comment has been minimized.

Copy link
Member

commented Sep 12, 2019

@mengzhuo I've submitted CL 191577, then redeployed the coordinator and the build.golang.org dashboard. Your builder appears to be running for go, net, and sys repositories as configured. It's getting this far:

Building Go cmd/dist using /usr/lib/golang.
Building Go toolchain1 using /usr/lib/golang.
Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
Building Go toolchain2 using go_bootstrap and Go toolchain1.
Building Go toolchain3 using go_bootstrap and Go toolchain2.


Error: error copying response: unexpected EOF
@mengzhuo

This comment has been minimized.

Copy link
Contributor

commented Sep 12, 2019

@dmitris Hi, I'm sorry. However it seems to be a network issue in China.
I've change to another proxy server in HongKong.
Can you stop all mipsle pending builds?

@dmitshur

This comment has been minimized.

Copy link
Member

commented Sep 12, 2019

No problem, it's expected that new builders take some time to work out the issues before they're fully operational.

Can you stop all mipsle pending builds?

We don't have an easy way of doing this, other than to remove the builder in code and redeploy. (As far as I know? /cc @bradfitz)

If it's not a problem for you, it might be better to keep it on. Once you resolve the networking issues, you can immediately start seeing what the next issue is, if any, without waiting to re-add the builder.

If you're worried about incorrect "FAIL" results, we can easily clear those when the builder is in a working state.

Let me know how you'd prefer to proceed.

@mengzhuo

This comment has been minimized.

Copy link
Contributor

commented Sep 12, 2019

@dmitris OK, I will fix this today

(I have to go to work now :(

@dmitshur

This comment has been minimized.

Copy link
Member

commented Sep 12, 2019

There's no immediate rush, this is a 5 month old issue. Take the time you need. Thank you for helping out with this!

@mengzhuo

This comment has been minimized.

Copy link
Contributor

commented Sep 12, 2019

@dmitris The log shows

Error: writeSnapshot: local error: tls: bad record MAC

I can only find writeSnapshot function at
https://github.com/golang/build/blob/01fd29966998a0a3ecd5d721de6bcde3ea9b9a6f/cmd/coordinator/coordinator.go#L2161
Is something wrong with snapshot server?

      rev: e7e2b1c2b91320ef0ddf025d330061d56115dd53
 buildlet: http://ls-3a3k reverse peer ls-3a3k/129.226.132.234:39892 for host type host-linux-mipsle-mengzhuo
  started: 2019-09-11 23:53:42.184067314 +0000 UTC m=+58.957360392
    ended: 2019-09-12 03:57:57.089056773 +0000 UTC m=+14713.862349853
  success: false

Events:
  2019-09-11T23:53:42Z checking_for_snapshot 
  2019-09-11T23:53:42Z finish_checking_for_snapshot after 191.2ms
  2019-09-11T23:53:42Z get_buildlet 
  2019-09-11T23:53:42Z wait_static_builder host-linux-mipsle-mengzhuo
  2019-09-11T23:53:42Z waiting_machine_in_use 
  2019-09-12T03:48:32Z finish_wait_static_builder after 3h54m49.4s; host-linux-mipsle-mengzhuo
  2019-09-12T03:48:32Z clean_buildlet http://ls-3a3k reverse peer ls-3a3k/129.226.132.234:39892 for host type host-linux-mipsle-mengzhuo
  2019-09-12T03:48:32Z finish_clean_buildlet after 443ms; http://ls-3a3k reverse peer ls-3a3k/129.226.132.234:39892 for host type host-linux-mipsle-mengzhuo
  2019-09-12T03:48:32Z finish_get_buildlet after 3h54m50s
  2019-09-12T03:48:32Z using_buildlet ls-3a3k
  2019-09-12T03:48:32Z write_version_tar 
  2019-09-12T03:48:32Z get_source 
  2019-09-12T03:48:33Z finish_get_source after 0s
  2019-09-12T03:48:33Z write_go_src_tar 
  2019-09-12T03:50:00Z finish_write_go_src_tar after 1m27.4s
  2019-09-12T03:50:00Z make_and_test 
  2019-09-12T03:50:00Z make src/make.bash
  2019-09-12T03:56:10Z finish_make after 6m10.2s; src/make.bash
  2019-09-12T03:56:10Z clean_for_snapshot 
  2019-09-12T03:56:11Z finish_clean_for_snapshot after 178.3ms
  2019-09-12T03:56:11Z write_snapshot_to_gcs 
  2019-09-12T03:56:11Z fetch_snapshot_reader_from_buildlet 
  2019-09-12T03:56:11Z finish_fetch_snapshot_reader_from_buildlet after 345.6ms
  2019-09-12T03:57:56Z finish_write_snapshot_to_gcs after 1m45.7s; err=local error: tls: bad record MAC
  2019-09-12T03:57:56Z finish_make_and_test after 7m56.3s; err=writeSnapshot: local error: tls: bad record MAC

Build log:
linux-mips64le-mengzhuo at e7e2b1c2b91320ef0ddf025d330061d56115dd53

:: Running /tmp/workdir-host-linux-mipsle-mengzhuo/go/src/make.bash with args ["/tmp/workdir-host-linux-mipsle-mengzhuo/go/src/make.bash"] and env ["LANG=en_US.UTF-8" "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin" "GO_BUILDER_ENV=host-linux-mipsle-mengzhuo" "WORKDIR=/tmp/workdir-host-linux-mipsle-mengzhuo" "HTTPS_PROXY=http://127.0.0.1:8123" "HTTP_PROXY=http://127.0.0.1:8123" "USER=root" "HOME=/root" "GO_STAGE0_NET_DELAY=800ms" "GO_STAGE0_DL_DELAY=300ms" "GOROOT_BOOTSTRAP=/tmp/workdir-host-linux-mipsle-mengzhuo/go1.4" "GO_BUILDER_NAME=linux-mips64le-mengzhuo" "GO_BUILDER_FLAKY_NET=1" "GOROOT_BOOTSTRAP=/usr/lib/golang" "GOMIPS64=hardfloat" "GOBIN=" "TMPDIR=/tmp/workdir-host-linux-mipsle-mengzhuo/tmp" "GOCACHE=/tmp/workdir-host-linux-mipsle-mengzhuo/gocache"] in dir /tmp/workdir-host-linux-mipsle-mengzhuo/go/src

Building Go cmd/dist using /usr/lib/golang.
Building Go toolchain1 using /usr/lib/golang.
Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
Building Go toolchain2 using go_bootstrap and Go toolchain1.
Building Go toolchain3 using go_bootstrap and Go toolchain2.
Building packages and commands for linux/mips64le.
---
Installed Go for linux/mips64le in /tmp/workdir-host-linux-mipsle-mengzhuo/go
Installed commands in /tmp/workdir-host-linux-mipsle-mengzhuo/go/bin


Error: writeSnapshot: local error: tls: bad record MAC```
@odeke-em

This comment has been minimized.

Copy link
Member

commented Sep 16, 2019

@mengzhuo I've replied to your report at googleapis/google-cloud-go#1581 (comment) asking some questions.

@odeke-em

This comment has been minimized.

Copy link
Member

commented Sep 16, 2019

@mengzhuo actually since this involves running lots of code on your builder, what you can do to quickly verify if the problem is with the Go's TLS v1.3 vs Google TLS v1.2 interaction, you can apply this patch to your builder code in cmd/coordinator/gce.go

diff --git a/cmd/coordinator/gce.go b/cmd/coordinator/gce.go
index e4e702d..809c647 100644
--- a/cmd/coordinator/gce.go
+++ b/cmd/coordinator/gce.go
@@ -12,6 +12,7 @@ package main
 
 import (
 	"context"
+	"crypto/tls"
 	"encoding/json"
 	"errors"
 	"fmt"
@@ -44,6 +45,7 @@ import (
 	"golang.org/x/oauth2/google"
 	compute "google.golang.org/api/compute/v1"
 	"google.golang.org/api/googleapi"
+	"google.golang.org/api/option"
 )
 
 func init() {
@@ -137,20 +139,34 @@ func initGCE() error {
 	cfgDump, _ := json.MarshalIndent(buildEnv, "", "  ")
 	log.Printf("Loaded configuration %q for project %q:\n%s", *buildEnvName, buildEnv.ProjectName, cfgDump)
 
+	opts := []option.ClientOption{
+		// Force TLS 1.2 in the HTTP client because of issues:
+		// * https://github.com/golang/go/issues/31217
+		// * https://github.com/googleapis/google-cloud-go/issues/1581
+		// in which there might be a bad interaction with Go's TLS v1.3 and Google's TLS v1.2.
+		option.WithHTTPClient(&http.Client{
+			Transport: &http.Transport{
+				TLSClientConfig: &tls.Config{
+					MaxVersion: tls.VersionTLS12,
+				},
+			},
+		}),
+	}
+
 	ctx := context.Background()
 	if *mode != "dev" {
-		storageClient, err = storage.NewClient(ctx)
+		storageClient, err = storage.NewClient(ctx, opts...)
 		if err != nil {
 			log.Fatalf("storage.NewClient: %v", err)
 		}
 
-		metricsClient, err = monapi.NewMetricClient(ctx)
+		metricsClient, err = monapi.NewMetricClient(ctx, opts...)
 		if err != nil {
 			log.Fatalf("monapi.NewMetricClient: %v", err)
 		}
 	}
 
-	dsClient, err = datastore.NewClient(ctx, buildEnv.ProjectName)
+	dsClient, err = datastore.NewClient(ctx, buildEnv.ProjectName, opts...)
 	if err != nil {
 		if *mode == "dev" {
 			log.Printf("Error creating datastore client: %v", err)

or for the full file which is big, please see this section:

// Copyright 2015 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.

// +build go1.13
// +build linux darwin

// Code interacting with Google Compute Engine (GCE) and
// a GCE implementation of the BuildletPool interface.

package main

import (
	"context"
	"crypto/tls"
	"encoding/json"
	"errors"
	"fmt"
	"io"
	"io/ioutil"
	"log"
	"net/http"
	"os"
	"path"
	"sort"
	"strconv"
	"strings"
	"sync"
	"time"

	"cloud.google.com/go/compute/metadata"
	"cloud.google.com/go/datastore"
	"cloud.google.com/go/errorreporting"
	monapi "cloud.google.com/go/monitoring/apiv3"
	"cloud.google.com/go/storage"
	"golang.org/x/build/buildenv"
	"golang.org/x/build/buildlet"
	"golang.org/x/build/cmd/coordinator/spanlog"
	"golang.org/x/build/dashboard"
	"golang.org/x/build/gerrit"
	"golang.org/x/build/internal/buildgo"
	"golang.org/x/build/internal/buildstats"
	"golang.org/x/build/internal/lru"
	"golang.org/x/oauth2"
	"golang.org/x/oauth2/google"
	compute "google.golang.org/api/compute/v1"
	"google.golang.org/api/googleapi"
	"google.golang.org/api/option"
)

func init() {
	buildlet.GCEGate = gceAPIGate
}

// apiCallTicker ticks regularly, preventing us from accidentally making
// GCE API calls too quickly. Our quota is 20 QPS, but we temporarily
// limit ourselves to less than that.
var apiCallTicker = time.NewTicker(time.Second / 10)

func gceAPIGate() {
	<-apiCallTicker.C
}

// Initialized by initGCE:
var (
	buildEnv *buildenv.Environment

	dsClient       *datastore.Client
	computeService *compute.Service
	gcpCreds       *google.Credentials
	errTryDeps     error // non-nil if try bots are disabled
	gerritClient   *gerrit.Client
	storageClient  *storage.Client
	metricsClient  *monapi.MetricClient
	inStaging      bool                   // are we running in the staging project? (named -dev)
	errorsClient   *errorreporting.Client // Stackdriver errors client
	gkeNodeIP      string

	initGCECalled bool
)

// oAuthHTTPClient is the OAuth2 HTTP client used to make API calls to Google Cloud APIs.
// It is initialized by initGCE.
var oAuthHTTPClient *http.Client

func initGCE() error {
	initGCECalled = true
	var err error

	// If the coordinator is running on a GCE instance and a
	// buildEnv was not specified with the env flag, set the
	// buildEnvName to the project ID
	if *buildEnvName == "" {
		if *mode == "dev" {
			*buildEnvName = "dev"
		} else if metadata.OnGCE() {
			*buildEnvName, err = metadata.ProjectID()
			if err != nil {
				log.Fatalf("metadata.ProjectID: %v", err)
			}
		}
	}

	buildEnv = buildenv.ByProjectID(*buildEnvName)
	inStaging = (buildEnv == buildenv.Staging)

	// If running on GCE, override the zone and static IP, and check service account permissions.
	if metadata.OnGCE() {
		projectZone, err := metadata.Get("instance/zone")
		if err != nil || projectZone == "" {
			return fmt.Errorf("failed to get current GCE zone: %v", err)
		}

		gkeNodeIP, err = metadata.Get("instance/network-interfaces/0/ip")
		if err != nil {
			return fmt.Errorf("failed to get current instance IP: %v", err)
		}

		// Convert the zone from "projects/1234/zones/us-central1-a" to "us-central1-a".
		projectZone = path.Base(projectZone)
		buildEnv.Zone = projectZone

		if buildEnv.StaticIP == "" {
			buildEnv.StaticIP, err = metadata.ExternalIP()
			if err != nil {
				return fmt.Errorf("ExternalIP: %v", err)
			}
		}

		if !hasComputeScope() {
			return errors.New("coordinator is not running with access to read and write Compute resources. VM support disabled")
		}

		if value, err := metadata.ProjectAttributeValue("farmer-run-bench"); err == nil {
			*shouldRunBench, _ = strconv.ParseBool(value)
		}
	}

	cfgDump, _ := json.MarshalIndent(buildEnv, "", "  ")
	log.Printf("Loaded configuration %q for project %q:\n%s", *buildEnvName, buildEnv.ProjectName, cfgDump)

	opts := []option.ClientOption{
		// Force TLS 1.2 in the HTTP client because of issues:
		// * https://github.com/golang/go/issues/31217
		// * https://github.com/googleapis/google-cloud-go/issues/1581
		// in which there might be a bad interaction with Go's TLS v1.3 and Google's TLS v1.2.
		option.WithHTTPClient(&http.Client{
			Transport: &http.Transport{
				TLSClientConfig: &tls.Config{
					MaxVersion: tls.VersionTLS12,
				},
			},
		}),
	}

	ctx := context.Background()
	if *mode != "dev" {
		storageClient, err = storage.NewClient(ctx, opts...)
		if err != nil {
			log.Fatalf("storage.NewClient: %v", err)
		}

		metricsClient, err = monapi.NewMetricClient(ctx, opts...)
		if err != nil {
			log.Fatalf("monapi.NewMetricClient: %v", err)
		}
	}

	dsClient, err = datastore.NewClient(ctx, buildEnv.ProjectName, opts...)
	if err != nil {
		if *mode == "dev" {
			log.Printf("Error creating datastore client: %v", err)
		} else {
			log.Fatalf("Error creating datastore client: %v", err)
		}
	}

	// don't send dev errors to Stackdriver.
	if *mode != "dev" {
		errorsClient, err = errorreporting.NewClient(ctx, buildEnv.ProjectName, errorreporting.Config{
			ServiceName: "coordinator",
		})
		if err != nil {
			// don't exit, we still want to run coordinator
			log.Printf("Error creating errors client: %v", err)
		}
	}

	gcpCreds, err = buildEnv.Credentials(ctx)
	if err != nil {
		if *mode == "dev" {
			// don't try to do anything else with GCE, as it will likely fail
			return nil
		}
		log.Fatalf("failed to get a token source: %v", err)
	}
	oAuthHTTPClient = oauth2.NewClient(ctx, gcpCreds.TokenSource)
	computeService, _ = compute.New(oAuthHTTPClient)
	errTryDeps = checkTryBuildDeps()
	if errTryDeps != nil {
		log.Printf("TryBot builders disabled due to error: %v", errTryDeps)
	} else {
		log.Printf("TryBot builders enabled.")
	}

	if *mode != "dev" {
		go syncBuildStatsLoop(buildEnv)
	}

	go gcePool.pollQuotaLoop()
	go createBasepinDisks(context.Background())
	return nil
}

func checkTryBuildDeps() error {
	if !hasStorageScope() {
		return errors.New("coordinator's GCE instance lacks the storage service scope")
	}
	if *mode == "dev" {
		return errors.New("running in dev mode")
	}
	wr := storageClient.Bucket(buildEnv.LogBucket).Object("hello.txt").NewWriter(context.Background())
	fmt.Fprintf(wr, "Hello, world! Coordinator start-up at %v", time.Now())
	if err := wr.Close(); err != nil {
		return fmt.Errorf("test write of a GCS object to bucket %q failed: %v", buildEnv.LogBucket, err)
	}
	if inStaging {
		// Don't expect to write to Gerrit in staging mode.
		gerritClient = gerrit.NewClient("https://go-review.googlesource.com", gerrit.NoAuth)
	} else {
		gobotPass, err := metadata.ProjectAttributeValue("gobot-password")
		if err != nil {
			return fmt.Errorf("failed to get project metadata 'gobot-password': %v", err)
		}
		gerritClient = gerrit.NewClient("https://go-review.googlesource.com",
			gerrit.BasicAuth("git-gobot.golang.org", strings.TrimSpace(string(gobotPass))))
	}

	return nil
}

var gcePool = &gceBuildletPool{}

var _ BuildletPool = (*gceBuildletPool)(nil)

// maxInstances is a temporary hack because we can't get buildlets to boot
// without IPs, and we only have 200 IP addresses.
// TODO(bradfitz): remove this once fixed.
const maxInstances = 190

type gceBuildletPool struct {
	mu sync.Mutex // guards all following

	disabled bool

	// CPU quota usage & limits.
	cpuLeft   int // dead-reckoning CPUs remain
	instLeft  int // dead-reckoning instances remain
	instUsage int
	cpuUsage  int
	addrUsage int
	inst      map[string]time.Time // GCE VM instance name -> creationTime
}

func (p *gceBuildletPool) pollQuotaLoop() {
	if computeService == nil {
		log.Printf("pollQuotaLoop: no GCE access; not checking quota.")
		return
	}
	if buildEnv.ProjectName == "" {
		log.Printf("pollQuotaLoop: no GCE project name configured; not checking quota.")
		return
	}
	for {
		p.pollQuota()
		time.Sleep(5 * time.Second)
	}
}

func (p *gceBuildletPool) pollQuota() {
	gceAPIGate()
	reg, err := computeService.Regions.Get(buildEnv.ProjectName, buildEnv.Region()).Do()
	if err != nil {
		log.Printf("Failed to get quota for %s/%s: %v", buildEnv.ProjectName, buildEnv.Region(), err)
		return
	}
	p.mu.Lock()
	defer p.mu.Unlock()
	for _, quota := range reg.Quotas {
		switch quota.Metric {
		case "CPUS":
			p.cpuLeft = int(quota.Limit) - int(quota.Usage)
			p.cpuUsage = int(quota.Usage)
		case "INSTANCES":
			p.instLeft = int(quota.Limit) - int(quota.Usage)
			p.instUsage = int(quota.Usage)
		case "IN_USE_ADDRESSES":
			p.addrUsage = int(quota.Usage)
		}
	}
}

func (p *gceBuildletPool) SetEnabled(enabled bool) {
	p.mu.Lock()
	defer p.mu.Unlock()
	p.disabled = !enabled
}

func (p *gceBuildletPool) GetBuildlet(ctx context.Context, hostType string, lg logger) (bc *buildlet.Client, err error) {
	hconf, ok := dashboard.Hosts[hostType]
	if !ok {
		return nil, fmt.Errorf("gcepool: unknown host type %q", hostType)
	}
	qsp := lg.CreateSpan("awaiting_gce_quota")
	err = p.awaitVMCountQuota(ctx, hconf.GCENumCPU())
	qsp.Done(err)
	if err != nil {
		return nil, err
	}

	deleteIn, ok := ctx.Value(buildletTimeoutOpt{}).(time.Duration)
	if !ok {
		deleteIn = vmDeleteTimeout
	}

	instName := "buildlet-" + strings.TrimPrefix(hostType, "host-") + "-rn" + randHex(7)
	instName = strings.Replace(instName, "_", "-", -1) // Issue 22905; can't use underscores in GCE VMs
	p.setInstanceUsed(instName, true)

	gceBuildletSpan := lg.CreateSpan("create_gce_buildlet", instName)
	defer func() { gceBuildletSpan.Done(err) }()

	var (
		needDelete   bool
		createSpan   = lg.CreateSpan("create_gce_instance", instName)
		waitBuildlet spanlog.Span // made after create is done
		curSpan      = createSpan // either instSpan or waitBuildlet
	)

	log.Printf("Creating GCE VM %q for %s", instName, hostType)
	bc, err = buildlet.StartNewVM(gcpCreds, buildEnv, instName, hostType, buildlet.VMOpts{
		DeleteIn: deleteIn,
		OnInstanceRequested: func() {
			log.Printf("GCE VM %q now booting", instName)
		},
		OnInstanceCreated: func() {
			needDelete = true

			createSpan.Done(nil)
			waitBuildlet = lg.CreateSpan("wait_buildlet_start", instName)
			curSpan = waitBuildlet
		},
		OnGotInstanceInfo: func() {
			lg.LogEventTime("got_instance_info", "waiting_for_buildlet...")
		},
	})
	if err != nil {
		curSpan.Done(err)
		log.Printf("Failed to create VM for %s: %v", hostType, err)
		if needDelete {
			deleteVM(buildEnv.Zone, instName)
			p.putVMCountQuota(hconf.GCENumCPU())
		}
		p.setInstanceUsed(instName, false)
		return nil, err
	}
	waitBuildlet.Done(nil)
	bc.SetDescription("GCE VM: " + instName)
	bc.SetOnHeartbeatFailure(func() {
		p.putBuildlet(bc, hostType, instName)
	})
	return bc, nil
}

func (p *gceBuildletPool) putBuildlet(bc *buildlet.Client, hostType, instName string) error {
	// TODO(bradfitz): add the buildlet to a freelist (of max N
	// items) for up to 10 minutes since when it got started if
	// it's never seen a command execution failure, and we can
	// wipe all its disk content? (perhaps wipe its disk content
	// when it's retrieved, like the reverse buildlet pool) But
	// this will require re-introducing a distinction in the
	// buildlet client library between Close, Destroy/Halt, and
	// tracking execution errors.  That was all half-baked before
	// and thus removed. Now Close always destroys everything.
	deleteVM(buildEnv.Zone, instName)
	p.setInstanceUsed(instName, false)

	hconf, ok := dashboard.Hosts[hostType]
	if !ok {
		panic("failed to lookup conf") // should've worked if we did it before
	}
	p.putVMCountQuota(hconf.GCENumCPU())
	return nil
}

func (p *gceBuildletPool) WriteHTMLStatus(w io.Writer) {
	fmt.Fprintf(w, "<b>GCE pool</b> capacity: %s", p.capacityString())
	const show = 6 // must be even
	active := p.instancesActive()
	if len(active) > 0 {
		fmt.Fprintf(w, "<ul>")
		for i, inst := range active {
			if i < show/2 || i >= len(active)-(show/2) {
				fmt.Fprintf(w, "<li>%v, %s</li>\n", inst.name, friendlyDuration(time.Since(inst.creation)))
			} else if i == show/2 {
				fmt.Fprintf(w, "<li>... %d of %d total omitted ...</li>\n", len(active)-show, len(active))
			}
		}
		fmt.Fprintf(w, "</ul>")
	}
}

func (p *gceBuildletPool) String() string {
	return fmt.Sprintf("GCE pool capacity: %s", p.capacityString())
}

func (p *gceBuildletPool) capacityString() string {
	p.mu.Lock()
	defer p.mu.Unlock()
	return fmt.Sprintf("%d/%d instances; %d/%d CPUs",
		len(p.inst), p.instUsage+p.instLeft,
		p.cpuUsage, p.cpuUsage+p.cpuLeft)
}

// awaitVMCountQuota waits for numCPU CPUs of quota to become available,
// or returns ctx.Err.
func (p *gceBuildletPool) awaitVMCountQuota(ctx context.Context, numCPU int) error {
	// Poll every 2 seconds, which could be better, but works and
	// is simple.
	for {
		if p.tryAllocateQuota(numCPU) {
			return nil
		}
		select {
		case <-time.After(2 * time.Second):
		case <-ctx.Done():
			return ctx.Err()
		}
	}
}

func (p *gceBuildletPool) HasCapacity(hostType string) bool {
	hconf, ok := dashboard.Hosts[hostType]
	if !ok {
		return false
	}
	numCPU := hconf.GCENumCPU()
	p.mu.Lock()
	defer p.mu.Unlock()
	return p.haveQuotaLocked(numCPU)
}

// haveQuotaLocked reports whether the current GCE quota permits
// starting numCPU more CPUs.
//
// precondition: p.mu must be held.
func (p *gceBuildletPool) haveQuotaLocked(numCPU int) bool {
	return p.cpuLeft >= numCPU && p.instLeft >= 1 && len(p.inst) < maxInstances && p.addrUsage < maxInstances
}

func (p *gceBuildletPool) tryAllocateQuota(numCPU int) bool {
	p.mu.Lock()
	defer p.mu.Unlock()
	if p.disabled {
		return false
	}
	if p.haveQuotaLocked(numCPU) {
		p.cpuUsage += numCPU
		p.cpuLeft -= numCPU
		p.instLeft--
		p.addrUsage++
		return true
	}
	return false
}

// putVMCountQuota adjusts the dead-reckoning of our quota usage by
// one instance and cpu CPUs.
func (p *gceBuildletPool) putVMCountQuota(cpu int) {
	p.mu.Lock()
	defer p.mu.Unlock()
	p.cpuUsage -= cpu
	p.cpuLeft += cpu
	p.instLeft++
}

func (p *gceBuildletPool) setInstanceUsed(instName string, used bool) {
	p.mu.Lock()
	defer p.mu.Unlock()
	if p.inst == nil {
		p.inst = make(map[string]time.Time)
	}
	if used {
		p.inst[instName] = time.Now()
	} else {
		delete(p.inst, instName)
	}
}

func (p *gceBuildletPool) instanceUsed(instName string) bool {
	p.mu.Lock()
	defer p.mu.Unlock()
	_, ok := p.inst[instName]
	return ok
}

func (p *gceBuildletPool) instancesActive() (ret []resourceTime) {
	p.mu.Lock()
	defer p.mu.Unlock()
	for name, create := range p.inst {
		ret = append(ret, resourceTime{
			name:     name,
			creation: create,
		})
	}
	sort.Sort(byCreationTime(ret))
	return ret
}

// resourceTime is a GCE instance or Kube pod name and its creation time.
type resourceTime struct {
	name     string
	creation time.Time
}

type byCreationTime []resourceTime

func (s byCreationTime) Len() int           { return len(s) }
func (s byCreationTime) Less(i, j int) bool { return s[i].creation.Before(s[j].creation) }
func (s byCreationTime) Swap(i, j int)      { s[i], s[j] = s[j], s[i] }

// cleanUpOldVMs loops forever and periodically enumerates virtual
// machines and deletes those which have expired.
//
// A VM is considered expired if it has a "delete-at" metadata
// attribute having a unix timestamp before the current time.
//
// This is the safety mechanism to delete VMs which stray from the
// normal deleting process. VMs are created to run a single build and
// should be shut down by a controlling process. Due to various types
// of failures, they might get stranded. To prevent them from getting
// stranded and wasting resources forever, we instead set the
// "delete-at" metadata attribute on them when created to some time
// that's well beyond their expected lifetime.
func (p *gceBuildletPool) cleanUpOldVMs() {
	if *mode == "dev" {
		return
	}
	if computeService == nil {
		return
	}

	// TODO(bradfitz): remove this list and just query it from the compute API?
	// http://godoc.org/google.golang.org/api/compute/v1#RegionsService.Get
	// and Region.Zones: http://godoc.org/google.golang.org/api/compute/v1#Region

	for {
		for _, zone := range buildEnv.ZonesToClean {
			if err := p.cleanZoneVMs(zone); err != nil {
				log.Printf("Error cleaning VMs in zone %q: %v", zone, err)
			}
		}
		time.Sleep(time.Minute)
	}
}

// cleanZoneVMs is part of cleanUpOldVMs, operating on a single zone.
func (p *gceBuildletPool) cleanZoneVMs(zone string) error {
	// Fetch the first 500 (default) running instances and clean
	// thoes. We expect that we'll be running many fewer than
	// that. Even if we have more, eventually the first 500 will
	// either end or be cleaned, and then the next call will get a
	// partially-different 500.
	// TODO(bradfitz): revist this code if we ever start running
	// thousands of VMs.
	gceAPIGate()
	list, err := computeService.Instances.List(buildEnv.ProjectName, zone).Do()
	if err != nil {
		return fmt.Errorf("listing instances: %v", err)
	}
	for _, inst := range list.Items {
		if inst.Metadata == nil {
			// Defensive. Not seen in practice.
			continue
		}
		var sawDeleteAt bool
		var deleteReason string
		for _, it := range inst.Metadata.Items {
			if it.Key == "delete-at" {
				if it.Value == nil {
					log.Printf("missing delete-at value; ignoring")
					continue
				}
				unixDeadline, err := strconv.ParseInt(*it.Value, 10, 64)
				if err != nil {
					log.Printf("invalid delete-at value %q seen; ignoring", *it.Value)
					continue
				}
				sawDeleteAt = true
				if time.Now().Unix() > unixDeadline {
					deleteReason = "delete-at expiration"
				}
			}
		}
		isBuildlet := strings.HasPrefix(inst.Name, "buildlet-")

		if isBuildlet && !sawDeleteAt && !p.instanceUsed(inst.Name) {
			createdAt, _ := time.Parse(time.RFC3339Nano, inst.CreationTimestamp)
			if createdAt.Before(time.Now().Add(-3 * time.Hour)) {
				deleteReason = fmt.Sprintf("no delete-at, created at %s", inst.CreationTimestamp)
			}
		}

		// Delete buildlets (things we made) from previous
		// generations. Only deleting things starting with "buildlet-"
		// is a historical restriction, but still fine for paranoia.
		if deleteReason == "" && sawDeleteAt && isBuildlet && !p.instanceUsed(inst.Name) {
			if _, ok := deletedVMCache.Get(inst.Name); !ok {
				deleteReason = "from earlier coordinator generation"
			}
		}

		if deleteReason != "" {
			log.Printf("deleting VM %q in zone %q; %s ...", inst.Name, zone, deleteReason)
			deleteVM(zone, inst.Name)
		}

	}
	return nil
}

var deletedVMCache = lru.New(100) // keyed by instName

// deleteVM starts a delete of an instance in a given zone.
//
// It either returns an operation name (if delete is pending) or the
// empty string if the instance didn't exist.
func deleteVM(zone, instName string) (operation string, err error) {
	deletedVMCache.Add(instName, token{})
	gceAPIGate()
	op, err := computeService.Instances.Delete(buildEnv.ProjectName, zone, instName).Do()
	apiErr, ok := err.(*googleapi.Error)
	if ok {
		if apiErr.Code == 404 {
			return "", nil
		}
	}
	if err != nil {
		log.Printf("Failed to delete instance %q in zone %q: %v", instName, zone, err)
		return "", err
	}
	log.Printf("Sent request to delete instance %q in zone %q. Operation ID, Name: %v, %v", instName, zone, op.Id, op.Name)
	return op.Name, nil
}

func hasScope(want string) bool {
	// If not on GCE, assume full access
	if !metadata.OnGCE() {
		return true
	}
	scopes, err := metadata.Scopes("default")
	if err != nil {
		log.Printf("failed to query metadata default scopes: %v", err)
		return false
	}
	for _, v := range scopes {
		if v == want {
			return true
		}
	}
	return false
}

func hasComputeScope() bool {
	return hasScope(compute.ComputeScope) || hasScope(compute.CloudPlatformScope)
}

func hasStorageScope() bool {
	return hasScope(storage.ScopeReadWrite) || hasScope(storage.ScopeFullControl) || hasScope(compute.CloudPlatformScope)
}

func readGCSFile(name string) ([]byte, error) {
	if *mode == "dev" {
		b, ok := testFiles[name]
		if !ok {
			return nil, &os.PathError{
				Op:   "open",
				Path: name,
				Err:  os.ErrNotExist,
			}
		}
		return []byte(b), nil
	}

	r, err := storageClient.Bucket(buildEnv.BuildletBucket).Object(name).NewReader(context.Background())
	if err != nil {
		return nil, err
	}
	defer r.Close()
	return ioutil.ReadAll(r)
}

// syncBuildStatsLoop runs forever in its own goroutine and syncs the
// coordinator's datastore Build & Span entities to BigQuery
// periodically.
func syncBuildStatsLoop(env *buildenv.Environment) {
	ticker := time.NewTicker(5 * time.Minute)
	for {
		ctx, cancel := context.WithTimeout(context.Background(), 10*time.Minute)
		if err := buildstats.SyncBuilds(ctx, env); err != nil {
			log.Printf("buildstats: SyncBuilds: %v", err)
		}
		if err := buildstats.SyncSpans(ctx, env); err != nil {
			log.Printf("buildstats: SyncSpans: %v", err)
		}
		cancel()
		<-ticker.C
	}
}

// createBasepinDisks creates zone-local copies of VM disk images, to
// speed up VM creations in the future.
//
// Other than a list call, this a no-op unless new VM images were
// added or updated recently.
func createBasepinDisks(ctx context.Context) {
	if !metadata.OnGCE() || (buildEnv != buildenv.Production && buildEnv != buildenv.Staging) {
		return
	}
	for {
		t0 := time.Now()
		bgc, err := buildgo.NewClient(ctx, buildEnv)
		if err != nil {
			log.Printf("basepin: NewClient: %v", err)
			return
		}
		log.Printf("basepin: creating basepin disks...")
		err = bgc.MakeBasepinDisks(ctx)
		d := time.Since(t0).Round(time.Second / 10)
		if err != nil {
			basePinErr.Store(err.Error())
			log.Printf("basepin: error creating basepin disks, after %v: %v", d, err)
			time.Sleep(5 * time.Minute)
			continue
		}
		basePinErr.Store("")
		log.Printf("basepin: created basepin disks after %v", d)
		return
	}
}
@dmitshur

This comment has been minimized.

Copy link
Member

commented Sep 16, 2019

Hmm, if the issue is on the cmd/coordinator side (which I or someone else on @golang/osp-team would have to deploy), I wonder why it's seemingly not affecting other builder types.

I'll look more into it on Monday.

Edit: It seems many of the builder configurations have SkipSnapshot set to true, which means few builders are actually doing them, and they might all be running into this error. We started deploying cmd/coordinator with Go 1.13 only recently, which is when TLS 1.3 started being on by default.

@mengzhuo

This comment has been minimized.

Copy link
Contributor

commented Sep 17, 2019

@dmitris I found coordinator has been restarted 13 hours ago but my buildlet still can't upload snapshot to GCS. Can I skipsnapshot for my host?

@dmitshur

This comment has been minimized.

Copy link
Member

commented Sep 17, 2019

I didn't get to this today, but I will try tomorrow morning.

If we can't resolve the snapshot error on the coordinator side, then disabling it sounds reasonable. But I suspect it should be easy to fix. I'll post an update here after trying TLS 1.2.

Also, as a heads up, my GitHub username is @dmitshur. You're pinging the wrong Dmitri. :)

@dmitshur

This comment has been minimized.

Copy link
Member

commented Sep 17, 2019

@mengzhuo I've deployed coordinator version "318271009812bf18ef7ef1785ecc9dffdbe7ee78-dirty-dmitshur-20190917T083031" just now with TLS 1.2, so we should be able to tell if it makes a difference with the write_snapshot_to_gcs step.

@mengzhuo

This comment has been minimized.

Copy link
Contributor

commented Sep 17, 2019

@dmitshur

Sorry about pinging wrong Dmitri :)

It's still not working.

  2019-09-17T14:02:15Z finish_write_snapshot_to_gcs after 3m30.9s; err=local error: tls: bad record MAC
  2019-09-17T14:02:15Z finish_make_and_test after 9m44.8s; err=writeSnapshot: local error: tls: bad record MAC
@dmitshur

This comment has been minimized.

Copy link
Member

commented Sep 17, 2019

Indeed, it isn't working for your builder even after coordinator has been deployed with TLS 1.2.

I noticed that it is working for other builder types. For example, I watched this trybot run, and its log on linux-amd64 builder type said:

[...]
2019-09-17T13:49:05Z write_snapshot_to_gcs 
2019-09-17T13:49:05Z fetch_snapshot_reader_from_buildlet 
2019-09-17T13:49:05Z finish_fetch_snapshot_reader_from_buildlet after 55.2ms
2019-09-17T13:49:12Z finish_write_snapshot_to_gcs after 6.64s
[...]

We need to investigate this further.

@FiloSottile

This comment has been minimized.

Copy link
Member

commented Sep 17, 2019

bad record MAC means network corruption, and you mentioned having to use a proxy to exit China, so is there a chance a middlebox is tampering with and corrupting the connection?

@mengzhuo

This comment has been minimized.

Copy link
Contributor

commented Sep 17, 2019

@FiloSottile It might be. I think I should add skipsnapshot option to my builder.

@bradfitz

This comment has been minimized.

Copy link
Member Author

commented Sep 17, 2019

FWIW, the buildlet doesn't connect to GCS directly. It's the coordinator doing that.

All communications between the buildlet and farmer.golang.org are already encrypted. It's the buildlet that connects out to the farmer, and then the farmer initiates the connection to GCS.

The code returning the error is:

        wr := storageClient.Bucket(buildEnv.SnapBucket).Object(st.SnapshotObjectName()).NewWriter(ctx)
        wr.ContentType = "application/octet-stream"
        wr.ACL = append(wr.ACL, storage.ACLRule{Entity: storage.AllUsers, Role: storage.RoleReader})
        if _, err := io.Copy(wr, tgz); err != nil {
                st.logf("failed to write snapshot to GCS: %v", err)
                wr.CloseWithError(err)
                return err
        }

        return wr.Close()

If it's the io.Copy, we don't know for sure whether it's the reading part of the writing part that's failing, but yes... if the GFC is indeed killing connections, it's likely the reading part.

But in that case, this is just going to hurt elsewhere later. All the GFC can do is analysis based on SNI + certs + traffic patterns. This won't be the only bandwidth bursty part of the build. I think we're just going to be playing whack-a-mole & cross our fingers if you keep tweaking things.

Really we need to get a builder on a functioning network.

@odeke-em

This comment has been minimized.

Copy link
Member

commented Sep 17, 2019

@gopherbot

This comment has been minimized.

Copy link

commented Sep 17, 2019

Change https://golang.org/cl/195977 mentions this issue: dashboard: skipsnapshot for linux-mipsle-mengzhuo builder

@mengzhuo

This comment has been minimized.

Copy link
Contributor

commented Sep 17, 2019

Thanks everyone.
I'm using both polipo (convert socks proxy to http proxy) and ssh ( socks proxy ) to exist firewall and no unusual log shows up.

gopherbot pushed a commit to golang/build that referenced this issue Sep 18, 2019
dashboard: skipsnapshot for linux-mipsle-mengzhuo builder
The connection between linux-mipsle-mengzhuo and the build farm
is not stable and failed the snapshot upload stage, so we have to
skipsnapshot for that.

Updates golang/go#31217

Change-Id: Ic01620ec12254221f93234e8b25749a09f3dbc9c
Reviewed-on: https://go-review.googlesource.com/c/build/+/195977
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
@dmitshur

This comment has been minimized.

Copy link
Member

commented Sep 18, 2019

@mengzhuo FYI, I've merged the change that sets SkipSnapshot to true and deployed coordinator (an hour~ ago).

@mengzhuo

This comment has been minimized.

Copy link
Contributor

commented Sep 18, 2019

@dmitshur Thanks, there are 3 "ok" builds now. It seems good to me.

@cherrymui

This comment has been minimized.

Copy link
Contributor

commented Sep 18, 2019

Thanks @mengzhuo for setting up the builder. I'm happy to see MIPS64 port still works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.