Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build: update k8s clusters #33529

Closed
andybons opened this issue Aug 7, 2019 · 10 comments
Closed

x/build: update k8s clusters #33529

andybons opened this issue Aug 7, 2019 · 10 comments
Assignees
Milestone

Comments

@andybons
Copy link
Member

@andybons andybons commented Aug 7, 2019

Placeholder issue for upgrading our k8s clusters

@gopherbot gopherbot added this to the Unreleased milestone Aug 7, 2019
@andybons

This comment has been minimized.

Copy link
Member Author

@andybons andybons commented Aug 7, 2019

[Development] successfully disabled Kubernetes Dashboard

$ gcloud container clusters update buildlets --project=<dev project name>--zone=us-central1-f --update-addons=KubernetesDashboard=DISABLED
$ gcloud container clusters update go --project=<dev project name> --zone=us-central1-f --update-addons=KubernetesDashboard=DISABLED
@andybons

This comment has been minimized.

Copy link
Member Author

@andybons andybons commented Aug 7, 2019

[Production] successfully disabled Kubernetes Dashboard add-on

$ gcloud container clusters update buildlets --project=<project name>--zone=us-central1-f --update-addons=KubernetesDashboard=DISABLED
$ gcloud container clusters update go --project=<project name> --zone=us-central1-f --update-addons=KubernetesDashboard=DISABLED
@andybons

This comment has been minimized.

Copy link
Member Author

@andybons andybons commented Aug 7, 2019

Stepping away until 4pm ET. Will begin upgrade of k8s clusters then.

@andybons

This comment has been minimized.

Copy link
Member Author

@andybons andybons commented Aug 7, 2019

[Dev]
Updated Masters and nodes to 1.12.7-gke.25

No issues found so far.

Will continue with prod tomorrow morning.

@andybons andybons self-assigned this Aug 7, 2019
@andybons

This comment has been minimized.

Copy link
Member Author

@andybons andybons commented Aug 12, 2019

Delayed, but upgrading prod cluster now.

@andybons

This comment has been minimized.

Copy link
Member Author

@andybons andybons commented Aug 12, 2019

Updated masters and nodes to 1.12.7-gke.25

Scaleway builders are having issues. From https://farmer.golang.org:

# "scaleway" status: Scaleway linux/arm machines
# Notes: https://github.com/golang/build/tree/master/env/linux-arm/scaleway
Warn: scaleway-prod-16 missing, never seen (at least 12m7s)
Warn: scaleway-prod-17 missing, never seen (at least 12m7s)
Warn: scaleway-prod-18 missing, never seen (at least 12m7s)
Warn: scaleway-prod-20 missing, not seen for 11m20s
Warn: scaleway-prod-24 missing, not seen for 11m18s
Warn: scaleway-prod-25 missing, not seen for 11m18s
Warn: scaleway-prod-26 missing, not seen for 11m9s
Warn: scaleway-prod-27 missing, not seen for 11m20s
Warn: scaleway-prod-30 missing, not seen for 11m8s
Warn: scaleway-prod-31 missing, not seen for 11m15s
Error: 10 machines missing, 20% of capacity

Investigating

@andybons

This comment has been minimized.

Copy link
Member Author

@andybons andybons commented Aug 12, 2019

From the scaleway machine (scaleway-prod-16):

systemctl status rundockerbuildlet.service
● rundockerbuildlet.service - Run Buildlets in Docker
   Loaded: loaded (/etc/systemd/user/rundockerbuildlet.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2019-08-12 15:41:30 UTC; 7s ago
 Main PID: 6142 (rundockerbuildl)
   Memory: 1.6M
      CPU: 1.387s
   CGroup: /system.slice/rundockerbuildlet.service
           └─6142 /usr/local/bin/rundockerbuildlet -basename=scaleway -image=gobuilder-arm-scaleway:latest -n=1

Aug 12 15:41:34 scw-e8738f rundockerbuildlet[6142]: See 'docker run --help'.
Aug 12 15:41:35 scw-e8738f rundockerbuildlet[6142]: 2019/08/12 15:41:35 Creating scaleway-prod-16 ...
Aug 12 15:41:35 scw-e8738f rundockerbuildlet[6142]: 2019/08/12 15:41:35 Error creating scaleway-prod-16: exit status 125, docker: Error response from daemon: Conflict. The name "/scaleway-prod-16" is already in use by container 91786a55d6eace4a38e313ae0bb5e972c2a53667422ae064bc1
Aug 12 15:41:35 scw-e8738f rundockerbuildlet[6142]: See 'docker run --help'.
Aug 12 15:41:36 scw-e8738f rundockerbuildlet[6142]: 2019/08/12 15:41:36 Creating scaleway-prod-16 ...
Aug 12 15:41:36 scw-e8738f rundockerbuildlet[6142]: 2019/08/12 15:41:36 Error creating scaleway-prod-16: exit status 125, docker: Error response from daemon: Conflict. The name "/scaleway-prod-16" is already in use by container 91786a55d6eace4a38e313ae0bb5e972c2a53667422ae064bc1
Aug 12 15:41:36 scw-e8738f rundockerbuildlet[6142]: See 'docker run --help'.
Aug 12 15:41:37 scw-e8738f rundockerbuildlet[6142]: 2019/08/12 15:41:37 Creating scaleway-prod-16 ...
Aug 12 15:41:37 scw-e8738f rundockerbuildlet[6142]: 2019/08/12 15:41:37 Error creating scaleway-prod-16: exit status 125, docker: Error response from daemon: Conflict. The name "/scaleway-prod-16" is already in use by container 91786a55d6eace4a38e313ae0bb5e972c2a53667422ae064bc1
Aug 12 15:41:37 scw-e8738f rundockerbuildlet[6142]: See 'docker run --help'.
@andybons

This comment has been minimized.

Copy link
Member Author

@andybons andybons commented Aug 12, 2019

Cleaned up stopped container noted in that output. Will continue on other machines.

@andybons

This comment has been minimized.

Copy link
Member Author

@andybons andybons commented Aug 12, 2019

ssh -i ~/keys/id_ed25519_golang1 root@IP 'docker rm $(docker ps -a -q)'

All Scaleway Docker images are back up now and talking to coordinator.

No other issues seen. Will reopen bug if another issue comes up.

@andybons andybons closed this Aug 12, 2019
@toothrot

This comment has been minimized.

Copy link
Contributor

@toothrot toothrot commented Aug 12, 2019

rundockerbuildlet tries to clean up these containers itself, but is failing to remove it.

It tries to remove exited containers here: https://github.com/golang/build/blob/master/cmd/rundockerbuildlet/rundockerbuildlet.go#L91-L93

It also tries to remove "Created" containers that never reach the running status here: https://github.com/golang/build/blob/master/cmd/rundockerbuildlet/rundockerbuildlet.go#L111-L114

Based on the log lines you posted, I would expect to see an error if it tried and failed to remove one of these containers. I'm guessing it's not detecting it for some reason.

It's possible the container ended up in a status we're not handling:
status | One of created, restarting, running, removing, paused, exited, or dead. Even if that were the case, I would expect the logic in L111-L114 to handle this correctly

It's also possible that our logic in L91-L93 isn't fetching the status properly from the formatted string (an extra space in a name? multiple names?).

It's hard to tell the root cause at this point as all the impacted hosts have been fixed manually, so we can't inspect the conflicting container status.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.