Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build/cmd/scaleway: our Scaleway arm machines are misbehaving #32229

Closed
bradfitz opened this issue May 24, 2019 · 4 comments

Comments

Projects
None yet
3 participants
@bradfitz
Copy link
Member

commented May 24, 2019

According to farmer, a bunch of Scaleway machines are connected with duplicate hostnames, and 12 are missing

* host-linux-arm-scaleway: 12/38 (12 missing)
...
scaleway-prod-01 (51.158.110.44:33360) version 23, host-linux-arm-scaleway: connected 21m20.5s, idle for 1.33s
scaleway-prod-01 (212.47.250.152:56474) version 23, host-linux-arm-scaleway: connected 1h54m38.6s, working for 35m20.6s
scaleway-prod-03 (163.172.153.92:59146) version 23, host-linux-arm-scaleway: connected 16m31.3s, idle for 2.61s
scaleway-prod-04 (163.172.142.77:44776) version 23, host-linux-arm-scaleway: connected 27m37s, working for 28.3s
scaleway-prod-04 (51.15.206.81:34456) version 23, host-linux-arm-scaleway: connected 39m19.7s, working for 5m30.4s
scaleway-prod-05 (163.172.162.189:43702) version 23, host-linux-arm-scaleway: connected 3h2m5.7s, working for 51m12.7s
scaleway-prod-05 (212.47.242.131:35300) version 23, host-linux-arm-scaleway: connected 2h1m42.7s, working for 41m8.1s
scaleway-prod-06 (163.172.181.247:57724) version 23, host-linux-arm-scaleway: connected 27m39.1s, working for 1m29.2s
scaleway-prod-06 (51.15.217.183:59258) version 23, host-linux-arm-scaleway: connected 16m48.9s, idle for 2.18s
scaleway-prod-07 (51.158.101.91:59578) version 23, host-linux-arm-scaleway: connected 3m25.6s, idle for 5.8s
scaleway-prod-08 (163.172.179.11:36634) version 23, host-linux-arm-scaleway: connected 19m19.4s, idle for 8.53s
scaleway-prod-09 (51.158.105.235:45242) version 23, host-linux-arm-scaleway: connected 6m11.3s, idle for 7.12s
scaleway-prod-10 (51.158.108.84:50156) version 23, host-linux-arm-scaleway: connected 23s, idle for 10.9s
scaleway-prod-11 (51.15.232.144:47156) version 23, host-linux-arm-scaleway: connected 23.2s, idle for 10.1s
scaleway-prod-11 (163.172.134.55:44228) version 23, host-linux-arm-scaleway: connected 43m27.3s, working for 8m27.2s
scaleway-prod-12 (51.158.98.172:52450) version 23, host-linux-arm-scaleway: connected 26m26.1s, idle for 3.37s
scaleway-prod-13 (51.158.117.187:44266) version 23, host-linux-arm-scaleway: connected 22m39s, idle for 7.32s
scaleway-prod-13 (51.15.139.155:44318) version 23, host-linux-arm-scaleway: connected 6m4.4s, idle for 206.8ms
scaleway-prod-14 (51.158.125.73:46900) version 23, host-linux-arm-scaleway: connected 6m4.2s, idle for 9.18s
scaleway-prod-15 (163.172.160.106:41050) version 23, host-linux-arm-scaleway: connected 14m45.6s, idle for 6.19s
scaleway-prod-20 (163.172.188.6:46076) version 23, host-linux-arm-scaleway: connected 12m20.4s, idle for 11s
scaleway-prod-22 (212.47.238.171:38308) version 23, host-linux-arm-scaleway: connected 5m17.7s, idle for 3.92s
scaleway-prod-24 (163.172.161.121:45332) version 23, host-linux-arm-scaleway: connected 27m37.1s, working for 30.3s
scaleway-prod-26 (163.172.129.209:35966) version 23, host-linux-arm-scaleway: connected 7m37.9s, idle for 10.9s
scaleway-prod-27 (163.172.160.70:51956) version 23, host-linux-arm-scaleway: connected 32m7.1s, working for 2m29.1s
scaleway-prod-29 (51.15.137.232:56794) version 23, host-linux-arm-scaleway: connected 4m21.7s, idle for 531ms
scaleway-prod-30 (212.47.249.235:35328) version 23, host-linux-arm-scaleway: connected 11m21.5s, idle for 11.6s
scaleway-prod-31 (163.172.184.51:53732) version 23, host-linux-arm-scaleway: connected 1h55m6.9s, working for 35m20.6s
scaleway-prod-34 (163.172.143.99:42700) version 23, host-linux-arm-scaleway: connected 2m27.3s, idle for 4.06s
scaleway-prod-36 (163.172.153.179:40536) version 23, host-linux-arm-scaleway: connected 23m30.9s, idle for 7.89s
scaleway-prod-38 (212.47.234.101:45190) version 23, host-linux-arm-scaleway: connected 1h44m27.2s, working for 25m28.8s
scaleway-prod-39 (212.47.235.116:47378) version 23, host-linux-arm-scaleway: connected 14m50.5s, idle for 876.9ms
scaleway-prod-41 (163.172.150.238:43694) version 23, host-linux-arm-scaleway: connected 29.6s, idle for 5.42s
scaleway-prod-42 (163.172.155.113:40934) version 23, host-linux-arm-scaleway: connected 8m55.8s, idle for 5.33s
scaleway-prod-43 (212.47.235.115:49636) version 23, host-linux-arm-scaleway: connected 8m43.7s, idle for 7.28s
scaleway-prod-46 (51.15.133.5:40248) version 23, host-linux-arm-scaleway: connected 6m32.3s, idle for 4.98s
scaleway-prod-49 (51.15.141.39:38538) version 23, host-linux-arm-scaleway: connected 11m40.4s, idle for 2.4s
scaleway-prod-50 (212.47.234.37:47666) version 23, host-linux-arm-scaleway: connected 1h39m7s, working for 25m28.8s

And in the kubectl logs for the scaleway service (that runs cmd/scaleway to keep things healthy):

bradfitz@go:~/go$ kubectl logs --since=30m scaleway-deployment-568c59c45f-qbqzl  
2019/05/24 16:04:50 rebooting old running-but-disconnected "scaleway-prod-02" server...
2019/05/24 16:04:58 reboot("scaleway-prod-02"): <nil>
2019/05/24 16:04:58 rebooting old running-but-disconnected "scaleway-prod-16" server...
2019/05/24 16:04:59 reboot("scaleway-prod-16"): <nil>
2019/05/24 16:04:59 server "scaleway-prod-17" in state "stopping"; not creating
2019/05/24 16:04:59 server "scaleway-prod-18" in state "stopping"; not creating
2019/05/24 16:04:59 server "scaleway-prod-19" in state "stopping"; not creating
2019/05/24 16:04:59 server "scaleway-prod-21" in state "stopping"; not creating
2019/05/24 16:04:59 server "scaleway-prod-23" in state "stopping"; not creating
2019/05/24 16:04:59 server "scaleway-prod-25" in state "stopping"; not creating
2019/05/24 16:04:59 rebooting old running-but-disconnected "scaleway-prod-28" server...
2019/05/24 16:04:59 reboot("scaleway-prod-28"): <nil>
2019/05/24 16:04:59 server "scaleway-prod-32" in state "stopping"; not creating
2019/05/24 16:04:59 rebooting old running-but-disconnected "scaleway-prod-33" server...
2019/05/24 16:05:00 reboot("scaleway-prod-33"): <nil>
2019/05/24 16:05:00 server "scaleway-prod-35" in state "stopping"; not creating
2019/05/24 16:05:00 server "scaleway-prod-37" in state "stopping"; not creating
2019/05/24 16:05:00 rebooting old running-but-disconnected "scaleway-prod-40" server...
2019/05/24 16:05:08 reboot("scaleway-prod-40"): <nil>
2019/05/24 16:05:08 server "scaleway-prod-44" in state "stopping"; not creating
2019/05/24 16:05:08 rebooting old running-but-disconnected "scaleway-prod-45" server...
2019/05/24 16:05:08 reboot("scaleway-prod-45"): <nil>
2019/05/24 16:05:08 server "scaleway-prod-47" in state "stopping"; not creating
2019/05/24 16:05:08 rebooting old running-but-disconnected "scaleway-prod-48" server...
2019/05/24 16:05:09 reboot("scaleway-prod-48"): <nil>
2019/05/24 16:15:11 rebooting old running-but-disconnected "scaleway-prod-02" server...
2019/05/24 16:15:12 reboot("scaleway-prod-02"): <nil>
2019/05/24 16:15:12 rebooting old running-but-disconnected "scaleway-prod-16" server...
2019/05/24 16:15:20 reboot("scaleway-prod-16"): <nil>
2019/05/24 16:15:20 server "scaleway-prod-17" in state "stopping"; not creating
2019/05/24 16:15:20 server "scaleway-prod-18" in state "stopping"; not creating
2019/05/24 16:15:20 server "scaleway-prod-19" in state "stopping"; not creating
2019/05/24 16:15:20 server "scaleway-prod-21" in state "stopping"; not creating
2019/05/24 16:15:20 server "scaleway-prod-23" in state "stopping"; not creating
2019/05/24 16:15:20 server "scaleway-prod-25" in state "stopping"; not creating
2019/05/24 16:15:20 rebooting old running-but-disconnected "scaleway-prod-28" server...
2019/05/24 16:15:20 reboot("scaleway-prod-28"): <nil>
2019/05/24 16:15:20 server "scaleway-prod-32" in state "stopping"; not creating
2019/05/24 16:15:20 rebooting old running-but-disconnected "scaleway-prod-33" server...
2019/05/24 16:15:30 reboot("scaleway-prod-33"): <nil>
2019/05/24 16:15:30 server "scaleway-prod-35" in state "stopping"; not creating
2019/05/24 16:15:30 server "scaleway-prod-37" in state "stopping"; not creating
2019/05/24 16:15:30 rebooting old running-but-disconnected "scaleway-prod-40" server...
2019/05/24 16:15:30 reboot("scaleway-prod-40"): <nil>
2019/05/24 16:15:30 server "scaleway-prod-44" in state "stopping"; not creating
2019/05/24 16:15:30 rebooting old running-but-disconnected "scaleway-prod-45" server...
2019/05/24 16:15:30 reboot("scaleway-prod-45"): <nil>
2019/05/24 16:15:30 server "scaleway-prod-47" in state "stopping"; not creating
2019/05/24 16:15:30 rebooting old running-but-disconnected "scaleway-prod-48" server...
2019/05/24 16:15:31 reboot("scaleway-prod-48"): <nil>
2019/05/24 16:25:33 rebooting old running-but-disconnected "scaleway-prod-02" server...
2019/05/24 16:25:41 reboot("scaleway-prod-02"): <nil>
2019/05/24 16:25:41 rebooting old running-but-disconnected "scaleway-prod-16" server...
2019/05/24 16:25:42 reboot("scaleway-prod-16"): <nil>
2019/05/24 16:25:42 server "scaleway-prod-17" in state "stopping"; not creating
2019/05/24 16:25:42 server "scaleway-prod-18" in state "stopping"; not creating
2019/05/24 16:25:42 server "scaleway-prod-19" in state "stopping"; not creating
2019/05/24 16:25:42 server "scaleway-prod-21" in state "stopping"; not creating
2019/05/24 16:25:42 server "scaleway-prod-23" in state "stopping"; not creating
2019/05/24 16:25:42 server "scaleway-prod-25" in state "stopping"; not creating
2019/05/24 16:25:42 rebooting old running-but-disconnected "scaleway-prod-28" server...
2019/05/24 16:25:42 reboot("scaleway-prod-28"): <nil>
2019/05/24 16:25:42 server "scaleway-prod-32" in state "stopping"; not creating
2019/05/24 16:25:42 rebooting old running-but-disconnected "scaleway-prod-33" server...
2019/05/24 16:25:42 reboot("scaleway-prod-33"): <nil>
2019/05/24 16:25:42 server "scaleway-prod-35" in state "stopping"; not creating
2019/05/24 16:25:42 server "scaleway-prod-37" in state "stopping"; not creating
2019/05/24 16:25:42 rebooting old running-but-disconnected "scaleway-prod-40" server...
2019/05/24 16:25:51 reboot("scaleway-prod-40"): <nil>
2019/05/24 16:25:51 server "scaleway-prod-44" in state "stopping"; not creating
2019/05/24 16:25:51 rebooting old running-but-disconnected "scaleway-prod-45" server...
2019/05/24 16:25:59 reboot("scaleway-prod-45"): <nil>
2019/05/24 16:25:59 server "scaleway-prod-47" in state "stopping"; not creating
2019/05/24 16:25:59 rebooting old running-but-disconnected "scaleway-prod-48" server...
2019/05/24 16:26:08 reboot("scaleway-prod-48"): <nil>

Somebody should investigate & fix.

/cc @dmitshur @andybons

@dmitshur

This comment has been minimized.

Copy link
Member

commented May 24, 2019

I've logged in to the Scaleway UI and investigated.

There were some duplicate instances, which is what was causing:

"scaleway-prod-04" is connected from 2 machines
"scaleway-prod-01" is connected from 2 machines
"scaleway-prod-06" is connected from 2 machines
"scaleway-prod-13" is connected from 2 machines
"scaleway-prod-05" is connected from 2 machines
"scaleway-prod-11" is connected from 2 machines

The original instances were 2 years old, and the new ones were created on various days, within the last month or so.

That shouldn't happen in cmd/scaleway code, but perhaps there's some bug or race condition that makes it possible to happen sometimes. This needs investigation. But given it happens rarely enough, the short term problem is mitigated for now by disabling those instances.

Of the 18 missing machines, some are started and I was able to ssh into them successfully. However, for some reason, they're not running the buildlet in a docker container.

Compare the output from a healthy instance:

$ docker ps
CONTAINER ID        IMAGE                           COMMAND                  CREATED             STATUS              PORTS               NAMES
83daf8a89a14        gobuilder-arm-scaleway:latest   "/usr/local/bin/stage"   About an hour ago   Up About an hour                        scaleway-prod-13

Compared to a missing one:

$ docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES

Why that is needs more investigation. Perhaps a good thing to try is to just re-create those instances and see if that solves the problem.

There are also some instances in the Scaleway UI that are perpetually in the "stopping" state, and never actually completing. I'm going to open a Scaleway ticket about that.

@dmitshur

This comment has been minimized.

Copy link
Member

commented May 28, 2019

I got rid of all the duplicate instances last time, but there are two duplicate again today:

"scaleway-prod-04" is connected from 2 machines
"scaleway-prod-11" is connected from 2 machines

I checked the Scaleway UI, and the duplicate scaleway-prod-04 instance was created today 4~ hours ago. I checked cmd/scaleway logs from around then and here's how it happened:

2019/05/28 17:42:09 Doing req "{...,\"name\":\"scaleway-prod-04\",...}"
2019/05/28 17:42:09 Create of 4: 201 Created
...
2019/05/28 17:42:29 Powering on scaleway-prod-04 (f74749ad-e3d3-4775-a579-e17be0e0be58) = <nil>

That shouldn't have happened because there's an original scaleway-prod-04 instance from 2 years ago that's connected and functional now.

After some debugging, I've found that the code isn't doing pagination when listing servers, and the default page size was just 50, which wasn't enough to list all servers (we have around 70: 50 expected + some duplicates + some that are stuck shutting down). Going to fix that first and then see what more needs to be done here.

@gopherbot

This comment has been minimized.

Copy link

commented May 28, 2019

Change https://golang.org/cl/179182 mentions this issue: cmd/scaleway: set page size to 100 when listing servers

gopherbot pushed a commit to golang/build that referenced this issue May 29, 2019

cmd/scaleway: set page size to 100 when listing servers
According to Scaleway API documentation¹, most endpoints are paginated.
Use the maximum page size of 100 when listing servers. The default page
size was 50, which meant some servers weren't included in the response.

We currently have just over 50 servers, so this is sufficient for now,
but we'll need to do proper pagination when we need to handle over 100.
At that point, need to decide if it's better to implement it ad-hoc here
or start using an existing Scaleway API client. For now, add a TODO and
check that we haven't exceeded the page size, to avoid silent problems.

¹ https://developer.scaleway.com/#header-pagination

Updates golang/go#32229

Change-Id: I254671e464e88017eee2e49e382c686c52fd8fbd
Reviewed-on: https://go-review.googlesource.com/c/build/+/179182
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@dmitshur

This comment has been minimized.

Copy link
Member

commented May 29, 2019

Due to the aforementioned pagination issue, cmd/scaleway was constantly needlessly restarting servers, occasionally creating new duplicate ones, and so on. With that issue fixed in CL 179182, it has stopped doing that.

I cleaned up all the duplicate instances, and also removed some stale instances that weren't connecting successfully. cmd/scaleway re-created new instances to take their place and those are functioning okay.

So we went from this state (or an even worse version thereof):

To this:

(@bradfitz's work in CL 178798 to add the health visualization helped a lot; thanks Brad!)

There are now exactly 51 instances on scaleway, the 50 prod ones and 1 prep one, and all 50 are connected. 🎉 This specific issue is resolved, so closing.

@dmitshur dmitshur closed this May 29, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.