Support Kubernetes upgrade without volume down #703

shubb30 · 2019-08-29T19:28:54Z

Longhorn version: v0.5.0
Rancher version: 2.2.8

We have noticed that when we do a Kubernetes upgrade, often times any workloads that have a Longhorn persistent volume will get disconnected from the volume.

We just upgraded one cluster from Kubernetes 1.12.7 to 1.14.6.

We left 3 workloads with Longhorn volumes running to test with, and stopped all other workoads.
Of the 3, one workload re-attached, and continued working without issues.
The other 2 volumes show state Attaching from the Longhorn UI, and the workloads do not see the data.
Both of the volume pods show Unavailable status from Rancher.

This presents a real problem for us because we are having to scale down our production workloads every time there is a Kubernetes upgrade, which seems to be every few weeks for vulnerability patches.

We are using RKE to do the cluster upgrades.

First erroring workload has PV name pvc-5b6620c5-c9f1-11e9-9b4e-005056980a6a

Logs from Longhorn pods matching 5b6620c5-c9f1-11e9-9b4e-005056980a6a:

pvc-5b6620c5-c9f1-11e9-9b4e-005056980a6a-e-a5ee1852: State: Unavailable, rancher/longhorn-engine:v0.5.0

time="2019-08-29T17:33:22Z" level=info msg="launcher: controller 95dddbdc-d55c-40f1-a08c-a6870dad3007 started"
time="2019-08-29T17:33:22Z" level=info msg="Starting with replicas [\"tcp://10.42.2.222:9502\" \"tcp://10.42.4.46:9502\" \"tcp://10.42.1.115:9502\"]"
time="2019-08-29T17:33:22Z" level=info msg="Connecting to remote: 10.42.2.222:9502"
2019/08/29 17:33:22 Get http://10.42.2.222:9502/v1/replicas/1: dial tcp 10.42.2.222:9502: connect: connection refused

pvc-5b6620c5-c9f1-11e9-9b4e-005056980a6a-r-186a38c9: State: Running, rancher/longhorn-engine:v0.5.0

time="2019-08-29T17:33:15Z" level=info msg="Listening on data 0.0.0.0:9503"
time="2019-08-29T17:33:15Z" level=info msg="Listening on sync agent 0.0.0.0:9504"
time="2019-08-29T17:33:15Z" level=info msg="Listening on control 0.0.0.0:9502"
time="2019-08-29T17:33:15Z" level=info msg="Listening on sync 0.0.0.0:9504"

pvc-5b6620c5-c9f1-11e9-9b4e-005056980a6a-r-423305da State: Running, rancher/longhorn-engine:v0.5.0

time="2019-08-29T17:33:21Z" level=info msg="Listening on data 0.0.0.0:9503"
time="2019-08-29T17:33:21Z" level=info msg="Listening on sync agent 0.0.0.0:9504"
time="2019-08-29T17:33:21Z" level=info msg="Listening on control 0.0.0.0:9502"
time="2019-08-29T17:33:21Z" level=info msg="Listening on sync 0.0.0.0:9504"

pvc-5b6620c5-c9f1-11e9-9b4e-005056980a6a-r-bc74375d State: Running, rancher/longhorn-engine:v0.5.0

time="2019-08-29T17:33:34Z" level=info msg="Listening on data 0.0.0.0:9503"
time="2019-08-29T17:33:34Z" level=info msg="Listening on sync agent 0.0.0.0:9504"
time="2019-08-29T17:33:34Z" level=info msg="Listening on control 0.0.0.0:9502"
time="2019-08-29T17:33:34Z" level=info msg="Listening on sync 0.0.0.0:9504"

Second erroring workload has PV name pvc-d309e862-92de-11e9-9a36-0050569885a2

Logs from Longhorn pods matching d309e862-92de-11e9-9a36-0050569885a2:

pvc-d309e862-92de-11e9-9a36-0050569885a2-e-f85f58f4: State Unavailable, rancher/longhorn-engine:v0.5.0

time="2019-08-29T17:33:34Z" level=info msg="launcher: controller e5f51adf-af4c-4ae6-ac61-900c52da3693 started"
time="2019-08-29T17:33:34Z" level=info msg="Starting with replicas [\"tcp://10.42.3.105:9502\" \"tcp://10.42.1.11:9502\" \"tcp://10.42.4.195:9502\"]"
time="2019-08-29T17:33:34Z" level=info msg="Connecting to remote: 10.42.3.105:9502"
2019/08/29 17:34:04 Get http://10.42.3.105:9502/v1/replicas/1: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

pvc-d309e862-92de-11e9-9a36-0050569885a2-r-f7b0acca: State Running, rancher/longhorn-engine:v0.5.0

time="2019-08-29T17:33:23Z" level=info msg="Listening on data 0.0.0.0:9503"
time="2019-08-29T17:33:23Z" level=info msg="Listening on sync agent 0.0.0.0:9504"
time="2019-08-29T17:33:23Z" level=info msg="Listening on control 0.0.0.0:9502"
time="2019-08-29T17:33:23Z" level=info msg="Listening on sync 0.0.0.0:9504"

When I look at the second workload in the list above from the Longhorn UI, it shows only one replica running, but the top section of the box was grey instead of blue. I scaled down the workload, and then scaled it back up again. Once the workload started, Longhorn showed the one replica as Running, and then it started to rebuild one of the other replicas. Once the first replica finished, it started rebuilding the second replica.

The first replica in the above list shows 3 replicas as running from the Longhorn UI, but the boxes are also grey instead of blue. I am certain that if I scale down the workload and bring it back up, that the replicas will all start working again. I'm leaving it in this state in case anyone needs to get logs from the system.

The text was updated successfully, but these errors were encountered:

yasker · 2019-09-04T21:59:15Z

@shubb30 Thanks for the reporting. We haven't looked into the Kubernetes upgrade yet, but I assume if the overlay networking is down, Longhorn volume would be down.

We need to look into the details of how Kubernetes upgrades to understand more about the impact to Longhorn. I think it's easy to reproduce it in our lab so feel free to recover your current workloads.

shubb30 · 2019-09-04T23:47:05Z

Thanks @yasker. It's RKE that's doing the upgrade, and I'm sure you are right that it is taking down the networking during the upgrade.

I would understand that if the network were to go down that it might cause an interruption, but what I am worried about is that Longhorn does not self-heal in this scenario. In just about every other case we have encountered, Longhorn is able to fix itself except for this one.

Longhorn longhorn#703 Signed-off-by: Shuo Wu <shuo@rancher.com>

yasker · 2019-11-14T19:43:08Z

Done as a part of #851

yasker added area/kubernetes Kubernetes related like K8s version compatibility component/longhorn-manager Longhorn manager (control plane) labels Sep 4, 2019

yasker added this to the v0.7.0 milestone Sep 20, 2019

yasker added the kind/poc Potential feature request but need POC label Sep 23, 2019

yasker assigned shuo-wu Sep 23, 2019

yasker mentioned this issue Sep 24, 2019

PoC: Kubernetes upgrade #763

Closed

2 tasks

yasker removed the kind/poc Potential feature request but need POC label Sep 24, 2019

yasker modified the milestones: v0.7.0, v0.8.0 Oct 22, 2019

yasker changed the title ~~Longhorn volumes get disconnected if the workload is running during a Kubernetes upgrade~~ Support Kubernetes upgrade without volume down Oct 25, 2019

yasker mentioned this issue Oct 25, 2019

investigation: possible solution to automatic recover from remount as read-only #381

Closed

This was referenced Oct 29, 2019

Remount reattached volumes longhorn/longhorn-manager#453

Merged

Update Helm Chart #841

Merged

shuo-wu pushed a commit to shuo-wu/longhorn that referenced this issue Oct 29, 2019

doc: Restore volumes for running workloads after Kubernetes upgrade

db326d4

Longhorn longhorn#703 Signed-off-by: Shuo Wu <shuo@rancher.com>

yasker mentioned this issue Nov 5, 2019

Automatically attach volume if was detached unexpected #851

Closed

yasker closed this as completed Nov 14, 2019

yasker modified the milestones: v0.8.0, v0.7.0 Nov 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Kubernetes upgrade without volume down #703

Support Kubernetes upgrade without volume down #703

shubb30 commented Aug 29, 2019

yasker commented Sep 4, 2019

shubb30 commented Sep 4, 2019

yasker commented Nov 14, 2019

Support Kubernetes upgrade without volume down #703

Support Kubernetes upgrade without volume down #703

Comments

shubb30 commented Aug 29, 2019

yasker commented Sep 4, 2019

shubb30 commented Sep 4, 2019

yasker commented Nov 14, 2019