Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Kubernetes upgrade without volume down #703

Closed
shubb30 opened this issue Aug 29, 2019 · 3 comments
Closed

Support Kubernetes upgrade without volume down #703

shubb30 opened this issue Aug 29, 2019 · 3 comments
Assignees
Labels
area/kubernetes Kubernetes related like K8s version compatibility component/longhorn-manager Longhorn manager (control plane)
Milestone

Comments

@shubb30
Copy link

shubb30 commented Aug 29, 2019

Longhorn version: v0.5.0
Rancher version: 2.2.8

We have noticed that when we do a Kubernetes upgrade, often times any workloads that have a Longhorn persistent volume will get disconnected from the volume.

We just upgraded one cluster from Kubernetes 1.12.7 to 1.14.6.

We left 3 workloads with Longhorn volumes running to test with, and stopped all other workoads.
Of the 3, one workload re-attached, and continued working without issues.
The other 2 volumes show state Attaching from the Longhorn UI, and the workloads do not see the data.
Both of the volume pods show Unavailable status from Rancher.

This presents a real problem for us because we are having to scale down our production workloads every time there is a Kubernetes upgrade, which seems to be every few weeks for vulnerability patches.

We are using RKE to do the cluster upgrades.

First erroring workload has PV name pvc-5b6620c5-c9f1-11e9-9b4e-005056980a6a

Logs from Longhorn pods matching 5b6620c5-c9f1-11e9-9b4e-005056980a6a:

pvc-5b6620c5-c9f1-11e9-9b4e-005056980a6a-e-a5ee1852: State: Unavailable, rancher/longhorn-engine:v0.5.0

time="2019-08-29T17:33:22Z" level=info msg="launcher: controller 95dddbdc-d55c-40f1-a08c-a6870dad3007 started"
time="2019-08-29T17:33:22Z" level=info msg="Starting with replicas [\"tcp://10.42.2.222:9502\" \"tcp://10.42.4.46:9502\" \"tcp://10.42.1.115:9502\"]"
time="2019-08-29T17:33:22Z" level=info msg="Connecting to remote: 10.42.2.222:9502"
2019/08/29 17:33:22 Get http://10.42.2.222:9502/v1/replicas/1: dial tcp 10.42.2.222:9502: connect: connection refused

pvc-5b6620c5-c9f1-11e9-9b4e-005056980a6a-r-186a38c9: State: Running, rancher/longhorn-engine:v0.5.0

time="2019-08-29T17:33:15Z" level=info msg="Listening on data 0.0.0.0:9503"
time="2019-08-29T17:33:15Z" level=info msg="Listening on sync agent 0.0.0.0:9504"
time="2019-08-29T17:33:15Z" level=info msg="Listening on control 0.0.0.0:9502"
time="2019-08-29T17:33:15Z" level=info msg="Listening on sync 0.0.0.0:9504"

pvc-5b6620c5-c9f1-11e9-9b4e-005056980a6a-r-423305da State: Running, rancher/longhorn-engine:v0.5.0

time="2019-08-29T17:33:21Z" level=info msg="Listening on data 0.0.0.0:9503"
time="2019-08-29T17:33:21Z" level=info msg="Listening on sync agent 0.0.0.0:9504"
time="2019-08-29T17:33:21Z" level=info msg="Listening on control 0.0.0.0:9502"
time="2019-08-29T17:33:21Z" level=info msg="Listening on sync 0.0.0.0:9504"

pvc-5b6620c5-c9f1-11e9-9b4e-005056980a6a-r-bc74375d State: Running, rancher/longhorn-engine:v0.5.0

time="2019-08-29T17:33:34Z" level=info msg="Listening on data 0.0.0.0:9503"
time="2019-08-29T17:33:34Z" level=info msg="Listening on sync agent 0.0.0.0:9504"
time="2019-08-29T17:33:34Z" level=info msg="Listening on control 0.0.0.0:9502"
time="2019-08-29T17:33:34Z" level=info msg="Listening on sync 0.0.0.0:9504"

Second erroring workload has PV name pvc-d309e862-92de-11e9-9a36-0050569885a2

Logs from Longhorn pods matching d309e862-92de-11e9-9a36-0050569885a2:

pvc-d309e862-92de-11e9-9a36-0050569885a2-e-f85f58f4: State Unavailable, rancher/longhorn-engine:v0.5.0

time="2019-08-29T17:33:34Z" level=info msg="launcher: controller e5f51adf-af4c-4ae6-ac61-900c52da3693 started"
time="2019-08-29T17:33:34Z" level=info msg="Starting with replicas [\"tcp://10.42.3.105:9502\" \"tcp://10.42.1.11:9502\" \"tcp://10.42.4.195:9502\"]"
time="2019-08-29T17:33:34Z" level=info msg="Connecting to remote: 10.42.3.105:9502"
2019/08/29 17:34:04 Get http://10.42.3.105:9502/v1/replicas/1: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

pvc-d309e862-92de-11e9-9a36-0050569885a2-r-f7b0acca: State Running, rancher/longhorn-engine:v0.5.0

time="2019-08-29T17:33:23Z" level=info msg="Listening on data 0.0.0.0:9503"
time="2019-08-29T17:33:23Z" level=info msg="Listening on sync agent 0.0.0.0:9504"
time="2019-08-29T17:33:23Z" level=info msg="Listening on control 0.0.0.0:9502"
time="2019-08-29T17:33:23Z" level=info msg="Listening on sync 0.0.0.0:9504"

When I look at the second workload in the list above from the Longhorn UI, it shows only one replica running, but the top section of the box was grey instead of blue. I scaled down the workload, and then scaled it back up again. Once the workload started, Longhorn showed the one replica as Running, and then it started to rebuild one of the other replicas. Once the first replica finished, it started rebuilding the second replica.

The first replica in the above list shows 3 replicas as running from the Longhorn UI, but the boxes are also grey instead of blue. I am certain that if I scale down the workload and bring it back up, that the replicas will all start working again. I'm leaving it in this state in case anyone needs to get logs from the system.

@yasker yasker added area/kubernetes Kubernetes related like K8s version compatibility component/longhorn-manager Longhorn manager (control plane) labels Sep 4, 2019
@yasker
Copy link
Member

yasker commented Sep 4, 2019

@shubb30 Thanks for the reporting. We haven't looked into the Kubernetes upgrade yet, but I assume if the overlay networking is down, Longhorn volume would be down.

We need to look into the details of how Kubernetes upgrades to understand more about the impact to Longhorn. I think it's easy to reproduce it in our lab so feel free to recover your current workloads.

@shubb30
Copy link
Author

shubb30 commented Sep 4, 2019

Thanks @yasker. It's RKE that's doing the upgrade, and I'm sure you are right that it is taking down the networking during the upgrade.

I would understand that if the network were to go down that it might cause an interruption, but what I am worried about is that Longhorn does not self-heal in this scenario. In just about every other case we have encountered, Longhorn is able to fix itself except for this one.

@yasker yasker added this to the v0.7.0 milestone Sep 20, 2019
@yasker yasker added the kind/poc Potential feature request but need POC label Sep 23, 2019
@yasker yasker mentioned this issue Sep 24, 2019
2 tasks
@yasker yasker removed the kind/poc Potential feature request but need POC label Sep 24, 2019
@yasker yasker modified the milestones: v0.7.0, v0.8.0 Oct 22, 2019
@yasker yasker changed the title Longhorn volumes get disconnected if the workload is running during a Kubernetes upgrade Support Kubernetes upgrade without volume down Oct 25, 2019
shuo-wu pushed a commit to shuo-wu/longhorn that referenced this issue Oct 29, 2019
Longhorn longhorn#703

Signed-off-by: Shuo Wu <shuo@rancher.com>
@yasker
Copy link
Member

yasker commented Nov 14, 2019

Done as a part of #851

@yasker yasker closed this as completed Nov 14, 2019
@yasker yasker modified the milestones: v0.8.0, v0.7.0 Nov 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubernetes Kubernetes related like K8s version compatibility component/longhorn-manager Longhorn manager (control plane)
Projects
Status: Closed
Development

No branches or pull requests

3 participants