snapshot stucks and PVC turns readOnlny #66

muecahit94 · 2021-02-25T10:57:21Z

Hi,
we have a MYSQL DB running with a linstore/piraeus PVC and we try to create snapshot for the mysql-data-pvc, but it does not work stable, sometimes it works sometimes not.
Often we have the issue that the "READYTOUSE" (on snapshot) does not turn to "true" and after some time the PVC thats in use by MYSQL turns readonly and the MYSQL-Statefullset turns inactive.
It takes about 20-30mins until the PVC can be used again by the MYSQL-Pod. When we scale down and up, it takes about 20-30 mins until the PVC can be mounted.

We want to use snapshots/cloning on our MYSQL-Service to create backups/dumps of the DB, but this issue happens too often that we cant really use the snapshot/clone features.
We have a script running to create a snapshot and it sends first a "LOCK TABLES FOR BACKUP" to the MYSQL-DB that there are no write operations when a snapshot is taken.

Our clusters are running on rancher
the nodes OS is Ubuntu 18.04, we use LVM as storagepool
we tried it on K8S v1.18.14, after we had this issue, we updated our K8S cluster to v1.19.7 but we have still the same issue.

In the dmesg outputs of a node we could see this when we tried to scale down and up the MYSQL-Statefullset:

[Thu Feb 25 00:23:45 2021] drbd pvc-71e0f8aa-b391-4de0-aaf7-2615056aeb69: Preparing cluster-wide state change 1808454609 (2->-1 3/1)
[Thu Feb 25 00:23:47 2021] drbd pvc-71e0f8aa-b391-4de0-aaf7-2615056aeb69: Aborting cluster-wide state change 1808454609 (2036ms) rv = -10
[Thu Feb 25 00:23:47 2021] drbd pvc-71e0f8aa-b391-4de0-aaf7-2615056aeb69: Preparing cluster-wide state change 3281288772 (2->-1 3/1)
[Thu Feb 25 00:23:49 2021] drbd pvc-71e0f8aa-b391-4de0-aaf7-2615056aeb69: Aborting cluster-wide state change 3281288772 (2016ms) rv = -10
[Thu Feb 25 00:23:49 2021] drbd pvc-71e0f8aa-b391-4de0-aaf7-2615056aeb69: Preparing cluster-wide state change 3724795384 (2->-1 3/1)
[Thu Feb 25 00:23:51 2021] drbd pvc-71e0f8aa-b391-4de0-aaf7-2615056aeb69: Declined by peer node30 (id: 0), see the kernel log there
[Thu Feb 25 00:23:51 2021] drbd pvc-71e0f8aa-b391-4de0-aaf7-2615056aeb69: Aborting cluster-wide state change 3724795384 (2016ms) rv = -10
[Thu Feb 25 00:23:51 2021] drbd pvc-71e0f8aa-b391-4de0-aaf7-2615056aeb69: Preparing cluster-wide state change 706470052 (2->-1 3/1)
[Thu Feb 25 00:23:53 2021] drbd pvc-71e0f8aa-b391-4de0-aaf7-2615056aeb69: Declined by peer node30 (id: 0), see the kernel log there
[Thu Feb 25 00:23:53 2021] drbd pvc-71e0f8aa-b391-4de0-aaf7-2615056aeb69: Aborting cluster-wide state change 706470052 (2016ms) rv = -10
[Thu Feb 25 00:23:54 2021] drbd pvc-71e0f8aa-b391-4de0-aaf7-2615056aeb69: Preparing cluster-wide state change 3245719846 (2->-1 3/1)
[Thu Feb 25 00:23:56 2021] drbd pvc-71e0f8aa-b391-4de0-aaf7-2615056aeb69: Aborting cluster-wide state change 3245719846 (2036ms) rv = -10
[Thu Feb 25 00:23:56 2021] drbd pvc-71e0f8aa-b391-4de0-aaf7-2615056aeb69: Preparing cluster-wide state change 1156981054 (2->-1 3/1)
[Thu Feb 25 00:23:58 2021] drbd pvc-71e0f8aa-b391-4de0-aaf7-2615056aeb69: Aborting cluster-wide state change 1156981054 (2016ms) rv = -10
[Thu Feb 25 00:23:58 2021] drbd pvc-71e0f8aa-b391-4de0-aaf7-2615056aeb69: Preparing cluster-wide state change 4267462796 (2->-1 3/1)

WanzenBug · 2021-03-09T16:09:35Z

Sorry for the late response.

If you still see this issue: could you include the kernel log from the node that is stopping progress. See the lines:

Declined by peer node30 (id: 0), see the kernel log there

So hopefully node30 holds some clues. By the way, which version of piraeus + drbd are you using?

muecahit94 · 2021-03-09T19:51:09Z

node30-kern.log

Hi @WanzenBug!

Uploaded the full kernel-log from that timeframe.

we use the latest version of piraeus with drbd version 9.0.27-1

WanzenBug · 2021-03-10T08:12:30Z

Thanks! Looks like a process is keeping the device open, either pvdisplay or blkid.

Is there perhaps some monitoring agent running on node30 that uses these commands? If you are running some LVM commands directly on the host (like pvdisplay), you need to exclude all DRBD devices from the device scan. You can do that in /etc/lvm/lvm.conf:

devices {
    ...
    global_filter = [ "r|^/dev/drbd|" ]
    ...
}

Otherwise LVM can deadlock (as seen in the task blkid:25334 blocked for more than 120 seconds message).

muecahit94 · 2021-03-10T10:21:09Z

We are using Zabbix and monitoring the mounts for free space, this could be the reason for locking.
I will check how Zabbix-Agent gets the information

muecahit94 · 2021-04-09T13:35:55Z

We didnt had much time to test it in different ways but needed a stable solution and solved it in an other way.
Closing this issue

WanzenBug added the bug Something isn't working label Mar 9, 2021

muecahit94 closed this as completed Apr 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

snapshot stucks and PVC turns readOnlny #66

snapshot stucks and PVC turns readOnlny #66

muecahit94 commented Feb 25, 2021 •

edited

Loading

WanzenBug commented Mar 9, 2021

muecahit94 commented Mar 9, 2021

WanzenBug commented Mar 10, 2021

muecahit94 commented Mar 10, 2021

muecahit94 commented Apr 9, 2021

snapshot stucks and PVC turns readOnlny #66

snapshot stucks and PVC turns readOnlny #66

Comments

muecahit94 commented Feb 25, 2021 • edited Loading

WanzenBug commented Mar 9, 2021

muecahit94 commented Mar 9, 2021

WanzenBug commented Mar 10, 2021

muecahit94 commented Mar 10, 2021

muecahit94 commented Apr 9, 2021

muecahit94 commented Feb 25, 2021 •

edited

Loading