Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

snapshot stucks and PVC turns readOnlny #66

Closed
muecahit94 opened this issue Feb 25, 2021 · 5 comments
Closed

snapshot stucks and PVC turns readOnlny #66

muecahit94 opened this issue Feb 25, 2021 · 5 comments
Labels
bug Something isn't working

Comments

@muecahit94
Copy link

muecahit94 commented Feb 25, 2021

Hi,
we have a MYSQL DB running with a linstore/piraeus PVC and we try to create snapshot for the mysql-data-pvc, but it does not work stable, sometimes it works sometimes not.
Often we have the issue that the "READYTOUSE" (on snapshot) does not turn to "true" and after some time the PVC thats in use by MYSQL turns readonly and the MYSQL-Statefullset turns inactive.
It takes about 20-30mins until the PVC can be used again by the MYSQL-Pod. When we scale down and up, it takes about 20-30 mins until the PVC can be mounted.

We want to use snapshots/cloning on our MYSQL-Service to create backups/dumps of the DB, but this issue happens too often that we cant really use the snapshot/clone features.
We have a script running to create a snapshot and it sends first a "LOCK TABLES FOR BACKUP" to the MYSQL-DB that there are no write operations when a snapshot is taken.

Our clusters are running on rancher
the nodes OS is Ubuntu 18.04, we use LVM as storagepool
we tried it on K8S v1.18.14, after we had this issue, we updated our K8S cluster to v1.19.7 but we have still the same issue.

In the dmesg outputs of a node we could see this when we tried to scale down and up the MYSQL-Statefullset:

[Thu Feb 25 00:23:45 2021] drbd pvc-71e0f8aa-b391-4de0-aaf7-2615056aeb69: Preparing cluster-wide state change 1808454609 (2->-1 3/1)
[Thu Feb 25 00:23:47 2021] drbd pvc-71e0f8aa-b391-4de0-aaf7-2615056aeb69: Aborting cluster-wide state change 1808454609 (2036ms) rv = -10
[Thu Feb 25 00:23:47 2021] drbd pvc-71e0f8aa-b391-4de0-aaf7-2615056aeb69: Preparing cluster-wide state change 3281288772 (2->-1 3/1)
[Thu Feb 25 00:23:49 2021] drbd pvc-71e0f8aa-b391-4de0-aaf7-2615056aeb69: Aborting cluster-wide state change 3281288772 (2016ms) rv = -10
[Thu Feb 25 00:23:49 2021] drbd pvc-71e0f8aa-b391-4de0-aaf7-2615056aeb69: Preparing cluster-wide state change 3724795384 (2->-1 3/1)
[Thu Feb 25 00:23:51 2021] drbd pvc-71e0f8aa-b391-4de0-aaf7-2615056aeb69: Declined by peer node30 (id: 0), see the kernel log there
[Thu Feb 25 00:23:51 2021] drbd pvc-71e0f8aa-b391-4de0-aaf7-2615056aeb69: Aborting cluster-wide state change 3724795384 (2016ms) rv = -10
[Thu Feb 25 00:23:51 2021] drbd pvc-71e0f8aa-b391-4de0-aaf7-2615056aeb69: Preparing cluster-wide state change 706470052 (2->-1 3/1)
[Thu Feb 25 00:23:53 2021] drbd pvc-71e0f8aa-b391-4de0-aaf7-2615056aeb69: Declined by peer node30 (id: 0), see the kernel log there
[Thu Feb 25 00:23:53 2021] drbd pvc-71e0f8aa-b391-4de0-aaf7-2615056aeb69: Aborting cluster-wide state change 706470052 (2016ms) rv = -10
[Thu Feb 25 00:23:54 2021] drbd pvc-71e0f8aa-b391-4de0-aaf7-2615056aeb69: Preparing cluster-wide state change 3245719846 (2->-1 3/1)
[Thu Feb 25 00:23:56 2021] drbd pvc-71e0f8aa-b391-4de0-aaf7-2615056aeb69: Aborting cluster-wide state change 3245719846 (2036ms) rv = -10
[Thu Feb 25 00:23:56 2021] drbd pvc-71e0f8aa-b391-4de0-aaf7-2615056aeb69: Preparing cluster-wide state change 1156981054 (2->-1 3/1)
[Thu Feb 25 00:23:58 2021] drbd pvc-71e0f8aa-b391-4de0-aaf7-2615056aeb69: Aborting cluster-wide state change 1156981054 (2016ms) rv = -10
[Thu Feb 25 00:23:58 2021] drbd pvc-71e0f8aa-b391-4de0-aaf7-2615056aeb69: Preparing cluster-wide state change 4267462796 (2->-1 3/1)

@WanzenBug WanzenBug added the bug Something isn't working label Mar 9, 2021
@WanzenBug
Copy link
Member

Sorry for the late response.

If you still see this issue: could you include the kernel log from the node that is stopping progress. See the lines:

Declined by peer node30 (id: 0), see the kernel log there

So hopefully node30 holds some clues. By the way, which version of piraeus + drbd are you using?

@muecahit94
Copy link
Author

node30-kern.log

Hi @WanzenBug!

Uploaded the full kernel-log from that timeframe.

we use the latest version of piraeus with drbd version 9.0.27-1

@WanzenBug
Copy link
Member

Thanks! Looks like a process is keeping the device open, either pvdisplay or blkid.

Is there perhaps some monitoring agent running on node30 that uses these commands? If you are running some LVM commands directly on the host (like pvdisplay), you need to exclude all DRBD devices from the device scan. You can do that in /etc/lvm/lvm.conf:

devices {
    ...
    global_filter = [ "r|^/dev/drbd|" ]
    ...
}

Otherwise LVM can deadlock (as seen in the task blkid:25334 blocked for more than 120 seconds message).

@muecahit94
Copy link
Author

We are using Zabbix and monitoring the mounts for free space, this could be the reason for locking.
I will check how Zabbix-Agent gets the information

@muecahit94
Copy link
Author

We didnt had much time to test it in different ways but needed a stable solution and solved it in an other way.
Closing this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants