-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Failed rebuilding a large replica of 3TB #2507
Comments
Hi! Can you provide information on disk size (NVME)? From log, I saw repeated messages like;
and all the PVC is
Notice that some of PVC are on |
Hi. They're degraded because default replica count is 3, but I only have 2 nodes at the moment (slave-sys2/3). Most of the PVs are in ssd disks and have no issues because most PVs have size less than 50G. The one failing rebuilding is on NVMe. NVMe size is 6TB with default overprovisioning of 200%. There were no issues creating 2 replicas the first time PV was created, I only realized it failed to resync and then to rebuild after 1 node was rebooted. Even recreating again a new PV with 2 replicas and then changing replica count to 1 and back to 2 failed to rebuild this PV. |
From
Replicas will take up space residing on host |
@cclhsu The original replica is in slave-sys2 and healthy. It's trying to schedule a replica rebuild in slave-sys3 of which space in /var/lib/longhorn/nvme is almost empty (6TB space). root@slave-sys2:~# df -h /var/lib/longhorn/nvme/ root@slave-sys2:/var/lib/longhorn/nvme/replicas# du -ms * root@slave-sys3:~# df -h /var/lib/longhorn/nvme/ root@slave-sys3:/var/lib/longhorn/nvme/replicas# du -ms * |
I have also tried merging 2 snapshots in the single healthy replica by deleting 17GB snapshot so it merges to the larger 2.4TB snapshot. It has been going on for 24 hours. The original creation of 2.4TB data only took 7 hours. |
For the 1st rebuilding failure, which is also what you reported in the beginning, we saw a HTTP call
For the 2nd failure, which is the failure you reported later, it's caused by the read timeout from a healthy replica, I am not sure if it is caused by network fluctuation or CPU/Disk surge.
As for the stuck snapshot deletion/merge, can you point out which volume the issue happened on? It's better to tell when. Since I saw there are lots of snapshot deletion & merge in the logs, most of them spent several seconds only. e.g.,
|
The detach you mentioned is because I had to reboot the node. The status is rebuilding, but the longhorn CPU process in slave-sys2 only spins at load of 1-2 and did nothing else. There's also zero data transfer to slave-sys3 for the large snapshot. When I cancel rebuilding the longhorn process got stuck. top(1) shows longhorn process running at load 1 with zero VIRT/RSS memory (literally ZERO). The snapshot merging was done after I submitted support bundle. It finished successfully after 30 hours. But merging 17GB for 30 hours is a bit too long. I tried creating a new PV, recopied the data from original source, and repeat the process of deleting one replica and rebuilding to no avail. As far as I can see, when processing a snapshot this large, longhorn engine does something on the source snapshot (maybe checksumming or something but whatever it's doing it's struggling) but never transfer any data for this snapshot across network (literally zero data transfer). It's also not CPU because the node is doing nothing much at that moment except for longhorn process trying to process the source snapshot. It's also not network fluctuation because there's no data transfer to the other node. I've managed to use PV as large as 400GB with no issues. |
I do find that stopping replicas then detaching takes more than 90s (from The snapshot merging you mentioned is abnormal. Can you provide more info about this part? I mean the related volume name/ snapshot name/timestamp. BTW. I don't know why the instance manager pod |
Before snapshot creation/merge, the engine will flush the data into replicas via The error I mentioned means somehow the engine encounters the read timeout from a healthy replica when the engine receive the io request from applications. There is no nothing Longhorn can do to prevent this.
|
Normally for a normal snapshot replication, longhorn in the source node would start processing in a few seconds and show progress transfer to the corresponding snapshot in the target node. "zero data transfer" just means longhorn in the source node starts processing the source snapshot and the size of the snapshot in the target node is 0 (as shown by du). I can't make out the read errors above, I think it's because when I thought it got stuck I tried to terminate the rebuilding and something happened. In any case I did update the firmware for the server just in case. However I now know what's the bottleneck after testing with a smaller set of data. It's not the size of the data in the snapshot but the number of files (and correspondingly the number of holes in the snapshot sparse file). This was also most likely the reasons the merging of snapshots took so long (there was only 1 replica at the time, so data transfer in the network is irrelevant). The merging was 17GB of data in 1 snapshot to 2.4TB data in another snapshot, but the number of files in the 2.4TB snapshot is around 21 millions. I reran the test with a smaller set, a PV snapshot with 1.9TB data, 20 million files in the snapshot. As you can see below, longhorn does processing for around 8 hours before starting the data transfer to the corresponding snapshot in the the target node. The data transfer after this intial processing is around 6 hours. Source node: Target node: I then tested creating a new PV, write 1 file of size 1.9TB using: longhorn took only a few seconds to process before starting the data transfer across the network. The data transfer took around the same time to finish as the above case, in the order of 6 hours. The difference of above 2 cases are both have the same data usage of 1.9TB in the source snapshot, but 1 has 20 million files, the other one 1 file. The initial processing of 8 hours (before starting any data transfer across the network) is a function of the number of files. And after reading the issue and longhorn source code mentioned here hence a function of the number of holes in the sparse file which in my case probably in the order of a million. At least now I know longhorn does the replica rebuild, it's just because the initial processing was so long I didn't know whether it was just stuck or not. So my question is, is this initial processing delay a result of longhorn inefficiently does extents/finding holes in the sparse file, or it's inherent ext4 sparse file processing for large number of holes? If the target node has no replica, is this processing necessary? Can it just bypass it and start transferring instead? Also this a complete rebuild of a replica, I don't know how long it will take if the target node already has a replica, my guess would be the initial 8 hours and the time to do checksum on the snapshot if checksum is parallel (8+6 hours if checksum takes 6 hours), otherwise twice of this if done sequentially for source and target snapshots (30 hours, which is interestingly the time it took for snapshots merging I previously did above); plus the time to actually transfer the difference between 2 replicas. |
Thank you for your reporting! The case that massive holes in a large snapshot file greatly degrades the rebuilding speed is what Longhorn team has not paid attention to. For the rebuilding progress, I think there is a tiny misleading. Let me explain the actual progress then answer the questions:
Step1 & 2, which are the initial processing of the rebuilding, don't involve the internal instruction(the hole numbers) of the snapshot files. The snapshot metafiles mainly record the info like parent snapshot and the file size. Hence the metafile size is almost empty.
To minimize the checksum cost, Longhorn can check the target replica file status first. But this implicitly means more extra communication, which will make the case worse if there are massive small extents in a snapshot file. For the 1st step, Longhorn team should do some benchmarking based on the case you report. Hope we can fix this in the next release. |
OK, so somehow I misunderstood the steps. Then I'm still not sure what took so long after finishing transfering the small prior snapshot at time="2021-04-28T18:23:21Z", and starting transfer for the next big snapshot at time="2021-04-29T02:37:23Z". It seems this is the GetFiemapExtents step? (if it is I wonder if xfs filesystem is better). I always see longhorn process in the source node just spinned at load 1 and no transfer for the big snapshot happening (until 8 hours later for this case). This always happened for this big snapshot no matter whether I reran the rebuild or recreated a new PV and recopied the data and rebuilt again. |
Based on the test you report, I did a test today:
Based on the simple test, it seems that the time should be approximately twice between the best case and the worst case.
You said |
Rebuilding and snapshot merge showed different behaviour. For snapshot merge, I saw progress at 20%, hours later at 40%, etc. For rebuilding, the UI show "Rebuilding ..." as in pic in previous comment, but there's no network transfer for this snapshot, so it's effectively at 0% for 8 hours. The issue is at GetFiemapExtents function in the client.go. I downloaded this fiemap.c, and run this against the source snapshot. It really took 7.5h, as well as load at 1cpu, and the process is hard to kill (the same behaviour when longhorn was running during initial 8h): Nvme disk rating with dbench is: To show it's not the structure of the extents that is an issue, I created a sparse file using your method above (but with 4k data size), alternating between data and holes, so 4k data, 4k hole, 4k data, 4k hole, etc. Then I ran the fiemap on it, it took 6h. To show it's not an OS/disk issue, I did same on CentOS 8.2 and different ssd disk (the above ran on Ubuntu 20.04.2 and nvme disk), it took 7h. Then I did the same on XFS filesystem. It took 1s!! So, the ext4 based sparse file probably has to traverse all the trees of the extents while xfs seems to keep this info somewhere. I don't know if mkfs.ext4 has options so it can do like xfs. But now I retested the 1.9T production data using xfs. Initial write from outside source is 30% slower, but snapshot merging now takes less than 2h. Rebuilding doesn't have that 8h lead time. Normal read/write performance is similar to ext4. Parallel checksum for source and target snapshot will also be helpful to reduce rebuilding when target snapshot exists. Thanks for the help with this troubleshooting, the rebuilding steps you mentioned above is really helpful. |
Hi @hendra-2020 , this is an interesting finding and a brilliant troubleshooting! We're interested in hearing more about it. When you have time, would it be possible for you to reach out to me at sheng.yang@suse.com ? Looking forward to talking with you. |
Pre Ready-For-Testing Checklist
|
Creation of the Test Volume. FIO jobs:
|
Kernel info / Update instructions: EXT4 You might also want 2 different storage classes / sdd's attached to your nodes. You want todo the benchmark on old kernels as well as a separate one on new kernels To figure out if your kernel has slow ext4 extent retrieval you can use This should get the fiemap implementation installed
This is for evaluating the extents of the replica file Then run fiemap Example of a full run on a newer 5.8 kernel
|
Benchmark / Test instructions:
Example process, we are always looking for the
Before killing the process you might want to attach kubetail otherwise you can retrieve the logs later. You can use the UI to get a rough idea of the time required for the rebuild.
For testing scenarios you can test the
For less precise benchmarks create a filesystem on top of the volume then create a bunch of test files inside of the workload filesystem. This will give you less control over the blocks/extents but would more likely represent a users use case.
|
Verified with v1.1.2-rc1 Validation - Pass Cluster Config : 1 Control plane, 3 worker nodes (4 vcpus, 8GB RAM)
File System : ext4 Test with files created on workload fs :
Test with above fio jobs :
File System : XFS
Note: The checksum of data is compared after each rebuild across all the replicas. |
@khushboo-rancher can you add the
Also what do you mean with 10K holes after each block? |
@joshimoo Updated the above comment #2507 (comment) with os and fs details. I used the command like |
Thanks for adding the additional information. |
On the old kernel < 5.8 ext4 retrieval for the worst case 4k data / 4k holes will take approximately 3 minutes per 1024 extends. For the new engine, there will be an initial 3 minutes to retrieve the first 1024 extents then transmission will happen so you will have progress updates right from the start, the time required will still be rather slow. Since the extent retrieval will be the bottleneck. With the new engine the termination of the process will only be stuck for a time between 1 - 6 minutes |
@joshimoo Thanks for explaining the expected behavior and testing steps in details. |
Closing this issue as the performance has enhanced and we can confirm that the data after rebuild is fine from manual and integration automation tests as there is no integration test failure related to rebuild and restore because of data mismatch. https://ci.longhorn.io/job/public/job/v1.1.2/job/v1.1.2-longhorn-tests-ubuntu-amd64/38/testReport/ |
I did a quick test of this on Longhorn:master-head |
Describe the bug
I have a large PV (2 replicas) of 3TB of which data I just rsynced from another storage system; the size of data in the PV is 2.4TB. When I rebooted one node, and replica in the rebooted node always failed and another replica was created but never managed to finish. So I tested with reducing replica count to 1 and increasing again to 2, however rebuilding also failed. The CPU load in the source node was around 1 with a longhorn process running, but no data was actually transfer in the network during this time. When data was originally rsynced from another storage, I assume this snapshot is 2.4TB with no holes in the sparse file. Is the reason it fails because it's too large? I have other PVs of 50GB size and have no issue rebuilding.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Replica should rebuild successfully.
Log
If applicable, add the Longhorn managers' log when the issue happens.
time="2021-04-18T06:00:54Z" level=error msg="Failed rebuilding of replica 10.42.1.207:10075" controller=longhorn-engine error="failed to add replica address='tcp://10.42.1.207:10075' to controller 'pvc-a9bb1886-22a7-42d5-b4dc-3ecfa4fa21b0': failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v1.1.0/longhorn [--url 10.42.2.65:10000 add tcp://10.42.1.207:10075], output , stderr, time=\"2021-04-18T05:13:17Z\" level=info msg=\"Adding replica tcp://10.42.1.207:10075 in WO mode\"\ntime=\"2021-04-18T05:13:17Z\" level=info msg=\"Using replica tcp://10.42.2.127:10075 as the source for rebuild \"\ntime=\"2021-04-18T05:13:17Z\" level=info msg=\"Using replica tcp://10.42.1.207:10075 as the target for rebuild \"\ntime=\"2021-04-18T06:00:54Z\" level=fatal msg=\"Error running add replica command: failed to sync files [{FromFileName:volume-snap-1dfe2ff5-eb10-41be-a686-1661013710be.img ToFileName:volume-snap-1dfe2ff5-eb10-41be-a686-1661013710be.img ActualSize:2593904701440} {FromFileName:volume-snap-1dfe2ff5-eb10-41be-a686-1661013710be.img.meta ToFileName:volume-snap-1dfe2ff5-eb10-41be-a686-1661013710be.img.meta ActualSize:0} {FromFileName:volume-snap-51bf61cc-0f68-4a03-848c-07b8c35a51d5.img ToFileName:volume-snap-51bf61cc-0f68-4a03-848c-07b8c35a51d5.img ActualSize:18315870208} {FromFileName:volume-snap-51bf61cc-0f68-4a03-848c-07b8c35a51d5.img.meta ToFileName:volume-snap-51bf61cc-0f68-4a03-848c-07b8c35a51d5.img.meta ActualSize:0} {FromFileName:volume-snap-5e304d7a-bec7-4ec1-9ec0-598443e46306.img ToFileName:volume-snap-5e304d7a-bec7-4ec1-9ec0-598443e46306.img ActualSize:34347081728} {FromFileName:volume-snap-5e304d7a-bec7-4ec1-9ec0-598443e46306.img.meta ToFileName:volume-snap-5e304d7a-bec7-4ec1-9ec0-598443e46306.img.meta ActualSize:0} {FromFileName:volume-snap-6dcea017-1fe4-45be-9f3f-88207a3de8cb.img ToFileName:volume-snap-6dcea017-1fe4-45be-9f3f-88207a3de8cb.img ActualSize:0} {FromFileName:volume-snap-6dcea017-1fe4-45be-9f3f-88207a3de8cb.img.meta ToFileName:volume-snap-6dcea017-1fe4-45be-9f3f-88207a3de8cb.img.meta ActualSize:0}] from tcp://10.42.2.127:10075: rpc error: code = Unavailable desc = transport is closing\"\n, error exit status 1" node=slave-sys2 volume=pvc-a9bb1886-22a7-42d5-b4dc-3ecfa4fa21b0
You can also attach a Support Bundle here. You can generate a Support Bundle using the link at the footer of the Longhorn UI.
longhorn-support-bundle_a34f5b3a-be96-4afd-8271-6de8b0ebaa01_2021-04-18T06-33-04Z.zip
Environment:
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: