Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Failed rebuilding a large replica of 3TB #2507

Closed
hendra-2020 opened this issue Apr 18, 2021 · 32 comments
Closed

[BUG] Failed rebuilding a large replica of 3TB #2507

hendra-2020 opened this issue Apr 18, 2021 · 32 comments
Assignees
Labels
area/v1-data-engine v1 data engine (iSCSI tgt) backport/1.1.2 Require to backport to 1.1.2 release branch highlight Important feature/issue to highlight kind/bug priority/1 Highly recommended to fix in this release (managed by PO) release/note Note for installation, upgrade and highlighted issues
Milestone

Comments

@hendra-2020
Copy link

hendra-2020 commented Apr 18, 2021

Describe the bug
I have a large PV (2 replicas) of 3TB of which data I just rsynced from another storage system; the size of data in the PV is 2.4TB. When I rebooted one node, and replica in the rebooted node always failed and another replica was created but never managed to finish. So I tested with reducing replica count to 1 and increasing again to 2, however rebuilding also failed. The CPU load in the source node was around 1 with a longhorn process running, but no data was actually transfer in the network during this time. When data was originally rsynced from another storage, I assume this snapshot is 2.4TB with no holes in the sparse file. Is the reason it fails because it's too large? I have other PVs of 50GB size and have no issue rebuilding.

To Reproduce
Steps to reproduce the behavior:

  1. Create a PV with 2 replicas of 3TB
  2. Rsync-ed a 2.4TB data
  3. Reduce replica count to 1, increase it again to 2
  4. Rebuilding ran for less than 1 hour, than failed

Expected behavior
Replica should rebuild successfully.

Log
If applicable, add the Longhorn managers' log when the issue happens.

time="2021-04-18T06:00:54Z" level=error msg="Failed rebuilding of replica 10.42.1.207:10075" controller=longhorn-engine error="failed to add replica address='tcp://10.42.1.207:10075' to controller 'pvc-a9bb1886-22a7-42d5-b4dc-3ecfa4fa21b0': failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v1.1.0/longhorn [--url 10.42.2.65:10000 add tcp://10.42.1.207:10075], output , stderr, time=\"2021-04-18T05:13:17Z\" level=info msg=\"Adding replica tcp://10.42.1.207:10075 in WO mode\"\ntime=\"2021-04-18T05:13:17Z\" level=info msg=\"Using replica tcp://10.42.2.127:10075 as the source for rebuild \"\ntime=\"2021-04-18T05:13:17Z\" level=info msg=\"Using replica tcp://10.42.1.207:10075 as the target for rebuild \"\ntime=\"2021-04-18T06:00:54Z\" level=fatal msg=\"Error running add replica command: failed to sync files [{FromFileName:volume-snap-1dfe2ff5-eb10-41be-a686-1661013710be.img ToFileName:volume-snap-1dfe2ff5-eb10-41be-a686-1661013710be.img ActualSize:2593904701440} {FromFileName:volume-snap-1dfe2ff5-eb10-41be-a686-1661013710be.img.meta ToFileName:volume-snap-1dfe2ff5-eb10-41be-a686-1661013710be.img.meta ActualSize:0} {FromFileName:volume-snap-51bf61cc-0f68-4a03-848c-07b8c35a51d5.img ToFileName:volume-snap-51bf61cc-0f68-4a03-848c-07b8c35a51d5.img ActualSize:18315870208} {FromFileName:volume-snap-51bf61cc-0f68-4a03-848c-07b8c35a51d5.img.meta ToFileName:volume-snap-51bf61cc-0f68-4a03-848c-07b8c35a51d5.img.meta ActualSize:0} {FromFileName:volume-snap-5e304d7a-bec7-4ec1-9ec0-598443e46306.img ToFileName:volume-snap-5e304d7a-bec7-4ec1-9ec0-598443e46306.img ActualSize:34347081728} {FromFileName:volume-snap-5e304d7a-bec7-4ec1-9ec0-598443e46306.img.meta ToFileName:volume-snap-5e304d7a-bec7-4ec1-9ec0-598443e46306.img.meta ActualSize:0} {FromFileName:volume-snap-6dcea017-1fe4-45be-9f3f-88207a3de8cb.img ToFileName:volume-snap-6dcea017-1fe4-45be-9f3f-88207a3de8cb.img ActualSize:0} {FromFileName:volume-snap-6dcea017-1fe4-45be-9f3f-88207a3de8cb.img.meta ToFileName:volume-snap-6dcea017-1fe4-45be-9f3f-88207a3de8cb.img.meta ActualSize:0}] from tcp://10.42.2.127:10075: rpc error: code = Unavailable desc = transport is closing\"\n, error exit status 1" node=slave-sys2 volume=pvc-a9bb1886-22a7-42d5-b4dc-3ecfa4fa21b0

You can also attach a Support Bundle here. You can generate a Support Bundle using the link at the footer of the Longhorn UI.
longhorn-support-bundle_a34f5b3a-be96-4afd-8271-6de8b0ebaa01_2021-04-18T06-33-04Z.zip

Environment:

  • Longhorn version: 1.1.0
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Rancher Catalog App
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE2 1.20.5+rke2r1
    • Number of management node in the cluster: 1
    • Number of worker node in the cluster: 2
  • Node config
    • OS type and version: Ubuntu 20.04.2
    • CPU per node: 32
    • Memory per node: 256GB
    • Disk type(e.g. SSD/NVMe): NVMe
    • Network bandwidth between the nodes: 10Gbps
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
  • Number of Longhorn volumes in the cluster: ~20

Additional context
Add any other context about the problem here.

@jenting jenting added this to New in Community Issue Review via automation Apr 19, 2021
@innobead innobead added the area/v1-data-engine v1 data engine (iSCSI tgt) label Apr 19, 2021
@cclhsu
Copy link

cclhsu commented Apr 19, 2021

Hi! Can you provide information on disk size (NVME)? From log, I saw repeated messages like;

2021-04-18T05:36:21.216620804+04:00 time="2021-04-18T01:36:21Z" level=error msg="There's no available disk for replica pvc-a9bb1886-22a7-42d5-b4dc-3ecfa4fa21b0-r-4815d394, size 3298534883328"

and all the PVC is degraded, which mean cluster doesn't meet the minimum resource requirement.

NODE        NAME                                      WORKLOAD             STATE     STATUS    DISK_SEL  NODE_SEL           SIZE           ACTUAL_SIZE    REPLICA
master3     pvc-1492d039-409b-42f2-8809-25bd4cac8601  mail-865489c586      attached  degraded  ssd       master             209715200      26648576       3
master3     pvc-2e8dc1b4-504f-439c-8e89-3ddba7d1d814  warewulf-7ddfcd949c  attached  degraded  ssd       master             104857600      7958528        3
master3     pvc-8f76f7ba-1204-4d13-a102-a69378a2ba54  syslog-6d5775b947    attached  degraded  ssd       master             10737418240    251097088      3
master3     pvc-bea32350-c1fe-4757-93f8-127d9c750da3  registry-5f5dcc6d4c  attached  degraded  ssd       master             21474836480    2086252544     3
            pvc-4abcdf2b-f1cf-4aca-861f-190d0c1d06c2  home-bak-5b54748c6   detached  unknown   ssd       master             10485760       0              3
            pvc-85bd81b2-7380-4834-8165-b480a2b26686  mgmt-bg-87668dbbc    detached  unknown   ssd       master             10485760       1957888        3
            pvc-d76290b8-32bc-45c0-b700-3f4459739efc                       detached  unknown   ssd       slave-sys-default  20971520       0              3
            pvc-df38c94f-ae33-428d-a72a-77a0613e01de                       detached  unknown   ssd       slave-sys-default  10485760       0              3
slave-sys2  pvc-a9bb1886-22a7-42d5-b4dc-3ecfa4fa21b0  home-57c99f66fd      attached  degraded  nvme      slave-sys-default  3298534883328  2646567763968  2
slave-sys2  pvc-da8b91d4-ef94-4e21-8a90-2048ec44fef4  home-57c99f66fd      attached  degraded  ssd       slave-sys-default  10485760       2162688        3
slave-sys3  pvc-3a30f709-1e87-4fb2-be26-ca151dbb0bcf  apps-65974bbdd8      attached  degraded  nvme      slave-sys-default  10485760       1949696        3
slave-sys3  pvc-41d1fe78-9ea2-4307-8337-e9ea16f722e7  slurm-5d8b67bfbc     attached  degraded  ssd       slave-sys-default  53687091200    1440493568     3
slave-sys3  pvc-dd15ce97-8276-4300-b26b-9bb0373ae9be  warewulf-7ddfcd949c  attached  degraded  ssd       slave-sys-default  53687091200    4614463488     3
slave-sys3  pvc-ecf5db8c-9091-4705-b664-df17e7c1375b  apps-65974bbdd8      attached  degraded  ssd       slave-sys-default  10485760       2068480        3

Notice that some of PVC are on /var/lib/longhorn/nvme and others are in /var/lib/longhorn/ssd.

@hendra-2020
Copy link
Author

hendra-2020 commented Apr 19, 2021

Hi. They're degraded because default replica count is 3, but I only have 2 nodes at the moment (slave-sys2/3). Most of the PVs are in ssd disks and have no issues because most PVs have size less than 50G. The one failing rebuilding is on NVMe. NVMe size is 6TB with default overprovisioning of 200%.

There were no issues creating 2 replicas the first time PV was created, I only realized it failed to resync and then to rebuild after 1 node was rebooted. Even recreating again a new PV with 2 replicas and then changing replica count to 1 and back to 2 failed to rebuild this PV.

@innobead innobead moved this from New to In progress in Community Issue Review Apr 20, 2021
@innobead innobead moved this from In progress to New in Community Issue Review Apr 20, 2021
@cclhsu
Copy link

cclhsu commented Apr 20, 2021

From longhorn-manager.log, we can see the schedule backup cannot find an available disk where pvc-a9bb1886-22a7-42d5-b4dc-3ecfa4fa21b0 reside on slave-sys2 and clean up the process and eventually lead to the error which provide in ticket description.

2021-04-18T09:12:43.148701782+04:00 time="2021-04-18T05:12:43Z" level=error msg="There's no available disk for replica pvc-a9bb1886-22a7-42d5-b4dc-3ecfa4fa21b0-r-4815d394, size 3298534883328"
2021-04-18T09:12:43.148729555+04:00 time="2021-04-18T05:12:43Z" level=error msg="unable to schedule replica" accessMode=rwo controller=longhorn-volume frontend=blockdev node=slave-sys2 owner=slave-sys2 replica=pvc-a9bb1886-22a7-42d5-b4dc-3ecfa4fa21b0-r-4815d394 state=attached volume=pvc-a9bb1886-22a7-42d5-b4dc-3ecfa4fa21b0
2021-04-18T09:12:43.148741673+04:00 time="2021-04-18T05:12:43Z" level=info msg="Cleaning up failed to scheduled replica pvc-a9bb1886-22a7-42d5-b4dc-3ecfa4fa21b0-r-4815d394"

Replicas will take up space residing on host /var/lib/longhorn/nvme/replica/pvc-a9bb1886-22a7-42d5-b4dc-3ecfa4fa21b0, which includes snapshot data. Seems like there are already data there. Overprovision works if only volume didn't take up most request space. The actual size for pvc-a9bb1886-22a7-42d5-b4dc-3ecfa4fa21b0 is 2646567763968, which is more than half of requested space.

@cclhsu cclhsu moved this from New to Pending user response in Community Issue Review Apr 20, 2021
@hendra-2020
Copy link
Author

hendra-2020 commented Apr 20, 2021

@cclhsu The original replica is in slave-sys2 and healthy. It's trying to schedule a replica rebuild in slave-sys3 of which space in /var/lib/longhorn/nvme is almost empty (6TB space).

root@slave-sys2:~# df -h /var/lib/longhorn/nvme/
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg--nvme-k8s 5.8T 2.5T 3.1T 44% /var/lib/longhorn/nvme

root@slave-sys2:/var/lib/longhorn/nvme/replicas# du -ms *
2 pvc-3a30f709-1e87-4fb2-be26-ca151dbb0bcf-9f651a77
2523972 pvc-a9bb1886-22a7-42d5-b4dc-3ecfa4fa21b0-591a968c

root@slave-sys3:~# df -h /var/lib/longhorn/nvme/
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg--nvme-k8s 5.8T 98M 5.5T 1% /var/lib/longhorn/nvme

root@slave-sys3:/var/lib/longhorn/nvme/replicas# du -ms *
2 pvc-3a30f709-1e87-4fb2-be26-ca151dbb0bcf-dee17aeb
8 pvc-a9bb1886-22a7-42d5-b4dc-3ecfa4fa21b0-50ed201f

Screenshot_2021-04-20_09-40-38

Screenshot_2021-04-20_09-56-34

@hendra-2020
Copy link
Author

I have also tried merging 2 snapshots in the single healthy replica by deleting 17GB snapshot so it merges to the larger 2.4TB snapshot. It has been going on for 24 hours. The original creation of 2.4TB data only took 7 hours.

@cclhsu cclhsu moved this from Pending user response to Team review required in Community Issue Review Apr 21, 2021
@innobead innobead moved this from Team review required to In progress in Community Issue Review Apr 23, 2021
@shuo-wu
Copy link
Contributor

shuo-wu commented Apr 27, 2021

For the 1st rebuilding failure, which is also what you reported in the beginning, we saw a HTTP call detach. If users try to detach a volume during the rebuilding, the rebuilding will be interrupted then gets failed:

2021-04-18T10:00:54.783660865+04:00 time="2021-04-18T06:00:54Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"pvc-a9bb1886-22a7-42d5-b4dc-3ecfa4fa21b0-e-f333a514\", UID:\"60bdf9fc-5dfb-4999-9567-c12d923974d7\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"2376654\", FieldPath:\"\"}): type: 'Warning' reason: 'FailedRebuilding' Failed rebuilding replica with Address 10.42.1.207:10075: failed to add replica address='tcp://10.42.1.207:10075' to controller 'pvc-a9bb1886-22a7-42d5-b4dc-3ecfa4fa21b0': failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v1.1.0/longhorn [--url 10.42.2.65:10000 add tcp://10.42.1.207:10075], output , stderr, time=\"2021-04-18T05:13:17Z\" level=info msg=\"Adding replica tcp://10.42.1.207:10075 in WO mode\"\ntime=\"2021-04-18T05:13:17Z\" level=info msg=\"Using replica tcp://10.42.2.127:10075 as the source for rebuild \"\ntime=\"2021-04-18T05:13:17Z\" level=info msg=\"Using replica tcp://10.42.1.207:10075 as the target for rebuild \"\ntime=\"2021-04-18T06:00:54Z\" level=fatal msg=\"Error running add replica command: failed to sync files [{FromFileName:volume-snap-1dfe2ff5-eb10-41be-a686-1661013710be.img ToFileName:volume-snap-1dfe2ff5-eb10-41be-a686-1661013710be.img ActualSize:2593904701440} {FromFileName:volume-snap-1dfe2ff5-eb10-41be-a686-1661013710be.img.meta ToFileName:volume-snap-1dfe2ff5-eb10-41be-a686-1661013710be.img.meta ActualSize:0} {FromFileName:volume-snap-51bf61cc-0f68-4a03-848c-07b8c35a51d5.img ToFileName:volume-snap-51bf61cc-0f68-4a03-848c-07b8c35a51d5.img ActualSize:18315870208} {FromFileName:volume-snap-51bf61cc-0f68-4a03-848c-07b8c35a51d5.img.meta ToFileName:volume-snap-51bf61cc-0f68-4a03-848c-07b8c35a51d5.img.meta ActualSize:0} {FromFileName:volume-snap-5e304d7a-bec7-4ec1-9ec0-598443e46306.img ToFileName:volume-snap-5e304d7a-bec7-4ec1-9ec0-598443e46306.img ActualSize:34347081728} {FromFileName:volume-snap-5e304d7a-bec7-4ec1-9ec0-598443e46306.img.meta ToFileName:volume-snap-5e304d7a-bec7-4ec1-9ec0-598443e46306.img.meta ActualSize:0} {FromFileName:volume-snap-6dcea017-1fe4-45be-9f3f-88207a3de8cb.img ToFileName:volume-snap-6dcea017-1fe4-45be-9f3f-88207a3de8cb.img ActualSize:0} {FromFileName:volume-snap-6dcea017-1fe4-45be-9f3f-88207a3de8cb.img.meta ToFileName:volume-snap-6dcea017-1fe4-45be-9f3f-88207a3de8cb.img.meta ActualSize:0}] from tcp://10.42.2.127:10075: rpc error: code = Unavailable desc = transport is closing\"\n, error exit status 1"
2021-04-18T10:00:54.789161648+04:00 time="2021-04-18T06:00:54Z" level=info msg="Removed failed rebuilding replica 10.42.1.207:10075" controller=longhorn-engine node=slave-sys2 volume=pvc-a9bb1886-22a7-42d5-b4dc-3ecfa4fa21b0
2021-04-18T10:00:54.793428137+04:00 time="2021-04-18T06:00:54Z" level=info msg="Marked failed rebuild on replica pvc-a9bb1886-22a7-42d5-b4dc-3ecfa4fa21b0-r-73048188, backoff period is now 10 seconds" controller=longhorn-engine node=slave-sys2 volume=pvc-a9bb1886-22a7-42d5-b4dc-3ecfa4fa21b0
2021-04-18T10:00:54.862331401+04:00 time="2021-04-18T06:00:54Z" level=debug msg="Detaching volume pvc-a9bb1886-22a7-42d5-b4dc-3ecfa4fa21b0 from slave-sys2"
2021-04-18T10:00:54.862842280+04:00 10.42.1.224 - - [18/Apr/2021:06:00:54 +0000] "POST /v1/volumes/pvc-a9bb1886-22a7-42d5-b4dc-3ecfa4fa21b0?action=detach HTTP/1.1" 200 3818 "" "Go-http-client/1.1"

For the 2nd failure, which is the failure you reported later, it's caused by the read timeout from a healthy replica, I am not sure if it is caused by network fluctuation or CPU/Disk surge.

2021-04-18T10:19:58.450458764+04:00 [pvc-a9bb1886-22a7-42d5-b4dc-3ecfa4fa21b0-e-f333a514] time="2021-04-18T06:19:58Z" level=info msg="Connecting to remote: 10.42.1.207:10075"
2021-04-18T10:19:58.451551566+04:00 [pvc-a9bb1886-22a7-42d5-b4dc-3ecfa4fa21b0-e-f333a514] time="2021-04-18T06:19:58Z" level=info msg="Opening: 10.42.1.207:10075"
2021-04-18T10:19:58.466642742+04:00 [pvc-a9bb1886-22a7-42d5-b4dc-3ecfa4fa21b0-e-f333a514] time="2021-04-18T06:19:58Z" level=info msg="Snapshot: 10.42.2.127:10075 018bc2f4-1c08-41d0-a66c-1897c779d5ff UserCreated false Created at 2021-04-18T06:19:58Z, Labels map[]"
2021-04-18T10:19:58.467845149+04:00 [pvc-a9bb1886-22a7-42d5-b4dc-3ecfa4fa21b0-e-f333a514] time="2021-04-18T06:19:58Z" level=info msg="Snapshot: 10.42.1.207:10075 018bc2f4-1c08-41d0-a66c-1897c779d5ff UserCreated false Created at 2021-04-18T06:19:58Z, Labels map[]"
2021-04-18T10:19:58.468976176+04:00 [pvc-a9bb1886-22a7-42d5-b4dc-3ecfa4fa21b0-e-f333a514] time="2021-04-18T06:19:58Z" level=info msg="Adding backend: tcp://10.42.1.207:10075"
2021-04-18T10:19:58.469172387+04:00 [pvc-a9bb1886-22a7-42d5-b4dc-3ecfa4fa21b0-e-f333a514] time="2021-04-18T06:19:58Z" level=info msg="Start monitoring tcp://10.42.1.207:10075"
2021-04-18T10:19:58.472966877+04:00 [pvc-a9bb1886-22a7-42d5-b4dc-3ecfa4fa21b0-e-f333a514] time="2021-04-18T06:19:58Z" level=info msg="Set revision counter of 10.42.1.207:10075 to : 0"
2021-04-18T10:19:58.473814013+04:00 [pvc-a9bb1886-22a7-42d5-b4dc-3ecfa4fa21b0-e-f333a514] time="2021-04-18T06:19:58Z" level=info msg="Set backend tcp://10.42.1.207:10075 revision counter to 0"
2021-04-18T10:19:58.476513942+04:00 [pvc-a9bb1886-22a7-42d5-b4dc-3ecfa4fa21b0-e-f333a514] time="2021-04-18T06:19:58Z" level=info msg="Synchronizing volume-head-005.img.meta to volume-head-001.img.meta@10.42.1.207:10078"
2021-04-18T10:19:58.478880042+04:00 [pvc-a9bb1886-22a7-42d5-b4dc-3ecfa4fa21b0-e-f333a514] time="2021-04-18T06:19:58Z" level=info msg="Done synchronizing volume-head-005.img.meta to volume-head-001.img.meta@10.42.1.207:10078"
2021-04-18T10:21:38.264829875+04:00 [pvc-a9bb1886-22a7-42d5-b4dc-3ecfa4fa21b0-e-f333a514] time="2021-04-18T06:21:38Z" level=error msg="Read timeout on replica 10.42.2.127:10076 seq= 147 size= 512 (kB)"
2021-04-18T10:21:38.264862736+04:00 time="2021-04-18T06:21:38Z" level=error msg="Retry  1 on replica 10.42.2.127:10076 seq= 147 size= 512 (kB)"
2021-04-18T10:21:38.265714795+04:00 [pvc-a9bb1886-22a7-42d5-b4dc-3ecfa4fa21b0-e-f333a514] time="2021-04-18T06:21:38Z" level=error msg="Read timeout on replica 10.42.2.127:10076 seq= 148 size= 1024 (kB)"
2021-04-18T10:21:38.265724133+04:00 time="2021-04-18T06:21:38Z" level=error msg="Retry  1 on replica 10.42.2.127:10076 seq= 148 size= 1024 (kB)"

As for the stuck snapshot deletion/merge, can you point out which volume the issue happened on? It's better to tell when. Since I saw there are lots of snapshot deletion & merge in the logs, most of them spent several seconds only. e.g.,

2021-04-15T14:00:06.142707573+04:00 [pvc-41d1fe78-9ea2-4307-8337-e9ea16f722e7-r-0c61c297] time="2021-04-15T10:00:06Z" level=info msg="Prepare removing disk: snap-31e6835b"
2021-04-15T14:00:06.142889261+04:00 [pvc-41d1fe78-9ea2-4307-8337-e9ea16f722e7-r-0c61c297] time="2021-04-15T10:00:06Z" level=info msg="Coalescing volume-snap-snap-7d066325.img to volume-snap-snap-31e6835b.img"
2021-04-15T14:00:06.183703922+04:00 [pvc-41d1fe78-9ea2-4307-8337-e9ea16f722e7-r-0c61c297] time="2021-04-15T10:00:06Z" level=info msg="Replace volume-snap-snap-7d066325.img with volume-snap-snap-31e6835b.img"
2021-04-15T14:00:06.184191754+04:00 [pvc-41d1fe78-9ea2-4307-8337-e9ea16f722e7-r-0c61c297] time="2021-04-15T10:00:06Z" level=info msg="Replacing disk volume-snap-snap-31e6835b.img with volume-snap-snap-7d066325.img"
2021-04-15T14:00:06.184508342+04:00 [pvc-41d1fe78-9ea2-4307-8337-e9ea16f722e7-r-0c61c297] time="2021-04-15T10:00:06Z" level=info msg="Old file volume-snap-snap-7d066325.img exists, deleting"
2021-04-15T14:00:06.203081121+04:00 [pvc-41d1fe78-9ea2-4307-8337-e9ea16f722e7-r-0c61c297] time="2021-04-15T10:00:06Z" level=info msg="Done replacing volume-snap-snap-7d066325.img with volume-snap-snap-31e6835b.img"


2021-04-15T14:00:10.707822206+04:00 [pvc-dd15ce97-8276-4300-b26b-9bb0373ae9be-r-0fce5843] time="2021-04-15T10:00:10Z" level=info msg="Prepare removing disk: snap-589c357f"
2021-04-15T14:00:10.708123054+04:00 [pvc-dd15ce97-8276-4300-b26b-9bb0373ae9be-r-0fce5843] time="2021-04-15T10:00:10Z" level=info msg="Coalescing volume-snap-snap-bd32d492.img to volume-snap-snap-589c357f.img"
2021-04-15T14:00:10.710401947+04:00 [pvc-dd15ce97-8276-4300-b26b-9bb0373ae9be-r-0fce5843] time="2021-04-15T10:00:10Z" level=info msg="Replace volume-snap-snap-bd32d492.img with volume-snap-snap-589c357f.img"
2021-04-15T14:00:10.710763397+04:00 [pvc-dd15ce97-8276-4300-b26b-9bb0373ae9be-r-0fce5843] time="2021-04-15T10:00:10Z" level=info msg="Replacing disk volume-snap-snap-589c357f.img with volume-snap-snap-bd32d492.img"
2021-04-15T14:00:10.711024983+04:00 [pvc-dd15ce97-8276-4300-b26b-9bb0373ae9be-r-0fce5843] time="2021-04-15T10:00:10Z" level=info msg="Old file volume-snap-snap-bd32d492.img exists, deleting"
2021-04-15T14:00:10.729899264+04:00 [pvc-dd15ce97-8276-4300-b26b-9bb0373ae9be-r-0fce5843] time="2021-04-15T10:00:10Z" level=info msg="Done replacing volume-snap-snap-bd32d492.img with volume-snap-snap-589c357f.img"
......

@shuo-wu shuo-wu moved this from In progress to Pending user response in Community Issue Review Apr 27, 2021
@hendra-2020
Copy link
Author

hendra-2020 commented Apr 27, 2021

The detach you mentioned is because I had to reboot the node. The status is rebuilding, but the longhorn CPU process in slave-sys2 only spins at load of 1-2 and did nothing else. There's also zero data transfer to slave-sys3 for the large snapshot. When I cancel rebuilding the longhorn process got stuck. top(1) shows longhorn process running at load 1 with zero VIRT/RSS memory (literally ZERO).

The snapshot merging was done after I submitted support bundle. It finished successfully after 30 hours. But merging 17GB for 30 hours is a bit too long.

I tried creating a new PV, recopied the data from original source, and repeat the process of deleting one replica and rebuilding to no avail. As far as I can see, when processing a snapshot this large, longhorn engine does something on the source snapshot (maybe checksumming or something but whatever it's doing it's struggling) but never transfer any data for this snapshot across network (literally zero data transfer). It's also not CPU because the node is doing nothing much at that moment except for longhorn process trying to process the source snapshot. It's also not network fluctuation because there's no data transfer to the other node.

I've managed to use PV as large as 400GB with no issues.

@shuo-wu shuo-wu moved this from Pending user response to In progress in Community Issue Review Apr 28, 2021
@shuo-wu
Copy link
Contributor

shuo-wu commented Apr 28, 2021

I do find that stopping replicas then detaching takes more than 90s (from 2021-04-18T06:17:03Z to 2021-04-18T06:18:35Z). I guess the replica needs some time to complete the in-fly data processing. But I don't understand what the zero data transfer is.

The snapshot merging you mentioned is abnormal. Can you provide more info about this part? I mean the related volume name/ snapshot name/timestamp.

BTW. I don't know why the instance manager pod instance-manager-r-35918346 only contains the log after 2021-04-18T06:17:03Z. Regradless the pod is not crashed recently.

@shuo-wu shuo-wu moved this from In progress to Pending user response in Community Issue Review Apr 28, 2021
@shuo-wu
Copy link
Contributor

shuo-wu commented Apr 28, 2021

Before snapshot creation/merge, the engine will flush the data into replicas via sync before v1.1.1, or freeze the volume in v1.1.1 to avoid unexpected data async/loss. Then each replicas is responsible for operating the actual snapshots by themselves and the engine won't transfer data for them.

The error I mentioned means somehow the engine encounters the read timeout from a healthy replica when the engine receive the io request from applications. There is no nothing Longhorn can do to prevent this.

2021-04-18T10:21:38.264829875+04:00 [pvc-a9bb1886-22a7-42d5-b4dc-3ecfa4fa21b0-e-f333a514] time="2021-04-18T06:21:38Z" level=error msg="Read timeout on replica 10.42.2.127:10076 seq= 147 size= 512 (kB)"
2021-04-18T10:21:38.264862736+04:00 time="2021-04-18T06:21:38Z" level=error msg="Retry  1 on replica 10.42.2.127:10076 seq= 147 size= 512 (kB)"
2021-04-18T10:21:38.265714795+04:00 [pvc-a9bb1886-22a7-42d5-b4dc-3ecfa4fa21b0-e-f333a514] time="2021-04-18T06:21:38Z" level=error msg="Read timeout on replica 10.42.2.127:10076 seq= 148 size= 1024 (kB)"
2021-04-18T10:21:38.265724133+04:00 time="2021-04-18T06:21:38Z" level=error msg="Retry  1 on replica 10.42.2.127:10076 seq= 148 size= 1024 (kB)"

@hendra-2020
Copy link
Author

hendra-2020 commented Apr 29, 2021

Normally for a normal snapshot replication, longhorn in the source node would start processing in a few seconds and show progress transfer to the corresponding snapshot in the target node. "zero data transfer" just means longhorn in the source node starts processing the source snapshot and the size of the snapshot in the target node is 0 (as shown by du). I can't make out the read errors above, I think it's because when I thought it got stuck I tried to terminate the rebuilding and something happened. In any case I did update the firmware for the server just in case.

However I now know what's the bottleneck after testing with a smaller set of data. It's not the size of the data in the snapshot but the number of files (and correspondingly the number of holes in the snapshot sparse file). This was also most likely the reasons the merging of snapshots took so long (there was only 1 replica at the time, so data transfer in the network is irrelevant). The merging was 17GB of data in 1 snapshot to 2.4TB data in another snapshot, but the number of files in the 2.4TB snapshot is around 21 millions.

I reran the test with a smaller set, a PV snapshot with 1.9TB data, 20 million files in the snapshot. As you can see below, longhorn does processing for around 8 hours before starting the data transfer to the corresponding snapshot in the the target node. The data transfer after this intial processing is around 6 hours.

Source node:
[pvc-11313189-67a6-4589-af8d-3ecee3c76604-r-edc988a8] time="2021-04-28T18:23:21Z" level=info msg="Sending file volume-snap-38d221f9-eabf-4aff-8803-5d3444062443.img.meta to 10.42.1.207:10112" [pvc-11313189-67a6-4589-af8d-3ecee3c76604-r-edc988a8] time="2021-04-28T18:23:21Z" level=info msg="source file size: 178, setting up directIo: false" [pvc-11313189-67a6-4589-af8d-3ecee3c76604-r-edc988a8] time="2021-04-28T18:23:21Z" level=info msg="Done sending file volume-snap-38d221f9-eabf-4aff-8803-5d3444062443.img.meta to 10.42.1.207:10112" [pvc-11313189-67a6-4589-af8d-3ecee3c76604-r-edc988a8] time="2021-04-28T18:23:21Z" level=info msg="Sending file volume-snap-6087797b-f04a-4f6c-adb0-d67cc9a3306c.img to 10.42.1.207:10113" [pvc-11313189-67a6-4589-af8d-3ecee3c76604-r-edc988a8] time="2021-04-28T18:23:21Z" level=info msg="source file size: 3298534883328, setting up directIo: true"

Target node:
[pvc-11313189-67a6-4589-af8d-3ecee3c76604-r-dca34737] time="2021-04-28T18:23:21Z" level=info msg="open: receiving fileSize: 178, setting up directIo: false" [pvc-11313189-67a6-4589-af8d-3ecee3c76604-r-dca34737] time="2021-04-28T18:23:21Z" level=info msg="Ssync server opened and ready" [pvc-11313189-67a6-4589-af8d-3ecee3c76604-r-dca34737] time="2021-04-28T18:23:21Z" level=info msg="Running ssync server for file volume-snap-6087797b-f04a-4f6c-adb0-d67cc9a3306c.img at port 10113" [pvc-11313189-67a6-4589-af8d-3ecee3c76604-r-dca34737] time="2021-04-28T18:23:21Z" level=info msg="Closing ssync server" [pvc-11313189-67a6-4589-af8d-3ecee3c76604-r-dca34737] time="2021-04-29T02:37:23Z" level=info msg="open: receiving fileSize: 3298534883328, setting up directIo: true" [pvc-11313189-67a6-4589-af8d-3ecee3c76604-r-dca34737] time="2021-04-29T02:37:23Z" level=info msg="Ssync server opened and ready"

I then tested creating a new PV, write 1 file of size 1.9TB using:
dd if=/dev/urandom of=test1 bs=1G count=1900 iflag=fullblock

longhorn took only a few seconds to process before starting the data transfer across the network. The data transfer took around the same time to finish as the above case, in the order of 6 hours.

The difference of above 2 cases are both have the same data usage of 1.9TB in the source snapshot, but 1 has 20 million files, the other one 1 file. The initial processing of 8 hours (before starting any data transfer across the network) is a function of the number of files. And after reading the issue and longhorn source code mentioned here hence a function of the number of holes in the sparse file which in my case probably in the order of a million.

At least now I know longhorn does the replica rebuild, it's just because the initial processing was so long I didn't know whether it was just stuck or not. So my question is, is this initial processing delay a result of longhorn inefficiently does extents/finding holes in the sparse file, or it's inherent ext4 sparse file processing for large number of holes? If the target node has no replica, is this processing necessary? Can it just bypass it and start transferring instead?

Also this a complete rebuild of a replica, I don't know how long it will take if the target node already has a replica, my guess would be the initial 8 hours and the time to do checksum on the snapshot if checksum is parallel (8+6 hours if checksum takes 6 hours), otherwise twice of this if done sequentially for source and target snapshots (30 hours, which is interestingly the time it took for snapshots merging I previously did above); plus the time to actually transfer the difference between 2 replicas.

@shuo-wu
Copy link
Contributor

shuo-wu commented Apr 29, 2021

Thank you for your reporting! The case that massive holes in a large snapshot file greatly degrades the rebuilding speed is what Longhorn team has not paid attention to.

For the rebuilding progress, I think there is a tiny misleading. Let me explain the actual progress then answer the questions:

  1. the engine list all snapshot file and meta files in the source replica.
  2. launch the target(new) replica process.
  3. The target replica establishes connections (1 connection for 1 snapshot) with the source replica then starts file transferring.

Step1 & 2, which are the initial processing of the rebuilding, don't involve the internal instruction(the hole numbers) of the snapshot files. The snapshot metafiles mainly record the info like parent snapshot and the file size. Hence the metafile size is almost empty.
In step3, when transferring the actual snapshot file, Longhorn will check the fiemap and get data chunk/extent locations between holes, then compare (calculate checksum) and transfer the chunks one by one. Each time when a chunked is transferred, the rebuilding progress will be updated. As you can see here:

  1. Longhorn won't calculate the checksum for all chunks first then start the transferring. Instead, it will process chunks sequentially.
  2. Longhorn relies on the stat of inherent ext4 sparse files to process/figure out chunks & holes. Here is the implementation: https://github.com/longhorn/sparse-tools/blob/5c51de69e0ebc2baf20d412863d416d88a2c4384/sparse/client.go#L71 (the client in the code means the source & healthy replica, which is the sender)

To minimize the checksum cost, Longhorn can check the target replica file status first. But this implicitly means more extra communication, which will make the case worse if there are massive small extents in a snapshot file.

For the 1st step, Longhorn team should do some benchmarking based on the case you report. Hope we can fix this in the next release.

@shuo-wu shuo-wu moved this from Pending user response to Team review required in Community Issue Review Apr 29, 2021
@hendra-2020
Copy link
Author

hendra-2020 commented Apr 29, 2021

OK, so somehow I misunderstood the steps. Then I'm still not sure what took so long after finishing transfering the small prior snapshot at time="2021-04-28T18:23:21Z", and starting transfer for the next big snapshot at time="2021-04-29T02:37:23Z". It seems this is the GetFiemapExtents step? (if it is I wonder if xfs filesystem is better). I always see longhorn process in the source node just spinned at load 1 and no transfer for the big snapshot happening (until 8 hours later for this case). This always happened for this big snapshot no matter whether I reran the rebuild or recreated a new PV and recopied the data and rebuilt again.

@shuo-wu
Copy link
Contributor

shuo-wu commented Apr 30, 2021

Based on the test you report, I did a test today:

  1. Create a snapshot file, the data consists of 512B blocks and splits by 512B holes. The total actual data size is 1Gi (2million extents/chunks).
  2. Sync the file multiple times and calculate the average time spent. The result is 65s.
  3. Create a snapshot file which is 1 continuous 1Gi data block with a following 1Gi hole.
  4. Sync the file multiple times and calculate the average time spent. The result is 30s.

Based on the simple test, it seems that the time should be approximately twice between the best case and the worst case.
BTW, here is the brenchmark of the disk I used for the test:

$ fio -filename=./data1 -name=randw -ioengine=libaio -direct=1 -iodepth=32 -rw=randwrite -bs=512 -numjobs=1 -runtime=30 -size=10G -group_reporting
Jobs: 1 (f=1): [w(1)][100.0%][r=0KiB/s,w=5887KiB/s][r=0,w=11.8k IOPS][eta 00m:00s]

$ fio -filename=./data2 -name=seqw -ioengine=libaio -direct=1 -iodepth=32 -rw=write -bs=512 -numjobs=1 -runtime=30 -size=10G -group_reporting
Jobs: 1 (f=1): [W(1)][100.0%][r=0KiB/s,w=38.5MiB/s][r=0,w=78.8k IOPS][eta 00m:00s]

You said the source node just spinned at load 1 and no transfer for the big snapshot happening(until 8 hours later for this case), do you mean that the rebuilding/snapshot purge progress shown in the UI keeps 0 in the first 8 hours? Maybe other rebuilding or snapshot merge calls (happened in other volumes) eat up all disk bandwidth. I am not sure if there are any other possibilities that can lead to the unexpected long processing time.

@hendra-2020
Copy link
Author

Rebuilding and snapshot merge showed different behaviour. For snapshot merge, I saw progress at 20%, hours later at 40%, etc. For rebuilding, the UI show "Rebuilding ..." as in pic in previous comment, but there's no network transfer for this snapshot, so it's effectively at 0% for 8 hours.

The issue is at GetFiemapExtents function in the client.go. I downloaded this fiemap.c, and run this against the source snapshot. It really took 7.5h, as well as load at 1cpu, and the process is hard to kill (the same behaviour when longhorn was running during initial 8h):
File volume-snap-7432db6c-6aef-4bd7-901d-a4cd5ccea236.img has 1216060 extents: 0.00user 26854.01system 7:28:05elapsed 99%CPU (0avgtext+0avgdata 134160maxresident)k 50368inputs+0outputs (0major+33621minor)pagefaults 0swaps

Nvme disk rating with dbench is:
Random Read/Write IOPS: 260k/233k. BW: 3210MiB/s / 2675MiB/s Average Latency (usec) Read/Write: 99.53/25.30 Sequential Read/Write: 3394MiB/s / 2660MiB/s Mixed Random Read/Write IOPS: 166k/55.4k

To show it's not the structure of the extents that is an issue, I created a sparse file using your method above (but with 4k data size), alternating between data and holes, so 4k data, 4k hole, 4k data, 4k hole, etc. Then I ran the fiemap on it, it took 6h.

To show it's not an OS/disk issue, I did same on CentOS 8.2 and different ssd disk (the above ran on Ubuntu 20.04.2 and nvme disk), it took 7h.

Then I did the same on XFS filesystem. It took 1s!!

So, the ext4 based sparse file probably has to traverse all the trees of the extents while xfs seems to keep this info somewhere. I don't know if mkfs.ext4 has options so it can do like xfs. But now I retested the 1.9T production data using xfs. Initial write from outside source is 30% slower, but snapshot merging now takes less than 2h. Rebuilding doesn't have that 8h lead time. Normal read/write performance is similar to ext4. Parallel checksum for source and target snapshot will also be helpful to reduce rebuilding when target snapshot exists.

Thanks for the help with this troubleshooting, the rebuilding steps you mentioned above is really helpful.

@yasker
Copy link
Member

yasker commented May 4, 2021

Hi @hendra-2020 , this is an interesting finding and a brilliant troubleshooting! We're interested in hearing more about it. When you have time, would it be possible for you to reach out to me at sheng.yang@suse.com ? Looking forward to talking with you.

@jenting jenting moved this from Team review required to In progress in Community Issue Review May 5, 2021
@yasker yasker added this to the v1.1.2 milestone May 6, 2021
@innobead innobead added the backport/1.1.2 Require to backport to 1.1.2 release branch label May 26, 2021
@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented May 26, 2021

Pre Ready-For-Testing Checklist

@joshimoo
Copy link
Contributor

joshimoo commented Jun 8, 2021

Creation of the Test Volume.

FIO jobs:
You can use Job10 as a template if you only want to populate a single volume.
Then update the engines in between the rebuild tests. i.e. test a single rebuild at a time.
This way you only need to run fio once to populate the volume.
To use this attach the for job 10 it's rebuild-test to a random cluster node then run fio fio.job
This populates the volume and creates the appropriate sparse file on the replicas.

[global]
bs=4k
iodepth=128
direct=1
sync=0
end_fsync=1
ioengine=libaio
randseed=1
randrepeat=0
# buffer_pattern="X"
group_reporting

# 4k bs
[job1]
bs=4k
rw=write:4k
filename=/dev/longhorn/rebuild-4k-old
name=data4k-holes4k-old
io_size=5G

[job2]
bs=4k
rw=write:4k
filename=/dev/longhorn/rebuild-4k-new
name=data4k-holes4k-new
io_size=5G


# 128k bs
[job4]
bs=128k
rw=write:4k
filename=/dev/longhorn/rebuild-128k-old
name=data128k-holes4k-old
io_size=5G

[job5]
bs=128k
rw=write:4k
filename=/dev/longhorn/rebuild-128k-new
name=data128k-holes4k-new
io_size=5G


# 512k bs
[job7]
bs=512k
rw=write:4k
filename=/dev/longhorn/rebuild-512k-old
name=data512k-holes4k-old
io_size=5G

[job8]
bs=512k
rw=write:4k
filename=/dev/longhorn/rebuild-512k-new
name=data512k-holes4k-new
io_size=5G

# 50GB test, need to switch engines between tests
[job10]
bs=128k
rw=write:4k
filename=/dev/longhorn/rebuild-test # specify the <volume-name>
name=test-total50gb-data128k-holes4k
io_size=50G # you can specify io_size to limit the volume to only be half written for example

@joshimoo
Copy link
Contributor

joshimoo commented Jun 8, 2021

Kernel info / Update instructions:

EXT4
Bad Kernel < 5.8
Good Kernel >= 5.8

You might also want 2 different storage classes / sdd's attached to your nodes.
One for EXT4 and one for XFS to compare the different performance.
Generally the extent retrieval on XFS on older kernels is faster then EXT4 but the write time to populate the disk via FIO for small block sizes is much longer.

You want todo the benchmark on old kernels as well as a separate one on new kernels
To update the ubuntu kernel of your test nodes
https://linuxconfig.org/how-to-upgrade-kernel-to-latest-version-on-ubuntu-20-04-focal-fossa-linux


To figure out if your kernel has slow ext4 extent retrieval you can use
You can run this fiemap against one of the larger replica file.
https://github.com/joshimoo/fiemap/archive/refs/tags/v0.2.0.tar.gz
ref: #2640

This should get the fiemap implementation installed

sudo apt install gcc make -y
wget https://github.com/joshimoo/fiemap/archive/refs/tags/v0.2.0.tar.gz
tar -xzy v0.2.0.tar.gz
cd fiemap-0.2.0
make
sudo make install

This is for evaluating the extents of the replica file
look for a snapshot file of a replica that actually has a bunch of data, you can use ls -lsah to see the actual used space as well as the max space.
Example below 10GB allocated, 5.1GB used
5.1G -rw-rw-r-- 1 jmoody jmoody 10G May 7 13:49 ssync-src-fiemap-10gb-file

Then run fiemap
If the initial 1024 retrieval takes around 3 minutes you can cancel it via ctrl+c this will take another 3 minutes till the process returns from the kernel. This is a slow EXT4 extent retrieval -> Bad Kernel
If the initial 1024 retrievals takes 0 seconds than you are good and can let it run to completion. -> Good Kernel

Example of a full run on a newer 5.8 kernel

retrieved 1024 extents in 0 seconds

fiemap done retrieved 1310720 extents in 8 seconds
File /home/jmoody/test-files/ssync-src-fiemap-10gb-file has 1310720 extents:

@joshimoo
Copy link
Contributor

joshimoo commented Jun 8, 2021

Benchmark / Test instructions:

  • Create a of size can be small or large
  • Attach the to a node in the cluster via the longhorn ui
  • Populate the volume via FIO job (only needs to be done once, for comparison between old/new implementation for that block size)
  • For different Block sizes you need to create a new Volume and populate it with a FIO job examples above
  • Exec into the instance-manager-r on a different node
  • ps -aux to indentify the correct replica process then kill

Example process, we are always looking for the replica process, there will also be a sync-agent process

root       22852  0.0  0.9 1386484 39876 ?       Sl   Jun01   7:52 /host/var/lib/longhorn/engine-binaries/joshimoo-longhorn-engine-1948-replica-rebuild-no-chksum/longhorn replica

Before killing the process you might want to attach kubetail otherwise you can retrieve the logs later.
kubetail instance-manager-r -n longhorn-system

You can use the UI to get a rough idea of the time required for the rebuild.
For accurate numbers use the log example below:

# time 8 minutes - sender
[rebuild-perf-test-new-r-12baf306] time="2021-05-28T17:38:13Z" level=info msg="Sending file volume-snap-39f71d7b-e5de-4bd0-af0b-5c97e8835138.img to 10.42.1.41:10004"
[rebuild-perf-test-new-r-12baf306] time="2021-05-28T17:38:13Z" level=info msg="source file size: 10737418240, setting up directIO: true"
[rebuild-perf-test-new-r-12baf306] time="2021-05-28T17:45:52Z" level=info msg="Done sending file volume-snap-39f71d7b-e5de-4bd0-af0b-5c97e8835138.img to 10.42.1.41:10004"

# time 8 minutes - receiver
[rebuild-perf-test-new-r-c7bcb713] time="2021-05-28T17:38:13Z" level=info msg="Running ssync server for file volume-snap-39f71d7b-e5de-4bd0-af0b-5c97e8835138.img at port 10004"
time="2021-05-28T17:38:13Z" level=info msg="Creating Ssync service"
[rebuild-perf-test-new-r-c7bcb713] time="2021-05-28T17:38:13Z" level=info msg="open: receiving fileSize: 10737418240, setting up directIO: true"
[rebuild-perf-test-new-r-c7bcb713] time="2021-05-28T17:45:52Z" level=info msg="Done running ssync server for file volume-snap-39f71d7b-e5de-4bd0-af0b-5c97e8835138.img at port 10004"

For testing scenarios you can test the

  • worst case 4k data 4k holes
  • best case 128k data 4k holes (since we sent a single interval in batches of 128k bytes)
  • other cases 512k, 1MB, etc

For less precise benchmarks create a filesystem on top of the volume then create a bunch of test files inside of the workload filesystem. This will give you less control over the blocks/extents but would more likely represent a users use case.

  • You can create test cases for:
  • many small files (4k - 1MB)
  • couple of big files (128MB - 1GB)

@innobead
Copy link
Member

innobead commented Jun 9, 2021

cc @khushboo-rancher @kaxing

@khushboo-rancher
Copy link
Contributor

khushboo-rancher commented Jun 23, 2021

Verified with v1.1.2-rc1

Validation - Pass

Cluster Config : 1 Control plane, 3 worker nodes (4 vcpus, 8GB RAM)
Host OS : Ubuntu 20.04

cat /proc/version
Linux version 5.4.0-73-generic (buildd@lcy01-amd64-019) (gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04))

File System : ext4


Test with files created on workload fs :
Created test files inside of the workload filesystem. Used the dd command to generate the data with the skip parameter. e.g. dd if=/dev/urandom of=file1.txt count=100 bs=1M conv=fsync skip=10k

Data / Version v1.1.1 v1.1.2-rc1
2 Gi (2 files of 1 Gi) with 10K holes after each block ~1 min 55 sec ~1 min 42 sec
2 Gi (2000 files of 1M) with 10K holes after each block ~2 min 30 sec ~1 min 9 sec
10 Gi (10000 files of 1M) with 100K holes after each block ~12 min ~ 3 min 40 sec
50 Gi (50000 files of 1M) with 10k holes after each block ~39 min ~21 min
50 Gi ( 50 Files of 1G) ~1 hr 1 min 15 sec ~25 min

Test with above fio jobs :
Used the above fio jobs to populate the data in the volume

  • Best Case - 128k data 4k holes - 5G
    v1.1.1 v1.1.2-rc1
    5 min 4 sec 2 min 36 sec
  • Worst Case - 4k data 4k holes - 5G
    v1.1.1 v1.1.2-rc1
    11 hr 58 min 2 hr 57 min

File System : XFS
Data Size : 10 Gi with 10k holes introduced after each block

v1.1.1 v1.1.2-rc1
9 min 26 sec 3 min 11 sec

Note: The checksum of data is compared after each rebuild across all the replicas.

@joshimoo
Copy link
Contributor

@khushboo-rancher can you add the

  • host os (ubuntu XYZ)
  • host filesystem where the replica data is stored. (xfs/ext4)

Also what do you mean with 10K holes after each block?
Can you describe the testing, I am guessing you are creating files inside of the workload fs, which is fine.
But just want to make sure we understand what and how it was tested :)

@khushboo-rancher
Copy link
Contributor

@joshimoo Updated the above comment #2507 (comment) with os and fs details.

I used the command like dd if=/dev/urandom of=file1.txt count=100 bs=1M conv=fsync skip=10k to introduce the holes.

@joshimoo
Copy link
Contributor

Thanks for adding the additional information.

@joshimoo
Copy link
Contributor

joshimoo commented Jun 24, 2021

On the old kernel < 5.8 ext4 retrieval for the worst case 4k data / 4k holes will take approximately 3 minutes per 1024 extends.
So for 100mb of data the prepass of the old engine will take around 75 minutes. before the first transfer.
During that time you will not see any progress of the rebuild.
For the old engine you will also end up with problems when trying to stop the rebuild, since the process is stuck in kernel space, during the full extent retrieval see #2640

For the new engine, there will be an initial 3 minutes to retrieve the first 1024 extents then transmission will happen so you will have progress updates right from the start, the time required will still be rather slow. Since the extent retrieval will be the bottleneck.

With the new engine the termination of the process will only be stuck for a time between 1 - 6 minutes

@khushboo-rancher
Copy link
Contributor

@joshimoo Thanks for explaining the expected behavior and testing steps in details.
The result looks good as per validation comment above #2507 (comment)

@khushboo-rancher
Copy link
Contributor

Closing this issue as the performance has enhanced and we can confirm that the data after rebuild is fine from manual and integration automation tests as there is no integration test failure related to rebuild and restore because of data mismatch. https://ci.longhorn.io/job/public/job/v1.1.2/job/v1.1.2-longhorn-tests-ubuntu-amd64/38/testReport/

@khushboo-rancher
Copy link
Contributor

I did a quick test of this on Longhorn:master-head 06/25/2021 with 10 Gi, it took around 3 min 51 sec to rebuild which is enhanced from v1.1.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/v1-data-engine v1 data engine (iSCSI tgt) backport/1.1.2 Require to backport to 1.1.2 release branch highlight Important feature/issue to highlight kind/bug priority/1 Highly recommended to fix in this release (managed by PO) release/note Note for installation, upgrade and highlighted issues
Projects
Archived in project
Community Issue Review
Resolved/Scheduled
Development

No branches or pull requests

8 participants