Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Backing Image Data Inconsistency if it's Exported from a Backing Image Backed Volume #6899

Closed
WebberHuang1118 opened this issue Oct 16, 2023 · 45 comments
Assignees
Labels
area/backing-image Backing image related backport/1.4.5 backport/1.5.4 kind/bug priority/0 Must be fixed in this release (managed by PO) require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage
Milestone

Comments

@WebberHuang1118
Copy link

WebberHuang1118 commented Oct 16, 2023

Describe the bug (🐛 if you encounter this issue)

The volume encounters a data integrity issue, if it's from a backing image, and such backing image is exported by a volume backed by another backing image.

To Reproduce

  1. Preparing a Harvester v1.2.0-head cluster (for using volume as an OS root disk)
  2. Uploading Ubuntu jammy cloud image (latest version), img1
  3. Creating volume vol1 from img1
  4. Creating a VM with vol1 as the root disk, and writing some data
  5. Issuing sync on filesystem, e.q. $ sudo sync
  6. Stopping VM, and waiting until vol1 is completely detached
  7. Exporting vol1 as img2
  8. Creating vol2 from img2
  9. Comparing data content between vol1 and vol2, the data is inconsistent
  10. Running e2fsck -v on vol2, fsck will report filesystem errors and require data repair

Expected behavior

The data of vol1 and vol2 should be consistent.

Support bundle for troubleshooting

supportbundle_066aa8ed-a12b-4ccc-971c-59404ee44cd4_2023-10-16T07-46-44Z.zip

Environment

  • Longhorn version: v1.4.3
  • Harvester version: v1.2.0-head
  • Number of Longhorn volumes in the cluster: 2
  • Impacted Longhorn resources:
    • Volume names: pvc-b8bd9013-e8e9-449d-9c25-6c6c0824d0b3, pvc-de64dd95-b0c1-4ce9-864a-88d38e06d9d3

Additional context

  • If vol1 is not backed by backing image, there is no data inconsistency found between vol1 and vol2
  • A single node Harvester cluster should be enough to reproduce this issue
@WebberHuang1118 WebberHuang1118 added kind/bug require/qa-review-coverage Require QA to review coverage require/backport Require backport. Only used when the specific versions to backport have not been definied. labels Oct 16, 2023
@ChanYiLin ChanYiLin self-assigned this Oct 16, 2023
@ChanYiLin ChanYiLin added this to the v1.6.0 milestone Oct 23, 2023
@innobead innobead added the priority/0 Must be fixed in this release (managed by PO) label Oct 26, 2023
@innobead
Copy link
Member

cc @shuo-wu

@WebberHuang1118
Copy link
Author

WebberHuang1118 commented Oct 27, 2023

Some observations with the following steps:

  1. Preparing a Harvester v1.2.0-head single node cluster (for using volume as an OS root disk)
  2. Uploading Ubuntu jammy cloud image (latest version), img1 with replica number is 1
  3. Checking data usage of img1 (mapped to bi default-image-s9rfj), which is 640MB data_usage_origin_img
  4. Creating volume os-vol (mapped to volume pvc-c3b1a434-86ee-453b-afc7-a52fe113c059) from img1 with claim size 10GB
  5. Creating a VM with os-vol as the root disk
  6. Stopping VM, and waiting until os-vol is completely detached
  7. Exporting os-vol as os-vol.img (mapped to bi default-image-j8ghr) with replica number is 1
  8. Checking replica data usage of os-vol, which is 523MB data_usage_os_vol
  9. Checking data usage of os-vol.img, which is 2507MB data_usage_os_vol_img
    • For the data usage, os-vol.img(2507MB) is much greater than the summation of img1(640MB) and os-vol(523MB)
    • The os-vol.img's data usage shouldn't be greater than the summation of img1's and os-vol's ?
  10. Creating volume copy-vol from os-vol.img with claim size 10GB
  11. Comparing data content between os-vol and copy-vol with $cmp -l os-vol copy-vol
  12. From the compare result cmp.log, there are several continuous byte ranges overwritten with zero in the synthetic volume, copy-vol. And running fsck on copy-vol, fsck will report filesystem errors and require data repair

Thanks

@ChanYiLin
Copy link
Contributor

ChanYiLin commented Jan 3, 2024

Note: From Longhorn side, we can't reproduce the issue

Longhorn side reproduce steps

  1. Create a BackingImageA
  2. Create a VolumeA with the BackingImageA
  3. Write some data to the VolumeA
  4. Get the checksum of the VolumeA
  5. Export the BackingImageB from the Volume
  6. Create the VolumeB with BackingImageB
  7. Get the checksum of the VolumeB
  8. VolumeA and VolumeB have the same checksum

@innobead
Copy link
Member

innobead commented Jan 3, 2024

@WebberHuang1118 Please work with @ChanYiLin to reproduce the issue.

@WebberHuang1118
Copy link
Author

WebberHuang1118 commented Jan 3, 2024

For the above reproduce steps #6899 (comment),

After step 6 in #6899 (comment),
since the VM is stopped and the volume is detached, IMHO, this issue could be reduced to "if the exported backing image content is the same as the source volume ?"

As I described in step 9 #6899 (comment),
the original backing image size is 640MB and the volume is 523MB, however, the synthesized backing image's grows to 2507MB (> (640MB + 523MB)), this seems abnormal?

If we create a new volume from the synthesized backing image and compare the new volume's content with the source volume, the content is inconsistent. However, if the source volume is not a backing image backed volume, there is no such data inconsistency.

Thanks

@WebberHuang1118
Copy link
Author

WebberHuang1118 commented Jan 8, 2024

I found a method to reproduce a similar case without Harvester-specific operations (no VM operation, only creating PVC from backing image and I/O access)

Detailed steps:

  • Uploading Ubuntu jammy cloud image (latest version), img1
  • Creating PVC os-vol from img1 with claim size 10GB and 1 replica number
  • Applying the following yaml to mount os-vol as block mode in the pod
    apiVersion: v1
    kind: Pod
    metadata:
      name: block-volume-test
      namespace: default
    spec:
      containers:
      - name: volume-test
        image: ubuntu
        imagePullPolicy: IfNotPresent
        securityContext:
          privileged: true
        command: ["/bin/sleep"]
        args: ["3600"]
        volumeDevices:
        - devicePath: /dev/os-vol
          name: os-vol
      volumes:
        - name: os-vol
          persistentVolumeClaim:
            claimName: os-vol
    
  • Issuing filesystem expand and I/O on os-vol
    • Enter pod block-volume-test
    • Install fdisk
      • $ apt update; apt install -y fdisk
    • Expanding filesystem on os-vol
      • Check device (/dev/sda for exmaple) mapped to os-vol by device number
      • $ fdisk /dev/sda
        • d 1 n 1 default default No w
      • $ resize2fs -fp /dev/sda1
    • Mounting /dev/sda1 and writing I/O into it
      • $ mkdir /data; $ mount -t ext4 /dev/sda1 /data
      • $ dd if=/dev/zero of=/data/file bs=1M count=2048
    • Flushing data to storage for os-vol
      • $ umount /data
    • Delete pod block-volume-test, and waiting until os-vol's LH volume detach
  • Exporting os-vol as image os-vol.img
  • Creating PVC copy-vol from os-vol.img with claim size 10GB
  • Comparing data content between os-vol and copy-vol with $cmp -l os-vol copy-vol
    • The data content is different between the two PVCs
  • Finding filesystem error on copy-vol with $e2fsck -c /dev/sda1 (assume PVC copy-vol mapped to /dev/sda)
    • There is no filesystem error on os-vol

PS:

  • IMHO, it may be related resizefs, since it contains more complicated I/O patterns, including discard to image block ranges.

@ChanYiLin
Copy link
Contributor

ChanYiLin commented Jan 8, 2024

I can reproduce the issue, will look into how we export BackingImage

cc @shuo-wu

@ChanYiLin
Copy link
Contributor

ChanYiLin commented Jan 9, 2024

BackingImage Export Mechanism

Volume is composed of a series of snapshots and a BackingImage

Volume = BackingImage -> snap1 -> head

When Exporting Volume to another BackingImage code

  1. create a snapshot
  2. create an in-memory replica by OpenSnapshot() from that snaphot
  3. preload() the block index map (including BackingImage)
    • so we know where to find the latest data in each block
  4. use SyncContent() from sparse-tools to sync the replica content to BackingImage Datasource
    • we use GetDataLayout() to see if the block is hole or data

Volume Cloning Mechanism

Is totally the same as exporting BackingImage except it won't load BackingImage when doing `preload()

@ChanYiLin
Copy link
Contributor

ChanYiLin commented Jan 9, 2024

Volume Cloning Test - PASSED

Both Volumes are consistent

  1. Creating BackingImage img1 with Ubuntu jammy cloud image (latest version)
  2. Creating StorageClass
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: longhorn-test
provisioner: driver.longhorn.io
allowVolumeExpansion: true
reclaimPolicy: Delete
volumeBindingMode: Immediate
parameters:
  numberOfReplicas: "1"
  staleReplicaTimeout: "2880"
  backingImage: "img1"
  1. Creating PVC os-vol from img1 with claim size 5GB and 1 replicas
  2. Mount os-vol as block mode in the pod
  3. Issuing filesystem expand and I/O on os-vol
    • fdisk recreate the filesystem partition
    • resize2fs resize the filesystem
  4. Cloning os-vol to another volume os-cloned by applying the YAML
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: os-cloned
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Block
  storageClassName: longhorn-test
  resources:
    requests:
      storage: 5Gi
  dataSource:
    name: os-vol
    kind: PersistentVolumeClaim
  1. attach both os-vol and os-cloned to the host and check the checksum are the same

@ChanYiLin
Copy link
Contributor

ChanYiLin commented Jan 9, 2024

BackingImage Export Test (without resize2fs) - FAILED

Following the same steps mentioned here, but don't apply resize2fs

  • Create BackingImageA
  • Create PVC/VolumeA and mount to pod
  • fdisk recreate the partition, but don't resize2fs
  • Write some data
  • Export the BackingImageB and create another VolumeB with it
  • VolumeA and VolumeB are not consistent

Still inconsistent

@ChanYiLin
Copy link
Contributor

BackingImage Export Test (without fdisk&resize2fs) - PASSED

Following the same steps mentioned here, but don't apply fdisk and resize2fs

  • Create BackingImageA
  • Create PVC/VolumeA and mount to pod
  • don't fdisk recreate the partition and resize2fs to device
  • Simply mount the devise original Linux Filesystem (/dev/sdb1 in the following example) to the /data and write some data
Disk /dev/sdb: 5 GiB, 5368709120 bytes, 10485760 sectors
Disk model: VIRTUAL-DISK    
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: A7AB4193-D2B4-4239-8244-0F2ABDF8C44F

Device      Start     End Sectors  Size Type
/dev/sdb1  227328 4612062 4384735  2.1G Linux filesystem
/dev/sdb14   2048   10239    8192    4M BIOS boot
/dev/sdb15  10240  227327  217088  106M EFI System
  • Export the BackingImageB and create another VolumeB with it
  • VolumeA and VolumeB are consistent

@ChanYiLin
Copy link
Contributor

ChanYiLin commented Jan 9, 2024

Consider Volume Clone(with partition recreation) works but BackingImage export(with partition recreation) doesn't.
I am thinking somehow the partition information are not correctly export when exporting with BackingImage

The only difference between Volume Clone and BackingImage export is that when
doing preload() to build replica.volume.Location

replica.volume.location = array of sectorIndex -> diskIndex

Volume Clone won't consider BackingIamge index in this case, so the export won't export BackingImage data
BackingImage Export will consider BackingImage when the sector has no new data, so it will export BackingImage data

cc @derekbit @shuo-wu

@shuo-wu
Copy link
Contributor

shuo-wu commented Jan 10, 2024

Volume Clone won't consider BackingIamge index in this case, so the export won't export BackingImage data
BackingImage Export will consider BackingImage when the sector has no new data, so it will export BackingImage data

Sorry...? Do you mean that VolumeExport won't retrieve the backing image fiemap/sector list hence the BackingImage data won't be exported? But why does Volume Cloning Test get passed?

@ChanYiLin
Copy link
Contributor

ChanYiLin commented Jan 10, 2024

Hi @shuo-wu
When doing VolumeExport here

It will create a replica and preload the replica to build the r.volume.location index map
Both Volume Clone and BackingImage Export utilize the same function

In Volume Clone when doing it, since it don't clone the BackingImage, so when doing preload(), it ignore the BackingImage by init the index map with "0".
In BackingImage Export, it will consider BackingImage, so the preload() will init the index map with "1" which is the index of BackingImage.
ref: https://github.com/longhorn/longhorn-engine/blob/master/pkg/replica/replica.go#L1437

Volume Clone only compressed all the Snapshots into one Snapshot and sync to the target Volume
BackingImage Export compressed all the Snapshots + BackingImage and sync to the target BackingImage

In the test, we mount the BackingImage to the Pod.
We recreate the partition and resize the filesystem and write some date

If we just do Volume Clone, then the new target Volume + Original BackingImage will be consistent
If we do BackingImage Export, then the new empty Volume + New BackingImage will be inconsistent

I am guessing somehow something when wrong when doing preload and syncContent()

@ChanYiLin
Copy link
Contributor

And also we found that

  1. mount the BackingImage to the Pod + write some data => Export BackingImage is consistent
  2. mount the BackingImage to the Pod + recreate partition&resizefs + write some data => Export BackingImage is inconsistent and the filesystem is corrupted

So maybe the partition information corrupted when exporting
those information should be in the new snapshot image and overwrites the base BackingImage when exporting

@shuo-wu
Copy link
Contributor

shuo-wu commented Jan 10, 2024

Does the volume clone test and the failed backing image test mean that the new partitioning info can be cloned correctly via VolumeClone but cannot be exported via ExportingBackingImage?

@ChanYiLin
Copy link
Contributor

ChanYiLin commented Jan 10, 2024

Yes, that is what I observed here.

And the only different between these two is preload()

@ChanYiLin
Copy link
Contributor

ChanYiLin commented Jan 10, 2024

I am thinking if it is because

assuming snapshot chain as follow
0. nil
1. BackingImage [1, 1, 0, 0]
2. Snap1 [1, 0, 0, 1]


Case 1 - Volume Clone: Export without BackingImage
first init to all 0, location = [0, 0, 0, 0]
r.volumes.preload(), location:
i = 0, nil                          skip => location = [0, 0, 0, 0]
i = 1, BackingImage [1, 1, 0, 0],   skip => location = [0, 0, 0, 0]
i = 2, Snap1        [1, 0, 0, 1]         => location = [2, 0, 0, 2]

ExportVolume and GetDataLayout=> [2, 0, 0, 2] which is [Data, Hole, Hole, Data]

Case 2 - BackingImage Export: Export with BackingImage
first init to all 1, location = [1, 1, 1, 1]
r.volumes.preload(), location
i = 0, nil                          skip => location = [1, 1, 1, 1]
i = 1, BackingImage [1, 1, 0, 0],   skip => location = [1, 1, 1, 1]
i = 2, Snap1        [1, 0, 0, 1]         => location = [2, 1, 1, 2]

ExportVolume and GetDataLayout => [2, 1, 1, 2] which is [Data, Data, Data, Data]

When SyncContent() and GetDataLayout(), we actually check if the sector is Data or Hole by checking if the DiskIndex is 0 or non-0
If it is Data, we read the data from the replica of the interval and send to the target
If it is Hole, we simply ask the target to punch Hole

So in the Case 1, we still punch hole for sector 2, 3
But in the Case 2, since we init the index map to all "1", so all the sector is Data, we don't do punch hole. Then there might have some sectors should be Hole because it doesn't have data in BackingImage either, but we fill the sector with 0 instead of Hole?

GetDataLayout()

Note: If we don't recreate the partition, the export BackingImage would be consistent...but why? TestCase

cc @shuo-wu @PhanLe1010

@shuo-wu
Copy link
Contributor

shuo-wu commented Jan 10, 2024

The question is, in case 2, location = [2, 1, 1, 2] means [reading data in snap1, reading data in BI, reading data in BI, reading data in snap1 ]. And reading data from BI should get the same result, which is a bunch of zero. In other words, the final data set should be identical...

@ChanYiLin
Copy link
Contributor

Yeah so the problem is still
Why the partition info is not exported or is corrupted during exporting BackingImage…

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented Jan 11, 2024

This example assumes that the logical (virtual) size of the backing image is equal to the logical size of snapshot:

assuming snapshot chain as follow
0. nil
1. BackingImage [1, 1, 0, 0]
2. Snap1 [1, 0, 0, 1]


Case 1 - Volume Clone: Export without BackingImage
first init to all 0, location = [0, 0, 0, 0]
r.volumes.preload(), location:
i = 0, nil                          skip => location = [0, 0, 0, 0]
i = 1, BackingImage [1, 1, 0, 0],   skip => location = [0, 0, 0, 0]
i = 2, Snap1        [1, 0, 0, 1]         => location = [2, 0, 0, 2]

ExportVolume and GetDataLayout=> [2, 0, 0, 2] which is [Data, Hole, Hole, Data]

Case 2 - BackingImage Export: Export with BackingImage
first init to all 1, location = [1, 1, 1, 1]
r.volumes.preload(), location
i = 0, nil                          skip => location = [1, 1, 1, 1]
i = 1, BackingImage [1, 1, 0, 0],   skip => location = [1, 1, 1, 1]
i = 2, Snap1        [1, 0, 0, 1]         => location = [2, 1, 1, 2]

ExportVolume and GetDataLayout => [2, 1, 1, 2] which is [Data, Data, Data, Data]

In reality, they are not:

  • The ubuntu image has 2.2 GiB logical size:
    ➜  Downloads qemu-img info jammy-server-cloudimg-amd64.img 
    image: jammy-server-cloudimg-amd64.img
    file format: qcow2
    virtual size: 2.2 GiB (2361393152 bytes)
    disk size: 642 MiB
    cluster_size: 65536
    Format specific information:
        compat: 0.10
        compression type: zlib
        refcount bits: 16
    
  • While the Longhorn snapshot has 10GB logical size

Revisiting 2 cases:

  1. mount the BackingImage to the Pod + write some data => Export BackingImage is consistent
    -> In this case, all the write fall into the first 2.2GB region of the backing image file
  2. mount the BackingImage to the Pod + recreate partition&resizefs + write some data => Export BackingImage is inconsistent and the filesystem is corrupted
    -> In this case, some data might fall outside of the 2.2GB region of the backing image file. So something like:
    assuming snapshot chain as follow
    0. nil
    1. BackingImage [1, 1, 0, 0]
    2. Snap1        [1, 0, 0, 1, 0, 1, 0]
    
    I am wondering if reading data from the qcow2 backing image outside of its virtual size can cause any inconsistency? Could we test with a raw backing image instead to see if we can hit the same issue @ChanYiLin ?

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented Jan 11, 2024

An other testing idea I am thinking about is:

  1. Create an empty qcow2/raw image file of 1MB
  2. Create a volume with that backing image and the volume size is 16MB
  3. Write some data at the 2MB offset to the volume
  4. Exporting the backing image from the volume
  5. Read and compare the data from the newly exported backing image and the old volume.
    1. Which regions of the data are different?
    2. Does the newly exported backing image contain all zero? -> If yes, it means the backing image layer overwrite the snapshot layer?

@PhanLe1010
Copy link
Contributor

Thanks for the idea, I will try
But we have confirmed that using qcow2 image,
if we directly mount the filesystem, without recreating the partition, the exported BackingImage will be consistent.
Here is the test case

I think the new test idea is different than the consistent test case because the new test idea does write data to outside of the backing image boundary (similar to when you recreating the partition and resize the filesystem to be bigger than the virtual size of the backing image). WDYT?

@ChanYiLin
Copy link
Contributor

ChanYiLin commented Jan 11, 2024

Hi @PhanLe1010 I think you are right

Backing Image Export Test (using empty raw image with larger volume size) - FAILED

  • Create an empty raw image and upload as a BackingImage empty-raw
    • $ truncate -s 2M empty-raw
  • Create a 16MB Volume test-empty with the raw image
  • Attach to the host
  • dd 1MB data starts from 3MB offset
    • $ dd of=/dev/longhorn/test-empty if=/dev/urandom bs=1M count=1 oflag=seek_bytes seek=3145728
  • export the Volume to another BackingImage empty-raw-export
  • Create a 16MB Volume test-empty-export with the BackingImage empty-raw-export
  • Attach to the host

Test Result

test-empty-export is empty
and the comparison result shows that it dffers from 3MB offset

brw-rw----  1 root root 8, 48 Jan 11 05:35 test-empty
brw-rw----  1 root root 8, 64 Jan 11 05:36 test-empty-export
root@kworker1:/dev/longhorn# cat test-empty-export
root@kworker1:/dev/longhorn# cmp test-empty test-empty-export
test-empty test-empty-export differ: byte 3145729, line 1
root@kworker1:/dev/longhorn# 

cc @shuo-wu

@ChanYiLin
Copy link
Contributor

This is actually fit the testing results from @WebberHuang1118 testing
The cmp.log he provided also showed that the different bytes started after 2.2GB which is the size of the jammy ubuntu image.
And the differences were also because those values were now "0"

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented Jan 11, 2024

Awesome. It means somehow the region outside of the backing image boundary (3MB offset to 4MB offset region) in the backing image layer is overwriting the snapshot layer.

Maybe next step would be adding some logs to the preload() function to see if we can identify the faulted logic (why choosing the backing image layer instead of snapshot layer for the region 3MB-4MB)?

@ChanYiLin
Copy link
Contributor

ChanYiLin commented Jan 11, 2024

Hi @PhanLe1010
I did intentionally print the r.volume.location after the preload and also print the segments sending during the exporting
Here is the results
Screenshot from 2024-01-11 14-00-46

As you can see the preload location I think is correct, there is a series of 2 which means the data from the snapshot
But when doing exporting, it only sent one segment [D 0 4096] and then finished

Misleading here, the 4096 here means 4096 Blocks, each Block is 4096Bytes
It is correct, it is sending 16MB in one interval because all blocks are Data

Note: I add the log message here, this function will be executed by multiple goroutine in parallel when sending segment
lhim_log.txt

Note:

  • VolumeSectorSize = 4096 which is the size of each sector in r.Volume.Location

@PhanLe1010
Copy link
Contributor

Sorry @ChanYiLin. I will have to continue tomorrow.

@ChanYiLin
Copy link
Contributor

sure no problem, thanks for your help!
I think we are pretty close to the answer

@ChanYiLin
Copy link
Contributor

it seems the preload result and the sending segment intervals are correct
now the problem might happen in reading the data with the given segments?

@ChanYiLin
Copy link
Contributor

ChanYiLin commented Jan 11, 2024

The Sending Batch Size is 2MB, so for 16MB Volume
It only sends 8 Batch: [0, 2MB], [2MB, 4MB], .....,
I printed the read data out and found out that
Event it was reading [2MB, 4MB] position from the replicas, the data is all 0
However, I also printed out the reading disk index during the ReadAt(), it was reading from the correct disk.
Need to investiage more.

Here is the log lhim2_new2.txt

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented Jan 12, 2024

Interesting finding @ChanYiLin !

Event it was reading [2MB, 4MB] position from the replicas, the data is all 0

What about when cloning, ReadAt() also returns all 0 in the region [2MB, 4MB] ?

By the way, how do you know that ReadAt() return 0 in the region [2MB,4MB]? From https://github.com/longhorn/longhorn/files/13905510/lhim2_new2.txt I only see the newTarget 2 which indicates the snapshot layer

time="2024-01-11T09:53:54Z" level=info msg="[DEBUG in fullReadAt]: newTarget 2 for sector: 805" func="replica.(*diffDisk).fullReadAt" file="diff_disk.go:195"

@ChanYiLin
Copy link
Contributor

@PhanLe1010 you can search [DEBUG in syncDataInterval]: Sending batch: [ 512: 1024](512), dataBuffer: in the file
that is the data sending from the sorce volume the target volume

Since the sending batch size is 2MB and the sector size is 4KB so the [ 512: 1024] means [2MB: 4MB]

@ChanYiLin
Copy link
Contributor

ChanYiLin commented Jan 12, 2024

Hi @PhanLe1010
Data only exists in [3MB: 4MB]
As you can see from the previous test, the whole [2MB:4MB] data sending is "0"

But Here is the result of using clone
The result of preload location is correct and the when sending the batch, it is sending Sending batch: [ 768: 1024](256) exactly 1MB because [2MB: 3MB] is Hole now, it didn't even try to read from that offest
and the result is correct
lhim1_new3.txt

I can't see where goes wrong...

Note: it also contains syncing the .meta so just look at syncing the image

@ChanYiLin
Copy link
Contributor

ChanYiLin commented Jan 15, 2024

I think I found the bug
When reading [2MB-4MB] from the Volume

[2MB-3MB] is BackingImage
[3MB-4MB] is snap1

https://github.com/longhorn/longhorn-engine/blob/deb8b18a1558ddf4f359146b289d5469bef7cc6d/pkg/replica/diff_disk.go#L202-L241
Here when we try to read the data

  • If the target disk is out of bound, it can because of Volume Expansion or BackingImage.
    • In this case the preload location indicated that [2MB-3MB] is BackingImage and tried to read from it.
  • Then we directly returned 0 count because we cannot read any data from BackingImage to the buf in this range
	// May read the expanded part
	if offset >= size {
		return 0, nil
	}
  • For [3MB-4MB], the target disk was correct, and we read the data from snap1 to the buf and the count would be 1MB
  • But we cut the return buf with count after reading from the file here. So the return data abandoned the data we read from the snap1 and only leave the first half of "0" data.
func ReadDataInterval(rw ReaderWriterAt, dataInterval Interval) ([]byte, error) {
	data := AllocateAligned(int(dataInterval.Len()))
	n, err := rw.ReadAt(data, dataInterval.Begin)
	if err != nil {
		if err == io.EOF {
			log.Debugf("Have read at the end of file, total read: %d", n)
		} else {
			log.WithError(err).Errorf("Failed to read interval %+v", dataInterval)
			return nil, errors.Wrapf(err, "failed to read interval %+v", dataInterval)
		}
	}
	return data[:n], nil
}

I wonder do we really need to cut it with count?
Or is the alignment here correct?
Or maybe we should still return the buf length count when reading from expanded part like we are reading a bunch of "0" instead of 0 count.

cc @shuo-wu @PhanLe1010 @WebberHuang1118

@ChanYiLin
Copy link
Contributor

ChanYiLin commented Jan 16, 2024

TEST (By directly returning data instead of data[:n]) - PASSED

The Volume are now consistent even testing with the original Test Case

  • cksum are the same
  • filesystem is not corrupted

Summary

I think this is the correct fix.
When sending the batch with the interval, we allocate the buf with the batch size (interval size)
then we send the data with the interval information to the target volume to write the data in that interval
we should not cut the data with the count we actually read from volume because the count might be 0 in case of volume expansion or BackingImage
And I don't think in this case the count number matters

@shuo-wu
Copy link
Contributor

shuo-wu commented Jan 16, 2024

Agreeing with removing the cut for ReadDataInterval.

On the other hand, I am thinking if we need to improve diffDisk.fullReadAt result asl well.
In the above case, we try to read data [2MB, 4MB], then diffDisk tells us the read count is 1MB. Intuitively users will consider that the read content is at the begin of the buffer then they will discard the latter 1MB. But actually NO...
What we can improve here is: When we read a data interval like [out of bound part of the backing image/smaller snapshot,data, out of bound part of the backing image/smaller snapshot], the returned read length should be len([out of bound part of the backing image/smaller snapshot, data]) rather than len([data])

@ChanYiLin
Copy link
Contributor

ChanYiLin commented Jan 16, 2024

From my perspective, I think we should still return len([data]) for diffDisk.fullReadAt
The reason is that, from the caller's perspective, it is reading from the "Volume"

func (r *Replica) ReadAt(buf []byte, offset int64) (int, error) {
	r.RLock()
	c, err := r.volume.ReadAt(buf, offset) // which will call diffDisk.fullReadAt()
	r.RUnlock()
	return c, err
}

So it has no knowledge about the snapshot chain underneath.

If it reads outside of the boundary of the "Volume", we do return io.ErrUnexpectedEOF error here
But if it only reads outside of the boundary of "specific snapshot image", and we allow it to do so, then we should consider it as reading a bunch of "0", and that is also true. The caller does get a bunch of "0".
The caller should not take care of the situation.
It might get confused that it doesn't get EOF error but how come the data length is not correct.

@shuo-wu
Copy link
Contributor

shuo-wu commented Jan 17, 2024

After discussing with @ChanYiLin, we agree that Longhorn should return the length the caller wants to read if there is no error EOF

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Jan 18, 2024

Pre Ready-For-Testing Checklist

@chriscchien chriscchien self-assigned this Jan 18, 2024
@chriscchien
Copy link
Contributor

Hi @ChanYiLin ,

Following test steps, this issue verified pass on master-head, but in v1.6.x-head, the two backing image data were still not consistent.

The measter-head engine images build time is later then v1.6.x-head, not sure if this means that v1.6.x engine does not contain your fix, could you help to check? thank you.

> docker images
REPOSITORY                   TAG           IMAGE ID       CREATED       SIZE
longhornio/longhorn-engine   master-head   16d097e2f0e6   4 hours ago   343MB
longhornio/longhorn-engine   v1.6.x-head   1c7202bbd25e   5 hours ago   343MB

@innobead
Copy link
Member

@chriscchien 1.6 PR was just merged, so please wait for https://drone-publish.longhorn.io/longhorn/longhorn-engine/810 finished.

@chriscchien
Copy link
Contributor

Verified pass on longhorn master(longhorn-engine ac4748), longhorn v1.6.x(longhorn-engine 0b553a) with test steps

I could reproduce this issue before the fix, after the fix, the two volumes data are identical.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/backing-image Backing image related backport/1.4.5 backport/1.5.4 kind/bug priority/0 Must be fixed in this release (managed by PO) require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage
Projects
None yet
Development

No branches or pull requests

7 participants