[BUG] timestamp or checksum not matched in test_snapshot_hash_detect_corruption test case #6145

yangchiu · 2023-06-17T01:48:57Z

Describe the bug (🐛 if you encounter this issue)

In test case test_snapshot_hash_detect_corruption_in_global_fast_check_mode or test_snapshot_hash_detect_corruption_in_global_enabled_mode, it tries to check the checksum value and ctime of the checksum file in check_snapshot_checksums_and_change_timestamps before corrupting the snapshot:

                # Check checksums in snapshot resource and the calculated value
                # are matched
                checksum = get_checksum_from_snapshot_disk_file(data_path,
                                                                s.name)
                print(f'snapshot {s.name}: '
                      f'checksum in resource={s.checksum}, '
                      f'checksum recorded={checksum}')
                assert checksum == s.checksum

                # Check ctime in checksum file and from stat are matched
                ctime_recorded = get_ctime_in_checksum_file(disk_path)
                ctime = get_ctime_from_snapshot_disk_file(data_path, s.name)

                print(f'snapshot {s.name}: '
                      f'ctime recorded={ctime_recorded}, '
                      f'ctime={ctime}')

But this check randomly failed. It could be the checksum not matched:
https://ci.longhorn.io/job/public/job/master/job/sles/job/amd64/job/longhorn-tests-sles-amd64/524/testReport/tests/test_snapshot/test_snapshot_hash_detect_corruption_in_global_fast_check_mode/
https://ci.longhorn.io/job/public/job/master/job/rhel/job/amd64/job/longhorn-tests-rhel-amd64/64/testReport/tests/test_snapshot/test_snapshot_hash_detect_corruption_in_global_fast_check_mode/
Or the ctime of the checksum file not matched:
https://ci.longhorn.io/job/public/job/master/job/rhel/job/amd64/job/longhorn-tests-rhel-amd64/59/testReport/tests/test_snapshot/test_snapshot_hash_detect_corruption_in_global_enabled_mode/
https://ci.longhorn.io/job/public/job/v1.5.x/job/v1.5.x-longhorn-tests-sles-arm64/15/testReport/tests/test_snapshot/test_snapshot_hash_detect_corruption_in_global_enabled_mode/
https://ci.longhorn.io/job/public/job/v1.5.x/job/v1.5.x-longhorn-tests-sles-amd64/6/testReport/tests/test_snapshot/test_snapshot_hash_detect_corruption_in_global_enabled_mode/
https://ci.longhorn.io/job/public/job/v1.5.x/job/v1.5.x-longhorn-tests-sles-amd64/12/testReport/tests/test_snapshot/test_snapshot_hash_detect_corruption_in_global_fast_check_mode/
https://ci.longhorn.io/job/public/job/master/job/rhel/job/amd64/job/longhorn-tests-rhel-amd64/62/testReport/tests/test_snapshot/test_snapshot_hash_detect_corruption_in_global_fast_check_mode/

It could be hard to manually reproduce because of its tedious and time-consuming test setup, and there's another issue also happening to this test case: #6129. So if the test case failed, it could be due to either issue addressed in this ticket or the issue addressed in #6129.

This issue could be introduced after v1.5.0-rc2, at least we didn't observe this in v1.5.0-rc1.

To Reproduce

Run test case test_snapshot_hash_detect_corruption_in_global_fast_check_mode or test_snapshot_hash_detect_corruption_in_global_enabled_mode

Expected behavior

A clear and concise description of what you expected to happen.

Log or Support bundle

If applicable, add the Longhorn managers' log or support bundle when the issue happens.
You can generate a Support Bundle using the link at the footer of the Longhorn UI.

Environment

Longhorn version:
Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
- Number of management node in the cluster:
- Number of worker node in the cluster:
Node config
- OS type and version:
- CPU per node:
- Memory per node:
- Disk type(e.g. SSD/NVMe):
- Network bandwidth between the nodes:
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
Number of Longhorn volumes in the cluster:

Additional context

Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

derekbit · 2023-06-17T02:19:25Z

IIRC, we don't change the data plane logic, so snapshots might somehow be corrupted by other mechanisms or changes.

BTW, this is good that the checksum logic successfully caught the corrupted snapshots.

derekbit · 2023-06-17T03:17:53Z

Also found the grpc error when instance-manage is talking to longhorn-engine

time="2023-06-17T03:03:34Z" level=error msg="Error running ssync server" error="listen tcp :10010: bind: address already in use"
[longhorn-testvol-y0s8nr-r-ecd8355c] time="2023-06-17T03:05:04Z" level=error msg="Shutting down the server since it is idle for 1m30s"
[longhorn-testvol-y0s8nr-r-ecd8355c] time="2023-06-17T03:05:34Z" level=error msg="sync agent gRPC server failed to rebuild replica/sync files" error="replica tcp://10.42.3.6:10000 failed to send file volume-snap-db0e2c9d-420c-4047-ac64-d1e6bf4873ea.img to 10.42.2.7:10010: failed to send file volume-snap-db0e2c9d-420c-4047-ac64-d1e6bf4873ea.img to 10.42.2.7:10010: rpc error: code = Unknown desc = failed to sync content for source file volume-snap-db0e2c9d-420c-4047-ac64-d1e6bf4873ea.img: failed to open: failed to open server: Get \"http://10.42.2.7:10010/v1-ssync/open?begin=0&directIO=true&end=2147483648\": net/http: HTTP/1.x transport connection broken: malformed HTTP response \"\\x00\\x00\\x06\\x04\\x00\\x00\\x00\\x00\\x00\\x00\\x05\\x00\\x00@\\x00\""
2023/06/17 03:05:34 ERROR: [core] [Server #1] grpc: server failed to encode response: rpc error: code = Internal desc = grpc: error while marshaling: proto: Marshal called with nil
[longhorn-instance-manager] time="2023-06-17T03:05:34Z" level=info msg="Removing replica" engineName=longhorn-testvol-y0s8nr-e-577e28dd replicaAddress="tcp://10.42.2.7:10000" replicaName= serviceURL="10.42.2.7:10010"

innobead · 2023-06-17T05:44:11Z

Describe the bug ( if you encounter this issue)

In test case test_snapshot_hash_detect_corruption_in_global_fast_check_mode or test_snapshot_hash_detect_corruption_in_global_enabled_mode, it tries to check the checksum value and ctime of the checksum file in check_snapshot_checksums_and_change_timestamps before corrupting the snapshot:

Does this mean this test case failed just before corrupting the snapshot? then it means the test case itself was not actually tested yet.

/manager/integration/tests/test_snapshot.py#L328-L339

    # Step 2
    create_snapshots(client, volume, 1536, 3)

    # Step 3
    assert check_snapshot_checksums_and_change_timestamps(volume) # <---- failed here

    # Step 4
    snapshot_name = get_available_snapshot(volume)
    assert snapshot_name != ""

    assert corrupt_snapshot_on_local_host(volume, snapshot_name)

@derekbit Would be it possible the checksum and ctime saved in checksum inconsistent from the snapshot at runtime somehow? I thought the checksum file should be updated after the snapshot is immutably ready.

innobead · 2023-06-17T05:45:31Z

Probably we need to review the test case or see if there would be a chance the checksum file inconsistent with the snapshot disk file at runtime.

derekbit · 2023-06-17T06:01:09Z

The test case is okay. I've found the root cause. Will update later.

longhorn-io-github-bot · 2023-06-17T09:52:20Z

Pre Ready-For-Testing Checklist

Where is the reproduce steps/test steps documented?
The reproduce steps/test steps are at:
Does the PR include the explanation for the fix or the feature?
The PR is at Set correct sync-agent-port-count when creating a replica process longhorn-manager#2000
Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
The PR is at Set correct sync-agent-port-count when creating a replica process longhorn-manager#2000
Which areas/issues this PR might have potential impacts on?
Area: replica port allocation
Issues

derekbit · 2023-06-17T10:04:47Z

Describe the bug ( if you encounter this issue)

In test case test_snapshot_hash_detect_corruption_in_global_fast_check_mode or test_snapshot_hash_detect_corruption_in_global_enabled_mode, it tries to check the checksum value and ctime of the checksum file in check_snapshot_checksums_and_change_timestamps before corrupting the snapshot:

Does this mean this test case failed just before corrupting the snapshot? then it means the test case itself was not actually tested yet.

/manager/integration/tests/test_snapshot.py#L328-L339
    # Step 2
    create_snapshots(client, volume, 1536, 3)

    # Step 3
    assert check_snapshot_checksums_and_change_timestamps(volume) # <---- failed here

    # Step 4
    snapshot_name = get_available_snapshot(volume)
    assert snapshot_name != ""

    assert corrupt_snapshot_on_local_host(volume, snapshot_name)
@derekbit Would be it possible the checksum and ctime saved in checksum inconsistent from the snapshot at runtime somehow? I thought the checksum file should be updated after the snapshot is immutably ready.

No, a snapshot's checksum or ctime should be immutable, or it indicates there are bugs in the data engine.

yangchiu · 2023-06-20T12:48:57Z

Verified passed on master-head (longhorn-manager a3b16b8) and v1.5.x-head (longhorn-manager 6f26265) by running the test case test_snapshot_hash_detect_corruption_in_global_enabled_mode and test_snapshot_hash_detect_corruption_in_global_fast_check_mode.

Run test_snapshot_hash_detect_corruption_in_global_enabled_mode 30 times on master-head, all passed:
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/4264/
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/4265/
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/4266/

Run test_snapshot_hash_detect_corruption_in_global_fast_check_mode 30 times on master-head, all passed:
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/4238/
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/4241/
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/4243/

Run test_snapshot_hash_detect_corruption_in_global_enabled_mode 30 times on v1.5.x-head, all passed:
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/4267/
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/4269/
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/4270/

Run test_snapshot_hash_detect_corruption_in_global_fast_check_mode 30 times on v1.5.x-head, all passed:
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/4268/
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/4271/
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/4272/

yangchiu added kind/bug reproduce/often 80 - 50% reproducible labels Jun 17, 2023

yangchiu added this to the v1.5.0 milestone Jun 17, 2023

yangchiu mentioned this issue Jun 17, 2023

[BUG] Cron job triggered replica rebuilding keeps repeating itself after corrupting snapshot data #6129

Closed

innobead assigned derekbit Jun 17, 2023

innobead added flaky-test area/snapshot Volume snapshot (in-cluster snapshot or external backup) area/volume-data-integrity Volume Data integrity related investigation-needed Need to identify the case before estimating and starting the development labels Jun 17, 2023

innobead modified the milestones: v1.5.0, v1.6.0 Jun 17, 2023

innobead added the priority/0 Must be fixed in this release (managed by PO) label Jun 17, 2023

derekbit mentioned this issue Jun 17, 2023

Set correct sync-agent-port-count when creating a replica process longhorn/longhorn-manager#2000

Merged

innobead modified the milestones: v1.6.0, v1.5.0 Jun 17, 2023

innobead added backport/1.4.3 and removed flaky-test labels Jun 17, 2023

github-actions bot mentioned this issue Jun 17, 2023

[BACKPORT][v1.4.3][BUG] timestamp or checksum not matched in test_snapshot_hash_detect_corruption test case #6148

Closed

innobead removed the backport/1.4.3 label Jun 17, 2023

innobead assigned yangchiu and roger-ryao and unassigned yangchiu Jun 19, 2023

yangchiu closed this as completed Jun 20, 2023

yangchiu assigned yangchiu and unassigned roger-ryao Jun 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] timestamp or checksum not matched in test_snapshot_hash_detect_corruption test case #6145

[BUG] timestamp or checksum not matched in test_snapshot_hash_detect_corruption test case #6145

yangchiu commented Jun 17, 2023

derekbit commented Jun 17, 2023

derekbit commented Jun 17, 2023

innobead commented Jun 17, 2023

Describe the bug ( if you encounter this issue)

innobead commented Jun 17, 2023 •

edited

derekbit commented Jun 17, 2023

longhorn-io-github-bot commented Jun 17, 2023 •

edited by derekbit

derekbit commented Jun 17, 2023

Describe the bug ( if you encounter this issue)

yangchiu commented Jun 20, 2023

[BUG] timestamp or checksum not matched in test_snapshot_hash_detect_corruption test case #6145

[BUG] timestamp or checksum not matched in test_snapshot_hash_detect_corruption test case #6145

Comments

yangchiu commented Jun 17, 2023

Describe the bug (🐛 if you encounter this issue)

To Reproduce

Expected behavior

Log or Support bundle

Environment

Additional context

derekbit commented Jun 17, 2023

derekbit commented Jun 17, 2023

innobead commented Jun 17, 2023

Describe the bug ( if you encounter this issue)

innobead commented Jun 17, 2023 • edited

derekbit commented Jun 17, 2023

longhorn-io-github-bot commented Jun 17, 2023 • edited by derekbit

Pre Ready-For-Testing Checklist

derekbit commented Jun 17, 2023

Describe the bug ( if you encounter this issue)

yangchiu commented Jun 20, 2023

innobead commented Jun 17, 2023 •

edited

longhorn-io-github-bot commented Jun 17, 2023 •

edited by derekbit