Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] timestamp or checksum not matched in test_snapshot_hash_detect_corruption test case #6145

Closed
yangchiu opened this issue Jun 17, 2023 · 8 comments
Assignees
Labels
area/snapshot Volume snapshot (in-cluster snapshot or external backup) area/volume-data-integrity Volume Data integrity related investigation-needed Need to identify the case before estimating and starting the development kind/bug priority/0 Must be fixed in this release (managed by PO) reproduce/often 80 - 50% reproducible
Milestone

Comments

@yangchiu
Copy link
Member

Describe the bug (馃悰 if you encounter this issue)

In test case test_snapshot_hash_detect_corruption_in_global_fast_check_mode or test_snapshot_hash_detect_corruption_in_global_enabled_mode, it tries to check the checksum value and ctime of the checksum file in check_snapshot_checksums_and_change_timestamps before corrupting the snapshot:

                # Check checksums in snapshot resource and the calculated value
                # are matched
                checksum = get_checksum_from_snapshot_disk_file(data_path,
                                                                s.name)
                print(f'snapshot {s.name}: '
                      f'checksum in resource={s.checksum}, '
                      f'checksum recorded={checksum}')
                assert checksum == s.checksum

                # Check ctime in checksum file and from stat are matched
                ctime_recorded = get_ctime_in_checksum_file(disk_path)
                ctime = get_ctime_from_snapshot_disk_file(data_path, s.name)

                print(f'snapshot {s.name}: '
                      f'ctime recorded={ctime_recorded}, '
                      f'ctime={ctime}')

But this check randomly failed. It could be the checksum not matched:
https://ci.longhorn.io/job/public/job/master/job/sles/job/amd64/job/longhorn-tests-sles-amd64/524/testReport/tests/test_snapshot/test_snapshot_hash_detect_corruption_in_global_fast_check_mode/
https://ci.longhorn.io/job/public/job/master/job/rhel/job/amd64/job/longhorn-tests-rhel-amd64/64/testReport/tests/test_snapshot/test_snapshot_hash_detect_corruption_in_global_fast_check_mode/
Or the ctime of the checksum file not matched:
https://ci.longhorn.io/job/public/job/master/job/rhel/job/amd64/job/longhorn-tests-rhel-amd64/59/testReport/tests/test_snapshot/test_snapshot_hash_detect_corruption_in_global_enabled_mode/
https://ci.longhorn.io/job/public/job/v1.5.x/job/v1.5.x-longhorn-tests-sles-arm64/15/testReport/tests/test_snapshot/test_snapshot_hash_detect_corruption_in_global_enabled_mode/
https://ci.longhorn.io/job/public/job/v1.5.x/job/v1.5.x-longhorn-tests-sles-amd64/6/testReport/tests/test_snapshot/test_snapshot_hash_detect_corruption_in_global_enabled_mode/
https://ci.longhorn.io/job/public/job/v1.5.x/job/v1.5.x-longhorn-tests-sles-amd64/12/testReport/tests/test_snapshot/test_snapshot_hash_detect_corruption_in_global_fast_check_mode/
https://ci.longhorn.io/job/public/job/master/job/rhel/job/amd64/job/longhorn-tests-rhel-amd64/62/testReport/tests/test_snapshot/test_snapshot_hash_detect_corruption_in_global_fast_check_mode/

It could be hard to manually reproduce because of its tedious and time-consuming test setup, and there's another issue also happening to this test case: #6129. So if the test case failed, it could be due to either issue addressed in this ticket or the issue addressed in #6129.

This issue could be introduced after v1.5.0-rc2, at least we didn't observe this in v1.5.0-rc1.

To Reproduce

Run test case test_snapshot_hash_detect_corruption_in_global_fast_check_mode or test_snapshot_hash_detect_corruption_in_global_enabled_mode

Expected behavior

A clear and concise description of what you expected to happen.

Log or Support bundle

If applicable, add the Longhorn managers' log or support bundle when the issue happens.
You can generate a Support Bundle using the link at the footer of the Longhorn UI.

Environment

  • Longhorn version:
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
    • Number of management node in the cluster:
    • Number of worker node in the cluster:
  • Node config
    • OS type and version:
    • CPU per node:
    • Memory per node:
    • Disk type(e.g. SSD/NVMe):
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
  • Number of Longhorn volumes in the cluster:

Additional context

Add any other context about the problem here.

@derekbit
Copy link
Member

IIRC, we don't change the data plane logic, so snapshots might somehow be corrupted by other mechanisms or changes.

BTW, this is good that the checksum logic successfully caught the corrupted snapshots.

@derekbit
Copy link
Member

Also found the grpc error when instance-manage is talking to longhorn-engine

time="2023-06-17T03:03:34Z" level=error msg="Error running ssync server" error="listen tcp :10010: bind: address already in use"
[longhorn-testvol-y0s8nr-r-ecd8355c] time="2023-06-17T03:05:04Z" level=error msg="Shutting down the server since it is idle for 1m30s"
[longhorn-testvol-y0s8nr-r-ecd8355c] time="2023-06-17T03:05:34Z" level=error msg="sync agent gRPC server failed to rebuild replica/sync files" error="replica tcp://10.42.3.6:10000 failed to send file volume-snap-db0e2c9d-420c-4047-ac64-d1e6bf4873ea.img to 10.42.2.7:10010: failed to send file volume-snap-db0e2c9d-420c-4047-ac64-d1e6bf4873ea.img to 10.42.2.7:10010: rpc error: code = Unknown desc = failed to sync content for source file volume-snap-db0e2c9d-420c-4047-ac64-d1e6bf4873ea.img: failed to open: failed to open server: Get \"http://10.42.2.7:10010/v1-ssync/open?begin=0&directIO=true&end=2147483648\": net/http: HTTP/1.x transport connection broken: malformed HTTP response \"\\x00\\x00\\x06\\x04\\x00\\x00\\x00\\x00\\x00\\x00\\x05\\x00\\x00@\\x00\""
2023/06/17 03:05:34 ERROR: [core] [Server #1] grpc: server failed to encode response: rpc error: code = Internal desc = grpc: error while marshaling: proto: Marshal called with nil
[longhorn-instance-manager] time="2023-06-17T03:05:34Z" level=info msg="Removing replica" engineName=longhorn-testvol-y0s8nr-e-577e28dd replicaAddress="tcp://10.42.2.7:10000" replicaName= serviceURL="10.42.2.7:10010"

@innobead
Copy link
Member

Describe the bug ( if you encounter this issue)

In test case test_snapshot_hash_detect_corruption_in_global_fast_check_mode or test_snapshot_hash_detect_corruption_in_global_enabled_mode, it tries to check the checksum value and ctime of the checksum file in check_snapshot_checksums_and_change_timestamps before corrupting the snapshot:

Does this mean this test case failed just before corrupting the snapshot? then it means the test case itself was not actually tested yet.

/manager/integration/tests/test_snapshot.py#L328-L339

    # Step 2
    create_snapshots(client, volume, 1536, 3)

    # Step 3
    assert check_snapshot_checksums_and_change_timestamps(volume) # <---- failed here

    # Step 4
    snapshot_name = get_available_snapshot(volume)
    assert snapshot_name != ""

    assert corrupt_snapshot_on_local_host(volume, snapshot_name)

@derekbit Would be it possible the checksum and ctime saved in checksum inconsistent from the snapshot at runtime somehow? I thought the checksum file should be updated after the snapshot is immutably ready.

@innobead innobead added flaky-test area/snapshot Volume snapshot (in-cluster snapshot or external backup) area/volume-data-integrity Volume Data integrity related investigation-needed Need to identify the case before estimating and starting the development labels Jun 17, 2023
@innobead innobead modified the milestones: v1.5.0, v1.6.0 Jun 17, 2023
@innobead innobead added the priority/0 Must be fixed in this release (managed by PO) label Jun 17, 2023
@innobead
Copy link
Member

innobead commented Jun 17, 2023

Probably we need to review the test case or see if there would be a chance the checksum file inconsistent with the snapshot disk file at runtime.

@derekbit
Copy link
Member

The test case is okay. I've found the root cause. Will update later.

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Jun 17, 2023

Pre Ready-For-Testing Checklist

@derekbit
Copy link
Member

Describe the bug ( if you encounter this issue)

In test case test_snapshot_hash_detect_corruption_in_global_fast_check_mode or test_snapshot_hash_detect_corruption_in_global_enabled_mode, it tries to check the checksum value and ctime of the checksum file in check_snapshot_checksums_and_change_timestamps before corrupting the snapshot:

Does this mean this test case failed just before corrupting the snapshot? then it means the test case itself was not actually tested yet.

/manager/integration/tests/test_snapshot.py#L328-L339

    # Step 2
    create_snapshots(client, volume, 1536, 3)

    # Step 3
    assert check_snapshot_checksums_and_change_timestamps(volume) # <---- failed here

    # Step 4
    snapshot_name = get_available_snapshot(volume)
    assert snapshot_name != ""

    assert corrupt_snapshot_on_local_host(volume, snapshot_name)

@derekbit Would be it possible the checksum and ctime saved in checksum inconsistent from the snapshot at runtime somehow? I thought the checksum file should be updated after the snapshot is immutably ready.

No, a snapshot's checksum or ctime should be immutable, or it indicates there are bugs in the data engine.

@yangchiu
Copy link
Member Author

Verified passed on master-head (longhorn-manager a3b16b8) and v1.5.x-head (longhorn-manager 6f26265) by running the test case test_snapshot_hash_detect_corruption_in_global_enabled_mode and test_snapshot_hash_detect_corruption_in_global_fast_check_mode.

Run test_snapshot_hash_detect_corruption_in_global_enabled_mode 30 times on master-head, all passed:
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/4264/
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/4265/
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/4266/

Run test_snapshot_hash_detect_corruption_in_global_fast_check_mode 30 times on master-head, all passed:
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/4238/
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/4241/
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/4243/

Run test_snapshot_hash_detect_corruption_in_global_enabled_mode 30 times on v1.5.x-head, all passed:
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/4267/
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/4269/
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/4270/

Run test_snapshot_hash_detect_corruption_in_global_fast_check_mode 30 times on v1.5.x-head, all passed:
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/4268/
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/4271/
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/4272/

@yangchiu yangchiu assigned yangchiu and unassigned roger-ryao Jun 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/snapshot Volume snapshot (in-cluster snapshot or external backup) area/volume-data-integrity Volume Data integrity related investigation-needed Need to identify the case before estimating and starting the development kind/bug priority/0 Must be fixed in this release (managed by PO) reproduce/often 80 - 50% reproducible
Projects
None yet
Development

No branches or pull requests

5 participants