Skip to content

Fast way to verify and eliminate corrupted data files #6694

@zhangyue19921010

Description

@zhangyue19921010

In the actual production process, issues with dirty files may occur (such as unexpected commits during distributed writes), along with file corruption caused by storage-layer problems. We need a fast and convenient way to identify dirty files and even remove them from the Manifest to quickly restore the production availability of Lance tables.

For instance, occasional issues may occur in online environments

External error: Wrapped error: LanceError(IO): Object at location default.db/xxx.lance/data/0010001011101101010000109245a64c729acf551b4a3bc022.lance not found: 
Error performing GET https://xxxx/default.db/xxxx.lance/data/0010001011101101010000109245a64c729acf551b4a3bc022.lance in 2.891921ms 
Server returned non-2xx status code: 404 Not Found: <?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?>
<Error><Code>NoSuchKey</Code><RequestIdxxxx</RequestId><HostId>xxxx</HostId><Message>The specified key does not exist.</Message><EC>xxxxx</EC>

<Key>default.db/xxxx.lance/data/0010001011101101010000109245a64c729acf551b4a3bc022.lance</Key></Error>, 
/rustc/f8297e351a40c1439a467bbbb6879088047f50b3/library/core/src/ops/[function.rs:250](http://function.rs:250/):5, /home/runner/work/lance/lance/rust/lance-io/src/[scheduler.rs:395](http://scheduler.rs:395/):17

We can quickly eliminate dirty files through the capabilities provided by the current PR.

./target/debug/lance-tools \
dataset repair-manifest \
-s "$LANCE_TOS_DATASET_ROOT" \
--storage-options "$LANCE_STORAGE_OPTIONS_JSON" \
--remove-data-files "FILE1,FILE2"

It also provides data file verification, as well as the capabilities of quickly locating files and performing rollbacks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions