In the actual production process, issues with dirty files may occur (such as unexpected commits during distributed writes), along with file corruption caused by storage-layer problems. We need a fast and convenient way to identify dirty files and even remove them from the Manifest to quickly restore the production availability of Lance tables.
For instance, occasional issues may occur in online environments
External error: Wrapped error: LanceError(IO): Object at location default.db/xxx.lance/data/0010001011101101010000109245a64c729acf551b4a3bc022.lance not found:
Error performing GET https://xxxx/default.db/xxxx.lance/data/0010001011101101010000109245a64c729acf551b4a3bc022.lance in 2.891921ms
Server returned non-2xx status code: 404 Not Found: <?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?>
<Error><Code>NoSuchKey</Code><RequestIdxxxx</RequestId><HostId>xxxx</HostId><Message>The specified key does not exist.</Message><EC>xxxxx</EC>
<Key>default.db/xxxx.lance/data/0010001011101101010000109245a64c729acf551b4a3bc022.lance</Key></Error>,
/rustc/f8297e351a40c1439a467bbbb6879088047f50b3/library/core/src/ops/[function.rs:250](http://function.rs:250/):5, /home/runner/work/lance/lance/rust/lance-io/src/[scheduler.rs:395](http://scheduler.rs:395/):17
We can quickly eliminate dirty files through the capabilities provided by the current PR.
./target/debug/lance-tools \
dataset repair-manifest \
-s "$LANCE_TOS_DATASET_ROOT" \
--storage-options "$LANCE_STORAGE_OPTIONS_JSON" \
--remove-data-files "FILE1,FILE2"
It also provides data file verification, as well as the capabilities of quickly locating files and performing rollbacks.
In the actual production process, issues with dirty files may occur (such as unexpected commits during distributed writes), along with file corruption caused by storage-layer problems. We need a fast and convenient way to identify dirty files and even remove them from the Manifest to quickly restore the production availability of Lance tables.
For instance, occasional issues may occur in online environments
We can quickly eliminate dirty files through the capabilities provided by the current PR.
It also provides data file verification, as well as the capabilities of quickly locating files and performing rollbacks.