Bug Report
Please answer these questions before submitting your issue. Thanks!
1. Minimal reproduce step (Required)
Found when running CI integration tests for pingcap/tidb-engine-ext#457 under contrib/tiflash-proxy:
cd contrib/tiflash-proxy
make ci_test
# Or run proxy_tests only:
cargo test -p proxy_tests --test proxy
Under parallel CI or high load, the following four tests fail intermittently (process exits with SIGABRT). Two are temporarily marked #[ignore] in tests; two can still block CI.
| Test |
File |
Current status |
shared::fast_add_peer::simple::simple_normal::test_simple_from_invalid_source |
proxy_tests/proxy/shared/fast_add_peer/simple.rs |
#[ignore] |
shared::fast_add_peer::fp::test_overlap_apply_tikv_snap_in_the_middle |
proxy_tests/proxy/shared/fast_add_peer/fp.rs |
#[ignore] |
shared::fast_add_peer::fp::test_prehandle_snapshot_after_restart_reset |
proxy_tests/proxy/shared/fast_add_peer/fp.rs |
Not ignored; may still fail |
shared::snapshot::test_split_merge |
proxy_tests/proxy/shared/snapshot.rs |
Not ignored; may still fail |
1) test_simple_from_invalid_source
- Scenario: FAP selects an invalid source (
fap_mock_add_peer_from_id=100), then falls back to the slow-path snapshot flow.
- Failure:
components/raftstore/src/store/peer_storage.rs:1141
- Assertion:
assert_eq!(apply_state, last_applied_state) — when generating a snapshot, apply_state in KV does not match the in-memory peer state (e.g. KV index=10, memory index=11).
- Related failpoint:
on_pre_write_apply_state=return(true) (tests persist apply state early for leader snapshot logic).
2) test_overlap_apply_tikv_snap_in_the_middle
- Scenario: On a mono-store (raftstore v1 single RocksDB), FAP snapshot and TiKV snapshot apply overlap; store 3 previously had region 1 (
k1=v1), and after split region 1000 writes k1=v13.
- Failure:
must_get_finally / check_key_ex — expected k1=v13, got k1=v1.
- Test comment:
Raftstore v1 is mono store, so there could be v1 written by old_one_1_k3 (fp.rs, ~line 554).
3) test_prehandle_snapshot_after_restart_reset
- Scenario: 3-store cluster with FAP enabled. Store 2 joins via FAP (invalid source id 4), then
k3=v3 is written. Store 3 joins via FAP from store 2 while on_ob_pre_handle_snapshot_s3=pause delays TiKV snapshot prehandle. Test waits for FAP phase1 Persisted, resets cached_region_info.inited_or_fallback (kind=2, simulating restart), removes pause, sets fap_core_no_prehandle=panic, then checks store 3 disk has k3=v3.
- Failure:
fp.rs:289 / must_get — can't get value Some("v3") for key 6B33, actual=None.
- Observed race (from
build-log.4.txt):
- FAP
copy_data_from store 2 → store 3 only copies k1=v1, missing k3=v3, because store 2's mock kvstore does not yet contain k3 when FAP runs.
- Store 2 may catch up via TiKV snapshot idx=9 with
key_count: 1 (k1 only); leader then sends empty MsgAppend index:9 without re-sending the k3 put entry.
- TiKV snapshot idx=10 prehandle for store 3 stays paused; after FAP flush (k1 only), resume +
fap_core_no_prehandle skips TiKV prehandle when kvstore_region_exist==true, so k3 never reaches disk (no post apply snapshot for peer 3003 in failing logs).
- Contrast with passing run:
copy_data_from includes both k1 and k3; leader explicitly sends k3 entry in MsgAppend to store 2 before store 3 FAP copy.
4) test_split_merge
- Scenario: After split and merge, the test checks immediately after
must_merge that each node's mock kvstore no longer contains the source region (e.g. region 1000).
- Failure:
proxy_tests/proxy/shared/snapshot.rs:346 — node {id} should has no region {source_id}.
- Timing:
must_merge only waits for PD-side merge completion; store 1's CommitMerge / ffi_handle_destroy can finish later than PD returns (logs show store 2/3 ~11.120, store 1 ~11.147–11.150). In the same repo, normal.rs sleeps 1000ms after merge and test_handle_destroy sleeps 100ms, but test_split_merge has no wait.
2. What did you expect to see? (Required)
make ci_test / proxy_tests should pass reliably; these cases should not be flaky due to timing or mono-store stale data.
- Mock engine-store and raftstore state should stay consistent across merge, FAP fallback, overlapping snapshot apply, and FAP + paused TiKV snapshot prehandle resume.
3. What did you see instead (Required)
CI logs intermittently show failures during the proxy_tests stage (build succeeds):
test_simple_from_invalid_source: peer_storage.rs:1141 apply_state mismatch; snap-generator thread panics.
test_overlap_apply_tikv_snap_in_the_middle: can't get value Some("v13") for key 6B31 (must_get_finally).
test_prehandle_snapshot_after_restart_reset: can't get value Some("v3") for key 6B33 (must_get at fp.rs:289).
test_split_merge: node 1 should has no region 1000 (source region still present in store 1's kvstore).
Suggested fix directions (for tracking)
test_split_merge (test / mock, higher priority)
- After
must_merge, poll until all stores satisfy !kvstore.contains_key(source_id) (similar to must_wait_until_cond_generic), or add a wait like normal.rs.
- Optional: in mock, do not recreate merged/destroyed regions via the
handle_admin Vacant path.
test_prehandle_snapshot_after_restart_reset (test / FAP source readiness)
- Before adding store 3, wait until store 2 has
k3=v3 on disk (e.g. check_key(..., b"k3", b"v3", None, Some(true), Some(vec![2]))), so FAP copy is not racing store 2 apply.
- After removing
on_ob_pre_handle_snapshot_s3 pause, poll until TiKV snapshot apply finishes on store 3 before asserting disk keys.
- Short term: mark
#[ignore] if needed to unblock CI.
test_prehandle_snapshot_after_restart_reset (product / mock)
- FAP
copy_data_from should read from persisted engine state (or wait until source peer data at applied_index is fully visible), not a racing in-memory mock kvstore snapshot.
- When TiKV snapshot prehandle resumes after FAP has partially written data, merge/apply missing keys from the TiKV SST instead of skipping prehandle due to
kvstore_region_exist + fap_core_no_prehandle.
test_simple_from_invalid_source (product / FAP)
- In
fap_fallback_to_slow / fap_fallback_to_slow_with_lock (engine_store_ffi/src/core/fast_add_peer.rs), sync KV apply_state on fallback so slow-path snapshot generation matches the in-memory applied index.
test_overlap_apply_tikv_snap_in_the_middle (mock snapshot / mono-store)
- When applying a snapshot, clear overlapping stale data from other regions on the same store by key range (
write_snapshot_to_db_data / ffi_apply_prehandled_snapshot), so region 1's k1=v1 does not shadow region 1000's k1=v13.
4. What is your TiFlash version? (Required)
- Observed in
contrib/tiflash-proxy CI (make ci_test), on branches that integrate tidb-engine-ext proxy_tests (e.g. release-8.5 cherry-picks).
- TiFlash main-repo version: matches the integrated branch at build time (CI logs show commit-based git_hash, not a fixed release tag).
Reference paths (in tiflash repo)
contrib/tiflash-proxy/proxy_tests/proxy/shared/snapshot.rs — test_split_merge
contrib/tiflash-proxy/proxy_tests/proxy/shared/fast_add_peer/simple.rs — test_simple_from_invalid_source
contrib/tiflash-proxy/proxy_tests/proxy/shared/fast_add_peer/fp.rs — test_overlap_apply_tikv_snap_in_the_middle, test_prehandle_snapshot_after_restart_reset, prehandle_snapshot_after_restart
contrib/tiflash-proxy/proxy_components/mock-engine-store/src/mock_store/mock_core.rs — copy_data_from
contrib/tiflash-proxy/proxy_components/mock-engine-store/src/mock_store/mock_write_impls.rs — CommitMerge kvstore.remove
contrib/tiflash-proxy/proxy_components/engine_store_ffi/src/core/forward_raft/fap_snapshot.rs — FAP snapshot skip / apply
contrib/tiflash-proxy/proxy_components/engine_store_ffi/src/core/forward_raft/snapshot.rs — on_ob_pre_handle_snapshot_s3, fap_core_no_prehandle
contrib/tiflash-proxy/proxy_components/engine_store_ffi/src/core/fast_add_peer.rs — FAP fallback
Bug Report
Please answer these questions before submitting your issue. Thanks!
1. Minimal reproduce step (Required)
Found when running CI integration tests for pingcap/tidb-engine-ext#457 under
contrib/tiflash-proxy:Under parallel CI or high load, the following four tests fail intermittently (process exits with
SIGABRT). Two are temporarily marked#[ignore]in tests; two can still block CI.shared::fast_add_peer::simple::simple_normal::test_simple_from_invalid_sourceproxy_tests/proxy/shared/fast_add_peer/simple.rs#[ignore]shared::fast_add_peer::fp::test_overlap_apply_tikv_snap_in_the_middleproxy_tests/proxy/shared/fast_add_peer/fp.rs#[ignore]shared::fast_add_peer::fp::test_prehandle_snapshot_after_restart_resetproxy_tests/proxy/shared/fast_add_peer/fp.rsshared::snapshot::test_split_mergeproxy_tests/proxy/shared/snapshot.rs1)
test_simple_from_invalid_sourcefap_mock_add_peer_from_id=100), then falls back to the slow-path snapshot flow.components/raftstore/src/store/peer_storage.rs:1141assert_eq!(apply_state, last_applied_state)— when generating a snapshot,apply_statein KV does not match the in-memory peer state (e.g. KV index=10, memory index=11).on_pre_write_apply_state=return(true)(tests persist apply state early for leader snapshot logic).2)
test_overlap_apply_tikv_snap_in_the_middlek1=v1), and after split region 1000 writesk1=v13.must_get_finally/check_key_ex— expectedk1=v13, gotk1=v1.Raftstore v1 is mono store, so there could be v1 written by old_one_1_k3(fp.rs, ~line 554).3)
test_prehandle_snapshot_after_restart_resetk3=v3is written. Store 3 joins via FAP from store 2 whileon_ob_pre_handle_snapshot_s3=pausedelays TiKV snapshot prehandle. Test waits for FAP phase1Persisted, resetscached_region_info.inited_or_fallback(kind=2, simulating restart), removes pause, setsfap_core_no_prehandle=panic, then checks store 3 disk hask3=v3.fp.rs:289/must_get—can't get value Some("v3") for key 6B33, actual=None.build-log.4.txt):copy_data_fromstore 2 → store 3 only copiesk1=v1, missingk3=v3, because store 2's mock kvstore does not yet containk3when FAP runs.key_count: 1(k1 only); leader then sends emptyMsgAppend index:9without re-sending the k3 put entry.fap_core_no_prehandleskips TiKV prehandle whenkvstore_region_exist==true, so k3 never reaches disk (nopost apply snapshotfor peer 3003 in failing logs).copy_data_fromincludes both k1 and k3; leader explicitly sends k3 entry inMsgAppendto store 2 before store 3 FAP copy.4)
test_split_mergemust_mergethat each node's mockkvstoreno longer contains the source region (e.g. region 1000).proxy_tests/proxy/shared/snapshot.rs:346—node {id} should has no region {source_id}.must_mergeonly waits for PD-side merge completion; store 1's CommitMerge /ffi_handle_destroycan finish later than PD returns (logs show store 2/3 ~11.120, store 1 ~11.147–11.150). In the same repo,normal.rssleeps 1000ms after merge andtest_handle_destroysleeps 100ms, buttest_split_mergehas no wait.2. What did you expect to see? (Required)
make ci_test/proxy_testsshould pass reliably; these cases should not be flaky due to timing or mono-store stale data.3. What did you see instead (Required)
CI logs intermittently show failures during the
proxy_testsstage (build succeeds):test_simple_from_invalid_source:peer_storage.rs:1141apply_state mismatch; snap-generator thread panics.test_overlap_apply_tikv_snap_in_the_middle:can't get value Some("v13") for key 6B31(must_get_finally).test_prehandle_snapshot_after_restart_reset:can't get value Some("v3") for key 6B33(must_getatfp.rs:289).test_split_merge:node 1 should has no region 1000(source region still present in store 1'skvstore).Suggested fix directions (for tracking)
test_split_merge(test / mock, higher priority)must_merge, poll until all stores satisfy!kvstore.contains_key(source_id)(similar tomust_wait_until_cond_generic), or add a wait likenormal.rs.handle_adminVacant path.test_prehandle_snapshot_after_restart_reset(test / FAP source readiness)k3=v3on disk (e.g.check_key(..., b"k3", b"v3", None, Some(true), Some(vec![2]))), so FAP copy is not racing store 2 apply.on_ob_pre_handle_snapshot_s3pause, poll until TiKV snapshot apply finishes on store 3 before asserting disk keys.#[ignore]if needed to unblock CI.test_prehandle_snapshot_after_restart_reset(product / mock)copy_data_fromshould read from persisted engine state (or wait until source peer data atapplied_indexis fully visible), not a racing in-memory mock kvstore snapshot.kvstore_region_exist+fap_core_no_prehandle.test_simple_from_invalid_source(product / FAP)fap_fallback_to_slow/fap_fallback_to_slow_with_lock(engine_store_ffi/src/core/fast_add_peer.rs), sync KVapply_stateon fallback so slow-path snapshot generation matches the in-memory applied index.test_overlap_apply_tikv_snap_in_the_middle(mock snapshot / mono-store)write_snapshot_to_db_data/ffi_apply_prehandled_snapshot), so region 1'sk1=v1does not shadow region 1000'sk1=v13.4. What is your TiFlash version? (Required)
contrib/tiflash-proxyCI (make ci_test), on branches that integrate tidb-engine-ext proxy_tests (e.g. release-8.5 cherry-picks).Reference paths (in tiflash repo)
contrib/tiflash-proxy/proxy_tests/proxy/shared/snapshot.rs—test_split_mergecontrib/tiflash-proxy/proxy_tests/proxy/shared/fast_add_peer/simple.rs—test_simple_from_invalid_sourcecontrib/tiflash-proxy/proxy_tests/proxy/shared/fast_add_peer/fp.rs—test_overlap_apply_tikv_snap_in_the_middle,test_prehandle_snapshot_after_restart_reset,prehandle_snapshot_after_restartcontrib/tiflash-proxy/proxy_components/mock-engine-store/src/mock_store/mock_core.rs—copy_data_fromcontrib/tiflash-proxy/proxy_components/mock-engine-store/src/mock_store/mock_write_impls.rs— CommitMergekvstore.removecontrib/tiflash-proxy/proxy_components/engine_store_ffi/src/core/forward_raft/fap_snapshot.rs— FAP snapshot skip / applycontrib/tiflash-proxy/proxy_components/engine_store_ffi/src/core/forward_raft/snapshot.rs—on_ob_pre_handle_snapshot_s3,fap_core_no_prehandlecontrib/tiflash-proxy/proxy_components/engine_store_ffi/src/core/fast_add_peer.rs— FAP fallback