*: refactor proxy to hub lib for columnar#10849
Conversation
Signed-off-by: yongman <yming0221@gmail.com>
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
📝 WalkthroughWalkthroughReplace the old columnar proxy submodule with a new Rust-based tiflash-columnar-hub, update build/CMake wiring, introduce C++ FFI contracts, implement a Rust cdylib runtime (cloud helper, columnar reader FFI, status server, profiling, metrics), and wire the hub into disaggregated storage and proxy state machine. ChangesTiFlash Columnar Hub Proxy Integration
Estimated code review effort 🎯 4 (Complex) | ⏱️ ~65 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
✨ Finishing Touches🧪 Generate unit tests (beta)
|
Signed-off-by: yongman <yming0221@gmail.com>
Signed-off-by: yongman <yming0221@gmail.com>
Signed-off-by: yongman <yming0221@gmail.com>
Signed-off-by: yongman <yming0221@gmail.com>
|
@yongman I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details. ⏳ This process typically takes 10-30 minutes depending on the complexity of the changes. ℹ️ Learn more details on Pantheon AI. |
There was a problem hiding this comment.
Actionable comments posted: 11
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@contrib/tiflash-columnar-hub/hub-runtime/src/basic_ffi_impls.rs`:
- Around line 69-72: ProtoMsgBaseBuff::new currently calls
msg.write_to_bytes().unwrap() which can panic; change its signature to return
protobuf::ProtobufResult<Self> (i.e., return Ok(ProtoMsgBaseBuff{ data }) or
propagate the error from write_to_bytes()) instead of unwrapping, and update the
two FFI callers ffi_server_info and ffi_get_server_info_from_proxy to handle the
Result—map Err to a non-zero u32 error code returned over the FFI and on Ok
extract the ProtoMsgBaseBuff to continue; ensure no unwraps remain and error
paths return the designated non-zero value.
- Around line 59-62: The current BaseBuffView::to_slice calls
std::slice::from_raw_parts unconditionally which is UB when data is null even if
len == 0; modify the method to check self.data for null and return an empty
slice (&[]) when data.is_null() (or when len == 0 and data.is_null()), otherwise
call unsafe std::slice::from_raw_parts(self.data as *const _, self.len as
usize); update the implementation inside impl BaseBuffView::to_slice to perform
this null check before constructing the slice.
In `@contrib/tiflash-columnar-hub/hub-runtime/src/cloud_helper.rs`:
- Around line 638-665: The code currently calls merge_from_bytes(...).unwrap()
on DelegateResponse (delegate_resp.merge_from_bytes(&snap_data).unwrap()) and on
kvenginepb::ChangeSet (cs.merge_from_bytes(&snap_bytes).unwrap()), which panics
on malformed payloads; change those unwraps to proper error handling: replace
both unwrap() calls with match or map_err to capture the protobuf parse error,
log a descriptive message including tag/shard_id, and propagate the error to the
caller (return Err(...)) using an appropriate Error variant (or wrap the parse
error into an existing Error type) so invalid/partial responses are returned as
errors instead of panicking the worker task; update references around
DelegateResponse::merge_from_bytes and kvenginepb::ChangeSet::merge_from_bytes
accordingly.
In `@contrib/tiflash-columnar-hub/hub-runtime/src/columnar_impls.rs`:
- Around line 76-179: The ffi_make_columnar_reader should not panic on malformed
FFI input: add a null check for hub_ptr (use
RaftStoreProxyPtr::is_null()/as_ref() guard) and replace all unchecked slice
indexing and unwrap() calls in parse_table_ranges, the filter_conditions loop,
the tables_range_view loop, and the protobuf merge_from_bytes calls for columns,
table_scan, ann_query_info, and fts_query_info with validated bounds checks and
error handling; on any length/truncation/parse error (from parse_from_bytes or
merge_from_bytes) return a ColumnarReaderPtr containing crate::Error::Other(...)
converted via .into() instead of panicking so callers receive
ColumnarReaderPtr.error_type.
In `@contrib/tiflash-columnar-hub/hub-runtime/src/domain_impls.rs`:
- Around line 35-64: The current From<u32> for RawRustPtrType uses
std::mem::transmute which can produce undefined behaviour for invalid
discriminants and ffi_gc_rust_ptr panics on unexpected tags (which will abort
across FFI); replace the unchecked transmute with a checked conversion
(implement TryFrom<u32> for RawRustPtrType or an explicit mapping function that
matches each u32 to a valid RawRustPtrType and returns a Result/Option), update
call sites to use the try conversion and handle failures gracefully, and modify
ffi_gc_rust_ptr (and any other extern "C" entrypoints using RawRustPtrType) to
avoid panic on unknown tags by either safely ignoring unknown/None variants,
logging an error, or performing the correct free for additional valid variants
(e.g., ReadIndexTask, ArcFutureWaker, TimerTask) rather than calling panic!; key
symbols to change: the impl From<u32> for RawRustPtrType, add TryFrom/TryInto
for RawRustPtrType, and update ffi_gc_rust_ptr to use the checked conversion and
non-panicking error path.
In `@contrib/tiflash-columnar-hub/hub-runtime/src/engine_store_helper.rs`:
- Around line 29-37: Change the global pointer pattern to a one-shot, unsafe
initializer: replace the raw Atomic store with a
OnceLock<NonNull<EngineStoreServerHelper>> (or OnceCell) and make
init_engine_store_server_helper unsafe and accept
NonNull<EngineStoreServerHelper> (not *const u8); call once_lock.set(ptr) so the
pointer can only be set once. Update get_engine_store_server_helper to read the
OnceLock, unwrap_or_panic if unset, then convert the stored NonNull to &'static
EngineStoreServerHelper inside an unsafe block. Keep names
ENGINE_STORE_SERVER_HELPER_PTR, init_engine_store_server_helper, and
get_engine_store_server_helper to locate the changes.
In `@contrib/tiflash-columnar-hub/hub-runtime/src/run.rs`:
- Around line 318-320: The code currently assigns the "--advertise-engine-addr"
value into the same field used for the bind address (config.server.engine_addr),
which clobbers the listen address when both flags are provided; instead, add or
use a separate field (e.g., config.server.advertise_engine_addr) and set that
from matches.value_of("advertise-engine-addr") so the existing
config.server.engine_addr remains the bind address; update the ServerConfig
struct to include advertise_engine_addr (or the equivalent field already
present) and wire that field where PD/store registration constructs the
advertised endpoint.
- Around line 955-987: The sanitizer in sanitize_proxy_args is dropping attached
short-option values like -A1.2.3.4:20170 because it only expands -C, -s, -L and
-f; add a branch that mirrors the others to handle strip_prefix("-A") (check
value.is_empty(), push "-A".to_owned() and value.to_owned(), increment i,
continue) so attached -A... forms are preserved for clap parsing; ensure the new
branch follows the same pattern and placement as the existing -C/-s/-L/-f
blocks.
- Around line 1125-1134: In run_proxy, guard against malformed FFI inputs: first
check argc >= 0 and only cast to usize after that check (use the validated
n_args for Vec::with_capacity), ensure argv is non-null when n_args > 0, and for
each index verify argv.offset(i) is non-null before calling CStr::from_ptr; do
not call unwrap() on CStr::to_str() — instead handle invalid UTF-8 by using
to_str().map(...) or to_string_lossy() and propagate or log errors for any
invalid C string so the function doesn’t panic; reference the run_proxy function
and the earlier run_raftstore_proxy_ffi call path when adding these validations.
In `@contrib/tiflash-columnar-hub/hub-runtime/src/status_server.rs`:
- Around line 298-313: The handlers currently detect "simplify" and "verbose" by
substring checks which mis-handle values like "?verbose=false" and match
unrelated keys; update handle_metrics and handle_ready to call the shared
parse_bool_query helper (or implement exact-query parsing) to read the boolean
query parameter by name (e.g., parse_bool_query(&req.uri(), "simplify") and
parse_bool_query(&req.uri(), "verbose")), so that exact keys and explicit false
values are respected instead of using query.contains(...).
In `@contrib/tiflash-proxy-cmake/CMakeLists.txt`:
- Around line 109-114: The current CMake uses set_property(... PROPERTY
OBJECT_DEPENDS ...) to attach ${_TIFLASH_PROXY_CARGO_LOCK} to
${_TIFLASH_PROXY_LIBRARY}, but ${_TIFLASH_PROXY_LIBRARY} is produced by
add_custom_command and OBJECT_DEPENDS will not retrigger that command; remove
the set_property(...) block and instead, inside the add_custom_command(...) that
creates ${_TIFLASH_PROXY_LIBRARY}, add "${_TIFLASH_PROXY_CARGO_LOCK}" to its
DEPENDS list (guarded with if(EXISTS ${_TIFLASH_PROXY_CARGO_LOCK})), so changes
to the Cargo.lock will correctly re-run the custom command.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 1fa92c01-2a8e-4015-8a7f-e7dcdd067faa
⛔ Files ignored due to path filters (1)
contrib/tiflash-columnar-hub/Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (38)
.gitmodulescmake/find_tiflash_proxy.cmakecontrib/tiflash-columnar-hub/.gitignorecontrib/tiflash-columnar-hub/Cargo.tomlcontrib/tiflash-columnar-hub/Makefilecontrib/tiflash-columnar-hub/hub-runtime/Cargo.tomlcontrib/tiflash-columnar-hub/hub-runtime/ffi/src/RaftStoreProxyFFI/@versioncontrib/tiflash-columnar-hub/hub-runtime/ffi/src/RaftStoreProxyFFI/ColumnFamily.hcontrib/tiflash-columnar-hub/hub-runtime/ffi/src/RaftStoreProxyFFI/Common.hcontrib/tiflash-columnar-hub/hub-runtime/ffi/src/RaftStoreProxyFFI/EncryptionFFI.hcontrib/tiflash-columnar-hub/hub-runtime/ffi/src/RaftStoreProxyFFI/ProxyFFI.hcontrib/tiflash-columnar-hub/hub-runtime/ffi/src/RaftStoreProxyFFI/VersionCheck.hcontrib/tiflash-columnar-hub/hub-runtime/src/basic_ffi_impls.rscontrib/tiflash-columnar-hub/hub-runtime/src/cloud_helper.rscontrib/tiflash-columnar-hub/hub-runtime/src/columnar_impls.rscontrib/tiflash-columnar-hub/hub-runtime/src/domain_impls.rscontrib/tiflash-columnar-hub/hub-runtime/src/engine_store_helper.rscontrib/tiflash-columnar-hub/hub-runtime/src/hub.rscontrib/tiflash-columnar-hub/hub-runtime/src/interfaces.rscontrib/tiflash-columnar-hub/hub-runtime/src/lib.rscontrib/tiflash-columnar-hub/hub-runtime/src/metrics.rscontrib/tiflash-columnar-hub/hub-runtime/src/profile.rscontrib/tiflash-columnar-hub/hub-runtime/src/run.rscontrib/tiflash-columnar-hub/hub-runtime/src/server_info.rscontrib/tiflash-columnar-hub/hub-runtime/src/status_server.rscontrib/tiflash-columnar-hub/rust-toolchain.tomlcontrib/tiflash-columnar-hub/workspace-hack/Cargo.tomlcontrib/tiflash-columnar-hub/workspace-hack/src/lib.rscontrib/tiflash-proxy-cmake/CMakeLists.txtcontrib/tiflash-proxy-columnardbms/src/Interpreters/SharedContexts/Disagg.hdbms/src/Server/Server.cppdbms/src/Server/main.cppdbms/src/Storages/KVStore/ProxyStateMachine.hdbms/src/Storages/KVStore/TMTContext.cppdbms/src/Storages/Page/V3/CMakeLists.txtdbms/src/Storages/StorageDisaggregated.cppdbms/src/Storages/StorageDisaggregatedColumnar.cpp
💤 Files with no reviewable changes (2)
- contrib/tiflash-proxy-columnar
- .gitmodules
| if (EXISTS "${_TIFLASH_PROXY_CARGO_LOCK}") | ||
| set_property( | ||
| SOURCE ${_TIFLASH_PROXY_LIBRARY} | ||
| APPEND | ||
| PROPERTY OBJECT_DEPENDS "${_TIFLASH_PROXY_CARGO_LOCK}") | ||
| endif() |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In CMake, does set_property(SOURCE <generated_output_path> PROPERTY OBJECT_DEPENDS ...)affect rebuilds for anadd_custom_command(OUTPUT ...)target, or must the dependency be listed inadd_custom_command(... DEPENDS ...)?
💡 Result:
The OBJECT_DEPENDS source file property does not affect the rebuild behavior of add_custom_command(OUTPUT ...) targets [1][2]. OBJECT_DEPENDS is a specialized property intended to specify additional dependencies for compiled object files (such as those generated from C++ source files) [1][2]. When using Makefile or Ninja generators, an object file will be recompiled if any file listed in OBJECT_DEPENDS is newer than the object file itself [1][2]. This property is not intended to track dependencies for custom commands and does not influence the dependency graph of add_custom_command outputs [1][2]. To ensure that an add_custom_command output is rebuilt when a dependency changes, you must list the dependency explicitly in the DEPENDS argument of the add_custom_command itself [3]. If your custom command needs to depend on the output of another command or a specific file, include that file path or the OUTPUT of the other command directly in the DEPENDS list [3]. If you are attempting to create a dependency on a different target (like a library or executable), using the target name in the DEPENDS list will establish the necessary dependency [3]. For more complex inter-target dependencies where file-level dependencies are not automatically handled, add_dependencies may be used to establish target-level ordering, but this does not replace the need for file-level dependencies in add_custom_command [1][4].
Citations:
- 1: https://cmake.org/cmake/help/latest/prop_sf/OBJECT_DEPENDS.html
- 2: https://cmake.org/cmake/help/v3.3/prop_sf/OBJECT_DEPENDS.html
- 3: https://cmake.org/cmake/help/git-master/command/add_custom_command.html
- 4: https://cmake.org/cmake/help/latest/command/add_dependencies.html
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="contrib/tiflash-proxy-cmake/CMakeLists.txt"
# Show the relevant region around the reported lines plus the nearby variable definitions
nl -ba "$FILE" | sed -n '70,150p'Repository: pingcap/tiflash
Length of output: 102
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="contrib/tiflash-proxy-cmake/CMakeLists.txt"
if [ ! -f "$FILE" ]; then
echo "Missing $FILE" >&2
exit 1
fi
# Print a narrower window around the reported lines using cat -n
cat -n "$FILE" | sed -n '80,140p'Repository: pingcap/tiflash
Length of output: 2174
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "_TIFLASH_PROXY_CARGO_LOCK" -S contrib/tiflash-proxy-cmake/CMakeLists.txtRepository: pingcap/tiflash
Length of output: 250
Cargo.lock dependency is wired to the wrong CMake mechanism
${_TIFLASH_PROXY_CARGO_LOCK}is only added viaset_property(... PROPERTY OBJECT_DEPENDS ...)(lines 109-114), but${_TIFLASH_PROXY_LIBRARY}is the OUTPUT ofadd_custom_command(...)(lines 98-107).OBJECT_DEPENDSis for compiled object files and won’t trigger reruns of anadd_custom_commandwhen the lockfile changes.- Remove the
set_property(SOURCE ${_TIFLASH_PROXY_LIBRARY} ... OBJECT_DEPENDS ...)block and instead add"${_TIFLASH_PROXY_CARGO_LOCK}"to theDEPENDSlist of thatadd_custom_command(...)(conditionally withif(EXISTS ...)).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@contrib/tiflash-proxy-cmake/CMakeLists.txt` around lines 109 - 114, The
current CMake uses set_property(... PROPERTY OBJECT_DEPENDS ...) to attach
${_TIFLASH_PROXY_CARGO_LOCK} to ${_TIFLASH_PROXY_LIBRARY}, but
${_TIFLASH_PROXY_LIBRARY} is produced by add_custom_command and OBJECT_DEPENDS
will not retrigger that command; remove the set_property(...) block and instead,
inside the add_custom_command(...) that creates ${_TIFLASH_PROXY_LIBRARY}, add
"${_TIFLASH_PROXY_CARGO_LOCK}" to its DEPENDS list (guarded with if(EXISTS
${_TIFLASH_PROXY_CARGO_LOCK})), so changes to the Cargo.lock will correctly
re-run the custom command.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
dbms/src/Server/Server.cpp (1)
1230-1231: 💤 Low valueUse the local
is_disagg_storage_modefor consistency.A few lines above (Line 753) you already cache
is_disagg_storage_mode = global_context->getSharedContextDisagg()->isDisaggregatedStorageMode();and use it throughoutmain. CallinggetSharedContextDisagg()->isDisaggregatedStorageMode()again here is functionally equivalent but inconsistent with the surrounding style.♻️ Proposed tweak
- if (global_context->getSharedContextDisagg()->isDisaggregatedStorageMode() - && !store_ident.has_value()) + if (is_disagg_storage_mode && !store_ident.has_value())🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@dbms/src/Server/Server.cpp` around lines 1230 - 1231, Replace the direct call to global_context->getSharedContextDisagg()->isDisaggregatedStorageMode() with the cached local boolean is_disagg_storage_mode for consistency; specifically, update the if-condition that checks disaggregated storage and store_ident (the line currently using global_context->getSharedContextDisagg()->isDisaggregatedStorageMode()) to reference is_disagg_storage_mode alongside store_ident.has_value().
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@dbms/src/Server/Server.cpp`:
- Around line 1230-1231: Replace the direct call to
global_context->getSharedContextDisagg()->isDisaggregatedStorageMode() with the
cached local boolean is_disagg_storage_mode for consistency; specifically,
update the if-condition that checks disaggregated storage and store_ident (the
line currently using
global_context->getSharedContextDisagg()->isDisaggregatedStorageMode()) to
reference is_disagg_storage_mode alongside store_ident.has_value().
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: db96903f-3922-4fe1-a098-b19a403b2c70
📒 Files selected for processing (1)
dbms/src/Server/Server.cpp
…nullptr or not Signed-off-by: JaySon-Huang <tshent@qq.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@dbms/src/Server/Server.cpp`:
- Around line 1209-1214: The code logs "columnar proxy is ready to serve" when
proxy_machine.isColumnar() is true even if the proxy isn't runnable; update the
condition so the columnar-ready message only appears when the proxy is runnable
as well—either reorder to check proxy_machine.isProxyRunnable() first or change
the branch to require both predicates (e.g., proxy_machine.isColumnar() &&
proxy_machine.isProxyRunnable()) before calling LOG_INFO(log, "columnar proxy is
ready to serve"), ensuring the rest of the else-if logic
(proxy_machine.isProxyRunnable()) still executes correctly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: fa478638-3aa1-448a-ba7f-be455210742d
📒 Files selected for processing (2)
dbms/src/Server/Server.cppdbms/src/Storages/KVStore/ProxyStateMachine.h
| if (proxy_machine.isColumnar()) | ||
| { | ||
| const auto store_id = tmt_context.getKVStore()->getStoreID(std::memory_order_seq_cst); | ||
| LOG_INFO(log, "columnar proxy is ready to serve"); | ||
| } | ||
| else if (proxy_machine.isProxyRunnable()) | ||
| { |
There was a problem hiding this comment.
Gate the columnar “ready” path on runnable state.
At Line 1209, isColumnar() is checked before isProxyRunnable(), so a columnar-but-not-runnable configuration can still log “ready to serve”.
Proposed fix
- if (proxy_machine.isColumnar())
+ if (proxy_machine.isColumnar() && proxy_machine.isProxyRunnable())
{
LOG_INFO(log, "columnar proxy is ready to serve");
}
+ else if (proxy_machine.isColumnar())
+ {
+ LOG_WARNING(log, "columnar mode is enabled but proxy is not runnable");
+ }
else if (proxy_machine.isProxyRunnable())
{
auto kvstore = tmt_context.getKVStore();🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@dbms/src/Server/Server.cpp` around lines 1209 - 1214, The code logs "columnar
proxy is ready to serve" when proxy_machine.isColumnar() is true even if the
proxy isn't runnable; update the condition so the columnar-ready message only
appears when the proxy is runnable as well—either reorder to check
proxy_machine.isProxyRunnable() first or change the branch to require both
predicates (e.g., proxy_machine.isColumnar() && proxy_machine.isProxyRunnable())
before calling LOG_INFO(log, "columnar proxy is ready to serve"), ensuring the
rest of the else-if logic (proxy_machine.isProxyRunnable()) still executes
correctly.
What problem does this PR solve?
Issue Number: ref #10844
Problem Summary:
Proxy source code is complicated to maintain.
What is changed and how it works?
Build the tiflash server with
ENABLE_NEXT_GEN_COLUMNAR=ONandENABLE_NEXT_GEN=ONwill link the newtiflash-columnar-hublib to access kvengine.cloud-storage-enginedependencies to the crontrib directly to simplfy maintainance./debug/pprof/....Check List
Tests
Side effects
Documentation
Release note
Summary by CodeRabbit
New Features
Bug Fixes
Chores