feat(external): support Hive-style partitioned Parquet external tables#24329
feat(external): support Hive-style partitioned Parquet external tables#24329iamlinjunhong wants to merge 6 commits intomatrixorigin:3.0-devfrom
Conversation
ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one. |
XuPeng-SH
left a comment
There was a problem hiding this comment.
Thanks for the substantial effort here — +6347 lines with 3385 lines of tests, clean file-level decomposition, and real edge-case coverage in the BVT. A lot of it is solid. My concerns focus on whether this is ready to be announced as "Hive-style partitioned Parquet" given how several pieces interact.
Leaving as request-for-changes for the 🔴 items; the 🟡 items are open questions I'd like resolved before merge or explicitly deferred with a follow-up issue.
🔴 Blocking
1. Hive tables are scanned serially regardless of cluster size
compile.go:1675-1679 unconditionally forces param.Parallel = false for Hive tables:
if param.HivePartitioning {
param.Parallel = false
}Combined with getReadWriteParallelFlag (:1596-1604), this routes every Hive table into compileExternScanSerialReadWrite at :1862 — a single scope, no cross-CN fanout. A table with 10K partitions = 10K files is opened one at a time on one CN. The inline comment says it's because "the parallel loop mutates param.Filepath per file" — that's a correctness workaround, not a design decision.
This dwarfs every other performance concern. The feature technically works on small datasets but is unusable on anything Hive workloads actually look like. Please either:
- fix the
param.Filepathmutation so parallel paths can be reused, or - build a file-level fanout specifically for Hive (the per-partition file set is ideal for this).
2. S3 FS rebuilt per List call
hive_partition.go:94-105 — NewListDirFunc calls plan2.GetForETLWithType(param, prefix) on every recursive directory listing. For each call the option string is re-encoded and a fresh S3FS is constructed. For N partitions at depth D that's O(N·D) S3FS constructions on top of O(N·D) LIST calls. The code already carries a TODO:
TODO: For S3 this re-creates an S3FS instance per List call; pre-build the FS once and reuse across recursive calls for better performance.
Please pre-build the FS once and reuse. The combination of #1 + #2 means S3-backed Hive tables are going to surprise users badly.
3. Partition directory names are not sanitized against traversal
ParseHivePartitionSegment (hive_partition.go:143-151) accepts any bytes in Value — including /, .., NUL, newlines. The only rejection is % for URL-encoded content. Later path.Join(prefix, entry.Name) (:303) collapses ... On local FS a directory literally named year=.. (or any attacker-plantable layout underneath a tenant-readable base path) walks out of the intended subtree.
The directory name comes from fileservice.DirEntry, which reflects the real filesystem entry — it is attacker-reachable on any writable path that a different principal can manipulate. Please reject control bytes, /, .., and NUL in both Key and Value. An explicit allow-list is safer than an ad-hoc reject list.
🟡 Please address or acknowledge
4. "Hive compatibility" is narrower than the title claims
- URL-encoded partition values are rejected (
hive_partition.go:269-272). Spark / Hive / Trino writecountry=US%2FCAverbatim whenever a partition value contains/,:,%or other filesystem-unsafe bytes. Datasets using such values cannot be read. The test comment athive_partition_external_table.sqlacknowledges this as "P0 known limitation" — but if it's P0 and known-broken, the feature title overclaims compatibility. - DATE / TIMESTAMP / FLOAT / DECIMAL partitions do not prune (
hive_partition.go:419-431canPruneType).WHERE dt = '2024-01-01'lists every partition. Hive workloads are predominantly date-partitioned; this materially regresses the most common pruning case. - Predicate pruning supports only
=andINwith literals (hive_partition.go:586-595).>,>=,BETWEEN,!=,NOT IN,LIKE,CAST(...), and anyORcombination all fall back to full listing. Again, time-range queries are the typical Hive access pattern.
I'm not asking for all three to be fixed in this PR, but please either (a) scope the title/release notes to what actually works, or (b) file explicit follow-up issues and link them. Right now "support Hive-style partitioned Parquet" sets a wrong expectation.
5. Cross-partition schema evolution is untested and probably broken
Hive datasets routinely accumulate parquet files with drifting schemas (added/dropped columns across partition directories). The code caches the schema from the first opened file and findColumnIgnoreCase is per-file, so behavior depends on which file happens to be scanned first. Zero BVT cases for add-col / drop-col across partitions. Please add a test and document the behavior (fail hard / fill NULL / skip column) explicitly.
6. Rolling upgrade within 3.0-dev
parsers/tree/update.go:4511-4525 extends ExParamConst with HivePartitioning/HivePartitionCols/HivePartColTypes. These are serialized into Createsql via json.Marshal at DDL time (build_ddl.go:930) and re-parsed on every compile on the CN that runs the query (compile.go:1573). A CN on the older code inside the 3.0-dev window will json.Unmarshal successfully (unknown fields are ignored), miss the Hive flags, and route the table through generic ReadDir. For a Hive base path like /warehouse/sales/ with no glob characters, ReadDir (utils.go:2084-2133) returns an empty file list, so the query returns zero rows rather than wrong rows — still a visible break during mixed deployment. Please guard with a feature flag or version check, or document the ordering constraint for the 3.0 cutover.
7. A single bad parquet kills the whole query
parquet.go + scanParquetFile returns an error out of the Hive loop if newParquetHandler fails on any file (0-byte, truncated, _temporary residue). Production Hive datasets carry broken files as a matter of course; Spark/Trino tolerate them with per-file skip + log. Please at minimum add a continue-on-error path for the Hive reader, matching what CSV external already does in some configs.
8. Hive keywords added to DDL without a feature flag
build_ddl.go:939 + utils.go:1673-1829 (parseHiveOptionKV). Any existing external-table DDL that happens to use hive_partitioning / hive_partition_columns as option keys for unrelated reasons changes semantics silently. Previously these would be rejected at the "keyword not supported" default. Low probability but zero rollback. Either (a) gate behind a server-level feature flag for the first release cycle or (b) add a changelog entry specifically flagging this.
🟢 Follow-ups (fine to defer)
- Partition discovery is single-goroutine and serial (
hive_partition.go:258-322). At depth 3 with 99-wide fanout you exhaustmaxListCalls=10000. With a worker pool + LIST prefetching this is easily 10× better on S3. Make the two constants (maxListCalls,maxPartitionCount) configurable while you're at it. matchInt/matchUintstrip leading zeros (hive_partition.go:434,:451) soyear=007matchesyear=7, butcountry='CN'on varchar requires exact match. Two partition columns on the same dataset can have inconsistent pruning semantics. Documented behavior would help.extractVecValuespasses nil mempool tovec.Free(nil)(hive_partition.go:668). Other sites typically useproc.Mp(). Please verifyVector.Freesemantics on nil mempool are intended; at minimum add a comment pinning the expectation.SHOW CREATE TABLEdrops case on partition column names (build_ddl.go:4625-4627lowercases before storing).'hive_partition_columns'='Year,Month'round-trips as'year,month', breaking case-sensitive consumers ofSHOW CREATE.- Float partition values have no NaN/Inf guard (
hive_partition_fill.go:1911,1918).amount=inf/data.parquetsilently fills+Inf. HivePartColTypeembedded intreepackage (parsers/tree/update.go:4519). Layering workaround to avoid importingpkg/pb/plan, but nowtreeknows plan type codes. ATODOwith the plan-side extraction target would make this less load-bearing.isHivePartitionCollinear scan per column (hive_partition_fill.go:1730-1741). Fine for small schemas, quadratic for wide tables.- Placeholder imports in tests —
var _ = catalog.ExternalFilePath/var _ = function.EQUAL/var _ = math.MaxInt32inhive_partition_test.go:3890-3891and_coverage_test.go:1692. Just remove the imports. - Copyright headers mix 2024 (
hive_partition.go) and 2026 (hive_partition_coverage_test.go,compile/hive_partition_test.go). Cosmetic.
What's genuinely good (for balance)
- Split of
hive_partition.go(discovery + pruning) vshive_partition_fill.go(constant-vector fill at scan time) is clean. ClassifyFilterskeeping partition filters in both partitionFilters and rowFilters for double-filtering safety (hive_partition.go:501-504) is a good defensive choice.- Constant-vector fill path (
SetConstFixed/SetConstBytes) avoids per-row work in the hot loop.refreshPartitionValuesruns once per file open, not per batch. extractVecValuesdefensively bounds-checks the binary envelope beforeUnmarshalBinary(hive_partition.go:738-760).- The filePathColSet comment explaining why
STATEMENT_ACCOUNTis excluded is exemplary. - BVT covers 11 sections including real negative cases with exact error strings, and the
.resultshows honest float rounding rather than pretty-printed.
Happy to chat through any of the 🔴 items. The core approach is right; the asks above are about not shipping the announcement until the implementation matches it.
addressed the three blocking items in the latest commit.
also removed the placeholder test imports and added a comment documenting why For the non-blocking compatibility items: URL-encoded partition values remain explicitly unsupported in P0; pruning is still limited to |
Performance Benchmark ResultsBenchmarked this PR (commit ab6058a) at two scales using a synthesized TPC-DS-like Scope
All queries run against a single-CN local MatrixOne with PR binary. Warmup = 1 run, timed = 3 runs, median reported. Results
Key findingPartition pruning works as advertised. Q2 latency is ~flat from SMALL to MEDIUM (26ms → 35ms) even as total partition count went 4×. This is the main promise of the feature and it holds. Side effects (filed separately)While running the benchmark I hit two issues that block the advertised usage pattern:
Suggestions before merge
ReproducibilityAll benchmark scripts, data generator, DDL and queries at |
What type of PR is this?
Which issue(s) this PR fixes:
issue #24320
What this PR does / why we need it:
support Hive-style partitioned Parquet external tables