feat(external): support Hive-style partitioned Parquet external tables by iamlinjunhong · Pull Request #24329 · matrixorigin/matrixone

iamlinjunhong · 2026-05-11T02:34:07Z

What type of PR is this?

Which issue(s) this PR fixes:

What this PR does / why we need it:

support Hive-style partitioned Parquet external tables

qodo-code-review · 2026-05-11T02:34:11Z

ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one.

XuPeng-SH

Thanks for the substantial effort here — +6347 lines with 3385 lines of tests, clean file-level decomposition, and real edge-case coverage in the BVT. A lot of it is solid. My concerns focus on whether this is ready to be announced as "Hive-style partitioned Parquet" given how several pieces interact.

Leaving as request-for-changes for the 🔴 items; the 🟡 items are open questions I'd like resolved before merge or explicitly deferred with a follow-up issue.

🔴 Blocking

1. Hive tables are scanned serially regardless of cluster size

compile.go:1675-1679 unconditionally forces param.Parallel = false for Hive tables:

if param.HivePartitioning {
    param.Parallel = false
}

Combined with getReadWriteParallelFlag (:1596-1604), this routes every Hive table into compileExternScanSerialReadWrite at :1862 — a single scope, no cross-CN fanout. A table with 10K partitions = 10K files is opened one at a time on one CN. The inline comment says it's because "the parallel loop mutates param.Filepath per file" — that's a correctness workaround, not a design decision.

This dwarfs every other performance concern. The feature technically works on small datasets but is unusable on anything Hive workloads actually look like. Please either:

fix the param.Filepath mutation so parallel paths can be reused, or
build a file-level fanout specifically for Hive (the per-partition file set is ideal for this).

2. S3 FS rebuilt per `List` call

hive_partition.go:94-105 — NewListDirFunc calls plan2.GetForETLWithType(param, prefix) on every recursive directory listing. For each call the option string is re-encoded and a fresh S3FS is constructed. For N partitions at depth D that's O(N·D) S3FS constructions on top of O(N·D) LIST calls. The code already carries a TODO:

TODO: For S3 this re-creates an S3FS instance per List call; pre-build the FS once and reuse across recursive calls for better performance.

Please pre-build the FS once and reuse. The combination of #1 + #2 means S3-backed Hive tables are going to surprise users badly.

3. Partition directory names are not sanitized against traversal

ParseHivePartitionSegment (hive_partition.go:143-151) accepts any bytes in Value — including /, .., NUL, newlines. The only rejection is % for URL-encoded content. Later path.Join(prefix, entry.Name) (:303) collapses ... On local FS a directory literally named year=.. (or any attacker-plantable layout underneath a tenant-readable base path) walks out of the intended subtree.

The directory name comes from fileservice.DirEntry, which reflects the real filesystem entry — it is attacker-reachable on any writable path that a different principal can manipulate. Please reject control bytes, /, .., and NUL in both Key and Value. An explicit allow-list is safer than an ad-hoc reject list.

🟡 Please address or acknowledge

4. "Hive compatibility" is narrower than the title claims

URL-encoded partition values are rejected (hive_partition.go:269-272). Spark / Hive / Trino write country=US%2FCA verbatim whenever a partition value contains /, :, % or other filesystem-unsafe bytes. Datasets using such values cannot be read. The test comment at hive_partition_external_table.sql acknowledges this as "P0 known limitation" — but if it's P0 and known-broken, the feature title overclaims compatibility.
DATE / TIMESTAMP / FLOAT / DECIMAL partitions do not prune (hive_partition.go:419-431 canPruneType). WHERE dt = '2024-01-01' lists every partition. Hive workloads are predominantly date-partitioned; this materially regresses the most common pruning case.
Predicate pruning supports only = and IN with literals (hive_partition.go:586-595). >, >=, BETWEEN, !=, NOT IN, LIKE, CAST(...), and any OR combination all fall back to full listing. Again, time-range queries are the typical Hive access pattern.

I'm not asking for all three to be fixed in this PR, but please either (a) scope the title/release notes to what actually works, or (b) file explicit follow-up issues and link them. Right now "support Hive-style partitioned Parquet" sets a wrong expectation.

5. Cross-partition schema evolution is untested and probably broken

Hive datasets routinely accumulate parquet files with drifting schemas (added/dropped columns across partition directories). The code caches the schema from the first opened file and findColumnIgnoreCase is per-file, so behavior depends on which file happens to be scanned first. Zero BVT cases for add-col / drop-col across partitions. Please add a test and document the behavior (fail hard / fill NULL / skip column) explicitly.

6. Rolling upgrade within 3.0-dev

parsers/tree/update.go:4511-4525 extends ExParamConst with HivePartitioning/HivePartitionCols/HivePartColTypes. These are serialized into Createsql via json.Marshal at DDL time (build_ddl.go:930) and re-parsed on every compile on the CN that runs the query (compile.go:1573). A CN on the older code inside the 3.0-dev window will json.Unmarshal successfully (unknown fields are ignored), miss the Hive flags, and route the table through generic ReadDir. For a Hive base path like /warehouse/sales/ with no glob characters, ReadDir (utils.go:2084-2133) returns an empty file list, so the query returns zero rows rather than wrong rows — still a visible break during mixed deployment. Please guard with a feature flag or version check, or document the ordering constraint for the 3.0 cutover.

7. A single bad parquet kills the whole query

parquet.go + scanParquetFile returns an error out of the Hive loop if newParquetHandler fails on any file (0-byte, truncated, _temporary residue). Production Hive datasets carry broken files as a matter of course; Spark/Trino tolerate them with per-file skip + log. Please at minimum add a continue-on-error path for the Hive reader, matching what CSV external already does in some configs.

8. Hive keywords added to DDL without a feature flag

build_ddl.go:939 + utils.go:1673-1829 (parseHiveOptionKV). Any existing external-table DDL that happens to use hive_partitioning / hive_partition_columns as option keys for unrelated reasons changes semantics silently. Previously these would be rejected at the "keyword not supported" default. Low probability but zero rollback. Either (a) gate behind a server-level feature flag for the first release cycle or (b) add a changelog entry specifically flagging this.

🟢 Follow-ups (fine to defer)

Partition discovery is single-goroutine and serial (hive_partition.go:258-322). At depth 3 with 99-wide fanout you exhaust maxListCalls=10000. With a worker pool + LIST prefetching this is easily 10× better on S3. Make the two constants (maxListCalls, maxPartitionCount) configurable while you're at it.
matchInt / matchUint strip leading zeros (hive_partition.go:434, :451) so year=007 matches year=7, but country='CN' on varchar requires exact match. Two partition columns on the same dataset can have inconsistent pruning semantics. Documented behavior would help.
extractVecValues passes nil mempool to vec.Free(nil) (hive_partition.go:668). Other sites typically use proc.Mp(). Please verify Vector.Free semantics on nil mempool are intended; at minimum add a comment pinning the expectation.
SHOW CREATE TABLE drops case on partition column names (build_ddl.go:4625-4627 lowercases before storing). 'hive_partition_columns'='Year,Month' round-trips as 'year,month', breaking case-sensitive consumers of SHOW CREATE.
Float partition values have no NaN/Inf guard (hive_partition_fill.go:1911,1918). amount=inf/data.parquet silently fills +Inf.
HivePartColType embedded in tree package (parsers/tree/update.go:4519). Layering workaround to avoid importing pkg/pb/plan, but now tree knows plan type codes. A TODO with the plan-side extraction target would make this less load-bearing.
isHivePartitionCol linear scan per column (hive_partition_fill.go:1730-1741). Fine for small schemas, quadratic for wide tables.
Placeholder imports in tests — var _ = catalog.ExternalFilePath / var _ = function.EQUAL / var _ = math.MaxInt32 in hive_partition_test.go:3890-3891 and _coverage_test.go:1692. Just remove the imports.
Copyright headers mix 2024 (hive_partition.go) and 2026 (hive_partition_coverage_test.go, compile/hive_partition_test.go). Cosmetic.

What's genuinely good (for balance)

Split of hive_partition.go (discovery + pruning) vs hive_partition_fill.go (constant-vector fill at scan time) is clean.
ClassifyFilters keeping partition filters in both partitionFilters and rowFilters for double-filtering safety (hive_partition.go:501-504) is a good defensive choice.
Constant-vector fill path (SetConstFixed / SetConstBytes) avoids per-row work in the hot loop. refreshPartitionValues runs once per file open, not per batch.
extractVecValues defensively bounds-checks the binary envelope before UnmarshalBinary (hive_partition.go:738-760).
The filePathColSet comment explaining why STATEMENT_ACCOUNT is excluded is exemplary.
BVT covers 11 sections including real negative cases with exact error strings, and the .result shows honest float rounding rather than pretty-printed.

Happy to chat through any of the 🔴 items. The core approach is right; the asks above are about not shipping the announcement until the implementation matches it.

iamlinjunhong · 2026-05-11T08:45:06Z

re-parsed on every compile on the CN that runs the query (compile.go:1573). A CN on the older code inside the 3.0-dev window will json.Unmarshal successf

addressed the three blocking items in the latest commit.

Hive external scans are no longer forced into a single serial scope. Instead, Hive now uses a file-level fanout path after partition discovery. The discovered parquet files are sharded by file size across available CN slots for S3, and each shard scans whole files with Parallel=false internally. This avoids the generic parallel-read path mutating param.Filepath, while preserving Extern.Filepath as the Hive base path for partition value extraction.
S3 partition discovery no longer rebuilds the ETL/S3 filesystem on every recursive List call. NewListDirFunc now prebuilds the S3 ETL filesystem once and derives the per-prefix read path for subsequent recursive lists.
Hive partition directory segments are now validated before use. Partition keys/values reject path separators, . / .., NUL/control characters, and % URL-encoded names. I added unit coverage for parser rejection and discovery rejection.

also removed the placeholder test imports and added a comment documenting why vec.Free(nil) is intentional after UnmarshalBinary.

For the non-blocking compatibility items: URL-encoded partition values remain explicitly unsupported in P0; pruning is still limited to = / IN / AND for the currently prunable types, with row-level filters preserving correctness for range, OR, CAST, DATE/TIMESTAMP/DECIMAL, etc. I agree the PR title/release notes should scope this as Hive-style partitioned Parquet P0 rather than full Hive compatibility. Cross-partition schema evolution is also outside P0; current behavior is “extra parquet columns ignored, missing selected physical columns fail with column not found.” Rolling upgrade and bad-parquet skip behavior should be documented or handled in follow-ups; I did not add silent bad-file skipping because that can hide data loss without an explicit option.

Ariznawlll · 2026-05-11T09:09:33Z

Performance Benchmark Results

Benchmarked this PR (commit ab6058a) at two scales using a synthesized TPC-DS-like catalog_returns Parquet dataset. Hardware: Apple M-series laptop, local SSD.

Scope

SMALL: 500 partitions × 10,000 rows = 5M rows (~150 MB on disk)
MEDIUM: 2,000 partitions × 50,000 rows = 100M rows (~3 GB on disk)
LARGE (10,000 partitions): skipped, ran out of local disk (~30 GB needed)
StarRocks cross-engine comparison: intended but blocked — SR's allin1-ubuntu image BE segfaults on both arm64 and rosetta-amd64 on Apple Silicon. Would need a remote SR cluster to re-attempt.

All queries run against a single-CN local MatrixOne with PR binary. Warmup = 1 run, timed = 3 runs, median reported.

Results

Query	SMALL (500 part)	MEDIUM (2000 part)	Comment
Q1 full scan	313 ms	12.3 s	Baseline throughput
Q2 single-partition `=`	26 ms	35 ms	✅ Near-constant regardless of total partitions — pruning works
Q3 `IN (5 partitions)`	29 ms	59 ms	~2× Q2, scales linearly with N matched partitions
Q4 `BETWEEN` range (docs say not pruned)	79 ms	521 ms	Scales with total data, as documented
Q5 `GROUP BY partition_col`	96 ms	951 ms	Appears to touch files; may have room to be list-only
Q6 JOIN + IN pushdown	55 ms	90 ms	✅ IN list pushed into pruning
Q7 `COUNT(DISTINCT partition_col)`	99 ms	795 ms	Similar to Q5 — opens files unnecessarily?

Key finding

Partition pruning works as advertised. Q2 latency is ~flat from SMALL to MEDIUM (26ms → 35ms) even as total partition count went 4×. This is the main promise of the feature and it holds.

Side effects (filed separately)

While running the benchmark I hit two issues that block the advertised usage pattern:

[Bug]: s3option + format='parquet' rejected by InitS3Param (blocks hive_partitioning over S3) #24341 — URL s3option{... 'format'='parquet' ...} is rejected by InitS3Param. The entire S3 path in the PR's user guide is currently unusable; only local infile{} + parquet works. Had to fall back to local filesystem for this benchmark.
[Bug]: PR #24329 regression — DECIMAL columns can no longer be loaded from Parquet #24343 — DECIMAL columns can no longer be loaded from Parquet on this branch. The PR's parquet.go rewrite (−1489 / +298 lines) removed the T_decimal64/128/256 mappers that exist on main. Had to rewrite queries to avoid decimal aggregates.

Suggestions before merge

Fix [Bug]: s3option + format='parquet' rejected by InitS3Param (blocks hive_partitioning over S3) #24341 — this is the main advertised usage path (S3 + hive partitioning).
Fix [Bug]: PR #24329 regression — DECIMAL columns can no longer be loaded from Parquet #24343 — DECIMAL is table stakes for lakehouse data (TPC-DS/TPC-H both rely on it).
Q5/Q7 (partition-list-only queries) take 10× their SMALL times at MEDIUM scale, which suggests they may be reading files rather than just enumerating directories. Worth checking EXPLAIN ANALYZE inputRows on these at MEDIUM scale.
Add LARGE-scale (10k+ partitions) runs once disk is available — the PR documents a 50k-partition hard cap, worth probing closer to it.

Reproducibility

All benchmark scripts, data generator, DDL and queries at /tmp/mo_hive_perf/ locally. Can share if useful.

iamlinjunhong requested review from XuPeng-SH, aunjgr, fengttt, heni02 and ouyuanning as code owners May 11, 2026 02:34

iamlinjunhong had a problem deploying to ci May 11, 2026 02:34 — with GitHub Actions Error

iamlinjunhong temporarily deployed to ci May 11, 2026 02:34 — with GitHub Actions Inactive

iamlinjunhong had a problem deploying to ci May 11, 2026 02:34 — with GitHub Actions Error

iamlinjunhong temporarily deployed to ci May 11, 2026 02:34 — with GitHub Actions Inactive

matrix-meow added the size/XXL Denotes a PR that changes 2000+ lines label May 11, 2026

feat(external): support Hive-style partitioned Parquet external tables

4e6062a

mergify Bot had a problem deploying to ci May 11, 2026 02:35 Failure

mergify Bot temporarily deployed to ci May 11, 2026 02:35 Inactive

mergify Bot had a problem deploying to ci May 11, 2026 02:35 Failure

mergify Bot temporarily deployed to ci May 11, 2026 02:35 Inactive

mergify Bot had a problem deploying to ci May 11, 2026 02:35 Failure

mergify Bot temporarily deployed to ci May 11, 2026 02:35 Inactive

mergify Bot had a problem deploying to ci May 11, 2026 02:35 Failure

mergify Bot temporarily deployed to ci May 11, 2026 02:35 Inactive

iamlinjunhong temporarily deployed to ci May 11, 2026 07:14 — with GitHub Actions Inactive

heni02 approved these changes May 11, 2026

View reviewed changes

XuPeng-SH requested changes May 11, 2026

View reviewed changes

fix hive partition review blockers

1f7e8d3

iamlinjunhong temporarily deployed to ci May 11, 2026 08:44 — with GitHub Actions Inactive

iamlinjunhong had a problem deploying to ci May 11, 2026 08:44 — with GitHub Actions Failure

iamlinjunhong temporarily deployed to ci May 11, 2026 08:44 — with GitHub Actions Inactive

iamlinjunhong had a problem deploying to ci May 11, 2026 08:44 — with GitHub Actions Failure

This was referenced May 11, 2026

[Bug]: s3option + format='parquet' rejected by InitS3Param (blocks hive_partitioning over S3) #24341

Open

[Bug]: PR #24329 regression — DECIMAL columns can no longer be loaded from Parquet #24343

Open

fix

e3946f6

aunjgr mentioned this pull request May 11, 2026

feat(external): support Hive-style partitioned Parquet external tables #24321

Open

7 tasks

This comment was marked as outdated.

Sign in to view

aunjgr approved these changes May 11, 2026

View reviewed changes

fengttt approved these changes May 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(external): support Hive-style partitioned Parquet external tables#24329

feat(external): support Hive-style partitioned Parquet external tables#24329
iamlinjunhong wants to merge 6 commits intomatrixorigin:3.0-devfrom
iamlinjunhong:dp-hive

iamlinjunhong commented May 11, 2026

Uh oh!

qodo-code-review Bot commented May 11, 2026

Uh oh!

XuPeng-SH left a comment

Uh oh!

iamlinjunhong commented May 11, 2026

Uh oh!

Ariznawlll commented May 11, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

iamlinjunhong commented May 11, 2026

What type of PR is this?

Which issue(s) this PR fixes:

What this PR does / why we need it:

Uh oh!

qodo-code-review Bot commented May 11, 2026

Uh oh!

XuPeng-SH left a comment

Choose a reason for hiding this comment

🔴 Blocking

1. Hive tables are scanned serially regardless of cluster size

2. S3 FS rebuilt per List call

3. Partition directory names are not sanitized against traversal

🟡 Please address or acknowledge

4. "Hive compatibility" is narrower than the title claims

5. Cross-partition schema evolution is untested and probably broken

6. Rolling upgrade within 3.0-dev

7. A single bad parquet kills the whole query

8. Hive keywords added to DDL without a feature flag

🟢 Follow-ups (fine to defer)

What's genuinely good (for balance)

Uh oh!

iamlinjunhong commented May 11, 2026

Uh oh!

Ariznawlll commented May 11, 2026

Performance Benchmark Results

Scope

Results

Key finding

Side effects (filed separately)

Suggestions before merge

Reproducibility

Uh oh!

This comment was marked as outdated.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

2. S3 FS rebuilt per `List` call