Skip to content

feat(external): support Hive-style partitioned Parquet external tables#24329

Open
iamlinjunhong wants to merge 6 commits intomatrixorigin:3.0-devfrom
iamlinjunhong:dp-hive
Open

feat(external): support Hive-style partitioned Parquet external tables#24329
iamlinjunhong wants to merge 6 commits intomatrixorigin:3.0-devfrom
iamlinjunhong:dp-hive

Conversation

@iamlinjunhong
Copy link
Copy Markdown
Contributor

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue #24320

What this PR does / why we need it:

support Hive-style partitioned Parquet external tables

@qodo-code-review
Copy link
Copy Markdown

ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one.

Copy link
Copy Markdown
Contributor

@XuPeng-SH XuPeng-SH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the substantial effort here — +6347 lines with 3385 lines of tests, clean file-level decomposition, and real edge-case coverage in the BVT. A lot of it is solid. My concerns focus on whether this is ready to be announced as "Hive-style partitioned Parquet" given how several pieces interact.

Leaving as request-for-changes for the 🔴 items; the 🟡 items are open questions I'd like resolved before merge or explicitly deferred with a follow-up issue.

🔴 Blocking

1. Hive tables are scanned serially regardless of cluster size

compile.go:1675-1679 unconditionally forces param.Parallel = false for Hive tables:

if param.HivePartitioning {
    param.Parallel = false
}

Combined with getReadWriteParallelFlag (:1596-1604), this routes every Hive table into compileExternScanSerialReadWrite at :1862 — a single scope, no cross-CN fanout. A table with 10K partitions = 10K files is opened one at a time on one CN. The inline comment says it's because "the parallel loop mutates param.Filepath per file" — that's a correctness workaround, not a design decision.

This dwarfs every other performance concern. The feature technically works on small datasets but is unusable on anything Hive workloads actually look like. Please either:

  • fix the param.Filepath mutation so parallel paths can be reused, or
  • build a file-level fanout specifically for Hive (the per-partition file set is ideal for this).

2. S3 FS rebuilt per List call

hive_partition.go:94-105NewListDirFunc calls plan2.GetForETLWithType(param, prefix) on every recursive directory listing. For each call the option string is re-encoded and a fresh S3FS is constructed. For N partitions at depth D that's O(N·D) S3FS constructions on top of O(N·D) LIST calls. The code already carries a TODO:

TODO: For S3 this re-creates an S3FS instance per List call; pre-build the FS once and reuse across recursive calls for better performance.

Please pre-build the FS once and reuse. The combination of #1 + #2 means S3-backed Hive tables are going to surprise users badly.

3. Partition directory names are not sanitized against traversal

ParseHivePartitionSegment (hive_partition.go:143-151) accepts any bytes in Value — including /, .., NUL, newlines. The only rejection is % for URL-encoded content. Later path.Join(prefix, entry.Name) (:303) collapses ... On local FS a directory literally named year=.. (or any attacker-plantable layout underneath a tenant-readable base path) walks out of the intended subtree.

The directory name comes from fileservice.DirEntry, which reflects the real filesystem entry — it is attacker-reachable on any writable path that a different principal can manipulate. Please reject control bytes, /, .., and NUL in both Key and Value. An explicit allow-list is safer than an ad-hoc reject list.

🟡 Please address or acknowledge

4. "Hive compatibility" is narrower than the title claims

  • URL-encoded partition values are rejected (hive_partition.go:269-272). Spark / Hive / Trino write country=US%2FCA verbatim whenever a partition value contains /, :, % or other filesystem-unsafe bytes. Datasets using such values cannot be read. The test comment at hive_partition_external_table.sql acknowledges this as "P0 known limitation" — but if it's P0 and known-broken, the feature title overclaims compatibility.
  • DATE / TIMESTAMP / FLOAT / DECIMAL partitions do not prune (hive_partition.go:419-431 canPruneType). WHERE dt = '2024-01-01' lists every partition. Hive workloads are predominantly date-partitioned; this materially regresses the most common pruning case.
  • Predicate pruning supports only = and IN with literals (hive_partition.go:586-595). >, >=, BETWEEN, !=, NOT IN, LIKE, CAST(...), and any OR combination all fall back to full listing. Again, time-range queries are the typical Hive access pattern.

I'm not asking for all three to be fixed in this PR, but please either (a) scope the title/release notes to what actually works, or (b) file explicit follow-up issues and link them. Right now "support Hive-style partitioned Parquet" sets a wrong expectation.

5. Cross-partition schema evolution is untested and probably broken

Hive datasets routinely accumulate parquet files with drifting schemas (added/dropped columns across partition directories). The code caches the schema from the first opened file and findColumnIgnoreCase is per-file, so behavior depends on which file happens to be scanned first. Zero BVT cases for add-col / drop-col across partitions. Please add a test and document the behavior (fail hard / fill NULL / skip column) explicitly.

6. Rolling upgrade within 3.0-dev

parsers/tree/update.go:4511-4525 extends ExParamConst with HivePartitioning/HivePartitionCols/HivePartColTypes. These are serialized into Createsql via json.Marshal at DDL time (build_ddl.go:930) and re-parsed on every compile on the CN that runs the query (compile.go:1573). A CN on the older code inside the 3.0-dev window will json.Unmarshal successfully (unknown fields are ignored), miss the Hive flags, and route the table through generic ReadDir. For a Hive base path like /warehouse/sales/ with no glob characters, ReadDir (utils.go:2084-2133) returns an empty file list, so the query returns zero rows rather than wrong rows — still a visible break during mixed deployment. Please guard with a feature flag or version check, or document the ordering constraint for the 3.0 cutover.

7. A single bad parquet kills the whole query

parquet.go + scanParquetFile returns an error out of the Hive loop if newParquetHandler fails on any file (0-byte, truncated, _temporary residue). Production Hive datasets carry broken files as a matter of course; Spark/Trino tolerate them with per-file skip + log. Please at minimum add a continue-on-error path for the Hive reader, matching what CSV external already does in some configs.

8. Hive keywords added to DDL without a feature flag

build_ddl.go:939 + utils.go:1673-1829 (parseHiveOptionKV). Any existing external-table DDL that happens to use hive_partitioning / hive_partition_columns as option keys for unrelated reasons changes semantics silently. Previously these would be rejected at the "keyword not supported" default. Low probability but zero rollback. Either (a) gate behind a server-level feature flag for the first release cycle or (b) add a changelog entry specifically flagging this.

🟢 Follow-ups (fine to defer)

  • Partition discovery is single-goroutine and serial (hive_partition.go:258-322). At depth 3 with 99-wide fanout you exhaust maxListCalls=10000. With a worker pool + LIST prefetching this is easily 10× better on S3. Make the two constants (maxListCalls, maxPartitionCount) configurable while you're at it.
  • matchInt / matchUint strip leading zeros (hive_partition.go:434, :451) so year=007 matches year=7, but country='CN' on varchar requires exact match. Two partition columns on the same dataset can have inconsistent pruning semantics. Documented behavior would help.
  • extractVecValues passes nil mempool to vec.Free(nil) (hive_partition.go:668). Other sites typically use proc.Mp(). Please verify Vector.Free semantics on nil mempool are intended; at minimum add a comment pinning the expectation.
  • SHOW CREATE TABLE drops case on partition column names (build_ddl.go:4625-4627 lowercases before storing). 'hive_partition_columns'='Year,Month' round-trips as 'year,month', breaking case-sensitive consumers of SHOW CREATE.
  • Float partition values have no NaN/Inf guard (hive_partition_fill.go:1911,1918). amount=inf/data.parquet silently fills +Inf.
  • HivePartColType embedded in tree package (parsers/tree/update.go:4519). Layering workaround to avoid importing pkg/pb/plan, but now tree knows plan type codes. A TODO with the plan-side extraction target would make this less load-bearing.
  • isHivePartitionCol linear scan per column (hive_partition_fill.go:1730-1741). Fine for small schemas, quadratic for wide tables.
  • Placeholder imports in testsvar _ = catalog.ExternalFilePath / var _ = function.EQUAL / var _ = math.MaxInt32 in hive_partition_test.go:3890-3891 and _coverage_test.go:1692. Just remove the imports.
  • Copyright headers mix 2024 (hive_partition.go) and 2026 (hive_partition_coverage_test.go, compile/hive_partition_test.go). Cosmetic.

What's genuinely good (for balance)

  • Split of hive_partition.go (discovery + pruning) vs hive_partition_fill.go (constant-vector fill at scan time) is clean.
  • ClassifyFilters keeping partition filters in both partitionFilters and rowFilters for double-filtering safety (hive_partition.go:501-504) is a good defensive choice.
  • Constant-vector fill path (SetConstFixed / SetConstBytes) avoids per-row work in the hot loop. refreshPartitionValues runs once per file open, not per batch.
  • extractVecValues defensively bounds-checks the binary envelope before UnmarshalBinary (hive_partition.go:738-760).
  • The filePathColSet comment explaining why STATEMENT_ACCOUNT is excluded is exemplary.
  • BVT covers 11 sections including real negative cases with exact error strings, and the .result shows honest float rounding rather than pretty-printed.

Happy to chat through any of the 🔴 items. The core approach is right; the asks above are about not shipping the announcement until the implementation matches it.

@iamlinjunhong
Copy link
Copy Markdown
Contributor Author

re-parsed on every compile on the CN that runs the query (compile.go:1573). A CN on the older code inside the 3.0-dev window will json.Unmarshal successf

addressed the three blocking items in the latest commit.

  1. Hive external scans are no longer forced into a single serial scope. Instead, Hive now uses a file-level fanout path after partition discovery. The discovered parquet files are sharded by file size across available CN slots for S3, and each shard scans whole files with Parallel=false internally. This avoids the generic parallel-read path mutating param.Filepath, while preserving Extern.Filepath as the Hive base path for partition value extraction.

  2. S3 partition discovery no longer rebuilds the ETL/S3 filesystem on every recursive List call. NewListDirFunc now prebuilds the S3 ETL filesystem once and derives the per-prefix read path for subsequent recursive lists.

  3. Hive partition directory segments are now validated before use. Partition keys/values reject path separators, . / .., NUL/control characters, and % URL-encoded names. I added unit coverage for parser rejection and discovery rejection.

also removed the placeholder test imports and added a comment documenting why vec.Free(nil) is intentional after UnmarshalBinary.

For the non-blocking compatibility items: URL-encoded partition values remain explicitly unsupported in P0; pruning is still limited to = / IN / AND for the currently prunable types, with row-level filters preserving correctness for range, OR, CAST, DATE/TIMESTAMP/DECIMAL, etc. I agree the PR title/release notes should scope this as Hive-style partitioned Parquet P0 rather than full Hive compatibility. Cross-partition schema evolution is also outside P0; current behavior is “extra parquet columns ignored, missing selected physical columns fail with column not found.” Rolling upgrade and bad-parquet skip behavior should be documented or handled in follow-ups; I did not add silent bad-file skipping because that can hide data loss without an explicit option.

@Ariznawlll
Copy link
Copy Markdown
Contributor

Performance Benchmark Results

Benchmarked this PR (commit ab6058a) at two scales using a synthesized TPC-DS-like catalog_returns Parquet dataset. Hardware: Apple M-series laptop, local SSD.

Scope

  • SMALL: 500 partitions × 10,000 rows = 5M rows (~150 MB on disk)
  • MEDIUM: 2,000 partitions × 50,000 rows = 100M rows (~3 GB on disk)
  • LARGE (10,000 partitions): skipped, ran out of local disk (~30 GB needed)
  • StarRocks cross-engine comparison: intended but blocked — SR's allin1-ubuntu image BE segfaults on both arm64 and rosetta-amd64 on Apple Silicon. Would need a remote SR cluster to re-attempt.

All queries run against a single-CN local MatrixOne with PR binary. Warmup = 1 run, timed = 3 runs, median reported.

Results

Query SMALL (500 part) MEDIUM (2000 part) Comment
Q1 full scan 313 ms 12.3 s Baseline throughput
Q2 single-partition = 26 ms 35 ms Near-constant regardless of total partitions — pruning works
Q3 IN (5 partitions) 29 ms 59 ms ~2× Q2, scales linearly with N matched partitions
Q4 BETWEEN range (docs say not pruned) 79 ms 521 ms Scales with total data, as documented
Q5 GROUP BY partition_col 96 ms 951 ms Appears to touch files; may have room to be list-only
Q6 JOIN + IN pushdown 55 ms 90 ms ✅ IN list pushed into pruning
Q7 COUNT(DISTINCT partition_col) 99 ms 795 ms Similar to Q5 — opens files unnecessarily?

Key finding

Partition pruning works as advertised. Q2 latency is ~flat from SMALL to MEDIUM (26ms → 35ms) even as total partition count went 4×. This is the main promise of the feature and it holds.

Side effects (filed separately)

While running the benchmark I hit two issues that block the advertised usage pattern:

  1. [Bug]: s3option + format='parquet' rejected by InitS3Param (blocks hive_partitioning over S3) #24341URL s3option{... 'format'='parquet' ...} is rejected by InitS3Param. The entire S3 path in the PR's user guide is currently unusable; only local infile{} + parquet works. Had to fall back to local filesystem for this benchmark.
  2. [Bug]: PR #24329 regression — DECIMAL columns can no longer be loaded from Parquet #24343 — DECIMAL columns can no longer be loaded from Parquet on this branch. The PR's parquet.go rewrite (−1489 / +298 lines) removed the T_decimal64/128/256 mappers that exist on main. Had to rewrite queries to avoid decimal aggregates.

Suggestions before merge

Reproducibility

All benchmark scripts, data generator, DDL and queries at /tmp/mo_hive_perf/ locally. Can share if useful.

aunjgr

This comment was marked as outdated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XXL Denotes a PR that changes 2000+ lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants