Skip to content

Add parquet metadata subcommand and refactor parquet_tools#776

Merged
thinkingfish merged 11 commits intomainfrom
claude/integrate-parquet-arrow-ZdTsS
Apr 12, 2026
Merged

Add parquet metadata subcommand and refactor parquet_tools#776
thinkingfish merged 11 commits intomainfrom
claude/integrate-parquet-arrow-ZdTsS

Conversation

@thinkingfish
Copy link
Copy Markdown
Member

@thinkingfish thinkingfish commented Apr 12, 2026

Summary

  • Refactors parquet_tools from a single mod.rs into a modular structure, extracting the existing annotate logic into annotate.rs
  • Adds a shared read_parquet_footer() helper for reading parquet metadata and schema
  • Adds a new parquet metadata subcommand for inspecting parquet files

parquet metadata subcommand

Displays file-level metadata, table geometry, and column schema for a parquet file.

Flags (filters — default shows all sections):

  • --file — file-level key-value metadata only
  • --geometry — logical table shape and per-row-group detail (ASCII table with human-readable byte sizes)
  • --schema — column-level metadata as a pipe-delimited table with metric_type as its own column
  • --json — output in JSON format for programmatic use (combinable with the above)
  • --field=KEY — extract and print a single file-level metadata key (auto-pretty-prints JSON values)

Example outputs:

$ rezolus parquet metadata -i file.parquet --geometry
Geometry:
  Logical Table: 142 columns x 30000 rows

  Group | Rows  |    Bytes
  ------+-------+---------
  0     | 10000 | 4.6 MiB
  1     | 10000 | 4.6 MiB
  2     | 10000 | 4.6 MiB
  ------+-------+---------
  Total | 30000 | 13.8 MiB
$ rezolus parquet metadata -i file.parquet --schema
Schema (5 fields):
  Name             | Type    | Metric Type   | Metadata
  -----------------+---------+---------------+---------
  timestamp        | UInt64  | timestamp     | metric=timestamp
  cpu_cycles       | Float64 | delta_counter | metric=cpu/cycles
  cpu_instructions | Float64 | delta_counter | metric=cpu/instructions
  cgroup_memory    | Float64 | gauge         | metric=cgroup/memory/usage, name={...}
  request_latency  | List    | histogram     | metric=request/latency

Files changed

  • src/parquet_tools/mod.rs — slimmed to command definitions, dispatch, and shared read_parquet_footer() helper
  • src/parquet_tools/annotate.rs — extracted from mod.rs, unchanged behavior
  • src/parquet_tools/metadata.rs — new subcommand

claude added 10 commits April 12, 2026 20:11
Add metadata, compare, and compare-schema subcommands to the parquet
tool, porting core functionality from the parquet-arrow repository.
Extract existing annotate logic into its own module for cleaner
organization.

New subcommands:
- parquet metadata: display file-level and column-level metadata
  (--schema flag shows only column metadata)
- parquet compare: compare data values between two parquet files
- parquet compare-schema: compare schemas between two parquet files
The --geometry flag shows table geometry: logical shape (columns x rows)
and per-row-group detail (row count and byte size). Without flags, all
sections are shown. --schema and --geometry each filter to their
respective section.
Shows only file-level key-value metadata, skipping geometry and
column-level schema output.
- --json: output in pretty-printed JSON for programmatic consumption.
  File-level values that are valid JSON are nested as objects rather
  than escaped strings.
- --field=KEY: extract and print a single file-level metadata key's
  raw value. Combined with --json, pretty-prints JSON values.
- Human-readable schema output now uses an aligned table with verbose
  field values collapsed to {...}.
Values like systeminfo are deeply nested JSON and not useful as raw
strings. Now --field always attempts JSON pretty-printing regardless
of --json flag.
Row group details are now shown in an aligned table with a total row.
Byte sizes use human-readable units (KiB/MiB/GiB).
Schema now uses the same ASCII table style as geometry. metric_type is
lifted into its own column for easy scanning. Remaining metadata values
are truncated at 60 characters, collapsing to {...} when exceeded.
Schema comparison is achievable by diffing the output of
'parquet metadata --schema' or 'parquet metadata --schema --json'
for two files.
@thinkingfish thinkingfish changed the title Refactor parquet tools into modular subcommands Add parquet metadata subcommand and refactor parquet_tools Apr 12, 2026
@thinkingfish thinkingfish merged commit 0ca0f76 into main Apr 12, 2026
25 checks passed
@thinkingfish thinkingfish deleted the claude/integrate-parquet-arrow-ZdTsS branch April 12, 2026 22:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants