[ON HOLD] fix(ingestion): ingest Iceberg/Delta metadata.json with real table columns#28422
[ON HOLD] fix(ingestion): ingest Iceberg/Delta metadata.json with real table columns#28422harshsoni2024 wants to merge 3 commits into
Conversation
… to correctly ingest table columns
There was a problem hiding this comment.
Pull request overview
This PR fixes datalake JSON ingestion for Iceberg/Delta metadata files by ensuring the raw JSON text is propagated to the existing JsonDataFrameColumnParser Iceberg/Delta parsing path, so extracted columns reflect the table schema rather than the outer metadata keys.
Changes:
- Add Iceberg/Delta-shape detection (
schema.fieldsis a list) inJSONDataFrameReader._read_json_objectand forwardraw_datawhen detected. - Add unit tests verifying
raw_datapropagation behavior for Iceberg-shaped metadata, JSON Schema, and plain JSON objects.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| ingestion/src/metadata/readers/dataframe/json.py | Detect Iceberg/Delta metadata shape and propagate raw_data so downstream schema parsing can run. |
| ingestion/tests/unit/utils/test_datalake.py | Add unit tests covering raw_data propagation for Iceberg-shaped JSON and JSON Schema, and non-propagation for plain objects. |
… minified metadata.json reaches raw_data path
Code Review ✅ ApprovedEnables correct ingestion of Iceberg/Delta table columns by updating the JSON reader to correctly identify and forward metadata structures. No issues found. OptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
🟡 Playwright Results — all passed (17 flaky)✅ 4245 passed · ❌ 0 failed · 🟡 17 flaky · ⏭️ 88 skipped
🟡 17 flaky test(s) (passed on retry)
How to debug locally# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip # view trace |
Code Review ✅ ApprovedEnables correct ingestion of Iceberg/Delta table columns by updating the JSON reader to correctly identify and forward metadata structures. No issues found. OptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
| def _is_json_lines(file_obj) -> bool: | ||
| """Check if file is JSON Lines by reading first line.""" | ||
| first_line = file_obj.readline() | ||
| if isinstance(first_line, bytes): | ||
| first_line = first_line.decode(UTF_8, errors="ignore") | ||
| first_line = first_line.strip() | ||
| if not first_line: | ||
| return True | ||
| try: |
|



Summary
fix #28423
metadata.jsonfiles now produces the table's actual columns (e.g.customer_id,customer_type_cd, …) instead of the file's outer Iceberg keys (format-version,table-uuid,schema,schemas,partition-specs).JSONDataFrameReader._read_json_objectonly forwardedraw_datawhen the JSON had a$schemakey, so the already-implemented_parse_iceberg_delta_schemapath inJsonDataFrameColumnParserwas never reached for Iceberg metadata.schema.fieldsis a list) at read time and forwardraw_datain that case too. The downstream parser is unchanged.