Skip to content

[ON HOLD] fix(ingestion): ingest Iceberg/Delta metadata.json with real table columns#28422

Open
harshsoni2024 wants to merge 3 commits into
mainfrom
datalake_iceberg_table_fields_fix
Open

[ON HOLD] fix(ingestion): ingest Iceberg/Delta metadata.json with real table columns#28422
harshsoni2024 wants to merge 3 commits into
mainfrom
datalake_iceberg_table_fields_fix

Conversation

@harshsoni2024
Copy link
Copy Markdown
Contributor

@harshsoni2024 harshsoni2024 commented May 26, 2026

Summary

fix #28423

  • Datalake JSON ingestion of Iceberg/Delta metadata.json files now produces the table's actual columns (e.g. customer_id, customer_type_cd, …) instead of the file's outer Iceberg keys (format-version, table-uuid, schema, schemas, partition-specs).
  • Root cause: JSONDataFrameReader._read_json_object only forwarded raw_data when the JSON had a $schema key, so the already-implemented _parse_iceberg_delta_schema path in JsonDataFrameColumnParser was never reached for Iceberg metadata.
  • Fix: detect Iceberg/Delta shape (schema.fields is a list) at read time and forward raw_data in that case too. The downstream parser is unchanged.

Copilot AI review requested due to automatic review settings May 26, 2026 05:24
@harshsoni2024 harshsoni2024 requested a review from a team as a code owner May 26, 2026 05:24
@harshsoni2024 harshsoni2024 added Ingestion safe to test Add this label to run secure Github workflows on PRs To release Will cherry-pick this PR into the release branch labels May 26, 2026
@harshsoni2024 harshsoni2024 changed the title fix(ingestion): detect Iceberg/Delta metadata in datalake JSON reader fix(ingestion): ingest Iceberg/Delta metadata.json with real table columns May 26, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes datalake JSON ingestion for Iceberg/Delta metadata files by ensuring the raw JSON text is propagated to the existing JsonDataFrameColumnParser Iceberg/Delta parsing path, so extracted columns reflect the table schema rather than the outer metadata keys.

Changes:

  • Add Iceberg/Delta-shape detection (schema.fields is a list) in JSONDataFrameReader._read_json_object and forward raw_data when detected.
  • Add unit tests verifying raw_data propagation behavior for Iceberg-shaped metadata, JSON Schema, and plain JSON objects.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
ingestion/src/metadata/readers/dataframe/json.py Detect Iceberg/Delta metadata shape and propagate raw_data so downstream schema parsing can run.
ingestion/tests/unit/utils/test_datalake.py Add unit tests covering raw_data propagation for Iceberg-shaped JSON and JSON Schema, and non-propagation for plain objects.

Comment thread ingestion/src/metadata/readers/dataframe/json.py
… minified metadata.json reaches raw_data path
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented May 26, 2026

Code Review ✅ Approved

Enables correct ingestion of Iceberg/Delta table columns by updating the JSON reader to correctly identify and forward metadata structures. No issues found.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 26, 2026

🟡 Playwright Results — all passed (17 flaky)

✅ 4245 passed · ❌ 0 failed · 🟡 17 flaky · ⏭️ 88 skipped

Shard Passed Failed Flaky Skipped
✅ Shard 1 299 0 0 4
🟡 Shard 2 799 0 4 9
🟡 Shard 3 801 0 2 8
🟡 Shard 4 840 0 5 12
🟡 Shard 5 717 0 2 47
🟡 Shard 6 789 0 4 8
🟡 17 flaky test(s) (passed on retry)
  • Features/ColumnBulkOperations.spec.ts › should update pending changes counter when editing selected columns (shard 2, 1 retry)
  • Features/DataQuality/TestCaseImportExportE2eFlow.spec.ts › Admin: Complete export-import-validate flow (shard 2, 1 retry)
  • Features/DataQuality/TestCaseImportExportE2eFlow.spec.ts › EditAll User: Complete export-import-validate flow (shard 2, 1 retry)
  • Features/Glossary/GlossaryWorkflow.spec.ts › should display correct status badge color and icon (shard 2, 2 retries)
  • Features/KnowledgeCenter.spec.ts › Article mentions in description should working for Knowledge Center (shard 3, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Pages/CustomProperties.spec.ts › Email (shard 4, 1 retry)
  • Pages/CustomProperties.spec.ts › Email (shard 4, 1 retry)
  • Pages/CustomProperties.spec.ts › Should clear search and show all properties for dataProduct in right panel (shard 4, 1 retry)
  • Pages/DataContractsSemanticRules.spec.ts › Validate Description Rule Is_Not_Set (shard 4, 1 retry)
  • Pages/Domains.spec.ts › Create DataProducts and add remove assets (shard 4, 1 retry)
  • Pages/EntityDataSteward.spec.ts › Tier Add, Update and Remove (shard 5, 1 retry)
  • Pages/ExplorePageRightPanel_KnowledgeCenter.spec.ts › Should remove user owner for knowledgeCenter (shard 5, 1 retry)
  • Pages/GlossaryImportExport.spec.ts › Glossary CSV import preserves typed relations (shard 6, 1 retry)
  • Pages/Lineage/DataAssetLineage.spec.ts › Column lineage for dashboardDataModel -> topic (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify Impact Analysis service filter selection (shard 6, 1 retry)
  • Pages/ODCSImportExport.spec.ts › Multi-object ODCS contract - object selector shows all schema objects (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Copilot AI review requested due to automatic review settings May 27, 2026 04:57
@harshsoni2024 harshsoni2024 changed the title fix(ingestion): ingest Iceberg/Delta metadata.json with real table columns [ON HOLD] fix(ingestion): ingest Iceberg/Delta metadata.json with real table columns May 27, 2026
@harshsoni2024 harshsoni2024 removed the To release Will cherry-pick this PR into the release branch label May 27, 2026
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented May 27, 2026

Code Review ✅ Approved

Enables correct ingestion of Iceberg/Delta table columns by updating the JSON reader to correctly identify and forward metadata structures. No issues found.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment on lines 142 to 150
def _is_json_lines(file_obj) -> bool:
"""Check if file is JSON Lines by reading first line."""
first_line = file_obj.readline()
if isinstance(first_line, bytes):
first_line = first_line.decode(UTF_8, errors="ignore")
first_line = first_line.strip()
if not first_line:
return True
try:
Comment thread ingestion/src/metadata/readers/dataframe/json.py
@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ingestion safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ingestion: Iceberg/Delta metadata.json not ingesting real table columns

2 participants