Skip to content

Fix col type parsign - Databricks#27610

Merged
pmbrull merged 1 commit intomainfrom
databricks_connector
Apr 22, 2026
Merged

Fix col type parsign - Databricks#27610
pmbrull merged 1 commit intomainfrom
databricks_connector

Conversation

@harshach
Copy link
Copy Markdown
Collaborator

@harshach harshach commented Apr 22, 2026

Describe your changes:

Fixes

I worked on ... because ...

Type of change:

  • Bug fix
  • Improvement
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes <issue-number>: <short explanation>
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

Summary by Gitar

  • Bug fix:
    • Generalized end-of-columns detection to terminate on any #-prefixed marker or empty col_type to prevent TypeError.
    • Added broad try-except to column processing to ensure metadata for surviving columns is still captured.
  • Improvement:
    • Added robust handling for unparseable col_type values by skipping the column and logging a warning.
    • Implemented unit tests for Databricks get_columns to verify section boundary handling and failure recovery.

This will update automatically on new commits.

Copilot AI review requested due to automatic review settings April 22, 2026 06:41
@harshach harshach requested a review from a team as a code owner April 22, 2026 06:41
@github-actions github-actions Bot added backend safe to test Add this label to run secure Github workflows on PRs labels Apr 22, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets Databricks ingestion robustness by preventing get_columns from crashing (and silently returning no columns) when DESCRIBE TABLE EXTENDED returns rows with empty/None col_type.

Changes:

  • Add unit/regression coverage for malformed DESCRIBE TABLE EXTENDED rows (empty/None col_type).
  • Tighten _get_column_rows filtering and add defensive parsing in get_columns to skip unparseable types instead of raising.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
ingestion/tests/unit/topology/database/test_databricks_get_columns.py Adds regression tests covering None/empty col_type rows for Databricks get_columns and _get_column_rows.
ingestion/src/metadata/ingestion/source/database/databricks/metadata.py Updates row filtering and adds a guard around regex parsing of col_type to avoid runtime errors.

Comment on lines +168 to +175
# Filter out empty rows, comment markers, and rows missing a usable col_type.
# DESCRIBE TABLE EXTENDED can return rows where col_type is empty/None for
# Delta/streaming partition markers; downstream regex at line ~200 requires
# a non-empty string.
return [
row
for row in column_rows
if row[0] and row[0] != "# col_name" and isinstance(row[1], str) and row[1]
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new _get_column_rows filter drops rows where col_type is empty/None. However, Databricks/Hive DESCRIBE output uses section header rows like # Partition Information / # Detailed Table Information that often have empty col_type (see existing test_databricks_ordinal_position.py which provides ("# Partition Information", "", "")). If those header rows are filtered out, get_columns will never hit the break condition and can start treating table/partition metadata rows (e.g., Location/Provider) as columns. Consider preserving header rows (e.g., allow row[0].startswith('#') except # col_name, or explicitly whitelist the break-marker values) while still dropping truly malformed column rows with missing col_type.

Suggested change
# Filter out empty rows, comment markers, and rows missing a usable col_type.
# DESCRIBE TABLE EXTENDED can return rows where col_type is empty/None for
# Delta/streaming partition markers; downstream regex at line ~200 requires
# a non-empty string.
return [
row
for row in column_rows
if row[0] and row[0] != "# col_name" and isinstance(row[1], str) and row[1]
# Filter out empty rows and the DESCRIBE schema header row, while preserving
# section marker rows such as "# Partition Information" and
# "# Detailed Table Information". Those marker rows often have an empty
# col_type, but downstream parsing relies on them to stop before table
# metadata rows like Location/Provider are treated as columns.
return [
row
for row in column_rows
if row[0]
and row[0] != "# col_name"
and (
row[0].startswith("#")
or (isinstance(row[1], str) and row[1])
)

Copilot uses AI. Check for mistakes.
Comment thread ingestion/tests/unit/topology/database/test_databricks_get_columns.py Outdated
Comment thread ingestion/src/metadata/ingestion/source/database/databricks/metadata.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 22, 2026

🔴 Playwright Results — 1 failure(s), 22 flaky

✅ 3688 passed · ❌ 1 failed · 🟡 22 flaky · ⏭️ 89 skipped

Shard Passed Failed Flaky Skipped
🔴 Shard 1 479 1 1 4
🟡 Shard 2 654 0 2 7
🟡 Shard 3 663 0 3 1
🟡 Shard 4 644 0 4 27
🟡 Shard 5 609 0 2 42
🟡 Shard 6 639 0 10 8

Genuine Failures (failed on all attempts)

Pages/SearchIndexApplication.spec.ts › Search Index Application (shard 1)
Error: �[2mexpect(�[22m�[31mreceived�[39m�[2m).�[22mtoEqual�[2m(�[22m�[32mexpected�[39m�[2m) // deep equality�[22m

Expected: �[32mStringMatching /success|activeError/g�[39m
Received: �[31m"failed"�[39m
🟡 22 flaky test(s) (passed on retry)
  • Pages/Customproperties-part1.spec.ts › no duplicate card after update (shard 1, 1 retry)
  • Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
  • Features/Glossary/GlossaryHierarchy.spec.ts › should cancel move operation (shard 2, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Features/UserProfileOnlineStatus.spec.ts › Should show online status badge on user profile for active users (shard 3, 1 retry)
  • Flow/PersonaFlow.spec.ts › Set default persona for team should work properly (shard 3, 1 retry)
  • Pages/Customproperties-part2.spec.ts › entityReferenceList shows item count, scrollable list, no expand toggle (shard 4, 1 retry)
  • Pages/DataContractInheritance.spec.ts › Run Validation - Inherited contract validation uses entity-based validation (shard 4, 1 retry)
  • Pages/DataContracts.spec.ts › Create Data Contract and validate for ApiEndpoint (shard 4, 1 retry)
  • Pages/DomainAdvanced.spec.ts › Filter assets by domain from explore page (shard 4, 1 retry)
  • Pages/ExplorePageRightPanel.spec.ts › Should verify deleted user not visible in owner selection for table (shard 5, 1 retry)
  • Pages/Glossary.spec.ts › Add and Remove Assets (shard 5, 1 retry)
  • Pages/HyperlinkCustomProperty.spec.ts › should show No Data placeholder when hyperlink has no value (shard 6, 1 retry)
  • Pages/Lineage/DataAssetLineage.spec.ts › verify create lineage for entity - Table (shard 6, 2 retries)
  • Pages/Lineage/DataAssetLineage.spec.ts › verify create lineage for entity - Search Index (shard 6, 1 retry)
  • Pages/Lineage/DataAssetLineage.spec.ts › verify create lineage for entity - Directory (shard 6, 1 retry)
  • Pages/Lineage/DataAssetLineage.spec.ts › Column lineage for searchIndex -> mlModel (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab IS visible for supported type: searchIndex (shard 6, 1 retry)
  • Pages/Tag.spec.ts › Verify Owner Add Delete (shard 6, 1 retry)
  • Pages/Users.spec.ts › Permissions for table details page for Data Consumer (shard 6, 1 retry)
  • Pages/Users.spec.ts › Check permissions for Data Steward (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Comment on lines +200 to +207
raw_col_type = col_type
type_match = re.search(r"^\w+", col_type)
if type_match is None:
logger.warning(
f"Skipping column '{col_name}' in {schema}.{table_name}: "
f"unparseable col_type '{col_type}'"
)
continue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sold about this. From research it seems that this should never happen. Added it here as a defensive way to continue the column extraction in any case, but I also do not like the possibility of maybe having incomplete column metadata.

Not sure a proper failure and skipping column extraction would be best here.

@github-actions
Copy link
Copy Markdown
Contributor

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

pmbrull
pmbrull previously approved these changes Apr 22, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Comment thread ingestion/tests/unit/topology/database/test_databricks_get_columns.py Outdated
Comment on lines +172 to +182
def test_unexpected_exception_in_row_does_not_drop_other_columns(self, mock_rows):
"""Broad per-row try/except ensures one bad column doesn't lose the
rest of the table's columns via the outer sql_column_handler catch."""
mock_rows.return_value = [
("good1", "bigint", None),
# col_type that passes the truthy guard and the regex, but a
# downstream failure is simulated by an unknown type mapping
# returning NullType (already handled — should still append).
("exotic", "quintuplet", None),
("good2", "string", None),
]
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_unexpected_exception_in_row_does_not_drop_other_columns doesn't currently trigger an exception inside get_columns (unknown types are handled via KeyError -> NullType). As written, it won't fail if the new broad per-row exception handling is removed. Adjust the test to force an actual exception in per-row processing (e.g., patch _type_map lookup or re.search/complex-type subquery to raise) and assert that subsequent columns are still returned.

Copilot uses AI. Check for mistakes.
@open-metadata open-metadata deleted a comment from Copilot AI Apr 22, 2026
Comment thread ingestion/tests/unit/topology/database/test_databricks_get_columns.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

…arkers

DESCRIBE TABLE EXTENDED emits '#'-prefixed section markers beyond the
hardcoded whitelist (e.g. '# Metadata Columns' from Spark v2
DescribeTableExec), causing re.search on a None col_type to raise
TypeError and sql_column_handler to drop all columns for the table.

Treat any '#'-prefixed row, empty col_type, or col_type with no leading
word chars as end-of-columns and break. Add a per-row try/except so
unexpected failures skip one column instead of the whole table. Use
len(result) for ordinal_position to keep values contiguous.
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 22, 2026

Code Review ✅ Approved 1 resolved / 1 findings

Corrects Databricks column type parsing and removes the unused rows parameter from the _run test helper. No issues found.

✅ 1 resolved
Quality: Test helper _run accepts rows param but never uses it

📄 ingestion/tests/unit/topology/database/test_databricks_get_columns.py:48-55
The _run(self, rows) helper accepts rows as a parameter and every test passes mock_rows to it, but the parameter is completely ignored inside the method. The mock works only because mock_rows.return_value is set on the patched global before _run is called. This is confusing for readers who might expect _run to wire up the rows somehow. Either remove the parameter or use it to set return_value inside _run to make the data flow explicit.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

@sonarqubecloud
Copy link
Copy Markdown

@pmbrull pmbrull merged commit 89de412 into main Apr 22, 2026
52 of 53 checks passed
@pmbrull pmbrull deleted the databricks_connector branch April 22, 2026 15:32
ulixius9 pushed a commit that referenced this pull request Apr 23, 2026
…arkers (#27610)

DESCRIBE TABLE EXTENDED emits '#'-prefixed section markers beyond the
hardcoded whitelist (e.g. '# Metadata Columns' from Spark v2
DescribeTableExec), causing re.search on a None col_type to raise
TypeError and sql_column_handler to drop all columns for the table.

Treat any '#'-prefixed row, empty col_type, or col_type with no leading
word chars as end-of-columns and break. Add a per-row try/except so
unexpected failures skip one column instead of the whole table. Use
len(result) for ordinal_position to keep values contiguous.

Co-authored-by: Pablo Takara <pjt1991@gmail.com>
ulixius9 pushed a commit that referenced this pull request Apr 23, 2026
…arkers (#27610)

DESCRIBE TABLE EXTENDED emits '#'-prefixed section markers beyond the
hardcoded whitelist (e.g. '# Metadata Columns' from Spark v2
DescribeTableExec), causing re.search on a None col_type to raise
TypeError and sql_column_handler to drop all columns for the table.

Treat any '#'-prefixed row, empty col_type, or col_type with no leading
word chars as end-of-columns and break. Add a per-row try/except so
unexpected failures skip one column instead of the whole table. Use
len(result) for ordinal_position to keep values contiguous.

Co-authored-by: Pablo Takara <pjt1991@gmail.com>
ulixius9 pushed a commit that referenced this pull request Apr 27, 2026
…arkers (#27610)

DESCRIBE TABLE EXTENDED emits '#'-prefixed section markers beyond the
hardcoded whitelist (e.g. '# Metadata Columns' from Spark v2
DescribeTableExec), causing re.search on a None col_type to raise
TypeError and sql_column_handler to drop all columns for the table.

Treat any '#'-prefixed row, empty col_type, or col_type with no leading
word chars as end-of-columns and break. Add a per-row try/except so
unexpected failures skip one column instead of the whole table. Use
len(result) for ordinal_position to keep values contiguous.

Co-authored-by: Pablo Takara <pjt1991@gmail.com>
jatinmasaram pushed a commit to jatinmasaram/OpenMetadata that referenced this pull request May 2, 2026
…arkers (open-metadata#27610)

DESCRIBE TABLE EXTENDED emits '#'-prefixed section markers beyond the
hardcoded whitelist (e.g. '# Metadata Columns' from Spark v2
DescribeTableExec), causing re.search on a None col_type to raise
TypeError and sql_column_handler to drop all columns for the table.

Treat any '#'-prefixed row, empty col_type, or col_type with no leading
word chars as end-of-columns and break. Add a per-row try/except so
unexpected failures skip one column instead of the whole table. Use
len(result) for ordinal_position to keep values contiguous.

Co-authored-by: Pablo Takara <pjt1991@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants