Skip to content

Fixes #27538: feat(openlineage) add AWS Glue, Kusto, and Cosmos DB dataset naming support#27533

Open
mohittilala wants to merge 1 commit intomainfrom
feat/openlineage-glue-kusto-cosmos-naming
Open

Fixes #27538: feat(openlineage) add AWS Glue, Kusto, and Cosmos DB dataset naming support#27533
mohittilala wants to merge 1 commit intomainfrom
feat/openlineage-glue-kusto-cosmos-naming

Conversation

@mohittilala
Copy link
Copy Markdown
Contributor

@mohittilala mohittilala commented Apr 20, 2026

Describe your changes:

Fixes #27538

OpenLineage events from AWS Glue EMR, Azure Data Explorer (Kusto), and Azure Cosmos DB use non-standard dataset name formats that the connector couldn't parse, producing no lineage edges. This adds namespace-aware dispatch in _get_table_details to handle each format before falling back to the existing dot-split logic. All new parsers are sourced from OpenLineage's Naming.java and covered by unit tests.

Type of change:

  • Bug fix
  • Improvement
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes <issue-number>: <short explanation>
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

Improvement

  • I have added tests around the new logic.
  • For connector/ingestion changes: I updated the documentation.

@mohittilala mohittilala self-assigned this Apr 20, 2026
Copilot AI review requested due to automatic review settings April 20, 2026 05:24
@mohittilala mohittilala requested a review from a team as a code owner April 20, 2026 05:24
@mohittilala mohittilala added enhancement New feature or request Ingestion safe to test Add this label to run secure Github workflows on PRs To release Will cherry-pick this PR into the release branch Openlineage labels Apr 20, 2026
@gitar-bot
Copy link
Copy Markdown

gitar-bot bot commented Apr 20, 2026

Code Review ✅ Approved

Expands OpenLineage dataset naming support to include AWS Glue, Kusto, and Cosmos DB. No issues found.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the OpenLineage ingestion connector’s ability to parse non-standard dataset naming formats emitted by AWS Glue EMR, Azure Data Explorer (Kusto), and Azure Cosmos DB, so lineage edges can be created instead of dropped due to unparseable dataset names.

Changes:

  • Add namespace-aware parsing dispatch in OpenlineageSource._get_table_details for Glue/Kusto/Cosmos naming formats.
  • Introduce dedicated parsers for Glue (table/{db}/{table}), Kusto ({db}/{table}), and Cosmos (/dbs/{db} + colls/{collection}).
  • Add unit tests covering the new parsers and namespace dispatch behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
ingestion/src/metadata/ingestion/source/pipeline/openlineage/metadata.py Adds namespace-based dataset-name parsing and new parser helpers for Glue/Kusto/Cosmos.
ingestion/tests/unit/topology/pipeline/test_openlineage.py Adds unit tests validating the new parsing logic and dispatch behavior.

Comment on lines +265 to +270
if not name.startswith("table/"):
return None
parts = name[len("table/") :].split("/")
if len(parts) < 2:
return None
return TableDetails(name=parts[-1].lower(), schema=parts[-2].lower())
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Glue name parser can return empty schema/table when the input has empty path segments (e.g., trailing slash table/db/table/ or double slashes). That would later build an invalid FQN and potentially create/lookup wrong entities. Consider filtering out empty segments (or stripping trailing slashes) and returning None when database/table are missing.

Copilot uses AI. Check for mistakes.
Comment on lines +283 to +286
parts = name.split("/")
if len(parts) < 2:
return None
return TableDetails(name=parts[-1].lower(), schema=parts[-2].lower())
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_parse_slash_table_name has the same empty-segment issue as the Glue parser: inputs like db/table/ or db//table can yield an empty schema/table (since it blindly takes the last two split parts). Consider normalizing by stripping/filtering empty segments and returning None when the required parts are missing.

Copilot uses AI. Check for mistakes.
Comment on lines +2133 to +2149
def test_parse_cosmos_table_name_happy_path(self):
"""Cosmos OL naming: db from namespace /dbs/{db}, name colls/{coll} — source: Naming.java CosmosNaming."""
result = OpenlineageSource._parse_cosmos_table_name(
"azurecosmos://myaccount.documents.azure.com/dbs/mydb",
"colls/mycollection",
)
self.assertEqual(result.name, "mycollection")
self.assertEqual(result.schema, "mydb")

def test_parse_cosmos_table_name_normalizes_to_lowercase(self):
"""Cosmos database and collection names are normalized to lowercase for FQN matching."""
result = OpenlineageSource._parse_cosmos_table_name(
"azurecosmos://host/dbs/MyDB", "colls/MyCollection"
)
self.assertEqual(result.name, "mycollection")
self.assertEqual(result.schema, "mydb")

Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new dataset-name parsers are tested for happy paths, but there are no tests asserting they reject malformed inputs that would currently yield empty schema/table (e.g., trailing slashes) or, for Cosmos, names that don't match the documented colls/{collection} pattern. Adding these negative tests would help prevent incorrect lineage edges when events contain unexpected naming variants.

Copilot generated this review using guidance from repository custom instructions.
Comment on lines +302 to +303
database = match.group(1).lower()
collection = name.split("/")[-1].lower() if "/" in name else name.lower()
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_parse_cosmos_table_name currently returns a TableDetails for any name value (including ones not in the documented colls/{collection} format). Because _get_table_details dispatches on azurecosmos:// namespace, this can mis-parse unrelated Cosmos dataset names and produce incorrect lineage. Consider validating the name prefix/pattern (e.g., require colls/ with a non-empty collection) and returning None when it doesn't match.

Suggested change
database = match.group(1).lower()
collection = name.split("/")[-1].lower() if "/" in name else name.lower()
collection_match = re.fullmatch(r"colls/([^/]+)", name)
if not collection_match:
return None
database = match.group(1).lower()
collection = collection_match.group(1).lower()

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Copy Markdown
Contributor

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

@sonarqubecloud
Copy link
Copy Markdown

@github-actions
Copy link
Copy Markdown
Contributor

🟡 Playwright Results — all passed (22 flaky)

✅ 3665 passed · ❌ 0 failed · 🟡 22 flaky · ⏭️ 89 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 478 0 3 4
🟡 Shard 2 652 0 1 7
🟡 Shard 3 654 0 5 1
🟡 Shard 4 630 0 4 27
🟡 Shard 5 610 0 1 42
🟡 Shard 6 641 0 8 8
🟡 22 flaky test(s) (passed on retry)
  • Features/CustomizeDetailPage.spec.ts › Ml Model - customization should work (shard 1, 1 retry)
  • Pages/Customproperties-part1.spec.ts › Hyperlink (shard 1, 1 retry)
  • Pages/UserCreationWithPersona.spec.ts › Create user with persona and verify on profile (shard 1, 1 retry)
  • Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
  • Features/RestoreEntityInheritedFields.spec.ts › Validate restore with Inherited domain and data products assigned (shard 3, 2 retries)
  • Features/RestoreEntityInheritedFields.spec.ts › Validate restore with Inherited domain and data products assigned (shard 3, 1 retry)
  • Features/RestoreEntityInheritedFields.spec.ts › Validate restore with Inherited domain and data products assigned (shard 3, 2 retries)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Features/UserProfileOnlineStatus.spec.ts › Should show "Active recently" for users active within last hour (shard 3, 1 retry)
  • Pages/Customproperties-part2.spec.ts › entityReferenceList shows item count, scrollable list, no expand toggle (shard 4, 1 retry)
  • Pages/DataContracts.spec.ts › Create Data Contract and validate for Directory (shard 4, 1 retry)
  • Pages/Domains.spec.ts › Rename domain with tags and glossary terms preserves associations (shard 4, 1 retry)
  • Pages/DomainUIInteractions.spec.ts › Add expert to domain via UI (shard 4, 1 retry)
  • Pages/Glossary.spec.ts › Add and Remove Assets (shard 5, 1 retry)
  • Features/AutoPilot.spec.ts › Create Service and check the AutoPilot status (shard 6, 1 retry)
  • Pages/HyperlinkCustomProperty.spec.ts › should display URL when no display text is provided (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab IS visible for supported type: searchIndex (shard 6, 1 retry)
  • Pages/LoginConfiguration.spec.ts › update login configuration should work (shard 6, 1 retry)
  • Pages/Tag.spec.ts › Verify Owner Add Delete (shard 6, 1 retry)
  • Pages/UserDetails.spec.ts › Create team with domain and verify visibility of inherited domain in user profile after team removal (shard 6, 1 retry)
  • Pages/Users.spec.ts › Permissions for table details page for Data Consumer (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

@mohittilala mohittilala changed the title feat(openlineage): add AWS Glue, Kusto, and Cosmos DB dataset naming support Fixes #27538: feat(openlineage) add AWS Glue, Kusto, and Cosmos DB dataset naming support Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request Ingestion Openlineage safe to test Add this label to run secure Github workflows on PRs To release Will cherry-pick this PR into the release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support AWS Glue-style table naming (table/<schema>/<table>) in OpenLineage ingestion

2 participants