fix(snowflake-profiler): Handle leading whitespace/comments when parsing DML queries#28280
Conversation
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
1 similar comment
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
There was a problem hiding this comment.
Pull request overview
This PR improves Snowflake system-metrics DML parsing by normalizing query text before applying the existing QUERY_PATTERN, allowing leading whitespace and SQL comments to no longer cause silent parse misses and dropped DML events.
Changes:
- Added
_normalise_dml_sql()to strip block comments,--comments, and leading/trailing whitespace prior to regex matching. - Updated
_parse_query()to run the DML regex against the normalized SQL. - Added unit test coverage for leading whitespace/comments and for ensuring DML keywords inside comments don’t affect extraction.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
ingestion/src/metadata/profiler/metrics/system/snowflake/system.py |
Adds SQL normalization helper and uses it in _parse_query prior to re.match. |
ingestion/tests/unit/metadata/ingestion/source/database/snowflake/profiler/test_system_metrics.py |
Adds parametrized tests covering whitespace/comment-prefixed DML parsing and comment-body safety cases. |
| @cache.wrap(key_func=lambda query: sha256_hash(query.strip())) | ||
| def _parse_query(query: str) -> Optional[str]: # noqa: UP045 | ||
| """Parse snowflake queries to extract the identifiers""" | ||
| match = re.match(QUERY_PATTERN, query, re.IGNORECASE) | ||
| """Parse snowflake queries to extract the identifiers. | ||
|
|
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
| # Pass 1: block comments → single space (dotall so . matches \n) | ||
| query = re.sub(r"/\*.*?\*/", " ", query, flags=re.DOTALL) | ||
| # Pass 2: single-line comments → remove to end of line, keep the newline | ||
| query = re.sub(r"--[^\n]*", "", query) | ||
| # Pass 3: strip leading/trailing whitespace | ||
| return query.strip() |
| @cache.wrap(key_func=lambda query: sha256_hash(query.strip())) | ||
| def _parse_query(query: str) -> Optional[str]: # noqa: UP045 | ||
| """Parse snowflake queries to extract the identifiers""" | ||
| match = re.match(QUERY_PATTERN, query, re.IGNORECASE) | ||
| """Parse snowflake queries to extract the identifiers. | ||
|
|
||
| The query is first normalized (block comments, single-line comments, and | ||
| leading whitespace removed) so that re.match() can reliably find the DML | ||
| keyword at position 0 without being confused by commented-out SQL in the | ||
| query body. | ||
| """ | ||
| match = re.match(QUERY_PATTERN, _normalize_dml_sql(query), re.IGNORECASE) | ||
| try: |
| """Parse snowflake queries to extract the identifiers. | ||
|
|
||
| The query is first normalized (block comments, single-line comments, and | ||
| leading whitespace removed) so that re.match() can reliably find the DML | ||
| keyword at position 0 without being confused by commented-out SQL in the | ||
| query body. | ||
| """ | ||
| match = re.match(QUERY_PATTERN, _normalize_dml_sql(query), re.IGNORECASE) |
9b5dc13 to
a0f4610
Compare
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
| FROM TABLE(RESULT_SCAN('{query_id}')); | ||
| """ | ||
| QUERY_PATTERN = r"(?:(INSERT\s*INTO\s*|INSERT\s*OVERWRITE\s*INTO\s*|UPDATE\s*|MERGE\s*INTO\s*|DELETE\s*FROM\s*))([\w._\"\'()]+)(?=[\s*\n])" # pylint: disable=line-too-long | ||
| QUERY_PATTERN = r"(?:(INSERT\s*INTO\s*|INSERT\s*OVERWRITE\s*INTO\s*|UPDATE\s*|MERGE\s*INTO\s*|DELETE\s*FROM\s*|COPY\s+INTO\s+))([\w._\"\'()]+)(?=[\s*\n])" # pylint: disable=line-too-long |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
| FROM TABLE(RESULT_SCAN('{query_id}')); | ||
| """ | ||
| QUERY_PATTERN = r"(?:(INSERT\s*INTO\s*|INSERT\s*OVERWRITE\s*INTO\s*|UPDATE\s*|MERGE\s*INTO\s*|DELETE\s*FROM\s*))([\w._\"\'()]+)(?=[\s*\n])" # pylint: disable=line-too-long | ||
| QUERY_PATTERN = r"(?:(INSERT\s*INTO\s*|INSERT\s*OVERWRITE\s*INTO\s*|UPDATE\s*|MERGE\s*INTO\s*|DELETE\s*FROM\s*|COPY\s+INTO\s+))([\w._\"\'()]+)(?=[\s*\n])" # pylint: disable=line-too-long |
| Three-pass process: | ||
| 1. Replace /* ... */ block comments with a single space so they act as a | ||
| token separator but do not expose their body to the regex. | ||
| 2. Strip -- single-line comments (remove to end of line, leave the newline). | ||
| 3. Strip leading/trailing whitespace so re.match() can find the DML keyword | ||
| at position 0. | ||
|
|
||
| This prevents comment bodies (e.g. commented-out SQL snippets) from being | ||
| mistaken for actual DML operations. | ||
| """ | ||
| # Pass 1: block comments → single space (dotall so . matches \n) | ||
| query = re.sub(r"/\*.*?\*/", " ", query, flags=re.DOTALL) | ||
| # Pass 2: single-line comments → remove to end of line, keep the newline | ||
| query = re.sub(r"--[^\n]*", "", query) | ||
| # Pass 3: strip leading/trailing whitespace | ||
| return query.strip() | ||
|
|
||
|
|
| def test_parse_query(query, expected_identifier): | ||
| """Test that _parse_query correctly extracts the target table identifier from DML queries, | ||
| including those with leading whitespace or SQL comments.""" | ||
| # Clear the LRU cache between parametrized runs so each query is evaluated fresh | ||
| cache.clear() |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
…ndle leading whitespace and comments Before this fix, _parse_query used re.match() directly on the raw query text. re.match() only matches at the very start of the string, so queries with any leading whitespace (spaces, tabs, newlines) or SQL comments (-- or /* */) before the DML keyword would fail to parse and silently return None, causing those rows to be dropped from system profiling results. A more defensive normalisation step is now applied before matching: 1. Block comments (/* ... */) are replaced with a single space, so they act as a token separator but their body is not visible to the regex. This prevents commented-out SQL snippets (e.g. "/* INSERT INTO old_table ... */") from being mistakenly identified as the active DML operation. 2. Single-line comments (-- ...) are stripped to end-of-line, leaving the newline intact so adjacent tokens are n newline intact so adjacent tokens are n newline intact so adjacent tokens are n newline intact so adjacent tokens are n newline intact so adjacent tokens are n newlinethe newline intact so adjacent tokens are n newline intact so adjacent tokera newline intact so adjacent tokens are n newline intact so adjacent to4 l newline intact so adjacent tokens are n newline intact so adjacent ases - 3 leading /* */ comment cases - 3 non-DML (SELECT) correctly returning N- 3 non-DML (SELECT) correctly returning N- 3 nonses (D (D (D (D (D (D (D (D (D (D (D (D (D (D (D (D (D
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
…sql (American spelling)
…COPY INTO support
53cbdb8 to
dd60b2c
Compare
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
| FROM TABLE(RESULT_SCAN('{query_id}')); | ||
| """ | ||
| QUERY_PATTERN = r"(?:(INSERT\s*INTO\s*|INSERT\s*OVERWRITE\s*INTO\s*|UPDATE\s*|MERGE\s*INTO\s*|DELETE\s*FROM\s*))([\w._\"\'()]+)(?=[\s*\n])" # pylint: disable=line-too-long | ||
| QUERY_PATTERN = r"(?:(INSERT\s*INTO\s*|INSERT\s*OVERWRITE\s*INTO\s*|UPDATE\s*|MERGE\s*INTO\s*|DELETE\s*FROM\s*|COPY\s+INTO\s+))([\w._\"\'()]+)(?=\s|$)" # pylint: disable=line-too-long |
| query = re.sub(r"/\*.*?\*/", " ", query, flags=re.DOTALL) | ||
| # Pass 2: single-line comments → remove to end of line, keep the newline | ||
| query = re.sub(r"--[^\n]*", "", query) |
| return query.strip() | ||
|
|
||
|
|
||
| @cache.wrap(key_func=lambda query: sha256_hash((query or "").strip())) |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
| """Normalize a SQL query before DML pattern matching. | ||
|
|
||
| Three-pass process: | ||
| 1. Replace /* ... */ block comments with a single space so they act as a | ||
| token separator but do not expose their body to the regex. | ||
| 2. Strip -- single-line comments (remove to end of line, leave the newline). | ||
| 3. Strip leading/trailing whitespace so re.match() can find the DML keyword | ||
| at position 0. | ||
|
|
||
| This prevents comment bodies (e.g. commented-out SQL snippets) from being | ||
| mistaken for actual DML operations. | ||
| """ |
| clean_query = re.sub(strip_block_comment, "", clean_query) | ||
| clean_query = re.sub(strip_single_comment, "", clean_query) | ||
| clean_query = clean_query.lstrip() |
| strip_block_comment = re.compile(r"^/\*.*?\*/", flags=re.DOTALL) | ||
| strip_single_comment = re.compile(r"^--[^\n]*") |
| query = clean_query | ||
|
|
||
|
|
||
| @cache.wrap(key_func=lambda query: sha256_hash(_normalize_dml_sql(query or ""))) |
| """ | ||
| if not query: | ||
| return None | ||
| match = re.match(QUERY_PATTERN, _normalize_dml_sql(query), re.IGNORECASE) |
| @pytest.fixture | ||
| def isolated_parse_query_cache(): | ||
| cache.clear() | ||
| yield | ||
| cache.clear() |
This reverts commit 507dac4.
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
| @cache.wrap(key_func=lambda query: sha256_hash(_normalize_dml_sql(query or ""))) | ||
| def _parse_query(query: Optional[str]) -> Optional[str]: # noqa: UP045 |
| if not query: | ||
| return None | ||
| match = re.match(QUERY_PATTERN, _normalize_dml_sql(query), re.IGNORECASE) |
| def test_parse_query(query, expected_identifier, isolated_parse_query_cache): | ||
| """Test that _parse_query correctly extracts the target table identifier from DML queries, | ||
| including those with leading whitespace or SQL comments.""" | ||
| result = _parse_query(query) | ||
| assert result == expected_identifier |
Code Review ✅ Approved 3 resolved / 3 findingsNormalizes SQL strings by stripping leading whitespace and comments before DML parsing, successfully resolving issues with table extraction and test cache stability. ✅ 3 resolved✅ Bug: isolated_parse_query_cache fixture raises TypeError on LRUCache
✅ Quality: Docstring no longer matches _normalize_dml_sql implementation
✅ Edge Case: Comments between DML keyword and table name no longer handled
OptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
Describe your changes:
The Snowflake system-metrics profiler parses query-log entries with a regex (
QUERY_PATTERN) to identify DML operations (INSERT, UPDATE, DELETE, MERGE, COPY) and extract the target table name. The regex usesre.match(), which anchors at the start of the string, so any leading whitespace or SQL comment caused the match to fail silently — the profiler returnedNonefor the query, discarding the DML event entirely and producing an incomplete system-metrics profile.This was fixed by adding a
_normalize_dml_sql()helper that is called inside_parse_querybefore the regex:re.sub(r"/\*.*?\*/", " ", …, flags=re.DOTALL)so their bodies (which may contain other DML keywords) are reduced to a single space and cannot confuse the pattern.re.sub(r"--[^\n]*", "", …)(preserves the trailing newline as a token separator).str.strip()sore.match()finds the DML keyword at position 0.The LRU-cache key (
sha256_hash(query.strip())) is unchanged — the cache continues to deduplicate identical raw queries efficiently.Changes
ingestion/src/metadata/profiler/metrics/system/snowflake/system.pyingestion/tests/unit/metadata/ingestion/source/database/snowflake/profiler/test_system_metrics.pyTest plan
pytest ingestion/tests/unit/metadata/ingestion/source/database/snowflake/profiler/test_system_metrics.py— all tests passpy-testsCI workflow on branchfix/snowflake-profiling-query-parse-leading-whitespacein forkType of change:
High-level design:
N/A — small pre-processing step added to one helper function.
Tests:
Use cases covered
-- …comments are correctly parsed./* … */comments are correctly parsed.SELECT) continue to returnNone.Unit tests
test_parse_query, containing test cases that passed under the old query parser, and extra cases that correct extract the table despite comments/whitespace. This also accounts for cases where just using re.search() would incorrectly extract a table found within a comment, rather than from actual query.ingestion/tests/unit/metadata/ingestion/source/database/snowflake/profiler/test_system_metrics.pyBackend integration tests
Ingestion integration tests
Playwright (UI) tests
Manual testing performed
_normalise_sql_for_dml_parsingcall and confirmed parametrized leading-whitespace and comment cases fail as expected.UI screen recording / screenshots:
N/A
Checklist:
Summary by Gitar
COPY INTOsupport toQUERY_PATTERNto capture target tables in data loading operations._parse_queryto handleNoneor empty input queries by returningNonegracefully.isolated_parse_query_cachefixture to ensure clean test execution for_parse_query.This will update automatically on new commits.