Skip to content

fix(fqn): support double-quotes in fully qualified names + guard/repair corrupt FQNs#28697

Open
mohityadav766 wants to merge 6 commits into
mainfrom
fix/fqn-quote-escaping
Open

fix(fqn): support double-quotes in fully qualified names + guard/repair corrupt FQNs#28697
mohityadav766 wants to merge 6 commits into
mainfrom
fix/fqn-quote-escaping

Conversation

@mohityadav766
Copy link
Copy Markdown
Member

Problem

A name containing a double-quote (") could not be represented in a fully qualified name. The Fqn grammar had no escape mechanism for ", yet quoteName() backslash-escaped it (\") and stored an unparseable FQN segment.

Building an FQN is a pure string operation, so these corrupt values were written successfully — insert only hashes the entity's own FQN, not nested objects. They then detonated later with a 500 (ParseCancellationException) the first time a nested FQN was hashed (e.g. a tags-inclusive read, or the update path resolving the entity with tags). This was observed on real Mulesoft pipeline task names (e.g. ..._"agents"_...) and also on customer column names, and the corrupt rows could not even be deleted via the API.

Fix (three layers)

1. Grammar + quoteName — make " representable

  • NAME_WITH_RESERVED now allows any character, with " escaped by doubling it (""): QUOTE ( ~["] | '""' )* QUOTE.
  • quoteName/unquoteName encode/decode with ""-doubling and are idempotent.
  • Names without a " encode identically to before, so existing FQNs and their hashes are unchanged — no reindex/migration required for current data. Verified by the unchanged pre-existing FullyQualifiedNameTest cases.

2. Ingest guard — catch the un-representable early

  • FullyQualifiedName.validateFqnName() asserts a name round-trips through encode→parse→decode, wired into every nested-FQN setter: columns (ColumnUtil, ColumnRepository, ContainerRepository), pipeline tasks, topic/searchIndex/apiEndpoint fields, mlFeatures.
  • A name that cannot be hashed is now rejected at ingest with a clear 400 instead of being silently stored to fail on read / poison migrations.

3. Heal-on-read — recover legacy poisoned data without a migration

  • FullyQualifiedName.isValid() detects legacy-corrupt FQNs (parse fails).
  • PipelineRepository.repairTaskFqns() re-derives an unparseable task FQN from the task name on the fly, so existing poisoned rows read cleanly (200) again. The repair is in-memory (no write-amplification) and persists naturally on the next update.

Validation

  • FullyQualifiedNameTest: 16/16 green (incl. quote round-trip + validateFqnName).
  • Ingest guard verified end-to-end against a running server on the buggy grammar: the offending pipeline is rejected with 400 Invalid name ..."agents"..., nothing persisted.
  • Heal-on-read verified end-to-end: a task FQN corrupted directly in the DB to the legacy form reads back 200 with the FQN repaired to the ""-escaped form; stored blob untouched until next write.

Notes

  • Heal-on-read is wired for pipeline tasks only; the same isValid + re-derive pattern generalizes to columns/fields/features if we want it.
  • JS/Python ANTLR parsers are generated from the same Fqn.g4 at build time, so they pick up the grammar change automatically.

🤖 Generated with Claude Code

…ir corrupt FQNs

Names containing a double-quote could not be represented in an FQN: the Fqn
grammar had no escape mechanism, yet quoteName() backslash-escaped the quote and
stored an unparseable segment. Building the FQN is a pure string op, so such
values were written successfully (insert hashes only the entity's own FQN); they
then detonated later with a 500 (ParseCancellationException) the first time a
nested FQN was hashed (e.g. a tags read), and were painful to migrate.

Three layered fixes:

- Grammar + quoteName: NAME_WITH_RESERVED now allows any character with '"'
  escaped by doubling it (""). quoteName/unquoteName encode/decode accordingly
  and are idempotent. Names without a quote encode identically to before, so
  existing FQNs and their hashes are unchanged (no reindex/migration needed).

- Ingest guard: FullyQualifiedName.validateFqnName() asserts a name round-trips
  through encode->parse->decode, wired into every nested-FQN setter (columns,
  pipeline tasks, topic/searchIndex/apiEndpoint fields, mlFeatures). A name that
  cannot be hashed is now rejected at ingest with a clear 400 instead of being
  stored to fail later.

- Heal-on-read: FullyQualifiedName.isValid() detects legacy-corrupt FQNs;
  PipelineRepository repairs unparseable task FQNs on the fly by re-deriving them
  from the task name, so existing poisoned data reads cleanly (200) without a
  migration. The repair is in-memory and persists on the next update.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 4, 2026 09:27
@github-actions github-actions Bot added backend safe to test Add this label to run secure Github workflows on PRs labels Jun 4, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a long-standing FQN corruption path by making " representable inside a quoted FQN segment (via "" escaping), adding server-side validation to reject unhashable nested names at write time, and introducing a heal-on-read repair for legacy-corrupt pipeline task FQNs so reads no longer 500.

Changes:

  • Update the ANTLR FQN grammar to support embedded " in quoted segments using "" escaping.
  • Rework FullyQualifiedName.quoteName/unquoteName, add validateFqnName() and isValid(), and expand unit tests for quote round-trips.
  • Wire validateFqnName() into nested-FQN setters across repositories, and add a pipeline task FQN repair step during reads.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
openmetadata-spec/src/main/antlr4/org/openmetadata/schema/Fqn.g4 Extends quoted-segment grammar to allow escaped quotes via "".
openmetadata-service/src/test/java/org/openmetadata/service/util/FullyQualifiedNameTest.java Adds tests for quote escaping/idempotency, round-trip hashing, and new validation.
openmetadata-service/src/main/java/org/openmetadata/service/util/FullyQualifiedName.java Implements new quote escaping semantics plus validation and parseability checks.
openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/TopicRepository.java Validates nested field names before deriving/setting field FQNs.
openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/SearchIndexRepository.java Validates nested field names before deriving/setting field FQNs.
openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/PipelineRepository.java Validates task names on write and repairs legacy-corrupt task FQNs on read.
openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/MlModelRepository.java Validates ML feature/source names before deriving/setting their FQNs.
openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/ContainerRepository.java Validates container column names before deriving/setting column FQNs.
openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/ColumnUtil.java Validates column names before deriving/setting column FQNs (used in multiple read/write paths).
openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/ColumnRepository.java Validates data-model column names before deriving/setting column FQNs.
openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/APIEndpointRepository.java Validates API endpoint field names before deriving/setting field FQNs.

Comment thread openmetadata-spec/src/main/antlr4/org/openmetadata/schema/Fqn.g4
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 4, 2026

🔴 Playwright Results — 1 failure(s), 7 flaky

✅ 4272 passed · ❌ 1 failed · 🟡 7 flaky · ⏭️ 88 skipped

Shard Passed Failed Flaky Skipped
✅ Shard 1 301 0 0 4
✅ Shard 2 804 0 0 9
🔴 Shard 3 802 1 2 8
🟡 Shard 4 853 0 2 12
🟡 Shard 5 720 0 1 47
🟡 Shard 6 792 0 2 8

Genuine Failures (failed on all attempts)

Flow/ExploreAggregationCountsMatching.spec.ts › should verify left panel counts and tab search results for normal search (shard 3)
Error: Tab "topic" search total hits should match the aggregation count

�[2mexpect(�[22m�[31mreceived�[39m�[2m).�[22mtoBe�[2m(�[22m�[32mexpected�[39m�[2m) // Object.is equality�[22m

Expected: �[32m4�[39m
Received: �[31m24�[39m
🟡 7 flaky test(s) (passed on retry)
  • Features/KnowledgeCenterList.spec.ts › Knowledge Center List - Test unbookmark functionality (shard 3, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Pages/CustomProperties.spec.ts › Set & Update all CP types on apiCollection (shard 4, 1 retry)
  • Pages/Entity.spec.ts › User as Owner with unsorted list (shard 4, 1 retry)
  • Pages/ExplorePageRightPanel_KnowledgeCenter.spec.ts › Should remove user owner for knowledgeCenter (shard 5, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage service type filter selection (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

…n-read

Heal-on-read (PipelineRepository.repairTaskFqns) ran a full ANTLR parse for
every task on every pipeline read to subsidize a finite set of already-corrupt
rows, was incomplete (the bulk/LIST/search path still 500'd), and could NPE on
a null task FQN. Replace it with a one-time migration so the corruption leaves
the stored data and reads pay no per-request cost.

- Remove repairTaskFqns and its setFields() call; keep the validateFqnName
  write-path guard that rejects un-representable names at ingest (400).
- Add migration v11211 (mysql + postgres): re-derive task FQNs where !isValid,
  persist only when changed.
- Harden FullyQualifiedName.isValid to treat null/empty as invalid (no NPE).
- Require >=1 char inside a quoted FQN segment (grammar + not *), rejecting
  empty quoted segments ("").

FullyQualifiedNameTest: 17/17.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
validateFqnName returned early when quoteName(name) was unchanged, letting
empty names through (quoteName("") == ""). An empty pipeline task name (the
schema sets no minLength on task.name) then produced an unhashable empty FQN
segment ("parent.") that 500'd on the next FQN hash -- the same failure class
as unrepresentable names. Treat null/empty as invalid so every nested-FQN
setter (columns, tasks, fields, mlFeatures) rejects them up front with a 400.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 5, 2026 03:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

❌ PR checklist incomplete

This PR cannot be merged until the following are addressed on its linked issue:

  • No GitHub issue is linked. Add a closing reference such as Fixes #12345 to the PR description (accepted keywords: Fixes, Closes, Resolves).

The fields live on the linked issue in the Shipping project (open the issue → right sidebar → Projects). After you set them, re-run this check (or push a commit) — issue/project changes do not re-trigger it automatically.

Maintainers can bypass this check by adding the skip-pr-checks label.

Address review feedback on the one-time repair migration:

- Performance: scan pipelines in pages of 1000 via listAfterWithOffset instead
  of selecting every id and calling findEntityById per pipeline, dropping the
  N+1 round-trips and the full id list held in memory. Only changed rows are
  written.
- Observability: track scanned/repaired/failed counts and log a prominent WARN
  with up to 100 pipeline ids that could not be repaired, instead of swallowing
  each failure as a lone WARN, so operators get a concrete remediation list.
- Search: document (completion log + schemaChanges) that repaired task FQNs are
  reflected in the search index after the standard post-upgrade reindex, matching
  existing FQN-fix migration behavior.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 5, 2026 05:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

Add MigrationUtilTest for the v11211 repairPipelineTaskFqns migration:
repair correctness (re-derive unparseable/null task FQNs, leave valid ones
untouched, skip task-less pipelines) and migration-path resilience -- a single
unreadable row or a failing update must not abort the upgrade.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment on lines +125 to +128
private void givenPipelinePage(String... jsons) {
when(pipelineDAO.listAfterWithOffset(anyInt(), anyInt()))
.thenAnswer(inv -> (int) inv.getArgument(1) == 0 ? List.of(jsons) : List.of());
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Quality: Migration pagination loop is not covered by any test

All tests return their data only when offset == 0 and an empty list otherwise, so every test exercises a single page. The production repairPipelineTaskFqns loop (MigrationUtil.java:46-57) increments offset += PAGE_SIZE and re-queries until an empty page is returned. Because the mock matches listAfterWithOffset(anyInt(), anyInt()) with arbitrary ints, a regression that swapped the limit/offset arguments, mis-incremented the offset, or terminated early would still pass these tests undetected.

Consider adding a multi-page test that stubs distinct pages by offset (e.g., offset 0 returns a full page, offset 1000 returns a second page, offset 2000 returns empty) and asserts that pipelines from the second page are scanned/repaired. This locks in correct pagination and offset advancement.

This is test-only code, so severity is minor — the existing per-row behavior coverage is good.

Was this helpful? React with 👍 / 👎

@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Jun 5, 2026

Code Review 👍 Approved with suggestions 6 resolved / 7 findings

Adds robust support for double-quoted FQNs, implements ingest-time validation, and includes a heal-on-read mechanism for corrupt legacy data. Consider adding a test case to verify migration pagination behavior beyond the initial result set.

💡 Quality: Migration pagination loop is not covered by any test

📄 openmetadata-service/src/test/java/org/openmetadata/service/migration/utils/v11211/MigrationUtilTest.java:125-128

All tests return their data only when offset == 0 and an empty list otherwise, so every test exercises a single page. The production repairPipelineTaskFqns loop (MigrationUtil.java:46-57) increments offset += PAGE_SIZE and re-queries until an empty page is returned. Because the mock matches listAfterWithOffset(anyInt(), anyInt()) with arbitrary ints, a regression that swapped the limit/offset arguments, mis-incremented the offset, or terminated early would still pass these tests undetected.

Consider adding a multi-page test that stubs distinct pages by offset (e.g., offset 0 returns a full page, offset 1000 returns a second page, offset 2000 returns empty) and asserts that pipelines from the second page are scanned/repaired. This locks in correct pagination and offset advancement.

This is test-only code, so severity is minor — the existing per-row behavior coverage is good.

✅ 6 resolved
Bug: Heal-on-read missing in bulk read path; list reads still 500

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/PipelineRepository.java:185 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/PipelineRepository.java:200-214 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/PipelineRepository.java:895-906
repairTaskFqns() is only invoked from the single-entity setFields() (PipelineRepository.java:185). The bulk read path setFieldsInBulk() -> fetchAndSetPipelineSpecificFields() -> fetchAndSetTaskFieldsInBulk() (lines 200-260) calls getTaskTags() directly on un-repaired task FQNs. getTaskTags() calls getTags(task.getFullyQualifiedName()), which hashes the FQN (tagUsageDAO -> FullyQualifiedName.buildHash). For a legacy-poisoned (unparseable) task FQN this still throws ParseCancellationException -> 500. So the exact scenario the PR aims to fix (a tags-inclusive read of a pipeline with a corrupt task FQN) remains broken for LIST endpoints, which commonly request fields=tags (e.g. search indexing). Apply the same repair in the bulk loop before fetching task tags.

Edge Case: isValid(null)/split(null) can throw NPE instead of healing

📄 openmetadata-service/src/main/java/org/openmetadata/service/util/FullyQualifiedName.java:208-217 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/PipelineRepository.java:895-906
repairTaskFqns() calls FullyQualifiedName.isValid(task.getFullyQualifiedName()) (PipelineRepository.java:897). isValid() (FullyQualifiedName.java:208-217) only catches ParseCancellationException; it delegates to split() -> CharStreams.fromString(string). If a task's FQN is null (legacy/partially-populated data — exactly the corrupt-data situation this PR targets), CharStreams.fromString(null) throws a NullPointerException that is not caught, propagating out of setFields() as a 500 rather than triggering the repair. Treat a null FQN as invalid so it gets re-derived from the task name.

Edge Case: quoteName/isQuotedName collapses literal two-quote name to empty

📄 openmetadata-service/src/main/java/org/openmetadata/service/util/FullyQualifiedName.java:164-171 📄 openmetadata-service/src/main/java/org/openmetadata/service/util/FullyQualifiedName.java:233-243
A raw name consisting of exactly two double-quote characters ("") is misread by isQuotedName() (FullyQualifiedName.java:233-239) as an empty quoted segment: body = substring(1,1) = "", which contains no quote, so it is treated as already-quoted. decodeQuotedName() then returns the empty string, and quoteName("""") returns "" — i.e. a non-empty name silently becomes empty. This is an ambiguity between the encoded form "" (empty quoted segment) and a raw two-quote name. In practice validateFqnName() rejects this name at ingest (round-trip fails), so impact is limited, but quoteName/unquoteName callers that bypass validation could lose data. Worth documenting or guarding.

Bug: Migration silently skips corrupt pipelines with no read-time fallback

📄 openmetadata-service/src/main/java/org/openmetadata/service/migration/utils/v11211/MigrationUtil.java:44-57 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/PipelineRepository.java
This commit removes the heal-on-read repair from PipelineRepository (-14 lines) and replaces it with a one-time data migration (MigrationUtil.repairPipelineTaskFqns). The migration is best-effort: repairPipeline catches Exception and merely logs a WARN, then continues (MigrationUtil.java:53-55). Any pipeline that fails to load or update during the migration — or any corrupt row introduced/restored after the migration has already run (e.g. DB restore, backfill, replication from an older instance) — is left with an unparseable task FQN and will again throw a 500 (ParseCancellationException) on the first tag/owner-inclusive read, because the read path no longer self-heals.

Since the failures are only surfaced as WARN log lines, an operator has no easy signal for which pipelines remain broken after upgrade. Consider either (a) keeping a lightweight heal-on-read guard in PipelineRepository as a safety net (the previous behavior), or (b) collecting and reporting the skipped pipeline IDs prominently (e.g. an aggregated error count / list) so they can be remediated, rather than swallowing each failure individually.

Bug: Repair migration updates DB blob but does not reindex search

📄 openmetadata-service/src/main/java/org/openmetadata/service/migration/utils/v11211/MigrationUtil.java:44-57
repairPipeline persists the corrected JSON via the raw collectionDAO.pipelineDAO().update(pipeline) DAO call (MigrationUtil.java:51). This writes the entity blob and the (unchanged) top-level FQN hash directly, but bypasses the repository update path, so it does not bump the entity version and does not push the repaired task FQNs to the search index. As a result, the search index can retain the old/corrupt nested task FQNs until the next real update of each pipeline. If task FQNs are used for tag/search associations this leaves search state stale post-migration. Confirm whether the 1.12.11 migration is paired with a search reindex step; if not, consider triggering reindexing for repaired pipelines or documenting that a reindex is required.

...and 1 more resolved from earlier reviews

🤖 Prompt for agents
Code Review: Adds robust support for double-quoted FQNs, implements ingest-time validation, and includes a heal-on-read mechanism for corrupt legacy data. Consider adding a test case to verify migration pagination behavior beyond the initial result set.

1. 💡 Quality: Migration pagination loop is not covered by any test
   Files: openmetadata-service/src/test/java/org/openmetadata/service/migration/utils/v11211/MigrationUtilTest.java:125-128

   All tests return their data only when offset == 0 and an empty list otherwise, so every test exercises a single page. The production `repairPipelineTaskFqns` loop (MigrationUtil.java:46-57) increments `offset += PAGE_SIZE` and re-queries until an empty page is returned. Because the mock matches `listAfterWithOffset(anyInt(), anyInt())` with arbitrary ints, a regression that swapped the limit/offset arguments, mis-incremented the offset, or terminated early would still pass these tests undetected.
   
   Consider adding a multi-page test that stubs distinct pages by offset (e.g., offset 0 returns a full page, offset 1000 returns a second page, offset 2000 returns empty) and asserts that pipelines from the second page are scanned/repaired. This locks in correct pagination and offset advancement.
   
   This is test-only code, so severity is minor — the existing per-row behavior coverage is good.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Jun 5, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants