Skip to content

RDF Knowledge Graph epic with requirements including structured RDF modeling, PROV-O lineage representation, idempotent indexing, inference engine support, and scalable indexing architecture.#24911

Merged
pmbrull merged 4 commits intomainfrom
rdf
Dec 19, 2025

Conversation

@harshach
Copy link
Copy Markdown
Collaborator

Describe your changes:

Summary

This PR implements the RDF Knowledge Graph epic with requirements including structured RDF modeling, PROV-O lineage representation, idempotent indexing, inference engine support, and scalable indexing architecture.

Changes

Structured RDF Modeling for Nested Properties

Remodeled 5 properties from embedded JSON literals to proper RDF triples:

  • votesom:hasVotes with om:Votes class containing om:upVotes, om:downVotes
  • changeDescriptionom:hasChangeDescription with om:ChangeDescription class
  • lifeCycleom:hasLifeCycle with lifecycle properties
  • customProperties → Structured custom property mappings
  • extension → Proper RDF extension handling

Files:

  • openmetadata-service/src/main/java/org/openmetadata/service/rdf/sql2sparql/SqlMappingContext.java - Added nested mappings
  • openmetadata-service/src/main/java/org/openmetadata/service/rdf/sql2sparql/SparqlBuilder.java - Nested field projection support

PROV-O Lineage Representation

Implemented W3C PROV-O vocabulary for lineage relationships:

  • prov:wasDerivedFrom for upstream relationships
  • prov:wasInfluencedBy for downstream relationships
  • prov:wasGeneratedBy for pipeline associations
  • Column-level lineage with om:hasColumnLineage
  • SQL query preservation with om:sqlQuery

Files:

  • openmetadata-service/src/main/java/org/openmetadata/service/rdf/sql2sparql/SqlMappingContext.java - Lineage table mapping with PROV-O
  • openmetadata-service/src/main/java/org/openmetadata/service/rdf/RdfRepository.java - addLineageWithDetails() method

Idempotent RDF Indexing

Implemented DELETE/INSERT SPARQL Update pattern for idempotent entity updates:

  • Deletes existing triples for entity before inserting new ones
  • Prevents duplicate triples on re-indexing
  • Named graph isolation per entity

Files:

  • openmetadata-service/src/main/java/org/openmetadata/service/rdf/storage/JenaFusekiStorage.java - DELETE/INSERT pattern

Inference Engine Support

Enabled reasoning/inference capabilities:

  • Transitive property inference (e.g., upstream chain traversal)
  • Inverse relationship inference
  • Custom rule-based reasoning with Apache Jena
  • Configurable reasoning levels (NONE, RDFS, OWL, CUSTOM)

Files:

  • openmetadata-service/src/main/java/org/openmetadata/service/rdf/InferenceEngine.java - Rule-based inference
  • openmetadata-spec/src/main/resources/json/schema/api/configuration/rdfConfiguration.json - Enabled inference by default

RdfIndexApp Scalability Optimization

Optimized for million-scale environments using SearchIndexApp patterns:

Metric Before After
DB queries per entity ~80+ ~2 per batch
Architecture Single-threaded batches Producer-consumer with BlockingQueue
Thread pool Fixed 5 10 producers + 5 consumers
Memory management None Bounded by JVM memory

Key optimizations:

  • Producer-Consumer Architecture: BlockingQueue-based task distribution with separate thread pools
  • Batch Relationship Queries: findToBatchWithRelations() and findFromBatch() replacing N+1 queries
  • Memory-Aware Queue Sizing: Dynamic queue sizing based on available JVM memory
  • WebSocket Update Throttling: 2-second intervals to reduce I/O overhead

Files:

  • openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/rdf/RdfIndexApp.java

SQL-to-SPARQL Translation Enhancements

Enhanced SPARQL generation for structured/nested fields:

  • Nested field path resolution (e.g., votes.upVotes)
  • Proper prefix declarations (om, prov, rdfs, xsd, dct)
  • Object property vs data property detection
  • PROV-O property mappings for lineage queries

Files:

  • openmetadata-service/src/main/java/org/openmetadata/service/rdf/sql2sparql/SqlMappingContext.java
  • openmetadata-service/src/main/java/org/openmetadata/service/rdf/sql2sparql/SparqlBuilder.java

Test Plan

  • RdfIndexAppTest - 22 tests

    • Configuration validation
    • Producer-consumer architecture
    • Entity relationship conversion
    • Memory-aware queue sizing
    • Stop functionality
    • IndexingTask record behavior
  • SqlToSparqlTranslatorTest - 10 tests

    • Simple SELECT translation
    • WHERE clause with filters
    • LIKE to REGEX conversion
    • JOIN query handling
    • Dialect support (MySQL backticks)
    • Query caching
  • SparqlBuilderNestedFieldsTest - 29 tests

    • Nested mapping configuration (votes, changeDescription, lifeCycle)
    • Lineage table with PROV-O mappings
    • Prefix configuration (prov, om, rdfs, xsd, dct)
    • Nested field detection
    • Object property detection
    • Column-level lineage mappings

Total: 61 RDF tests passing

Files Changed

Core Implementation

  • openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/rdf/RdfIndexApp.java
  • openmetadata-service/src/main/java/org/openmetadata/service/rdf/RdfRepository.java
  • openmetadata-service/src/main/java/org/openmetadata/service/rdf/InferenceEngine.java
  • openmetadata-service/src/main/java/org/openmetadata/service/rdf/storage/JenaFusekiStorage.java
  • openmetadata-service/src/main/java/org/openmetadata/service/rdf/sql2sparql/SqlMappingContext.java
  • openmetadata-service/src/main/java/org/openmetadata/service/rdf/sql2sparql/SparqlBuilder.java

Configuration

  • openmetadata-spec/src/main/resources/json/schema/api/configuration/rdfConfiguration.json

Tests

  • openmetadata-service/src/test/java/org/openmetadata/service/apps/bundles/rdf/RdfIndexAppTest.java (new)
  • openmetadata-service/src/test/java/org/openmetadata/service/rdf/sql2sparql/SparqlBuilderNestedFieldsTest.java (new)
  • openmetadata-service/src/test/java/org/openmetadata/service/rdf/sql2sparql/SqlToSparqlTranslatorTest.java (updated)

…odeling, PROV-O lineage representation, idempotent indexing, inference engine support, and scalable indexing architecture.
@github-actions
Copy link
Copy Markdown
Contributor

TypeScript types have been updated based on the JSON schema changes in the PR

@github-actions github-actions Bot requested a review from a team as a code owner December 18, 2025 20:10
…odeling, PROV-O lineage representation, idempotent indexing, inference engine support, and scalable indexing architecture.
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Dec 18, 2025

🔍 CI failure analysis for 6509df1: Multiple CI failures: (1) RDF workflow configuration issue, (2) Widespread Playwright test failures across 3 shards (2/6, 4/6, 5/6) - all unrelated to backend RDF changes

Multiple CI Failures Detected

This PR has multiple distinct CI failures:


Issue 1: RDF Workflow Configuration Failure

maven-postgresql-rdf-ci (Jobs 58471241367, 58472281646)

Root Cause: Missing -f flag in docker compose ps command on line 108 of .github/workflows/maven-postgres-rdf-tests-build.yml

Solution: The fix is to add the -f flag:

docker compose -f ./docker/development/docker-compose-postgres-fuseki.yml ps

Status: This configuration issue should be corrected by updating the workflow file.


Issue 2: Widespread Playwright UI Test Failures

Multiple Playwright Shards Failing

Affected Jobs (3 out of 6 shards failing):

  • playwright-ci-postgresql (2, 6) - Job 58472289833 ❌
  • playwright-ci-postgresql (4, 6) - Job 58472289816 ❌
  • playwright-ci-postgresql (5, 6) - Job 58472289820 ❌

Failure Rate: 50% of Playwright shards failing (3/6)

Failed Test Patterns:

  • Certification add/remove workflows across multiple entity types
  • Data consumer permissions and domain rules
  • Table/entity description updates
  • UI element assertions and timeouts

Error Signatures:

  • expect(locator).toHaveCount(expected) - Element count mismatches
  • Timed out 30000ms waiting for expect(locator).toHaveCount(expected)
  • page.goto: Target page, context or browser has been closed

Critical Assessment: Playwright Failures NOT Related to PR

Evidence Against PR Causation:

  1. Zero Frontend Code Changes:

    • This PR exclusively modifies backend Java/RDF files
    • Changed files: openmetadata-service/src/main/java/org/openmetadata/service/rdf/
    • Test files: openmetadata-service/src/test/java/org/openmetadata/service/rdf/
    • No changes to openmetadata-ui/ or any frontend code
  2. Unrelated Test Areas:

    • Failures in UI certification workflows, permissions, entity management
    • RDF backend functionality has no connection to these UI flows
    • No API contract changes that would affect frontend
  3. Widespread Shard Failures:

    • 3 out of 6 shards failing (50% failure rate)
    • Shards 2, 4, and 5 all exhibiting similar failure patterns
    • This distribution indicates systemic issues, not PR-specific bugs
  4. Test Flakiness Indicators:

    • Timeout-based failures common across shards
    • Element count mismatches suggest timing/rendering issues
    • Browser/page closure errors indicate infrastructure instability
    • Long test suite duration (1+ hours) increases flakiness probability

Failure Characteristics Point To:

  1. Flaky UI Tests - Timing/synchronization issues in Playwright tests
  2. Infrastructure Problems - Test environment instability affecting multiple shards
  3. Pre-existing Test Issues - Failures in areas untouched by this PR
  4. Resource Contention - Multiple parallel test shards competing for resources

Recommendation

RDF CI Failure: Blocks PR - The workflow configuration issue should be corrected by updating line 108 of the workflow file

Playwright Failures: Should NOT block this PR for these reasons:

  1. ✅ PR scope is 100% backend RDF functionality
  2. ✅ Zero frontend/UI code modifications
  3. ✅ Failures in completely unrelated UI test areas
  4. ✅ 50% shard failure rate indicates systemic test instability
  5. ✅ No causal link between RDF backend changes and UI certification/permissions flows

Suggested Actions:

  1. Fix the RDF workflow configuration (add missing -f flag)
  2. Treat Playwright failures as unrelated infrastructure/flakiness issues
  3. Consider retrying failed Playwright shards to confirm flakiness
  4. Investigate Playwright test stability separately from this PR
Code Review 👍 Approved with suggestions

Substantial RDF improvements for structured properties and lineage support with comprehensive test coverage. Previous URI collision issue resolved, but atomic storage and silent exception concerns remain.

Resolved ✅ 1 resolved
Performance: Using System.nanoTime() for URI generation may cause collisions

📄 openmetadata-service/src/main/java/org/openmetadata/service/rdf/RdfRepository.java:771-772 📄 openmetadata-service/src/main/java/org/openmetadata/service/rdf/RdfRepository.java:765-772
In addLineageWithDetails (RdfRepository.java) and addLineageEdge (RdfPropertyMapper.java), URIs are generated using System.nanoTime() or System.currentTimeMillis(). On fast hardware, multiple calls in quick succession could produce identical values, leading to URI collisions.

Example (RdfRepository.java lines 771-772):

String colLineageUri = detailsUri + "/columnLineage/" + System.nanoTime();

Impact: In high-throughput scenarios, URI collisions could cause data overwrites.

Suggested fix: Use UUID.randomUUID() consistently for URI generation, which is already used elsewhere in the codebase:

String colLineageUri = detailsUri + "/columnLineage/" + UUID.randomUUID();

What Works Well

  • Comprehensive conversion of JSON literals to structured RDF triples (changeDescription, votes, lifeCycle, extension, customProperties) enables rich SPARQL queryability
  • Well-designed lineage support using PROV-O vocabulary (prov:wasDerivedFrom, prov:wasGeneratedBy) for semantic interoperability
  • Extensive test coverage added with 8 new test classes covering edge cases, idempotency, and structured property mapping
  • Inference engine rules improved with full URIs instead of prefixed names for Jena compatibility
  • SparqlBuilder enhanced with nested field support for queries like votes.upVotes

Recommendations

  • The storeRelationship and bulkStoreRelationships methods in JenaFusekiStorage still have the same issues flagged in the previous review:
    • The DELETE/INSERT pattern is not truly atomic (separate operations could fail independently)
    • Delete exceptions in bulk operations are silently caught and logged, which could mask data consistency issues
  • The addLineageEdge method still creates "unknown" entity type URIs for plain string IDs, which could lead to inconsistent RDF graphs

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off Gitar will not commit updates to this branch.
✅ Code review is on Gitar will review this change.
Display: compact Hiding non-applicable rules.

Comment with these commands to change:

Auto-apply ✅ Code review Compact
gitar auto-apply:on         
gitar code-review:off         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | This comment will update automatically (Docs)

@github-actions
Copy link
Copy Markdown
Contributor

Jest test Coverage

UI tests summary

Lines Statements Branches Functions
Coverage: 64%
64.28% (50819/79056) 41.94% (24786/59101) 45.36% (7797/17189)

@sonarqubecloud
Copy link
Copy Markdown

@sonarqubecloud
Copy link
Copy Markdown

@pmbrull pmbrull merged commit 055acb3 into main Dec 19, 2025
32 of 36 checks passed
@pmbrull pmbrull deleted the rdf branch December 19, 2025 07:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants