Skip to content

License database schema enhancement and query optimization#2287

Merged
mrrajan merged 10 commits intoguacsec:mainfrom
mrrajan:TC-3747-license-query
Mar 17, 2026
Merged

License database schema enhancement and query optimization#2287
mrrajan merged 10 commits intoguacsec:mainfrom
mrrajan:TC-3747-license-query

Conversation

@mrrajan
Copy link
Contributor

@mrrajan mrrajan commented Mar 13, 2026

JIRA - TC-3710 and TC-3747
Key Changes:

  • Database Schema Enhancement - Added expanded_license and sbom_license_expanded entities to normalize license expression data, eliminating the need for repeated runtime expansion
  • Migration m0002120 - Created migration with up/down SQL scripts to transform existing license data into the normalized schema
  • Query Optimization - Refactored license, PURL, and SBOM services to leverage the new schema, simplifying queries and reducing database round-trips

Summary by Sourcery

Normalize expanded license expressions into dedicated tables and update services to use the new schema for querying and filtering licenses across SBOMs and PURLs.

New Features:

  • Introduce expanded_license and sbom_license_expanded entities to store pre-expanded license expressions and their SBOM-specific mappings.
  • Populate expanded license data at ingestion time for SPDX and CycloneDX SBOMs using a dedicated ExpandedLicenseCreator.

Bug Fixes:

  • Ensure license listing and filtering handle both SPDX-expanded and CycloneDX raw licenses consistently without leaving unresolved LicenseRef entries.
  • Fix license pagination and counting over UNION queries by using a robust COUNT(*) subquery approach.

Enhancements:

  • Refactor license, SBOM, and PURL queries to use pre-expanded license tables via COALESCE, simplifying logic and reducing repeated expansion work.
  • Replace complex CTE-based runtime expansion and custom SQL functions with simpler subqueries and joins against the normalized license schema.
  • Deprecate license_ref_mapping fields in SBOM package models and OpenAPI, since licenses are now expanded at ingestion time.
  • Optimize license filtering for SBOMs and PURLs by splitting SPDX and CycloneDX paths into dedicated subqueries instead of CTE-based unions.
  • Add comprehensive tests validating COALESCE-based license resolution, junction table integrity, MD5-based deduplication, filtering, sorting, and pagination behavior.

Documentation:

  • Update API documentation to mark LicenseRefMapping fields as deprecated and explain that they are now always empty due to pre-expanded licenses.

Tests:

  • Add extensive tests for license service, SBOM package license handling, and license filtering to cover the new expanded license schema and COALESCE-based logic.

Chores:

  • Remove legacy license_filtering utilities and obsolete database functions used for on-the-fly license expansion.

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Mar 13, 2026

Reviewer's Guide

Normalize expanded licenses into new expanded_license and sbom_license_expanded tables, populate them at ingestion, and refactor license/SPDX/CycloneDX listing and filtering logic across services to query the new schema instead of runtime expansion functions/CTEs.

Sequence diagram for ingestion-time expanded license population

sequenceDiagram
    actor IngestionProcess
    participant SbomContext
    participant ComponentCreator
    participant ExpandedLicenseCreator
    participant DB

    IngestionProcess->>SbomContext: ingest_spdx_or_cyclonedx(sbom)
    SbomContext->>ComponentCreator: new(sbom_id)
    SbomContext->>ComponentCreator: create_related_entities(db)
    ComponentCreator->>DB: persist_packages_purls_cpes_licenses
    DB-->>ComponentCreator: ok

    note over ComponentCreator: After SBOM packages and licenses are stored
    ComponentCreator->>ExpandedLicenseCreator: new(sbom_id)
    ComponentCreator->>ExpandedLicenseCreator: create(db)

    ExpandedLicenseCreator->>DB: INSERT expanded_license
    DB-->>ExpandedLicenseCreator: upsert unique expanded_text rows

    ExpandedLicenseCreator->>DB: INSERT sbom_license_expanded
    DB-->>ExpandedLicenseCreator: upsert (sbom_id, license_id, expanded_license_id)

    ExpandedLicenseCreator-->>ComponentCreator: ok
    ComponentCreator-->>SbomContext: ok
    SbomContext-->>IngestionProcess: ingestion complete with normalized licenses
Loading

ER diagram for expanded_license normalization schema

erDiagram
    sbom {
        uuid sbom_id PK
    }

    license {
        uuid id PK
        text text
    }

    sbom_package_license {
        uuid sbom_id PK
        uuid license_id PK
        uuid node_id
        text license_type
    }

    expanded_license {
        int id PK
        text expanded_text
    }

    sbom_license_expanded {
        uuid sbom_id PK
        uuid license_id PK
        int expanded_license_id FK
    }

    licensing_infos {
        uuid sbom_id FK
        text license_id
        text name
    }

    sbom ||--o{ sbom_package_license : contains
    license ||--o{ sbom_package_license : used_in

    sbom ||--o{ sbom_license_expanded : has
    license ||--o{ sbom_license_expanded : expanded_for
    expanded_license ||--o{ sbom_license_expanded : referenced_by

    sbom ||--o{ licensing_infos : has

    sbom_package_license ||--o{ sbom_license_expanded : maps_to
Loading

Class diagram for expanded_license entities and ingestion creator

classDiagram
    class expanded_license_Model {
        +i32 id
        +String expanded_text
    }

    class sbom_license_expanded_Model {
        +Uuid sbom_id
        +Uuid license_id
        +i32 expanded_license_id
    }

    class ExpandedLicenseCreator {
        -Uuid sbom_id
        +new(sbom_id: Uuid) ExpandedLicenseCreator
        +create(db: ConnectionTrait) Result
    }

    class sbom_package_license_Model {
        +Uuid sbom_id
        +Uuid license_id
        +Uuid node_id
        +String license_type
    }

    class licensing_infos_Model {
        +Uuid sbom_id
        +String license_id
        +String name
    }

    expanded_license_Model <|-- sbom_license_expanded_Model : referenced_by
    sbom_license_expanded_Model --> sbom_package_license_Model : keyed_by
    sbom_license_expanded_Model --> licensing_infos_Model : uses_mappings

    ExpandedLicenseCreator --> expanded_license_Model : populates
    ExpandedLicenseCreator --> sbom_license_expanded_Model : populates
Loading

File-Level Changes

Change Details Files
Introduce normalized expanded license schema and migration, with ingestion-time population for SPDX and CycloneDX SBOMs.
  • Add expanded_license dictionary table and sbom_license_expanded junction table entities and relations.
  • Create migration m0002120 with SQL to backfill expanded license data and drop obsolete expansion functions, plus reversible down migration.
  • Implement ExpandedLicenseCreator to populate expanded_license and sbom_license_expanded during SPDX and CycloneDX SBOM ingestion flows.
entity/src/expanded_license.rs
entity/src/sbom_license_expanded.rs
entity/src/lib.rs
entity/src/sbom_package_license.rs
migration/src/m0002120_normalize_expanded_license.rs
migration/src/m0002120_normalize_expanded_license/up.sql
migration/src/m0002120_normalize_expanded_license/down.sql
modules/ingestor/src/graph/sbom/common/expanded_license.rs
modules/ingestor/src/graph/sbom/common/mod.rs
modules/ingestor/src/graph/sbom/spdx.rs
modules/ingestor/src/graph/sbom/cyclonedx.rs
migration/src/lib.rs
Refactor LicenseService to use pre-expanded licenses via COALESCE and a UNION-based listing query, replacing CTE-based expansion and legacy DB functions.
  • Remove old CTE-based license expansion and helper functions, keeping only the LICENSE field constant in license_filtering module.
  • Change get_all_license_info to join sbom_license_expanded → expanded_license and use Func::coalesce(expanded_text, license.text) for both selected columns and ordering.
  • Rewrite licenses() to UNION SPDX expanded_license entries with non-expanded license.text entries, apply filtering via translators on each side, handle sorting explicitly (including defaults and invalid directions), and use Func::count(Asterisk) for total counts.
  • Add an extensive test module for LicenseService covering UNION behavior, COALESCE correctness, filtering, ordering, pagination, junction integrity, MD5 deduplication, and preloaded-license visibility.
modules/fundamental/src/common/license_filtering.rs
modules/fundamental/src/license/service/mod.rs
modules/fundamental/src/license/service/test.rs
common/src/db/func.rs
Update SBOM and PURL services to query expanded licenses via joins and subqueries instead of runtime expansion CTEs, while preserving two-path (SPDX/CycloneDX) behavior.
  • Change SBOM-level license filtering to use two subqueries (SPDX via sbom_license_expanded/expanded_license, CycloneDX via license.text) and filter sbom.sbom_id by combined subqueries.
  • Change SBOM package-level license filtering to use analogous node_id subqueries and make the main query’s license field a no-op, since filtering is handled in subqueries.
  • Adjust SBOM package assembly to rely on pre-expanded license names (via COALESCE and joins) and remove LicenseRef mapping reconstruction and the get_licensing_infos helper.
  • Refactor PurlService’s license filtering to use subqueries over sbom_package_purl_ref with SPDX and CycloneDX paths, eliminating the use of CTEs and case_license_text_sbom_id/expand_license_expression.
  • Update purl details license projection to use COALESCE(expanded_license.expanded_text, license.text) and LEFT JOIN the new tables.
modules/fundamental/src/sbom/service/sbom.rs
modules/fundamental/src/sbom/service/test.rs
modules/fundamental/src/sbom/model/mod.rs
modules/fundamental/src/sbom/model/details.rs
modules/fundamental/src/purl/service/mod.rs
modules/fundamental/src/purl/model/details/purl.rs
Adjust API surface and OpenAPI docs to reflect deprecation of runtime LicenseRef mappings, now that licenses are pre-expanded.
  • Deprecate SbomPackage.licenses_ref_mapping in the SBOM model and mark it as always empty with doc comments.
  • Mark LicenseRefMapping arrays as deprecated in OpenAPI schemas for SBOM and package endpoints, explaining that pre-expanded licenses from expanded_license/sbom_license_expanded replace them.
modules/fundamental/src/sbom/model/mod.rs
openapi.yaml

Possibly linked issues

  • #(not specified): They solve the same need—storing expanded license expressions for efficient querying—using normalized tables instead of a new column.

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@mrrajan mrrajan force-pushed the TC-3747-license-query branch 2 times, most recently from 6051d81 to be04c3c Compare March 13, 2026 16:45
@mrrajan mrrajan marked this pull request as ready for review March 13, 2026 16:57
@mrrajan mrrajan requested review from ctron and mrizzi March 13, 2026 16:58
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • In m0002120_normalize_expanded_license/up.sql the ON CONFLICT (md5(expanded_text)) clauses are not valid because md5(expanded_text) is an expression, not a column; consider either using ON CONFLICT ON CONSTRAINT idx_expanded_license_text_hash or adding a generated column for the hash and indexing that column instead.
  • The sbom_license_expanded table only enforces a foreign key on expanded_license_id; if you want referential integrity to sbom and license to match the SeaORM relations, add FKs for (sbom_id) -> sbom.sbom_id and (license_id) -> license.id in the migration.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `m0002120_normalize_expanded_license/up.sql` the `ON CONFLICT (md5(expanded_text))` clauses are not valid because `md5(expanded_text)` is an expression, not a column; consider either using `ON CONFLICT ON CONSTRAINT idx_expanded_license_text_hash` or adding a generated column for the hash and indexing that column instead.
- The `sbom_license_expanded` table only enforces a foreign key on `expanded_license_id`; if you want referential integrity to `sbom` and `license` to match the SeaORM relations, add FKs for `(sbom_id) -> sbom.sbom_id` and `(license_id) -> license.id` in the migration.

## Individual Comments

### Comment 1
<location path="modules/fundamental/src/sbom/service/test.rs" line_range="782-783" />
<code_context>
+        )
+        .await?;
+
+    // REFACTOR: Verify filtering works on COALESCE result
+    assert!(apache_results.total > 0, "Should find Apache licenses");
+
</code_context>
<issue_to_address>
**issue (testing):** This test passes even if filtering returns zero results, which can hide regressions

In `test_sbom_package_license_filtering_with_coalesce`, the assertion on Apache licenses is guarded by `if apache_packages.total > 0`, so a regression that returns zero results won’t fail the test. Instead, assert that `apache_packages.total > 0` and then verify that at least one returned package has an Apache license, e.g.:

```rust
assert!(apache_packages.total > 0, "Expected at least one package to match Apache filter");
let has_apache = apache_packages.items.iter().any(|p| {
    p.licenses
        .iter()
        .any(|l| l.license_name.to_lowercase().contains("apache"))
});
assert!(has_apache, "Filtered packages should contain Apache licenses");
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@codecov
Copy link

codecov bot commented Mar 13, 2026

Codecov Report

❌ Patch coverage is 89.30233% with 23 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.98%. Comparing base (969c979) to head (0c5a54b).
⚠️ Report is 10 commits behind head on main.

Files with missing lines Patch % Lines
entity/src/sbom_license_expanded.rs 0.00% 9 Missing ⚠️
modules/fundamental/src/license/service/mod.rs 94.20% 0 Missing and 4 partials ⚠️
entity/src/expanded_license.rs 0.00% 3 Missing ⚠️
entity/src/sbom_package_license.rs 0.00% 3 Missing ⚠️
...ingestor/src/graph/sbom/common/expanded_license.rs 96.49% 0 Missing and 2 partials ⚠️
modules/fundamental/src/license/service/test.rs 85.71% 0 Missing and 1 partial ⚠️
modules/ingestor/src/graph/sbom/cyclonedx.rs 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2287      +/-   ##
==========================================
- Coverage   68.10%   67.98%   -0.12%     
==========================================
  Files         425      430       +5     
  Lines       24886    24731     -155     
  Branches    24886    24731     -155     
==========================================
- Hits        16948    16814     -134     
+ Misses       7019     6989      -30     
- Partials      919      928       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mrrajan mrrajan force-pushed the TC-3747-license-query branch 3 times, most recently from 1041f18 to 7eb7bd4 Compare March 16, 2026 11:24
@ctron
Copy link
Contributor

ctron commented Mar 16, 2026

@mrrajan This is great work. Had a few comments, but it's amazing to see the work you did here! 🥳

@mrrajan
Copy link
Contributor Author

mrrajan commented Mar 17, 2026

Thanks for the review @ctron and @jcrossley3. All credit goes to @mrizzi, he developed a comprehensive and detailed plan which helped to implement this enhancement with Claude's assistance :).

@mrrajan mrrajan requested review from ctron and jcrossley3 March 17, 2026 08:01
Copy link
Contributor

@mrizzi mrizzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall no blockers: well done @mrrajan 👏
I proposed some performance related changes and code cleanup refactors.

@mrrajan mrrajan force-pushed the TC-3747-license-query branch 2 times, most recently from 32380ea to b432ce6 Compare March 17, 2026 12:43
mrrajan and others added 7 commits March 17, 2026 18:14
Signed-off-by: mrrajan <86094767+mrrajan@users.noreply.github.com.>

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: mrrajan <86094767+mrrajan@users.noreply.github.com.>
Assisted-by: Claude

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: mrrajan <86094767+mrrajan@users.noreply.github.com.>
Assisted-by: Claude

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: mrrajan <86094767+mrrajan@users.noreply.github.com.>
Assisted-by: Claude

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: mrrajan <86094767+mrrajan@users.noreply.github.com.>

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
jcrossley3 and others added 2 commits March 17, 2026 18:14
I think we'd like to keep the Query processing logic consistent,
i.e. DRY. So we can take a mutable reference to the Select struct and
create the union before applying the sort filtering. And we use an
expression for the license alias that effectively does the field name
translation for us.

I removed the test for an invalid sort direction as that behavior is
inconsistent with every other TPA endpoint accepting a 'q'
parameter. If we think we should accept invalid sort directions, we
should make that change in the query module, but since the API is
largely used internally, I think it's fine to expect a valid
direction from callers.
…ced ExpandedLicenseCreator with a function

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@mrrajan mrrajan force-pushed the TC-3747-license-query branch from b432ce6 to 03fbc7b Compare March 17, 2026 12:44
@mrrajan mrrajan requested a review from mrizzi March 17, 2026 12:45
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@ctron ctron added the backport release/0.4.z Backport (0.4.z) label Mar 17, 2026
@mrrajan mrrajan force-pushed the TC-3747-license-query branch from 03fbc7b to 0c5a54b Compare March 17, 2026 14:22
Copy link
Contributor

@mrizzi mrizzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything I reported has been addressed.
Well done @mrrajan, thanks 👍

@mrrajan mrrajan added this pull request to the merge queue Mar 17, 2026
@mrrajan
Copy link
Contributor Author

mrrajan commented Mar 17, 2026

Thanks @mrizzi :)

Merged via the queue into guacsec:main with commit 43685c7 Mar 17, 2026
6 checks passed
@mrrajan mrrajan deleted the TC-3747-license-query branch March 17, 2026 15:57
@github-project-automation github-project-automation bot moved this to Done in Trustify Mar 17, 2026
@trustify-ci-bot
Copy link

Backport failed for release/0.4.z, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin release/0.4.z
git worktree add -d .worktree/backport-2287-to-release/0.4.z origin/release/0.4.z
cd .worktree/backport-2287-to-release/0.4.z
git switch --create backport-2287-to-release/0.4.z
git cherry-pick -x 49185eb8b05b06fc2baec157f14ce7e4f706a95d 3fa00869c5294b3cd81cbbb2550c1d09ae0ed6c0 e24f31863b4da0d0bfa4ea99a8207ecc9b90e236 bc100e47c7dbf8608e37269ef3b3f47059212a18 23c8277d6196f85da8ae1df4a4b846353049a450 a1cf05386d6d2394f71f4a0be589016f6cf0cb84 42752adfaf9ba89af15eb8da23ad9bbac8588c0e 2d650e8a34ddf8f6c80c8804a9052fa5682aac21 8d4632b56d416850e5189336c8cc0c9b92e3e487 43685c7578e2f053797cbc6ea96e9e3875dd69d4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport release/0.4.z Backport (0.4.z)

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants