Skip to content

Optimize domainscrawl domain matching#452

Merged
CorentinB merged 11 commits intomainfrom
optimize-domains-crawl
Sep 3, 2025
Merged

Optimize domainscrawl domain matching#452
CorentinB merged 11 commits intomainfrom
optimize-domains-crawl

Conversation

@CorentinB
Copy link
Copy Markdown
Collaborator

@CorentinB CorentinB commented Aug 27, 2025

This pull request introduces significant improvements to the domainscrawl postprocessor package, focusing on more efficient domain matching. It doesn't touch the URLs/regex matching part.

We now use an Adaptive Radix Tree (ART) in addition to the map. The map is still used in priority for O(1) lookup of domains, but when we go to the second step to check for subdomain match we now use the ART with prefix-based search on the reversed hostname to allow for O(k) lookup where k is the domain length.

It also adds simple e2e testing for the domains crawl feature.

@CorentinB CorentinB self-assigned this Aug 27, 2025
@CorentinB CorentinB added the enhancement New feature or request label Aug 27, 2025

This comment was marked as outdated.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Aug 27, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 56.72%. Comparing base (7ba600c) to head (1bddc50).
⚠️ Report is 85 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #452      +/-   ##
==========================================
+ Coverage   55.86%   56.72%   +0.85%     
==========================================
  Files         128      130       +2     
  Lines        8053     8136      +83     
==========================================
+ Hits         4499     4615     +116     
+ Misses       3182     3153      -29     
+ Partials      372      368       -4     
Flag Coverage Δ
e2etests 39.88% <53.46%> (+0.84%) ⬆️
unittests 30.23% <100.00%> (+0.91%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@CorentinB CorentinB requested a review from Copilot August 29, 2025 10:22

This comment was marked as outdated.

@CorentinB CorentinB changed the title Optimize domainscrawl Match Optimize domainscrawl domain matching Aug 29, 2025

This comment was marked as outdated.

@CorentinB CorentinB force-pushed the optimize-domains-crawl branch from dcf6cae to 1bddc50 Compare August 29, 2025 12:26
@CorentinB CorentinB requested a review from Copilot August 29, 2025 12:27
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request optimizes domain matching in the domainscrawl postprocessor by introducing an Adaptive Radix Tree (ART) for more efficient subdomain lookups. The optimization maintains O(1) exact domain matches via a map while providing O(k) subdomain matching where k is the domain length, replacing the previous O(n) iteration over all stored domains.

  • Replaces map-based domain storage with a hybrid ART + map approach
  • Implements efficient subdomain matching using reversed hostnames for prefix searches
  • Adds comprehensive test coverage and benchmarks for the new functionality

Reviewed Changes

Copilot reviewed 16 out of 17 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
internal/pkg/postprocessor/domainscrawl/tree.go Core ART implementation with domain insertion and matching logic
internal/pkg/postprocessor/domainscrawl/reversehost.go Host reversal utility for efficient prefix matching in the ART
internal/pkg/postprocessor/domainscrawl/domainscrawl.go Updated to use ART instead of map for domain storage and matching
internal/pkg/postprocessor/domainscrawl/*_test.go Comprehensive test coverage for ART, host reversal, and integration
e2e/test/domainscrawl/ End-to-end testing for domains crawl functionality
go.mod Added dependency on go-adaptive-radix-tree library

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment thread internal/pkg/postprocessor/domainscrawl/tree.go
Comment thread internal/pkg/postprocessor/domainscrawl/reversehost_test.go
Comment thread internal/pkg/postprocessor/domainscrawl/domainscrawl_bench_test.go
Comment thread internal/pkg/postprocessor/domainscrawl/domainscrawl_bench_test.go
Comment thread internal/pkg/postprocessor/domainscrawl/domainscrawl_bench_test.go
Comment thread internal/pkg/postprocessor/domainscrawl/domainscrawl_bench_test.go
Comment thread internal/pkg/postprocessor/domainscrawl/domainscrawl_bench_test.go
Comment thread internal/pkg/postprocessor/domainscrawl/domainscrawl_bench_test.go
Comment thread internal/pkg/postprocessor/domainscrawl/domainscrawl_bench_test.go
Comment thread internal/pkg/postprocessor/domainscrawl/domainscrawl_bench_test.go
Copy link
Copy Markdown
Member

@equals215 equals215 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread internal/pkg/postprocessor/domainscrawl/tree.go
Copy link
Copy Markdown
Collaborator

@NGTmeaty NGTmeaty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks for the improvements here!

"strings"
)

// reverseHost turns "www.google.com" -> "com.google.www".
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so SURT.... 😆

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when I wrote that, I was like "wait, that seems familiar" 🤣

@CorentinB CorentinB merged commit 5282aef into main Sep 3, 2025
5 checks passed
@CorentinB CorentinB deleted the optimize-domains-crawl branch September 3, 2025 08:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants