adding ip aware dedup by Mzack9999 · Pull Request #2463 · projectdiscovery/httpx

Mzack9999 · 2026-03-22T12:59:49Z

Proposed changes

Close #2014

Checklist

Pull request is created against the dev branch
All checks passed (lint, unit/integration/regression tests etc.) with my changes
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if appropriate)

Summary by CodeRabbit

Release Notes

Bug Fixes
- Enhanced duplicate detection logic to account for content source, ensuring results from different hosts are properly distinguished rather than being incorrectly filtered as duplicates.
Tests
- Added comprehensive test coverage for duplicate detection scenarios across various edge cases.

neo-by-projectdiscovery-dev · 2026-03-22T13:00:03Z

Neo - PR Security Review

No security issues found

Highlights

Adds IP-aware duplicate detection to improve deduplication accuracy
Stores simhash → IP list mappings in bounded cache (1000 entries)
Comprehensive test coverage for various IP/content combinations

Hardening Notes

Consider adding mutex protection around simHashes cache operations in runner.go:644-654 to prevent race conditions when multiple goroutines call duplicate() concurrently
Add monitoring for the simHashes cache growth to detect scenarios where many IPs return similar content, which could consume memory despite the 1000-entry cap

_{Comment @pdneo help for available commands. · Open in Neo}

coderabbitai · 2026-03-22T13:00:05Z

Walkthrough

The duplicate detection logic in the runner was enhanced to track IP addresses associated with each content hash (simhash). Instead of treating identical content as duplicates regardless of source IP, the system now only deduplicates responses that originate from the same IP, allowing the same content from different IPs to be retained.

Changes

Cohort / File(s)	Summary
Duplicate Detection Logic `runner/runner.go`	Modified `simHashes` storage from `gcache.Cache[uint64, struct{}]` to `gcache.Cache[uint64, []string]` to track IPs per simhash. Reworked duplicate detection to be IP-aware: skips responses only if they match/near-match an existing simhash AND the IP is empty or already exists in the stored IP list. Appends new IPs to existing simhashes or initializes with current IP if no match found.
Deduplication Test Suite `runner/runner_test.go`	Added comprehensive `TestRunner_duplicate` test suite covering: identical content on same IP (duplicate), identical content on different IPs (not duplicate), near-duplicates with IP differentiation, empty HostIP fallback behavior, and scale testing with 50 subdomains across same and different IPs.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A hash with an IP in tow,
Now tracks which servers come and go—
Same content, different IPs bright,
No longer lost in dedup's sight!
One server, one result, that's right! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'adding ip aware dedup' directly describes the main change: implementing IP-aware deduplication logic in the duplicate detection function.
Linked Issues check	✅ Passed	The changes implement IP-aware deduplication as required by issue `#2014`, storing host IPs per simhash and only marking responses as duplicates when both content and IP match or IP is empty.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to implementing IP-aware deduplication: modifications to duplicate detection logic and corresponding test coverage.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch 2014-ip-aware-duplicate-filter

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@runner/runner.go`:
- Around line 644-656: The loop over r.simHashes.GetALL(false) currently returns
or mutates on the first matching storedHash, causing nondeterministic dedupe
when multiple stored hashes are within threshold; change the logic in the block
that uses simhash.Compare, sliceutil.Contains, and r.simHashes.Set so you first
scan all stored entries: track whether any stored hash is within threshold and
whether any of those already contains ip, and only after the loop either: (a)
return true if any matching storedIPs already contained ip, or (b) attach ip to
an appropriate stored hash (or create a new entry for respSimHash) and return
false; remove the early r.simHashes.Set(...) and return inside the loop to
ensure all matches are considered before deciding.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2beeeee2-d2a0-41d8-8c4b-2d196e0eb71c

📥 Commits

Reviewing files that changed from the base of the PR and between 9836829 and 7d765f4.

📒 Files selected for processing (2)

runner/runner.go
runner/runner_test.go

coderabbitai · 2026-03-22T13:07:04Z

runner/runner.go

+	for storedHash, storedIPs := range r.simHashes.GetALL(false) {
+		if simhash.Compare(storedHash, respSimHash) > 3 {
+			continue
+		}
+		if ip == "" || sliceutil.Contains(storedIPs, ip) {
+			gologger.Debug().Msgf("Skipping duplicate response (simhash %d, ip %s) for URL %s\n", respSimHash, ip, result.URL)
 			return true
 		}
+		_ = r.simHashes.Set(storedHash, append(storedIPs, ip))
+		return false
 	}
-	_ = r.simHashes.Set(respSimHash, struct{}{})
+
+	_ = r.simHashes.Set(respSimHash, []string{ip})


⚠️ Potential issue | 🟠 Major

Scan all matching simhashes before deciding this response is unique.

Because GetALL(false) is iterated as a map, the order here is not stable. If two stored hashes are both within the <= 3 threshold of respSimHash, Line 652 can attach ip to the first one seen and Line 653 returns false before a later matching hash that already contains ip is checked. That makes dedupe results nondeterministic and can leak duplicates.

💡 Proposed fix

func (r *Runner) duplicate(result *Result) bool { respSimHash := simhash.Simhash(simhash.NewWordFeatureSet(converstionutil.Bytes(result.Raw))) ip := result.HostIP for storedHash, storedIPs := range r.simHashes.GetALL(false) { if simhash.Compare(storedHash, respSimHash) > 3 { continue } if ip == "" || sliceutil.Contains(storedIPs, ip) { gologger.Debug().Msgf("Skipping duplicate response (simhash %d, ip %s) for URL %s\n", respSimHash, ip, result.URL) return true } - _ = r.simHashes.Set(storedHash, append(storedIPs, ip)) - return false } - _ = r.simHashes.Set(respSimHash, []string{ip}) + if storedIPs, err := r.simHashes.GetIFPresent(respSimHash); err == nil { + _ = r.simHashes.Set(respSimHash, append(storedIPs, ip)) + } else { + _ = r.simHashes.Set(respSimHash, []string{ip}) + } return false }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@runner/runner.go` around lines 644 - 656, The loop over r.simHashes.GetALL(false) currently returns or mutates on the first matching storedHash, causing nondeterministic dedupe when multiple stored hashes are within threshold; change the logic in the block that uses simhash.Compare, sliceutil.Contains, and r.simHashes.Set so you first scan all stored entries: track whether any stored hash is within threshold and whether any of those already contains ip, and only after the loop either: (a) return true if any matching storedIPs already contained ip, or (b) attach ip to an appropriate stored hash (or create a new entry for respSimHash) and return false; remove the early r.simHashes.Set(...) and return inside the loop to ensure all matches are considered before deciding.

adding ip aware dedup

7d765f4

auto-assign bot requested a review from dogancanbakir March 22, 2026 12:59

Mzack9999 self-assigned this Mar 22, 2026

coderabbitai bot reviewed Mar 22, 2026

View reviewed changes

dogancanbakir approved these changes Mar 25, 2026

View reviewed changes

dogancanbakir merged commit 2cc5330 into dev Mar 25, 2026
16 checks passed

dogancanbakir deleted the 2014-ip-aware-duplicate-filter branch March 25, 2026 11:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding ip aware dedup#2463

adding ip aware dedup#2463
dogancanbakir merged 1 commit intodevfrom
2014-ip-aware-duplicate-filter

Mzack9999 commented Mar 22, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

neo-by-projectdiscovery-dev bot commented Mar 22, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Mar 22, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Mzack9999 commented Mar 22, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Checklist

Summary by CodeRabbit

Release Notes

Uh oh!

neo-by-projectdiscovery-dev bot commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Neo - PR Security Review

Highlights

Uh oh!

coderabbitai bot commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Mzack9999 commented Mar 22, 2026 •

edited by coderabbitai bot

Loading

neo-by-projectdiscovery-dev bot commented Mar 22, 2026 •

edited

Loading

coderabbitai bot commented Mar 22, 2026 •

edited

Loading